# Week 6 - Classification

This week's assignment will give you an opportunity to practice your classification skills with the scikit-learn package and learn to implement your own sklearn-like classifier from scratch.

In the first component, you will apply logistic regression to a dataset of wine quality. The idea is to predict wine quality based on its chemical properties. Read more about the wine dataset here - https://archive.ics.uci.edu/ml/datasets/wine+quality.

In the second component, you will use multinomial naive Bayes to predict wine qualities.

In the third component, you will implement your own version of KNN.

Note: Due to a heavy use of randomization techniques the exact performance mertics may not be achieved (sometimes even with the random seed set) therefore the tests in assert statements will  validate your solutions roughly.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Task 1 (1 point) - Logistic Regression

In this task you will use RepeatedStratifiedKfold cross-validation to evaluate performance of Logistic Regression on the wine dataset, predicting wine quality.

Wine quality can take values between 3-8. However, for this task you need to convert these qualities to a binary value and your model will predict whether a specific wine is "high quality". We will define wine with a quality value of __7__ and above as high-quality.

In order to get an estimate of how well your model is doing, you will use RepeatedStratifiedKfold cross-validation. This is an sklearn cross-validation iterator that runs stratified K-fold cross-valdiation multipe times.

We will assess how well we are doing using __accuracy__. To do this using the cross_val_score function, set the "scoring" parameter to 'accuracy' (scoring='accuracy').

Store the wine features in a variable named "X", and the response in a variable named "y". Store the cross-validation scores in a "scores" variable. We will check these variables in our asserts.

Additional instructions:
* Use the following from sklearn:  RepeatedStratifiedKFold, LogisticRegression and cross_val_score.
* We will load the wine dataset for you. You will need to divide the wine dataset into X and y.
* X should be a matrix of all the features in the dataset, __excluding wine quality__.
* y should be a vector of 0s and 1s, with 0 denoting low-quality wines and 1 denoting high-quality wines.
* Initialize the Repeated K-fold cross validation object to have 5 folds and 1000 repetitions. __Set random_state to 42__.
* For a visual aid, you can plot the distribution of scores across the cross-validation runs.

In [None]:
# Here we are loading the wine data for you.
import numpy as np
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wines = pd.read_csv(url, sep=";")

# YOUR CODE HERE
raise NotImplementedError()

# To view scores you can run: pd.Series(scores).hist()

In [None]:
assert y.shape == (1599,)
assert X.shape == (1599, 11)
assert  pd.Series(scores).mean() > 0.86

## Task 2 (1 point) - Multinomial Naive Bayes

In this task you will use RepeatedStratifiedKfold cross-validation to evaluate performance of multinomial naive Bayes (NB) on the wine dataset, predicting wine quality. This is similar to the previous task, except here we are using multinomial NB and we are not converting the wine qualities to binary.

For this task, we've decided we want to only use wines with qualities between __5__ and __7__ (inclusive). You will have to exclude wines that are not in this range.

In order to get an estimate of how well your model is doing, you will use RepeatedStratifiedKfold cross-validation. This is an sklearn cross-validation iterator that runs stratified K-fold cross-valdiation multipe times.

We will assess how well we are doing using __f1_macro__. f1_macro is an evaluation based on the f1 score, but it can be used in multiclass (non-binary) problems. To do this using the cross_val_score function, set the "scoring" parameter to 'f1_macro' (scoring='f1_macro').

Store the wine features in a variable named "X", and the response in a variable named "y". Store the cross-validation scores in a "scores" variable. We will check these variables in our asserts.

Additional instructions:
* Use the following from sklearn:  RepeatedStratifiedKFold, MultinomialNB and cross_val_score.
* We will load the wine dataset for you. You will need to divide the wine dataset into X and y.
* X should be a matrix of all the features in the dataset, __excluding the wine quality feature__ and __excluding wines with qualities that are less than 5 or greater than 7__.
* y should be a vector of wine qualities ranging from 5 to 7 (inclusive).
* Initialize the Repeated K-fold cross validation object to have 5 splits and 1000 repetitions. __Set random_state to 42__.
* For a visual aid, you can plot the distribution of scores across the cross-validation runs

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wines = pd.read_csv(url, sep=";")

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert y.shape == (1518,)
assert X.shape == (1518, 11)
assert  .4 < pd.Series(scores).mean() < .5

## Task 3  (2 points) - KNN

Implement your own version of Nearest Neighbor classifier as a class named MyKNN. The only hyperparameter to \_\_init\_\_(self, K) is K - the number of neighbors.

Implement .fit(X, y) method. There is no real training as the model simply memorizes all the data samples and their class labels.

Implement .predict(X_new) method; this is where all the calculations are. You need to compare each sample in X_new to the memorized data and choose the K nearest neighbors using euclidean distance. Then you need to predict the label based on the most frequent label of the K neighbors and return an array of predicted labels. The problem could be binary or multiclass, it does not matter for the implementation, however you should predict only one label for each sample.

Do not worry about memory or speed optimization, we will not use MyKNN on large datasets.

Hints:
* Your final class should have at least three functions: \_\_init\_\_, fit, predict
* You will need to implement a function that can calculate the Euclidean distance between two points
* The formula for Euclidean distance between two points a and b in python is: np.sum((a - b)\*\*2)\*\*(1/2)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

##### Iris dataset ##############################################
# 150 samples, 4 features
# 3 classes, 50 samples per class
from sklearn.datasets import load_iris

X_iris, y_iris = load_iris(return_X_y=True)
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris, test_size=0.3, stratify=y_iris)
cls_iris = MyKNN(5)
cls_iris.fit(X_iris_train, y_iris_train)

ref_cls_iris = KNeighborsClassifier(5)
ref_cls_iris.fit(X_iris_train, y_iris_train)
assert accuracy_score(y_iris_test, ref_cls_iris.predict(X_iris_test)) >= 0.9
assert accuracy_score(y_iris_test, cls_iris.predict(X_iris_test)) >= 0.9