Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

# Week 8 - Classification

This week's assignment will give you an opportunity to practice your classification skills with the scikit-learn package. You will have to apply several different classification methods and learn about their parameters.

In sklearn the different options (called hyperparameters) for each classifier are specofied in the constructor when you initialize an object. Then, method __.fit(X, y)__ will run the training procedure where the classifier will learn from a set of samples and their features in __X__ and the corresponding class labels in __y__. Once the classifier model is trained you can use it for predicting the labels on previously unseen data with __.predict(X_new)__.

You will need to use various classification metrics to __evaluate the performance of the trained model__. See http://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics for more details.

A balanced F1 score combining precision and recall is a good default performance measure for binary classifiers. However, for instance as in the example with handwritten digits - the problem is multiclass - where we have 10 classes, one for each digit. In extending a binary metric to multiclass or multilabel problems, it could be treated as a collection of binary problems, one for each class. It is, therefore, necessary to average the evaluation results. There is then a number of ways to average binary metric calculations across the set of classes, each of which may be useful in some scenario. Where available, you should select among these using the average parameter. Read more about __"micro" and "macro" averaging of classification metrics__ in scikit-learn documentation.

"macro" simply calculates the mean of the binary metrics, giving equal weight to each class. In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance. On the other hand, the assumption that all classes are equally important is often untrue, such that macro-averaging will over-emphasize the typically low performance on an infrequent class.

Thus if we make sure that each digit is represented by approximately equal number of samples, the performance of a classifier will be best evaluated using F1 with macro averaging across the classes. __StratifiedKfold__ class or a stratified train_test_split() can be used to make sure each class gets an approximately equal number of samples in a random split.

The final, and probably, one of most important aspects of machine learning covered by this assignment - is to __avoid overfitting the models while they train__. Overfitting leads to a classifier that is showing excellent performance on the dataset used for training, while its performance on a previously unseen dataset could be average if not poor. __Cross-validation__ is an important technique to master. Fortunately, even the complex scenarios of cross-validation are already implemented and available for use in sklearn.

Often we need to choose between several classification models based on their performance, for instance the classifiers initialized with different hyperparameters, we need to make sure there is no overfitting of hyperparameters. It means that the choise of hyperparameters can be optimal for the training dataset but may not be optimal in general. __GridSearchCV__ class simplifies the hyperparameter optimization procedure, but you need to make sure to put aside a subset of your data for the final validation. __train_test_split()__ is the best way to split the dataset in sklearn.

There is a lot to explore in scikit learn documentation, take this opportunity, and ask your questions on Slack channel #week8 if something is unclear.

Due to a heavy use of randomization techniques the exact performance mertics may not be achieved (sometimes even with the random seed set) therefore the tests in assert statements will  validate your solutions roughly. Don't be surprised to get slightly different results when you restart your calculations.


## Assignments

There are 5 graded assignments and total 6 points: the first task is worth 2 points, the other four are 1 point each. Additionally, there is an ungraded assignment that we highly encourage you to solve.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Task #1  (2 points) - KNN

Implement your own version of Nearest Neighbor classifier as a class MyKNN. The only hyperparameter to \_\_init\_\_(self, K) is K - the number of neighbors.

Implement  .fit(X, y) method. There is no real training as the model simply memorizes all the data samples and their class labels.

Implement .predict(X_new) method; this is where all the calculations are. You need to compare each sample in X_new to the memorized data and choose the K nearest neighbors using euclidean distance. Then you need to predict the label based on the most frequent label of the K neighbors and return an array of predicted labels. The problem could be binary or multiclass, it does not matter for the implementation, however you predict only one label for each sample, so it is not a multilabel classification problem.

Do not worry about memory or speed optimization, we will not use MyKNN on large datasets.

In [None]:
# YOUR CODE HERE
pass

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

###### Synthetic example  ###############################
# 6 samples, 2 features, binary class labels

X = np.array([
    [1.0, 2.5],
    [3.5, 4.5],
    [6.5, 6.5],
    [4.5, 1.5],
    [5.5, 3.0],
    [7.5, 3.5]])
y = np.array([0,0,0,1,1,1])

#### K = 1 ###############

cls = MyKNN(1)
cls.fit(X, y)

ref_cls = KNeighborsClassifier(1)
ref_cls.fit(X, y)

#### Prediction for the same dataset with K=1 should give accuracy 100%
assert accuracy_score(y, ref_cls.predict(X)) == 1.
assert accuracy_score(y, cls.predict(X)) == 1.


#### K = 3 ###############
cls = MyKNN(3)
cls.fit(X, y)

ref_cls = KNeighborsClassifier(3)
ref_cls.fit(X, y)

#### Prediction for the same dataset as used in training may not give you excellent results for K > 1
assert accuracy_score(y, ref_cls.predict(X)) == 5./6.
assert accuracy_score(y, cls.predict(X)) == 5./6.

#### In this test we create a perturbation in input data
# that should affect prediction accuracy
assert accuracy_score(y, ref_cls.predict(X * 0.5)) == 3./6.
assert accuracy_score(y, cls.predict(X * 0.5)) == 3./6.

##### Iris dataset ##############################################
# 150 samples, 4 features
# 3 classes, 50 samples per class
from sklearn.datasets import load_iris

X_iris, y_iris = load_iris(return_X_y=True)
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris, test_size=0.3, stratify=y_iris)
cls_iris = MyKNN(5)
cls_iris.fit(X_iris_train, y_iris_train)

ref_cls_iris = KNeighborsClassifier(5)
ref_cls_iris.fit(X_iris_train, y_iris_train)
assert accuracy_score(y_iris_test, ref_cls_iris.predict(X_iris_test)) >= 0.9
assert accuracy_score(y_iris_test, cls_iris.predict(X_iris_test)) >= 0.9

## Task 2  (1 point) - Sampling

Analyze stability of features in subsamples in train_test_split().
Take the wines dataset (loaded below), and plot a distribution of "quality" feature. You will see that for wine quality __&lt;5 and &gt;7__ we only have a few samples; exclude these samples from wines dataset. Define __y__ as a vector of wine quality class labels. Define __X__ as the matrix of samples with the rest of the features (excluding quality).

Run 1000 iterations of train_test_split() subsampling from __X__ and __y__ stratified by __y__ with 30% test size.

Create __avg_alcohol_values_train__ list and append mean values of "alcohol" feature in the training subset on each iteration.

Create __avg_alcohol_values_test__ list and append mean values of "alcohol" feature in the testing subset on each iteration.

Analyze the mean and standard deviation of values in each of the lists. Plot histograms for visual aid.
Think about why standard deviations of mean values from test and train samples differ and why the means are equal.

The automatic tests will be checking X, y, and the avg_alcohol... lists.

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wines = pd.read_csv(url, sep=";")
wines["quality"].hist()
wines.head()

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert y.shape == (1518,)
assert X.shape == (1518, 11)

assert 0.034 <= round(pd.Series(avg_alcohol_values_test).std(), 3) <= 0.037
assert round(pd.Series(avg_alcohol_values_test).mean(), 1) == round(pd.Series(avg_alcohol_values_train).mean(), 1)

assert 0.014 <= round(pd.Series(avg_alcohol_values_train).std(), 3) <= 0.016
assert round(pd.Series(avg_alcohol_values_train).mean(), 1) == round(wines['alcohol'].mean(), 1)

## Task 3 (1 point) - Multinomial NB

Use RepeatedStratifiedKfold cross-validation to evaluate performance of a Multinomial Naive Bayes on the wine dataset predicting wine quality. As in the previous task exclude wines with quality that is not 5,6 or 7; and prepare sample-feature matrix X and vector with class labels y accordingly.

Use the following classesfrom sklearn:  RepeatedStratifiedKFold, MultinomialNB and cross_val_score function.

Initialize Repeated 5-fold cross validation object (5 splits) with 1000 repetitions.
Initialize Multinomial NB classifier.
Calculate cross validation scores for the NB classifier and the kfold cross validation object that we created (use cv parameter to pass a custom cross validation object to cros_val_score function). For scoring use F1 score with macro averageing 'f1_macro'. If you want to speed things up set n_jobs=-1  which would use the power of multiple CPU cores on your computer for the calculation.

Store the results of cross validation scores in "scores" variable. We will check it in assert tests, as well as X and y will be tested.

For visual aid plot a distribution of scores across cross validation runs. Now you have a robust estimate of Multinomial NB classifier performance in predicting wine quality.

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wines = pd.read_csv(url, sep=";")

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert y.shape == (1518,)
assert X.shape == (1518, 11)
assert  .4 < pd.Series(scores).mean() < .5

## Task 4 (1 point) - Logistic Regression

Use RepeatedStratifiedKfold cross-validation to evaluate performance of a Logistic Regression on the wine dataset predicting wine quality. 

This time let us turn it into a binary classification problem and prepare a vector with class labels y that will only have 1 (for high-quality wine where quality >= 7) and 0 for the rest of wine samples.
Initialize sample-feature matrix X (don't forget to exclude quality column).

Use the following classesfrom sklearn:  RepeatedStratifiedKFold, LogisticRegression and cross_val_score function.

As in the previous example: 

Initialize Repeated 5-fold cross validation object (5 splits) with 1000 repetitions.
Initialize Logistic Regression classifier.
Calculate cross validation scores for the Logistic Regression classifier and the kfold cross validation object that we created (use cv parameter to pass a custom cross validation object to cros_val_score function).

However, for scoring let us use accuracy this time, and since it is a binary classification problem no averageing across classes is required. If you want to speed things up set n_jobs=-1  which would use the power of multiple CPU cores on your computer for the calculation.

Store the results of cross validation scores in "scores" variable. We will check it in assert tests, as well as X and y will be tested.

For visual aid plot a distribution of scores across cross validation runs.

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wines = pd.read_csv(url, sep=";")

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert y.shape == (1599,)
assert X.shape == (1599, 11)
assert  pd.Series(scores).mean() > 0.86

## Task 5 (1 point) - Random Forest

Let us use the digits dataset and apply Random Forest to predict the digit given a low resolution handwritten image.

Use the following classes from sklearn:  RandomForestClassifier

You will need to split data into training and 30% testing. Use training to fit RandomForestClassifier. And use testing subset to test the f1 score with macro averaging: first, predict the labels for test set using .predict(), then compare predicted to actual labels and calculate the score. Store this score in "score" variable, we will test it.

Now let us compare the results to a robust estimate of performance using 5-fold cross validation (5 splits) with 100 repetitions (more repetitions will take time). Calculate 'f1_macro'  scores with cross validation for the whole dataset of X and y (not the training or testing subsets). Store the results of cross validation scores in "scores" variable (it will be tested). Let us now take the mean of "scores" and compare it to "score" that we calculated previously. What do you see?
For visual aid plot distributions of scores across cross validation runs.

In [None]:
import sklearn.datasets

digits = sklearn.datasets.load_digits()

# X - how digits are handwritten
X = digits['data']

# y - what these digits actually are
y = digits['target']

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert X.shape == (1797, 64)
assert y.shape == (1797, )
assert score > 0.91
assert np.percentile(scores, 5) < score < np.percentile(scores, 95)

## Task 6 (not graded) - Grid SearchCV

Apply KNN classifier to digits dataset. You need to find the optimal parameters for KNN.
Number of neighbors is not the only parameter that can be changed. As you can see from the first exercise the distance metric is an important parameter. Also, when choosing the class label by voting between K nearest neigbors different weights could be assigned to samples. A popular choice would be to assign higher weights to neighbors that are closer.

Split the dataset into train_test and validation and only use the train_test portion for grid search CV.
Use Grid Search cross validation to find the optimal combination of parameters listed below:

n_neighbors:  3, 5, 7, 9, 11, 13, 15
weights: uniform, distance
metric: euclidean, manhattan, chebyshev

In order to define what is an optimal solution you need to choose a score. The choice is up to you, but remember that digits dataset involves multiple classes. There is no assert test.

You will need grid.best_params_ and grid.best_estimator_

Use the best performing classifier grid.best_estimator_ to predict values for the validation portion of your dataset. Print a confusion matrix (use confusion_matrix function) and analyze which digits are easily confused with the other ones by the classifier.

If you have questions about the solution mark them clearly and we will answer in okpy grading system.

In [None]:
import sklearn.datasets

digits = sklearn.datasets.load_digits()

# X - how digits are handwritten
X = digits['data']

# y - what these digits actually are
y = digits['target']

In [None]:
# YOUR CODE HERE
pass