Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

# Week5 Assignment - Cross-Validation.

For this assignment, you will use a generated dataset to practice cross-validation and parameter optimization.

In the first component, you will run linear regression using cross-validation.

In the second component, you will run linear regression using Lasso and cross-validation.

In the third component, you will run linear regression with nested cross-validation.

All the exercises are designed so that the solutions will need only one or a few lines of code.

Do not hesitate to contact instuctors and TA via #week5 channel on Slack if you get stuck. Join the channel first by clicking on Channels.

## Part A. Run linear regression with cross-validation.

In this component you will run linear regression with 5-fold cross-validation.

We have provided you with features X and response y. We've also provided a premade cross-validation iterator.

Build a linear model and use SKLearn's cross_val_score function to assess how well your model generalizes. Save the array from cross_val_score to a variable named cv_score.

When you run cross_val_score, make sure to:
* Use the cross-validation iterator that we provide (cv_iterator)
* Set the scoring function to "neg_mean_squared_error"


In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import make_regression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Generate data
X, y = make_regression(n_samples=100, n_features=100, n_informative=10, random_state=10)

# Iterator setup
cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)

# YOUR CODE HERE
pass

In [None]:
assert type(cv_score) == np.ndarray
assert cv_score.shape == (5,)
assert np.isclose(cv_score.mean(), -6327, atol=200)

## Part B. Run linear regression with Lasso and cross-validation.

Similar to component (A), but this time use Lasso regression with hyperparameter alpha = 0.5.
You can read more about Lasso regularization here: https://scikit-learn.org/stable/modules/linear_model.html#lasso
Note the role of hyperparameter alpha.

Also, this Datacamp tutorial may be helpful: https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net

Again, save the array from cross_val_score to a variable named cv_score.

In [None]:
from sklearn.linear_model import Lasso

cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)

# YOUR CODE HERE
pass

In [None]:
assert type(cv_score) == np.ndarray
assert np.isclose(cv_score.mean(), -4, atol=1)

## Part C. Run nested cross-validation.

Run nested cross-validation, while optimizing for parameter alpha. We have provided the grid for parameter alpha and the CV iterators. Save the final array from cross_val_score into a variable named cv_score.

You can read more about nested cross-validation in this Tutorial: https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html. 

Explore documentation for GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html, in particular its arguments: esstimator, param_grid, cv. 

You may also need documentation for cross_val_score https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html , which also has estimator and cv, in addition to the benchmark labeled dataset X with the labels in variable y.

In this exercise we also want to use "neg_mean_squared_error" scoring for consistency.

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error

# CV iterators:
# we will use "inner" cross-validation to find optimal hyperparameters
inner_cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)
# the "outer" cross-validation iterator will give us 
outer_cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)


# (hyper)parameter grid for Lasso()
p_grid = {
    "alpha": [0.1, 0.5, 1, 1.5]
}

# YOUR CODE HERE
pass

In [None]:
assert type(cv_score) == np.ndarray
assert cv_score.shape == (5,)
assert np.isclose(-0.167, cv_score.mean(), atol=0.05)