"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sb.scatterplot(X12_test.flatten(), y_test)\n",
"sb.scatterplot(X12_test.flatten(), m10_test_predicted)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let us summarize the models' errors"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"M1 train 39.0\n",
"M1 test 38.0\n",
"M2 train 29.0\n",
"M2 test 33.0\n",
"M10 train 26.0\n",
"M10 test 29.0\n"
]
}
],
"source": [
"print(\"M1 train\", round(mean_squared_error(y_train, m1_train_predicted)))\n",
"print(\"M1 test\", round(mean_squared_error(y_test, m1_test_predicted)))\n",
" \n",
"print(\"M2 train\", round(mean_squared_error(y_train, m2_train_predicted)))\n",
"print(\"M2 test\", round(mean_squared_error(y_test, m2_test_predicted)))\n",
"\n",
"print(\"M10 train\", round(mean_squared_error(y_train, m10_train_predicted)))\n",
"print(\"M10 test\", round(mean_squared_error(y_test, m10_test_predicted)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will talk about model selection and feature selection in mode detail in one of the next classes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Is there only one way to split the dataset? Cross-validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cross-validation is reusing the dataset and creates multiple train-holdout subset pairs.\n",
"\n",
"The major assumption is that our whole dataset is a representative sample. By taking the random subsamples from the whole dataset we can estimate the performance of the model on previously unseen data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### LeaveOneOut\n",
"\n",
"LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### KFold\n",
"KFold divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using \n",
"k−1folds, and the fold left out is used for test.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### ShuffleSplit\n",
"\n",
"The ShuffleSplit iterator will generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### TimeSeriesSplit\n",
"\n",
"TimeSeriesSplit is a variation of k-fold which returns first \n",
"k folds as train set and the (k+1)th fold as test set. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them. Also, it adds all surplus data to the first training partition, which is always used to train the model.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.716098217736928"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"KNeighborsRegressor().fit(X, y).score(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7079649368669326"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADupJREFUeJzt3W+MHPV9x/HPJyFRI841Tp2sLOOyqURoLdxCvWoj5UH2SltRLEFIqgirRaCQXFQVkgduJTd9UFSE5EoJPGme0IJAkcKJRmnjYFSKKBuUikQ9hz8GLCCll9aOA4GAxaH0j6NvH9xQX8ydZ25nZ/f2O++XdGJmdvY336/X+2E8O/s7R4QAANPvHZMuAAAwGgQ6ACRBoANAEgQ6ACRBoANAEgQ6ACRBoANAEgQ6ACRBoANAEueM82Bbt26Nbrdbe5w333xT5557bv2Cpkxb+5ba23tb+5ba2/tqfR8+fPiViHhf2XPHGujdblcLCwu1xxkMBur3+/ULmjJt7Vtqb+9t7Vtqb++r9W37+1WeyyUXAEiCQAeAJAh0AEiCQAeAJAh0AEiCQAeAJAh0AEiCQAeAJAh0AEhirN8UBbBxdPcfmtixFw/smdixM+MMHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSKA102ztsP2L7WdvP2P5csf1m28dtP1H8XNF8uQCAtVT5FXSnJO2LiO/a3iTpsO2Hisduj4gvNFceAKCq0kCPiBOSThTLb9g+Kml704UBANZnXdfQbXclXSrpO8WmG20/Zfsu21tGXBsAYB0cEdV2tGckfVPSrRHxNdsdSa9ICkm3SNoWEZ9c5XlzkuYkqdPp7J6fn69d9NLSkmZmZmqPM23a2rfU3t6b7PvI8ZONjFvFru2bS/fhNT9tdnb2cET0yp5bKdBtv0vS/ZIejIjbVnm8K+n+iLj4bOP0er1YWFgoPV6ZwWCgfr9fe5xp09a+pfb23mTf3f2HGhm3isUDe0r34TU/zXalQK9yl4sl3Snp6Mowt71txW5XS3q6asEAgNGrcpfLhyVdK+mI7SeKbZ+XtNf2JVq+5LIo6TONVAgAqKTKXS7fkuRVHnpg9OUAAIbFN0UBIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSKA102ztsP2L7WdvP2P5csf29th+y/ULx3y3NlwsAWEuVM/RTkvZFxE5JH5L0x7Z3Stov6eGIuFDSw8U6AGBCSgM9Ik5ExHeL5TckHZW0XdJVku4pdrtH0kebKhIAUG5d19BtdyVdKuk7kjoRcaJ46IeSOiOtDACwLo6IajvaM5K+KenWiPia7dcj4rwVj78WEW+7jm57TtKcJHU6nd3z8/O1i15aWtLMzEztcaZNW/uW2tt7k30fOX6ykXGr2LV9c+k+vOanzc7OHo6IXtlzKwW67XdJul/SgxFxW7HtOUn9iDhhe5ukQURcdLZxer1eLCwslB6vzGAwUL/frz3OtGlr31J7e2+y7+7+Q42MW8XigT2l+/Can2a7UqBXucvFku6UdPStMC8clHRdsXydpK9XLRgAMHrnVNjnw5KulXTE9hPFts9LOiDpPts3SPq+pE80UyIAoIrSQI+Ib0nyGg9fNtpyAADD4puiAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASRDoAJAEgQ4ASZQGuu27bL9s++kV2262fdz2E8XPFc2WCQAoU+UM/W5Jl6+y/faIuKT4eWC0ZQEA1qs00CPiUUk/HkMtAIAa6lxDv9H2U8UlmS0jqwgAMBRHRPlOdlfS/RFxcbHekfSKpJB0i6RtEfHJNZ47J2lOkjqdzu75+fnaRS8tLWlmZqb2ONOmrX1L7e29yb6PHD/ZyLhV7Nq+uXQfXvPTZmdnD0dEr+y5QwV61cfO1Ov1YmFhofR4ZQaDgfr9fu1xpk1b+5ba23uTfXf3H2pk3CoWD+wp3YfX/DTblQJ9qEsutretWL1a0tNr7QsAGI9zynawfa+kvqStto9J+gtJfduXaPmSy6KkzzRYIwCggtJAj4i9q2y+s4FaAAA18E1RAEiCQAeAJEovuQBo1tnuNtm365Sun+DdKJgunKEDQBIEOgAkQaADQBIEOgAkQaADQBIEOgAkQaADQBLchw5osjMPtlGVP+8m7sGvMsvjNOMMHQCSINABIAkCHQCSINABIAkCHQCSINABIAluW8SGstbtbEwjC5TjDB0AkiDQASAJAh0AkiDQASAJAh0AkiDQASAJAh0AkiDQASAJAh0AkiDQASCJ0kC3fZftl20/vWLbe20/ZPuF4r9bmi0TAFCmyhn63ZIuP2PbfkkPR8SFkh4u1gEAE1Qa6BHxqKQfn7H5Kkn3FMv3SProiOsCAKzTsNfQOxFxolj+oaTOiOoBAAzJEVG+k92VdH9EXFysvx4R5614/LWIWPU6uu05SXOS1Ol0ds/Pz9cuemlpSTMzM7XHmTZt6PvI8ZOrbu+8R3rpJ2MuZgNoa99SM73v2r55tAM2YLX3+ezs7OGI6JU9d9j50F+yvS0iTtjeJunltXaMiDsk3SFJvV4v+v3+kIc8bTAYaBTjTJs29L3WnOf7dp3SF4+0b/r+tvYtNdP74h/0RzpeE+q8z4e95HJQ0nXF8nWSvj7kOACAEaly2+K9kh6TdJHtY7ZvkHRA0u/YfkHSbxfrAIAJKv33TETsXeOhy0ZcCwCgBr4pCgBJEOgAkASBDgBJEOgAkASBDgBJEOgAkASBDgBJEOgAkASBDgBJEOgAkEQ7p3ED0ErdNWbzHIfFA3saPwZn6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQBIEOAEkQ6ACQRK3fWGR7UdIbkn4q6VRE9EZRFABg/UbxK+hmI+KVEYwDAKiBSy4AkETdQA9J/2T7sO25URQEABiOI2L4J9vbI+K47fdLekjSTRHx6Bn7zEmak6ROp7N7fn6+Tr2SpKWlJc3MzNQeZ9qMq+8jx082foz16rxHeuknk65i/Nrat5Sv913bN1fab7X3+ezs7OEqn1HWCvSfGci+WdJSRHxhrX16vV4sLCzUPtZgMFC/3689zrQZV9/d/YcaP8Z67dt1Sl88MoqPfKZLW/uW8vW+eGBPpf1We5/brhToQ19ysX2u7U1vLUv6XUlPDzseAKCeOv/760j6e9tvjfOViPjHkVQFAFi3oQM9Il6U9GsjrAUAUAO3LQJAEgQ6ACSR5yPkhM6822TfrlO6fgPegQJgY+AMHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIAkCHQCSINABIImpmT535VSy455GtuovdwWASeIMHQCSINABIAkCHQCSINABIAkCHQCSINABIImpuW1xkrpjvEUSAIbFGToAJEGgA0ASBDoAJFEr0G1fbvs529+zvX9URQEA1m/oQLf9TklfkvR7knZK2mt756gKAwCsT50z9N+Q9L2IeDEi/kfSvKSrRlMWAGC96gT6dkn/uWL9WLENADABjojhnmj/vqTLI+JTxfq1kn4zIm48Y785SXPF6kWSnhu+3P+3VdIrIxhn2rS1b6m9vbe1b6m9va/W9wUR8b6yJ9b5YtFxSTtWrJ9fbPsZEXGHpDtqHOdtbC9ERG+UY06DtvYttbf3tvYttbf3On3XueTyr5IutP0B2++WdI2kgzXGAwDUMPQZekScsn2jpAclvVPSXRHxzMgqAwCsS625XCLiAUkPjKiW9RjpJZwp0ta+pfb23ta+pfb2PnTfQ38oCgDYWPjqPwAksaEDverUArY/bjtsp/hEvKxv29fb/pHtJ4qfT02iziZUec1tf8L2s7afsf2VcdfYhAqv+e0rXu/nbb8+iTqbUKH3X7T9iO3HbT9l+4pJ1DlqFfq+wPbDRc8D2+eXDhoRG/JHyx+0/pukX5L0bklPStq5yn6bJD0q6duSepOuexx9S7pe0l9PutYJ9X6hpMclbSnW3z/pusfR9xn736TlmxAmXvuYXvM7JP1RsbxT0uKk6x5T338n6bpi+bckfbls3I18hl51aoFbJP2VpP8aZ3ENavOUClV6/7SkL0XEa5IUES+PucYmrPc13yvp3rFU1rwqvYekny+WN0v6wRjra0qVvndK+udi+ZFVHn+bjRzopVML2P51STsiItOvFKo6pcLHi3+KfdX2jlUen0ZVev+gpA/a/hfb37Z9+diqa07laTRsXyDpAzr9Rp92VXq/WdIf2j6m5bvqbhpPaY2q0veTkj5WLF8taZPtXzjboBs50M/K9jsk3SZp36RrmYBvSOpGxK9KekjSPROuZ5zO0fJll76Wz1T/xvZ5E61ovK6R9NWI+OmkCxmjvZLujojzJV0h6cvF+z+7P5H0EduPS/qIlr+Jf9bXfSP/oZRNLbBJ0sWSBrYXJX1I0sEEH4yWTqkQEa9GxH8Xq38rafeYamtalekkjkk6GBH/GxH/Lul5LQf8NKs0jUbhGuW53CJV6/0GSfdJUkQ8JunntDzfyTSr8j7/QUR8LCIulfTnxbazfhi+kQP9rFMLRMTJiNgaEd2I6Gr5Q9ErI2JhMuWOTOmUCra3rVi9UtLRMdbXpCrTSfyDls/OZXurli/BvDjOIhtQaRoN278saYukx8ZcX5Oq9P4fki6TJNu/ouVA/9FYqxy9Ku/zrSv+JfJnku4qG3TDBnpEnJL01tQCRyXdFxHP2P5L21dOtrrmVOz7s8Ute09K+qyW73qZehV7f1DSq7af1fIHRX8aEa9OpuLRWMff9WskzUdx20MGFXvfJ+nTxd/3eyVdP+1/BhX77kt6zvbzkjqSbi0bl2+KAkASG/YMHQCwPgQ6ACRBoANAEgQ6ACRBoANAEgQ6ACRBoANAEgQ6ACTxf2wUJgD1YBP9AAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from sklearn.model_selection import ShuffleSplit\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"reg = LinearRegression()\n",
"cv = ShuffleSplit(n_splits=100, test_size=0.1, random_state=0)\n",
"\n",
"# here we try to maximize the score, that is why neg_mean_squared_error\n",
"# essentially, score = - cost_function\n",
"s = cross_val_score(reg, X, y, cv=cv)\n",
"pd.Series(s).hist()\n",
"s.mean() # R^2"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['accuracy',\n",
" 'adjusted_mutual_info_score',\n",
" 'adjusted_rand_score',\n",
" 'average_precision',\n",
" 'balanced_accuracy',\n",
" 'brier_score_loss',\n",
" 'completeness_score',\n",
" 'explained_variance',\n",
" 'f1',\n",
" 'f1_macro',\n",
" 'f1_micro',\n",
" 'f1_samples',\n",
" 'f1_weighted',\n",
" 'fowlkes_mallows_score',\n",
" 'homogeneity_score',\n",
" 'jaccard',\n",
" 'jaccard_macro',\n",
" 'jaccard_micro',\n",
" 'jaccard_samples',\n",
" 'jaccard_weighted',\n",
" 'max_error',\n",
" 'mutual_info_score',\n",
" 'neg_log_loss',\n",
" 'neg_mean_absolute_error',\n",
" 'neg_mean_squared_error',\n",
" 'neg_mean_squared_log_error',\n",
" 'neg_median_absolute_error',\n",
" 'normalized_mutual_info_score',\n",
" 'precision',\n",
" 'precision_macro',\n",
" 'precision_micro',\n",
" 'precision_samples',\n",
" 'precision_weighted',\n",
" 'r2',\n",
" 'recall',\n",
" 'recall_macro',\n",
" 'recall_micro',\n",
" 'recall_samples',\n",
" 'recall_weighted',\n",
" 'roc_auc',\n",
" 'v_measure_score']"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sklearn.metrics\n",
"sorted(sklearn.metrics.SCORERS.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bias - Variance Tradeoff\n",
"\n",
"The **bias** is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).\n",
"\n",
"\n",
"The **variance** is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Let's plot some learning curves"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"#From http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.svm import SVC\n",
"from sklearn.datasets import load_digits\n",
"from sklearn.model_selection import learning_curve\n",
"from sklearn.model_selection import ShuffleSplit\n",
"\n",
"def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n",
" n_jobs=None, train_sizes=np.linspace(.3, 1.0, 10)):\n",
" plt.figure()\n",
" plt.title(title)\n",
" if ylim is not None:\n",
" plt.ylim(*ylim)\n",
" plt.xlabel(\"Training examples\")\n",
" plt.ylabel(\"Score\")\n",
" train_sizes, train_scores, test_scores = learning_curve(\n",
" estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring=\"neg_mean_squared_error\")\n",
" train_scores_mean = np.mean(train_scores, axis=1)\n",
" train_scores_std = np.std(train_scores, axis=1)\n",
" test_scores_mean = np.mean(test_scores, axis=1)\n",
" test_scores_std = np.std(test_scores, axis=1)\n",
" plt.grid()\n",
"\n",
" plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n",
" train_scores_mean + train_scores_std, alpha=0.1,\n",
" color=\"r\")\n",
" plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n",
" test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n",
" plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n",
" label=\"Training score\")\n",
" plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n",
" label=\"Cross-validation score\")\n",
"\n",
" plt.legend(loc=\"best\")\n",
" return plt"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html\n",
"title = \"Learning Curves\"\n",
"\n",
"# Create the CV iterator\n",
"cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)\n",
"model = LinearRegression()\n",
"# model = KNeighborsRegressor(n_neighbors=4)\n",
"\n",
"plot_learning_curve(model, title, X, y, cv=cv_iterator, n_jobs=4)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Lasso Regression\n",
"\n",
"“when you have two competing theories that make exactly the same predictions, the simpler one is the better.” - William of Ockham\n",
"\n",
"So for a regression model LASSO (least absolute shrinkage and selection operator), or more commonly referred to as L1 regularization, could be used to penalize for the large number of parameters.\n",
"\n",
"* L1 regularization (the last term of the equation) favors a sparse model with features having coefficients equal to zero or close to zero:\n",
"\n",
"$ Loss = ||y - Xw||^2_2 + \\alpha * ||w||_1$\n",
"\n",
"L1 norm $||w||_1$ is simply a sum of absolute values of coefficients and $\\alpha$ regulates the strength of regularization. A zero coefficient for a feature essentially mean that the feature is eliminated.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"from sklearn.linear_model import Lasso, LinearRegression\n",
"from sklearn.model_selection import cross_val_score, KFold\n",
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"26.183440497117296"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llr = Lasso(alpha=0.5)\n",
"llr.fit(X, y)\n",
"preds = llr.predict(X)\n",
"\n",
"# Create the CV iterator\n",
"cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)\n",
"\n",
"# Note: default in sklearn: higher return values are better than lower return values\n",
"np.mean(-cross_val_score(llr, X, y, cv=cv_iterator, scoring=\"neg_mean_squared_error\"))"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"26.183440497117296"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(llr, X, y, cv=5, scoring=\"neg_mean_squared_error\")\n",
"abs(np.mean(cross_val_score(llr, X, y, cv=cv_iterator, scoring=\"neg_mean_squared_error\")))"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html\n",
"title = \"Learning Curves\"\n",
"\n",
"# Create the CV iterator\n",
"cv_iterator = KFold(n_splits=5, shuffle=True, random_state=10)\n",
"llr = Lasso(alpha=0.5)\n",
"\n",
"plot_learning_curve(llr, title, X, y, cv=cv_iterator, n_jobs=4)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients.\n",
"\n",
"The ridge coefficients minimize a penalized residual sum of squares,\n",
" \n",
"\n",
"$$ Loss = ||y - Xw||^2_2 + \\alpha * ||w||^2_2$$\n",
"\n",
"Here, \n",
"α\n",
"≥\n",
"0\n",
" is a complexity parameter that controls the amount of shrinkage: the larger the value of \n",
"α\n",
", the greater the amount of shrinkage and thus the coefficients become more robust to collinearity."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"