{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Week 3\n", "## Data retrieval, preprocessing, and normalization for ML\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Basic Outline\n", " \n", "* Where do data come from? Data retreival.\n", "* Ideal datasets and data types\n", "* Common wrangling needs and implementations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Where did you get your data?\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Generated in-house (stored as CSVs, TSVs, SQL, proprietary, etc)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Collaborators" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Public sources" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Scripting data retrieval improves reproducibility" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# you may need to:\n", "# !pip install requests" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('brca_protein_expression.tsv.gz', )" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Downloading a data file from a remote repository\n", "import urllib\n", "\n", "URL = \"https://dcc.icgc.org/api/v1/download?fn=/release_18/Projects/BRCA-US/protein_expression.BRCA-US.tsv.gz\"\n", "FILENAME = \"brca_protein_expression.tsv.gz\"\n", "\n", "urllib.request.urlretrieve(URL, FILENAME)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Scraping tools such as Mechanize and BeautifulSoup allow extraction of data from websites\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TTCTTGACACTGATTGATCTGCCAAAAGGGGAAGAATGAGTCCAGCTAGAATCCAGGACTAACCAGCGGGTGAGCTTCAAGGAACAAAGGGCTTCCGCTGG\n" ] } ], "source": [ "import requests\n", "# Retrieving data from a remote web service in JSON format that gets converted to a python structure:\n", "def get_genome_sequence_ensembl(chromosome, start, end):\n", " \"\"\" API described here http://rest.ensembl.org/documentation/info/sequence_region\"\"\"\n", " url = 'https://rest.ensembl.org/sequence/region/human/{0}:{1}..{2}:1?content-type=application/json'.format(\n", " chromosome, start, end)\n", " r = requests.get(url, headers={\"Content-Type\": \"application/json\"}, timeout=10.000)\n", " if r.ok:\n", " return r.json()['seq']\n", "print(get_genome_sequence_ensembl(7, 200000,200100))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Pandas covers most of the data retrieval needs" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
icgc_donor_idproject_codeicgc_specimen_idicgc_sample_idsubmitted_sample_idanalysis_idantibody_idgene_namegene_stable_idgene_build_versionnormalized_expression_levelverification_statusverification_platformplatformexperimental_protocolraw_data_repositoryraw_data_accession
0DO4143BRCA-USSP8807SA11426TCGA-A1-A0SK-01A-21-A13A-2010694PAI-1SERPINE1NaNNaN1.769954not testedNaNM.D. Anderson Reverse Phase Protein Array CoreMDA_RPPA_Core http://tcga-data.nci.nih.gov/tcg...TCGATCGA-A1-A0SK-01A-21-A13A-20
\n", "
" ], "text/plain": [ " icgc_donor_id project_code icgc_specimen_id icgc_sample_id \\\n", "0 DO4143 BRCA-US SP8807 SA11426 \n", "\n", " submitted_sample_id analysis_id antibody_id gene_name \\\n", "0 TCGA-A1-A0SK-01A-21-A13A-20 10694 PAI-1 SERPINE1 \n", "\n", " gene_stable_id gene_build_version normalized_expression_level \\\n", "0 NaN NaN 1.769954 \n", "\n", " verification_status verification_platform \\\n", "0 not tested NaN \n", "\n", " platform \\\n", "0 M.D. Anderson Reverse Phase Protein Array Core \n", "\n", " experimental_protocol raw_data_repository \\\n", "0 MDA_RPPA_Core http://tcga-data.nci.nih.gov/tcg... TCGA \n", "\n", " raw_data_accession \n", "0 TCGA-A1-A0SK-01A-21-A13A-20 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Let's read with pandas\n", "# Note that we do not even need to unzip the file before opening!\n", "brca_data = pd.read_table(FILENAME, sep=\"\\t\")\n", "brca_data.head(1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas can even retrieve from an SQL database directly" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# you may need to \n", "# !pip install sqlalchemy\n", "# !pip install pymysql" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
binchromchromStartchromEndnamescorestrandrefNCBIrefUCSCobserved...locTypeweightexceptionssubmitterCountsubmittersalleleFreqCountallelesalleleNsalleleFreqsbitfields
0585chrY1002010020rs7455936000+b'-'b'-'-/A/AAC...between1MixedObserved1b'1000GENOMES,'3b'-,A,AAC,'b'4906.000000,10.000000,92.000000,'b'0.979633,0.001997,0.018371,'
1585chrY1003410036rs2012786420+b'CC'b'CC'-/CC...range1MixedObserved2b'1000GENOMES,SSMP,'2b'-,CC,'b'369.000000,4637.000000,'b'0.073711,0.926288,'maf-5-some-pop
2585chrY1005110052rs1864343150+b'T'b'T'A/T...exact12b'1000GENOMES,SSMP,'2b'A,T,'b'1582.000000,3426.000000,'b'0.315895,0.684105,'maf-5-some-pop,maf-5-all-pops
\n", "

3 rows × 26 columns

\n", "
" ], "text/plain": [ " bin chrom chromStart chromEnd name score strand refNCBI refUCSC \\\n", "0 585 chrY 10020 10020 rs745593600 0 + b'-' b'-' \n", "1 585 chrY 10034 10036 rs201278642 0 + b'CC' b'CC' \n", "2 585 chrY 10051 10052 rs186434315 0 + b'T' b'T' \n", "\n", " observed ... locType weight exceptions \\\n", "0 -/A/AAC ... between 1 MixedObserved \n", "1 -/CC ... range 1 MixedObserved \n", "2 A/T ... exact 1 \n", "\n", " submitterCount submitters alleleFreqCount alleles \\\n", "0 1 b'1000GENOMES,' 3 b'-,A,AAC,' \n", "1 2 b'1000GENOMES,SSMP,' 2 b'-,CC,' \n", "2 2 b'1000GENOMES,SSMP,' 2 b'A,T,' \n", "\n", " alleleNs alleleFreqs \\\n", "0 b'4906.000000,10.000000,92.000000,' b'0.979633,0.001997,0.018371,' \n", "1 b'369.000000,4637.000000,' b'0.073711,0.926288,' \n", "2 b'1582.000000,3426.000000,' b'0.315895,0.684105,' \n", "\n", " bitfields \n", "0 \n", "1 maf-5-some-pop \n", "2 maf-5-some-pop,maf-5-all-pops \n", "\n", "[3 rows x 26 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlalchemy as sa\n", "# Connect to UCSC genomic database\n", "engine = sa.create_engine('mysql+pymysql://genome@genome-mysql.cse.ucsc.edu/hg38', poolclass=sa.pool.NullPool)\n", "# select 3 SNPs from Chromosome Y\n", "pd.read_sql(\"SELECT * FROM snp147Common WHERE chrom='chrY' LIMIT 3\", engine)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "## Pandas dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Dataframes are convenient containers for mixed data types" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Pandas is *incredibly useful* for data wrangling" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* sklearn is happy to accept Pandas dataframes as input" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Pandas is built for exploratory analysis, visualization and stat tests / ML " ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD8CAYAAACcjGjIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAFRpJREFUeJzt3X+M3Hed3/Hn6xzCRaRcwuWyjWyrzglfj4CPAHtJKtRqIdfECSeSk0BKlIIDqXxFCQXVVTGcqlAgUmgb6KEDJN/FJfTomSjAxSLmcm4uW3TS5SeEOI6PZhtcYpImRx1+GFqi5d79Y76upv7M7o5ndz278fMhjWbm/f18P/P+2rN+7ffHjFNVSJLU7xfG3YAkaeUxHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQwHCRJDcNBktQ4ZdwNjOqss86qDRs2jLTuT37yE172spctbUMnkP2P12rvH1b/Ntj/6B5++OHvV9WvLDRu1YbDhg0beOihh0Zad3p6mqmpqaVt6ASy//Fa7f3D6t8G+x9dkv8xzDgPK0mSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGqv2E9LSQjZsv2tZ5t22aZZr55n74M1vWZbXlU4k9xwkSQ3DQZLUMBwkSQ3DQZLUMBwkSY0FwyHJLyZ5IMm3kuxP8m+6+rlJ7k/yRJIvJjm1q7+0ez7TLd/QN9cHu/q3k1zaV9/c1WaSbF/6zZQkHY9h9hx+Bry5ql4LnA9sTnIR8HHgk1W1EXgeuK4bfx3wfFW9EvhkN44k5wFXAa8GNgOfSbImyRrg08BlwHnA1d1YSdKYLBgO1XOke/qS7lbAm4E7uvptwJXd4yu653TLL06Srr6rqn5WVd8BZoALuttMVT1ZVS8Au7qxkqQxGepDcN1v9w8Dr6T3W/5/B35QVbPdkEPA2u7xWuApgKqaTfJD4Je7+n190/av89Qx9Qvn6GMrsBVgYmKC6enpYdpvHDlyZOR1VwL7H862TbMLDxrBxGnzz70a/m58D43Xauh/qHCoqp8D5yc5A/gK8KpBw7r7zLFsrvqgvZcaUKOqdgA7ACYnJ2vU/4PV/392vE5U//N9inkxtm2a5ZZ9c//oHLxmalledyn5Hhqv1dD/cV2tVFU/AKaBi4Azkhz9CVkHPN09PgSsB+iW/xJwuL9+zDpz1SVJYzLM1Uq/0u0xkOQ04LeAA8C9wNu6YVuAO7vHu7vndMv/oqqqq1/VXc10LrAReAB4ENjYXf10Kr2T1ruXYuMkSaMZ5rDSOcBt3XmHXwBur6qvJnkc2JXkY8A3gVu78bcC/ynJDL09hqsAqmp/ktuBx4FZ4PrucBVJbgDuBtYAO6tq/5JtoSTpuC0YDlX1KPC6AfUn6V1pdGz9/wBvn2Oum4CbBtT3AHuG6FeSdAL4CWlJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1FgyHJOuT3JvkQJL9Sd7X1T+c5HtJHulul/et88EkM0m+neTSvvrmrjaTZHtf/dwk9yd5IskXk5y61BsqSRreMHsOs8C2qnoVcBFwfZLzumWfrKrzu9segG7ZVcCrgc3AZ5KsSbIG+DRwGXAecHXfPB/v5toIPA9ct0TbJ0kawYLhUFXPVNU3usc/Bg4Aa+dZ5QpgV1X9rKq+A8wAF3S3map6sqpeAHYBVyQJ8Gbgjm7924ArR90gSdLipaqGH5xsAL4OvAb4F8C1wI+Ah+jtXTyf5A+A+6rqj7t1bgW+1k2xuar+aVd/B3Ah8OFu/Cu7+nrga1X1mgGvvxXYCjAxMfGGXbt2Hd/Wdo4cOcLpp58+0rorgf0PZ9/3frgs806cBs/+77mXb1r7S8vyukvJ99B4jbP/N73pTQ9X1eRC404ZdsIkpwNfAt5fVT9K8lngo0B197cA7wYyYPVi8F5KzTO+LVbtAHYATE5O1tTU1LDt/3+mp6cZdd2VwP6Hc+32u5Zl3m2bZrll39w/OgevmVqW111KvofGazX0P1Q4JHkJvWD4QlV9GaCqnu1b/ofAV7unh4D1fauvA57uHg+qfx84I8kpVTV7zHhJ0hgMc7VSgFuBA1X1ib76OX3Dfgd4rHu8G7gqyUuTnAtsBB4AHgQ2dlcmnUrvpPXu6h3Xuhd4W7f+FuDOxW2WJGkxhtlzeCPwDmBfkke62ofoXW10Pr1DQAeB3wWoqv1Jbgcep3el0/VV9XOAJDcAdwNrgJ1Vtb+b7wPAriQfA75JL4wkSWOyYDhU1V8y+LzAnnnWuQm4aUB9z6D1qupJelczSZJWAD8hLUlqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpIbhIElqGA6SpMaC4ZBkfZJ7kxxIsj/J+7r6K5LsTfJEd39mV0+STyWZSfJoktf3zbWlG/9Eki199Tck2det86kkWY6NlSQNZ5g9h1lgW1W9CrgIuD7JecB24J6q2gjc0z0HuAzY2N22Ap+FXpgANwIXAhcANx4NlG7M1r71Ni9+0yRJo1owHKrqmar6Rvf4x8ABYC1wBXBbN+w24Mru8RXA56vnPuCMJOcAlwJ7q+pwVT0P7AU2d8teXlV/VVUFfL5vLknSGJxyPIOTbABeB9wPTFTVM9ALkCRnd8PWAk/1rXaoq81XPzSgPuj1t9Lbw2BiYoLp6enjaf//OXLkyMjrrgT2P5xtm2aXZd6J0+afezX83fgeGq/V0P/Q4ZDkdOBLwPur6kfznBYYtKBGqLfFqh3ADoDJycmamppaoOvBpqenGXXdlcD+h3Pt9ruWZd5tm2a5Zd/cPzoHr5laltddSr6Hxms19D/U1UpJXkIvGL5QVV/uys92h4To7p/r6oeA9X2rrwOeXqC+bkBdkjQmw1ytFOBW4EBVfaJv0W7g6BVHW4A7++rv7K5augj4YXf46W7gkiRndieiLwHu7pb9OMlF3Wu9s28uSdIYDHNY6Y3AO4B9SR7pah8CbgZuT3Id8F3g7d2yPcDlwAzwU+BdAFV1OMlHgQe7cR+pqsPd4/cAnwNOA77W3SRJY7JgOFTVXzL4vADAxQPGF3D9HHPtBHYOqD8EvGahXiRJJ4afkJYkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNQwHSVLDcJAkNRYMhyQ7kzyX5LG+2oeTfC/JI93t8r5lH0wyk+TbSS7tq2/uajNJtvfVz01yf5InknwxyalLuYGSpOM3zJ7D54DNA+qfrKrzu9segCTnAVcBr+7W+UySNUnWAJ8GLgPOA67uxgJ8vJtrI/A8cN1iNkiStHgLhkNVfR04POR8VwC7qupnVfUdYAa4oLvNVNWTVfUCsAu4IkmANwN3dOvfBlx5nNsgSVpiiznncEOSR7vDTmd2tbXAU31jDnW1ueq/DPygqmaPqUuSxuiUEdf7LPBRoLr7W4B3AxkwthgcQjXP+IGSbAW2AkxMTDA9PX1cTR915MiRkdddCex/ONs2zS48aAQTp80/92r4u/E9NF6rof+RwqGqnj36OMkfAl/tnh4C1vcNXQc83T0eVP8+cEaSU7q9h/7xg153B7ADYHJysqampkZpn+npaUZddyWw/+Fcu/2uZZl326ZZbtk394/OwWumluV1l5LvofFaDf2PdFgpyTl9T38HOHol027gqiQvTXIusBF4AHgQ2NhdmXQqvZPWu6uqgHuBt3XrbwHuHKUnSdLSWXDPIcmfAFPAWUkOATcCU0nOp3cI6CDwuwBVtT/J7cDjwCxwfVX9vJvnBuBuYA2ws6r2dy/xAWBXko8B3wRuXbKtkySNZMFwqKqrB5Tn/Ae8qm4CbhpQ3wPsGVB/kt7VTJKkFcJPSEuSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKlhOEiSGoaDJKmxYDgk2ZnkuSSP9dVekWRvkie6+zO7epJ8KslMkkeTvL5vnS3d+CeSbOmrvyHJvm6dTyXJUm+kJOn4DLPn8Dlg8zG17cA9VbURuKd7DnAZsLG7bQU+C70wAW4ELgQuAG48GijdmK196x37WpKkE2zBcKiqrwOHjylfAdzWPb4NuLKv/vnquQ84I8k5wKXA3qo6XFXPA3uBzd2yl1fVX1VVAZ/vm0uSNCajnnOYqKpnALr7s7v6WuCpvnGHutp89UMD6pKkMTpliecbdL6gRqgPnjzZSu8QFBMTE0xPT4/QIhw5cmTkdVcC+x/Otk2zyzLvxGnzz70a/m58D43Xauh/1HB4Nsk5VfVMd2joua5+CFjfN24d8HRXnzqmPt3V1w0YP1BV7QB2AExOTtbU1NRcQ+c1PT3NqOuuBPY/nGu337Us827bNMst++b+0Tl4zdSyvO5S8j00Xquh/1EPK+0Gjl5xtAW4s6/+zu6qpYuAH3aHne4GLklyZnci+hLg7m7Zj5Nc1F2l9M6+uSRJY7LgnkOSP6H3W/9ZSQ7Ru+roZuD2JNcB3wXe3g3fA1wOzAA/Bd4FUFWHk3wUeLAb95GqOnqS+z30rog6Dfhad5MkjdGC4VBVV8+x6OIBYwu4fo55dgI7B9QfAl6zUB+SpBPHT0hLkhqGgySpYThIkhqGgySpYThIkhpL/Qlp6aS3YZk+fDeMgze/ZWyvrRcX9xwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUMBwkSQ3DQZLUWFQ4JDmYZF+SR5I81NVekWRvkie6+zO7epJ8KslMkkeTvL5vni3d+CeSbFncJkmSFmsp9hzeVFXnV9Vk93w7cE9VbQTu6Z4DXAZs7G5bgc9CL0yAG4ELgQuAG48GiiRpPJbjsNIVwG3d49uAK/vqn6+e+4AzkpwDXArsrarDVfU8sBfYvAx9SZKGtNhwKODPkzycZGtXm6iqZwC6+7O7+lrgqb51D3W1ueqSpDE5ZZHrv7Gqnk5yNrA3yV/PMzYDajVPvZ2gF0BbASYmJpienj7OdnuOHDky8rorgf0PZ9um2WWZd+K05Zt7sYb9c/U9NF6rof9FhUNVPd3dP5fkK/TOGTyb5JyqeqY7bPRcN/wQsL5v9XXA01196pj69ByvtwPYATA5OVlTU1ODhi1oenqaUdddCex/ONduv2tZ5t22aZZb9i3296rlcfCaqaHG+R4ar9XQ/8iHlZK8LMnfOfoYuAR4DNgNHL3iaAtwZ/d4N/DO7qqli4Afdoed7gYuSXJmdyL6kq4mSRqTxfz6MwF8JcnRef5zVf1ZkgeB25NcB3wXeHs3fg9wOTAD/BR4F0BVHU7yUeDBbtxHqurwIvqSJC3SyOFQVU8Crx1Q/1/AxQPqBVw/x1w7gZ2j9iJJWlp+QlqS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1DAcJEkNw0GS1Bj5/5CWhrFh+11NbdumWa4dUJe0crjnIElqGA6SpIbhIElqeM5BehEZdI5nkKU+73Pw5rcs2VxaGVbMnkOSzUm+nWQmyfZx9yNJJ7MVEQ5J1gCfBi4DzgOuTnLeeLuSpJPXiggH4AJgpqqerKoXgF3AFWPuSZJOWivlnMNa4Km+54eAC8fUy4vSsMeipVGc6PfX0XMmnutYPislHDKgVs2gZCuwtXt6JMm3R3y9s4Dvj7juSrCq+//n9j92q30bjvafj4+7k5GN88//7w0zaKWEwyFgfd/zdcDTxw6qqh3AjsW+WJKHqmpysfOMi/2P12rvH1b/Ntj/8lsp5xweBDYmOTfJqcBVwO4x9yRJJ60VsedQVbNJbgDuBtYAO6tq/5jbkqST1ooIB4Cq2gPsOUEvt+hDU2Nm/+O12vuH1b8N9r/MUtWc95UkneRWyjkHSdIKctKGQ5L3dl/XsT/Jvx13P6NK8i+TVJKzxt3L8Ujy75L8dZJHk3wlyRnj7mkYq/lrXpKsT3JvkgPd+/594+5pFEnWJPlmkq+Ou5dRJDkjyR3d+/9Akn8w7p4GOSnDIcmb6H0C+zeq6tXAvx9zSyNJsh74x8B3x93LCPYCr6mq3wD+G/DBMfezoBfB17zMAtuq6lXARcD1q6z/o94HHBh3E4vw+8CfVdWvA69lhW7LSRkOwHuAm6vqZwBV9dyY+xnVJ4F/xYAPDK50VfXnVTXbPb2P3mdbVrpV/TUvVfVMVX2je/xjev8orR1vV8cnyTrgLcAfjbuXUSR5OfCPgFsBquqFqvrBeLsa7GQNh18D/mGS+5P81yS/Oe6GjleStwLfq6pvjbuXJfBu4GvjbmIIg77mZVX943pUkg3A64D7x9vJcfsP9H4h+ttxNzKiXwX+BviP3aGxP0rysnE3NciKuZR1qSX5L8DfHbDo9+ht95n0dq1/E7g9ya/WCrt0a4Ft+BBwyYnt6PjM139V3dmN+T16hzu+cCJ7G9FQX/Oy0iU5HfgS8P6q+tG4+xlWkt8Gnquqh5NMjbufEZ0CvB54b1Xdn+T3ge3Avx5vW60XbThU1W/NtSzJe4Avd2HwQJK/pfddJ39zovobxlzbkGQTcC7wrSTQOyTzjSQXVNX/PIEtzmu+vwOAJFuA3wYuXmnBPIehvuZlJUvyEnrB8IWq+vK4+zlObwTemuRy4BeBlyf546r6J2Pu63gcAg5V1dE9tjvohcOKc7IeVvpT4M0ASX4NOJVV9CVkVbWvqs6uqg1VtYHeG+71KykYFpJkM/AB4K1V9dNx9zOkVf01L+n9JnErcKCqPjHufo5XVX2wqtZ17/mrgL9YZcFA9zP6VJK/35UuBh4fY0tzetHuOSxgJ7AzyWPAC8CWVfKb64vJHwAvBfZ2ez/3VdU/G29L83sRfM3LG4F3APuSPNLVPtR9O4FOnPcCX+h+wXgSeNeY+xnIT0hLkhon62ElSdI8DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUsNwkCQ1DAdJUuP/AveqEz7u8CuUAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting boilerplate\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "brca_data['normalized_expression_level'].hist()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Ideally, data are organized as a table: examples-vs-features" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Data from multiple sources are combined" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Missing data are handled" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Features have been combined and manipulated as needed" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Any data that need to be normalized have been normalized" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Data are of the correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Let's have a look at Boston housing prices" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice
0165.0Y856Normal208500
1280.0Y1262Normal181500
2368.0Y920Normal223500
3460.0Y961Abnorml140000
4584.0Y1145Normal250000
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice\n", "0 1 65.0 Y 856 Normal 208500\n", "1 2 80.0 Y 1262 Normal 181500\n", "2 3 68.0 Y 920 Normal 223500\n", "3 4 60.0 Y 961 Abnorml 140000\n", "4 5 84.0 Y 1145 Normal 250000" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston = pd.read_table(\"https://biof509.github.io/spring2019/_downloads/boston_data.csv\", sep=\",\")\n", "boston.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* Data from multiple sources are combined\n", "* Missing data are handled\n", "* Features have been combined and manipulated as needed\n", "* Any data that need to be normalized have been normalized\n", "* Data are of correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Combining data from multiple sources" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Id2ndFlrSF
020
11854
23866
34756
451053
\n", "
" ], "text/plain": [ " Id 2ndFlrSF\n", "0 2 0\n", "1 1 854\n", "2 3 866\n", "3 4 756\n", "4 5 1053" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston_second_floor = pd.read_table(\"https://biof509.github.io/spring2019/_downloads/boston_second_floor.csv\", sep=\",\")\n", "boston_second_floor.head()\n", "#boston.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Combining data from multiple sources" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice2ndFlrSF
0165.0Y856Normal208500854
1280.0Y1262Normal1815000
2368.0Y920Normal223500866
3460.0Y961Abnorml140000756
4584.0Y1145Normal2500001053
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice 2ndFlrSF\n", "0 1 65.0 Y 856 Normal 208500 854\n", "1 2 80.0 Y 1262 Normal 181500 0\n", "2 3 68.0 Y 920 Normal 223500 866\n", "3 4 60.0 Y 961 Abnorml 140000 756\n", "4 5 84.0 Y 1145 Normal 250000 1053" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's combine boston and boston second floor\n", "boston = pd.merge(boston, boston_second_floor, on=\"Id\")\n", "boston.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdCentralAir1stFlrSFSaleConditionSalePriceLotFrontage
044Y938Normal130250NaN
145Y1150Normal14100070.0
246Y1752Normal31990061.0
\n", "
" ], "text/plain": [ " Id CentralAir 1stFlrSF SaleCondition SalePrice LotFrontage\n", "0 44 Y 938 Normal 130250 NaN\n", "1 45 Y 1150 Normal 141000 70.0\n", "2 46 Y 1752 Normal 319900 61.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's add some additional data\n", "boston3 = pd.read_table(\"https://biof509.github.io/spring2019/_downloads/boston_additional.csv\", sep=\",\")\n", "boston3.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Thus far" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice2ndFlrSF
333470.0Y1700Normal1655000
343560.0Y1561Normal2775000
3536108.0Y1132Normal3090001320
3637112.0Y1097Normal1450000
373874.0Y1297Normal1530000
383968.0Y1057Abnorml1090000
394065.0N1152AdjLand820000
404184.0Y1324Abnorml1600000
4142115.0Y1328Normal1700000
4243NaNY884Normal1440000
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice 2ndFlrSF\n", "33 34 70.0 Y 1700 Normal 165500 0\n", "34 35 60.0 Y 1561 Normal 277500 0\n", "35 36 108.0 Y 1132 Normal 309000 1320\n", "36 37 112.0 Y 1097 Normal 145000 0\n", "37 38 74.0 Y 1297 Normal 153000 0\n", "38 39 68.0 Y 1057 Abnorml 109000 0\n", "39 40 65.0 N 1152 AdjLand 82000 0\n", "40 41 84.0 Y 1324 Abnorml 160000 0\n", "41 42 115.0 Y 1328 Normal 170000 0\n", "42 43 NaN Y 884 Normal 144000 0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston.tail(10)\n", "#boston.shape\n", "#boston.tail()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* ~~Data from multiple sources are combined~~\n", "* Missing data are handled\n", "* Features have been combined and manipulated as needed\n", "* Any data that need to be normalized have been normalized\n", "* Data are of correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Missing data\n", "There are a number of ways to handle missing data:\n", "\n", "* Drop all records with a value missing (simplest, but can lead to bias)\n", "* Substitute all missing values with an approximated value (usually depends on data and algorithm)\n", "* Add additional feature indicating when a value is missing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Missing data" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Id 0\n", "LotFrontage 0\n", "CentralAir 0\n", "1stFlrSF 0\n", "SaleCondition 0\n", "SalePrice 0\n", "2ndFlrSF 0\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop all records with missing data\n", "#boston.isnull().tail()\n", "# boston.isnull().sum()\n", "# boston.isnull().sum().sum()\n", "#boston.tail()\n", "#boston.dropna().tail()\n", "boston.dropna().isnull().sum()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice2ndFlrSF
383968Y1057Abnorml1090000
394065N1152AdjLand820000
404184Y1324Abnorml1600000
4142115Y1328Normal1700000
4243Value2!Y884Normal1440000
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice 2ndFlrSF\n", "38 39 68 Y 1057 Abnorml 109000 0\n", "39 40 65 N 1152 AdjLand 82000 0\n", "40 41 84 Y 1324 Abnorml 160000 0\n", "41 42 115 Y 1328 Normal 170000 0\n", "42 43 Value2! Y 884 Normal 144000 0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Substitute missing values\n", "# boston.fillna(\"Value!\").tail()\n", "boston.fillna({\"2ndFlrSF\": \"Value1!\", \"LotFrontage\": \"Value2!\"}).tail()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Substitute missing values with mean\n", "# print(boston.mean())\n", "#boston.fillna(boston.mean()).tail()\n", "#boston.fillna(boston.median()).tail()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Add column indicating missing values\n", "# boston[\"2ndFlrSF\"].isnull()\n", "#boston[\"missing_second_floor\"] = boston[\"2ndFlrSF\"].isnull()\n", "# boston.tail()\n", "# boston = boston.fillna(boston.mean())\n", "# boston.tail()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# You may need to upgrade Scikit-learn (and restart Jupyter kernel afterwards) to use Imputer\n", "# !pip install scikit-learn --upgrade" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 65. ],\n", " [ 80. ],\n", " [ 68. ],\n", " [ 60. ],\n", " [ 84. ],\n", " [ 85. ],\n", " [ 75. ],\n", " [ 74.05555556],\n", " [ 51. ],\n", " [ 50. ],\n", " [ 70. ],\n", " [ 85. ],\n", " [ 74.05555556],\n", " [ 91. ],\n", " [ 74.05555556],\n", " [ 51. ],\n", " [ 74.05555556],\n", " [ 72. ],\n", " [ 66. ],\n", " [ 70. ],\n", " [101. ],\n", " [ 57. ],\n", " [ 75. ],\n", " [ 44. ],\n", " [ 74.05555556],\n", " [110. ],\n", " [ 60. ],\n", " [ 98. ],\n", " [ 47. ],\n", " [ 60. ],\n", " [ 50. ],\n", " [ 74.05555556],\n", " [ 85. ],\n", " [ 70. ],\n", " [ 60. ],\n", " [108. ],\n", " [112. ],\n", " [ 74. ],\n", " [ 68. ],\n", " [ 65. ],\n", " [ 84. ],\n", " [115. ],\n", " [ 74.05555556]])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Imputation is a general technique for \"guessing\" appropriate missing values\n", "# It could be implemented as a complex ML regression algorithm or a simple 'take an average' strategy.\n", "from sklearn.impute import SimpleImputer\n", "\n", "imputer = SimpleImputer(strategy='mean')\n", "imputer.fit_transform(boston[[\"LotFrontage\"]])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 65.],\n", " [ 80.],\n", " [ 68.],\n", " [ 60.],\n", " [ 84.],\n", " [ 85.],\n", " [ 75.],\n", " [ 60.],\n", " [ 51.],\n", " [ 50.],\n", " [ 70.],\n", " [ 85.],\n", " [ 60.],\n", " [ 91.],\n", " [ 60.],\n", " [ 51.],\n", " [ 60.],\n", " [ 72.],\n", " [ 66.],\n", " [ 70.],\n", " [101.],\n", " [ 57.],\n", " [ 75.],\n", " [ 44.],\n", " [ 60.],\n", " [110.],\n", " [ 60.],\n", " [ 98.],\n", " [ 47.],\n", " [ 60.],\n", " [ 50.],\n", " [ 60.],\n", " [ 85.],\n", " [ 70.],\n", " [ 60.],\n", " [108.],\n", " [112.],\n", " [ 74.],\n", " [ 68.],\n", " [ 65.],\n", " [ 84.],\n", " [115.],\n", " [ 60.]])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imputer = SimpleImputer(strategy='most_frequent')\n", "imputer.fit_transform(boston[[\"LotFrontage\"]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How to decide how to treat missing data?\n", "* Very data-dependent!\n", "* Decisions need to be justified and documented\n", "* Implement missing data preprocessing in a reproducible way (python script)\n", "* Don't create data from nothing\n", "* Iris example" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* ~~Data from multiple sources are combined~~\n", "* ~~Missing data are handled~~\n", "* Features have been combined and manipulated as needed\n", "* Any data that need to be normalized have been normalized\n", "* Data are of correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice2ndFlrSFtotal_sf
0165.0Y856Normal2085008541710
1280.0Y1262Normal18150001262
2368.0Y920Normal2235008661786
3460.0Y961Abnorml1400007561717
4584.0Y1145Normal25000010532198
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice 2ndFlrSF \\\n", "0 1 65.0 Y 856 Normal 208500 854 \n", "1 2 80.0 Y 1262 Normal 181500 0 \n", "2 3 68.0 Y 920 Normal 223500 866 \n", "3 4 60.0 Y 961 Abnorml 140000 756 \n", "4 5 84.0 Y 1145 Normal 250000 1053 \n", "\n", " total_sf \n", "0 1710 \n", "1 1262 \n", "2 1786 \n", "3 1717 \n", "4 2198 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# boston.head()\n", "boston[\"total_sf\"] = boston[\"1stFlrSF\"] + boston[\"2ndFlrSF\"]\n", "boston.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLotFrontageCentralAir1stFlrSFSaleConditionSalePrice2ndFlrSFtotal_sf
0165.0Y856normal2085008541710
1280.0Y1262normal18150001262
2368.0Y920normal2235008661786
3460.0Y961abnormal1400007561717
4584.0Y1145normal25000010532198
\n", "
" ], "text/plain": [ " Id LotFrontage CentralAir 1stFlrSF SaleCondition SalePrice 2ndFlrSF \\\n", "0 1 65.0 Y 856 normal 208500 854 \n", "1 2 80.0 Y 1262 normal 181500 0 \n", "2 3 68.0 Y 920 normal 223500 866 \n", "3 4 60.0 Y 961 abnormal 140000 756 \n", "4 5 84.0 Y 1145 normal 250000 1053 \n", "\n", " total_sf \n", "0 1710 \n", "1 1262 \n", "2 1786 \n", "3 1717 \n", "4 2198 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston.head()\n", "boston = boston.replace({\"Abnorml\": \"abnormal\", \"Normal\": \"normal\"})\n", "boston.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* ~~Data from multiple sources are combined~~\n", "* ~~Missing data are handled~~\n", "* ~~Features have been combined and manipulated as needed~~\n", "* Any data that need to be normalized have been normalized\n", "* Data are of correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Normalization\n", "* What is it?\n", "* Why do it? (data sources, feature distributions)\n", "* Types?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Normalization\n", "\n", "Many machine learning algorithms expect features to have similar distributions and scales.\n", "\n", "A classic example is gradient descent, if features are on different scales some weights will update faster than others because the feature values scale the weight updates.\n", "\n", "There are two common approaches to normalization:\n", "\n", "* Z-score standardization\n", "* Min-max scaling\n", "\n", "#### Z-score standardization\n", "\n", "Z-score standardization rescales values so that they have a mean of zero and a standard deviation of 1. Specifically we perform the following transformation:\n", "\n", "$$z = \\frac{x - \\mu}{\\sigma}$$\n", "\n", "#### Min-max scaling\n", "\n", "An alternative is min-max scaling that transforms data into the range of 0 to 1. Specifically:\n", "\n", "$$x_{norm} = \\frac{x - x_{min}}{x_{max} - x_{min}}$$\n", "\n", "Min-max scaling is less commonly used but can be useful for image data and in some neural networks." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-1.11717197]\n", " [ 0.30978064]\n", " [-0.89223363]\n", " [-0.7481325 ]\n", " [-0.10143477]\n", " [-1.32805167]\n", " [ 1.82811445]\n", " [-0.23499191]\n", " [-0.53373814]\n", " [-0.34043176]\n", " [-0.47047424]\n", " [ 0.02860771]\n", " [-0.92035092]\n", " [ 1.12518213]\n", " [ 0.27814868]\n", " [-1.12420129]\n", " [-0.59700205]\n", " [ 0.42927913]\n", " [-0.21038928]\n", " [ 0.58040958]\n", " [-0.05574417]\n", " [-0.23147725]\n", " [ 2.18309527]\n", " [-0.400181 ]\n", " [-0.400181 ]\n", " [ 1.49773626]\n", " [-0.96252686]\n", " [ 1.86326106]\n", " [ 1.49773626]\n", " [-2.29809827]\n", " [-1.84470692]\n", " [ 0.19028214]\n", " [ 0.21137011]\n", " [ 1.84920242]\n", " [ 1.36066446]\n", " [-0.14712537]\n", " [-0.27013853]\n", " [ 0.43279379]\n", " [-0.41072499]\n", " [-0.07683214]\n", " [ 0.52768966]\n", " [ 0.5417483 ]\n", " [-1.01876145]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/agoncear/anaconda/envs/jupyter/lib/python3.7/site-packages/sklearn/preprocessing/data.py:617: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.\n", " return self.partial_fit(X, y)\n", "/Users/agoncear/anaconda/envs/jupyter/lib/python3.7/site-packages/sklearn/base.py:462: DataConversionWarning: Data with input dtype int64 were all converted to float64 by StandardScaler.\n", " return self.fit(X, **fit_params).transform(X)\n" ] } ], "source": [ "# a = (boston['1stFlrSF'] - boston['1stFlrSF'].mean()) / boston['1stFlrSF'].std()\n", "# boston['1stFlrSF'].hist()\n", "# boston.head()\n", "## boston.total_sf.hist()\n", "from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler\n", "\n", "scaler = StandardScaler()\n", "print(scaler.fit_transform(boston[['1stFlrSF']]))\n", "#scaled_size = pd.Series(scale(boston.total_sf))\n", "#scaled_size.hist()\n", "#scaled_size.mean()\n", "#scaled_size.std(ddof=0)\n", "#boston[\"normalized_total_sf\"] = scaled_size" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0.26352941]\n", " [0.58196078]\n", " [0.31372549]\n", " [0.34588235]\n", " [0.49019608]\n", " [0.21647059]\n", " [0.92078431]\n", " [0.46039216]\n", " [0.39372549]\n", " [0.43686275]\n", " [0.40784314]\n", " [0.51921569]\n", " [0.30745098]\n", " [0.76392157]\n", " [0.57490196]\n", " [0.26196078]\n", " [0.37960784]\n", " [0.60862745]\n", " [0.46588235]\n", " [0.64235294]\n", " [0.50039216]\n", " [0.46117647]\n", " [1. ]\n", " [0.42352941]\n", " [0.42352941]\n", " [0.84705882]\n", " [0.29803922]\n", " [0.92862745]\n", " [0.84705882]\n", " [0. ]\n", " [0.10117647]\n", " [0.55529412]\n", " [0.56 ]\n", " [0.9254902 ]\n", " [0.81647059]\n", " [0.48 ]\n", " [0.45254902]\n", " [0.60941176]\n", " [0.42117647]\n", " [0.49568627]\n", " [0.63058824]\n", " [0.63372549]\n", " [0.2854902 ]]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/agoncear/anaconda/envs/jupyter/lib/python3.7/site-packages/sklearn/preprocessing/data.py:323: DataConversionWarning: Data with input dtype int64 were all converted to float64 by MinMaxScaler.\n", " return self.partial_fit(X, y)\n" ] } ], "source": [ "scaler = MinMaxScaler()\n", "print(scaler.fit_transform(boston[['1stFlrSF']]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Other preprocessing / normalization techniques and thoughts\n", "* http://scikit-learn.org/stable/modules/preprocessing.html\n", "* http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* ~~Data from multiple sources are combined~~\n", "* ~~Missing data are handled~~\n", "* ~~Features have been combined and manipulated as needed~~\n", "* ~~Any data that need to be normalized have been normalized~~\n", "* Data are of correct type (e.g. categorical vs continuous, boolean vs int)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "CategoricalDtype(categories=['AdjLand', 'Partial', 'abnormal', 'normal'], ordered=False)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#boston.head()\n", "import numpy as np\n", "\n", "# boston[\"1stFlrSF\"].mean(skipna=False)\n", "boston[\"CentralAir_bool\"] = boston[\"CentralAir\"] == \"Y\"\n", "# boston.head()\n", "# boston[\"SaleCondition\"].dtype\n", "#boston[\"SaleCondition\"].head()\n", "boston[\"SaleCondition\"].astype(\"category\").dtype\n", "#boston[\"SaleCondition\"] = boston[\"SaleCondition\"].astype(\"category\")\n", "#boston[\"SaleCondition\"].dtype" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1],\n", " [1],\n", " [0],\n", " [0]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import OneHotEncoder, LabelBinarizer\n", "\n", "lb = LabelBinarizer()\n", "lb.fit_transform(['yes', 'yes', 'no', 'no']) " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 1],\n", " [0, 0, 1],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [1, 0, 0]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lb.fit_transform(['yes', 'yes', 'no', 'no', 'maybe'])" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [0., 1., 0., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 1., 0.],\n", " [0., 0., 0., 1., 1., 0.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [1., 0., 0., 0., 1., 0.],\n", " [0., 0., 1., 0., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.],\n", " [0., 0., 0., 1., 0., 1.]])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe = OneHotEncoder()\n", "sparse_matrix = ohe.fit_transform(boston[['SaleCondition', 'CentralAir_bool']])\n", "sparse_matrix.todense()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Another example of categorical data conversion to boolean features" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
0020.1100Green
11NaN0.12120Red
2270.11NaNBlue
3340.15127Blue
4490.16130Green
5510.11121Red
6630.14124Green
\n", "
" ], "text/plain": [ " A B C D E\n", "0 0 2 0.1 100 Green\n", "1 1 NaN 0.12 120 Red\n", "2 2 7 0.11 NaN Blue\n", "3 3 4 0.15 127 Blue\n", "4 4 9 0.16 130 Green\n", "5 5 1 0.11 121 Red\n", "6 6 3 0.14 124 Green" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = pd.DataFrame([[0,1,2,3,4,5,6],\n", " [2,np.nan,7,4,9,1,3],\n", " [0.1,0.12,0.11,0.15,0.16,0.11,0.14],\n", " [100,120,np.nan,127,130,121,124],\n", " ['Green','Red','Blue','Blue','Green','Red','Green']], ).T\n", "x.columns = ['A', 'B', 'C', 'D', 'E']\n", "x" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDEE_GreenE_RedE_Blue
0020.1100GreenTrueFalseFalse
11NaN0.12120RedFalseTrueFalse
2270.11NaNBlueFalseFalseTrue
3340.15127BlueFalseFalseTrue
4490.16130GreenTrueFalseFalse
5510.11121RedFalseTrueFalse
6630.14124GreenTrueFalseFalse
\n", "
" ], "text/plain": [ " A B C D E E_Green E_Red E_Blue\n", "0 0 2 0.1 100 Green True False False\n", "1 1 NaN 0.12 120 Red False True False\n", "2 2 7 0.11 NaN Blue False False True\n", "3 3 4 0.15 127 Blue False False True\n", "4 4 9 0.16 130 Green True False False\n", "5 5 1 0.11 121 Red False True False\n", "6 6 3 0.14 124 Green True False False" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_cat = x.copy()\n", "for val in x['E'].unique():\n", " x_cat['E_{0}'.format(val)] = x_cat['E'] == val\n", "x_cat" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EBlueGreenRed
A
0NaN0.1NaN
1NaNNaN0.12
20.11NaNNaN
30.15NaNNaN
4NaN0.16NaN
5NaNNaN0.11
6NaN0.14NaN
\n", "
" ], "text/plain": [ "E Blue Green Red\n", "A \n", "0 NaN 0.1 NaN\n", "1 NaN NaN 0.12\n", "2 0.11 NaN NaN\n", "3 0.15 NaN NaN\n", "4 NaN 0.16 NaN\n", "5 NaN NaN 0.11\n", "6 NaN 0.14 NaN" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Another option to have one feature per color is to use Pivot\n", "# Note that it will create missing data:\n", "x.pivot(index='A', columns='E', values='C')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-processing a dataset: when are ready for ML?\n", "* ~~Ideally, data are organized as a table: examples-vs-features~~\n", "* ~~Data from multiple sources are combined~~\n", "* ~~Missing data are handled~~\n", "* ~~Features have been combined and manipulated as needed~~\n", "* ~~Any data that need to be normalized have been normalized~~\n", "* ~~Data are of correct type (e.g. categorical vs continuous, boolean vs int)~~" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Other types of data storage\n", "* Image\n", "* Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Image\n", "\n", "Datasets with images also need to follow samples-by-features format.\n", "Features in this case are pixels and their intensities. For black and white images intensities are binary. For grayscale they could be integer or floating point numbers. Color images are usually represented as multiple images - one for each color channel (e.g. red / green / blue).\n", "\n", "Thus each image is represented as a one dimensional array, which is exactly what's needed for ML applications. To visualize it, however, we need to change its shape." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "downloading Olivetti faces from https://ndownloader.figshare.com/files/5976027 to /Users/agoncear/scikit_learn_data\n", "Dimensionality samples x features (400, 4096)\n" ] }, { "data": { "text/plain": [ "array([0.30991736, 0.3677686 , 0.41735536, ..., 0.15289256, 0.16115703,\n", " 0.1570248 ], dtype=float32)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.datasets import fetch_olivetti_faces\n", "dataset = fetch_olivetti_faces() \n", "print(\"Dimensionality samples x features\", dataset.data.shape)\n", "\n", "# first image - pixel intensities\n", "dataset.data[0]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAD8CAYAAABXXhlaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAIABJREFUeJztnW3MXVeV3//LdkwCSfBbXhw7r2DSJLwEZEHAZfDAgAIdTb5ANcyoSqtI/kIrRp1qElqpYqpWAj4M9EOFZBVKPtAA88IkQTAzUUooDSXBKS+J84IT4ySO7dh5cTAhEOLsfnjuPfM//zz7/+z7+HnuTTjrJ1k+95599t5nn7Ofu9Zea68VpRQkSTIsVsy6A0mSTJ+c+EkyQHLiJ8kAyYmfJAMkJ36SDJCc+EkyQHLiJ8kAOaGJHxFXRsQDEfFgRFy3VJ1KkmR5icU68ETESgA/BfB+APsB/ADAR0sp9y5d95IkWQ5WncC1bwfwYCllLwBExFcAXAWgOvFXr15dTjnllLmGV/WbXrlyZXe8YkVfEDnppJO4jnmvGfVh3mPFneM/hFqu9Y+kq38xaH3uPlvH4MUXX2yqw90zn9NytTq0Le7HUtD6bCdhMdfpfXEdeu6FF17ojo8fP94d/+Y3v+mV489ax/jzc889h+eff37BF/BEJv4mAI/S5/0A3uEuOOWUU7Bt2zYAwLp163rn1qxZ0yvXa2jTpu74nHPOmfcaAHjVq17VHesfBR54/uPhHpC+RPxQ+I+T1uH+iDHuOm5b74U/61jxH0nX9q9//et52wL648jltB/8wj7//PO9c7Xx0T796le/qvbRPQuGz2n9/Jn7qOXcHzHuvz4zbpvfDx0PHke958OHD3fHx44d644PHjzYK3fo0KF5y3Gd3/ve99DCiUz8+Z7ES/40RsQOADsA4OSTTz6B5pIkWSpOZOLvB3Aufd4M4IAWKqXsBLATANauXVte/epXAwBOPfXUXjn+FdZfMf6Dwb9oqi64XziG/zIv9te6JkEA/tfJSQr8me9Nf2m5Dm2Lz/Gx/orxr7r2g6/jsedfeO2Xk7AYHvuFaFWZnBhdq0+fLfdL2+VnoeJ37T5b+wH058LTTz+9YN3z1TGpenkiq/o/ALAlIi6MiNUA/hDATSdQX5IkU2LRv/illBci4l8D+HsAKwF8sZSye8l6liTJsnEioj5KKd8E8M0l6kuSJFPihCb+CTUs+jnr7qeddlrvHOujbuXe6XC1corTn7m9pQhg4nT31rWGVv3f6ZzjdZf5yrLu61b1+bkA/ZVrt2K+FLix4nvhcm513un/+t4yzz77bLWOWn+V9evXd8e6qu/qHD+L1vFNl90kGSA58ZNkgExV1H/xxRc7cUhFLRYVVZyqifdOfHUijxO1Wj3fatdo285TTUU3vm/nPcflnMccm0XVjMaiuHNicuKlE3traob2g9UFVve0H+57V39trNx9aR38vui5VicjxjlMPfHEE92xmryPHj3aHevY1+ZVjfzFT5IBkhM/SQZITvwkGSBT1fFXrFiB2u481nvYpAEAp59+enfM103iFlkzKbnNGq5Ot4nG1edcVmv6v9uko3D9fKz9YFdcPcduqby5RHVw7qPeV02Hds/Fmdjc964fizEf6vi26s08Pq27FYH6+paON6Nu4medddaC1zD5i58kAyQnfpIMkKmb88ZmpF/+8pe9c2vXru2O3R5zNv+ouqC7x2rnnBnKeaoxzuznRM/FBIpw4pvbWefUCh4D3TuuYmStf06VcEEjarBaAdTv24nRSk19mkTlcG1NstuwBtfBJrzHH3+8eo3ux3/ta18LoH2Hav7iJ8kAyYmfJANkZpt0NKABr+TzKj7QF6+cOO/ELhaBnKjv6qtdp6KyW9114iZTC+mkOOuIE18Xs9o9SRxDHgN+1vrca1YIoC5yTxKggsu2rngrrSoCv5uTjG9NHXGbp1QNHQfwaFU98hc/SQZITvwkGSA58ZNkgExVxz9+/DieeeYZAC8Nje0i8LLuVItBDnjzG+vCLqQze1E5rz5uS9cdnJmLdUQXLJSvcyZBPcc6tDPtcDkdNx6fWkhxoH/fqo+zaW6xwTe4ztaw5+4dcDq4C+LKYzVJPogarSZBNWsz2o/xzr3cnZckSZWc+EkyQKYq6pdSOtOXiscs1qh5rCb2qmjFHmd6jkXPp556qjtW02HNaw1o90Zz2Wec2Mt1uow4ahJrqb/Vuw146fjX+tFqOmoNitIqpjqTmsvo48zCLq4evxOqJnLZWk4DpTkuntTBaqj2Y/zOpaifJEmVnPhJMkBy4ifJAJmqjh8Rnb70i1/8onfO6Xo1XVJ1IKfD/fznP++OWT9St0gXhLIWdFF1NrcT0Lnzamz6Wjmnx3G/uC13L07frV2jdao7bK2O1oAd2i9nOnQmzFoGYmfOU9xY8fPltQB9lrz28Nxzz/XO1Z6ntsV16lrG+P1eMpfdiPhiRByOiHvou3URcUtE7Bn9v9bVkSTJy4sWUf9LAK6U764DcGspZQuAW0efkyR5hbCgqF9K+d8RcYF8fRWA7aPj6wHcBuDahepasWJFVZxlMcylInYBMFj8UVGOPzuRictpPzh4CIthGkCCUVH/Na95TXespkP+7MaDxVLtf218dKy4Tn0mLvY/4+LN1zzo9LmwajX26hzDceQ5PZXiTHHcDx43l9pc49k78xufY+9T7UctBTrQv7ea6gD0nzWn0wb+cRyXO4XWWaWUg6OGDgI4c5H1JEkyA5Z9cS8idgDYAdQXr5IkmS6LnfiPR8TGUsrBiNgI4HCtYCllJ4CdAHD66aeXseilfwRaN1rwOV0ddav63B6LZBr7z60esyjK4r2K4twP3XzE9WtWYK6HxVIniitc1ol9tXRdQF+85/t0K+YK30vNExDoj78T9dkqo8+M0fGu3SerXEB/vFVtcWPKZZ2qyXWoCF9TmdSrdN26dd2xjsG0UmjdBODq0fHVAG5cZD1JksyAFnPeDQD+L4CLI2J/RFwD4FMA3h8RewC8f/Q5SZJXCC2r+h+tnHrfEvclSZIpMbNgm24XnDOFsI7sdmI5MxTvBFRdjOvUfnDsfxf00wX6YLSO1kAcLoBkLWW0K9cap1776/RJPsdjqs/MBdvk8WATm747vNajwStY56/lZwD696xrKi44K/fR3Seja0I8Vi41G9+bjsH4GS63OS9JklcwOfGTZIBMPVvueFOM2/ChImTNxKZiDZtoVNTncyw2solE21KPvJoJT01IfE7FfhZnVaRmEdPF/mORVUVbrpPFQRUNuR8qetZUJhcApDWzq4rYfN04DdQYl9GX4XvR+6zF0nOqmpr6eIxdPD42OeomNH533EYfly2XP4+z447Zv38/gEyhlSSJISd+kgyQnPhJMkBmZs7THVCs27h8c64O1oHOPLO/b4jdY7kt1eNrMeX1Ok5TrDvH2NWUj7VOXRvQ9Yb52gX69+n0QA4yomPI9+n0Xca5q+p6RW2tQdd22BSnawj8bLicPrNW0yfr6rqewGsqLpiFPrPaGOt4szuyC9jp1jL4/dB1iPG5w4er3vM98hc/SQZITvwkGSBTj6s/Fuc4LTbQF41UHGTxlUVIjZfHIqXubGJRjoMYqGjEItmRI0d65w4dOtQd/+xnP+uONR3Y448/3h2zSgAAGzZsmLe/QH9MNm/e3B2fccYZvXIsmjtRvzUFtRPha9coKn7zM3TxFPm6J554oneOx59NZapa8fjs27evWgejKhirjfruXHbZZd3x2Wef3Tt3zjnnzHudqgTcf5cay6VH43dHRf3xuKbnXpIkVXLiJ8kAmdmqPm94AfqiuIqvKtKPUXGHy+nK6SOPPNId33fffd2xinxsDWBxHuiLkbzK/PrXv75XjtUFXal+4xvf2B2rqL93795521bR9g1veANq8H2z2KciNp9zor4LtsHXqWWA++Hq4LEae5+NYTWAn+3555/fK7dly5Zq/dz2eeed1x3fcMMNvXKsru3Zs6d3jkX9u+66q3fu7rvv7o63bt3aHbNYrv3Xd7NmtdJx48+qXo7HP0X9JEmq5MRPkgGSEz9JBsjUU2iN9XcNNMn6i4s3z2sBzuPs4Ycf7n1mXYxNPG9605t65d785jd3x6973et652688R9DC7Jp7/bbb++VY5OPrk/w2samTZt65zh4w2OPPdYd632yXqy6JK8HsNnIBZdQU1wt9bM+F96BpmsIujutxpNPPlntI6+P8G40XR/iMVbTJ+vrvIaiz+WDH/xgd7x79+7euW3btnXHDz30UO/cd7/73e74zjvv7I7f9ra39cqxp6BL7+50dL5Odfzxfet7X62rqVSSJL9V5MRPkgEyVVF/1apVnYiimyRYvFSRrybqazn2yFMzHXtSsSiuotXFF1/cHesmoAsvvLA7/sxnPtMdq9jI/XKbhZ566qneORZha3EGgb6Hm3pAsrrAnmQuQIOal3is2GTq1AX1gKxtAlLzJnu0uZh7rN7opiX2jrz00kur/WCT7rnnntsrx6Y+9prUfulmJH4nuB+sqgH9567jWEvppuoTf964cWPv3PiZ1eIlKvmLnyQDJCd+kgyQnPhJMkCmquOfeuqpeM973gPABztQfbQW81x3ObHuq7o7m724bTV/3Hzzzd3x9u3be+dYz2Q3XY2hzjqymsBY59T8Z+yqzOYaF6OdzWFAX/91OQK5nD4Lbq8W61/7qCZBXmNxQUt4HUJ1Xx4PrkPXPNjVV4OZ8HPn9QV9d9hMpybBgwcPVtvmMeD+u0CtunZUyzGhbXH/1Zw3Xpdw+SqYlhRa50bEtyPivojYHREfH32/LiJuiYg9o//XLlRXkiQvD1pE/RcA/Gkp5RIAVwD4WERcCuA6ALeWUrYAuHX0OUmSVwAtufMOAjg4Oj4WEfcB2ATgKgDbR8WuB3AbgGtdXStXruzMGi6FkTNJ8DlNk83ilIqNLNqxCKzlWJz/1re+Ve0Hm7lU7GLxVcVoFvVdSmc+p15wLC7rGLDZyInpLhBHLYiGBhVxMeD13saoCMz91yAXPMYs5uq7w+Ktmki5Pa5P3zFWb7SPbK5VE1stnbmqNC4nAffFpS/n+9ZndskllwDw7ywz0eJeRFwA4K0A7gBw1uiPwviPw5n1K5MkeTnRPPEj4lQAfw3gT0opP1+oPF23IyJ2RcQudtZIkmR2NE38iDgJc5P+y6WUvxl9/XhEbByd3whg3ri+pZSdpZStpZStGsssSZLZsKCOH3OKxhcA3FdK+Qs6dROAqwF8avT/jfNcXkWji7Ce5vKwsZ6p+i3rPS4PG+tsukOOdTPVaWv5z1TvY31O9UXWA1WH4/67uPe1fHB6jse41ZVT++HqqEX7AfrPk81XOlbsZq3BNvm+a2sG2g81BdfWc3RXY82UCvTHQM2nXD/fm+4gbB1/N958bzqO4zWs1tx5LXb8bQD+BYC7I+JHo+/+PeYm/Nci4hoAjwD4SFOLSZLMnJZV/f8DoPbn6n1L250kSabB1ANxjEVTFQ2dKMRiDYt16tHmzEssejpTHO92U7GORX/2hFOVg/urHmKsBqgpkdUMvk8V52u7FYG+6MljoCIg91HHnk2QunOP4XNaR228ncemmi25rEtP5Uxg3I9aWi/A757j69TrznlV1urQtmvvvt4nqwHa7rjPuTsvSZIqOfGTZIBMPa5+bdXRZTx1cdlr5VSV4M8sbqt4WSsH9MUoLqfiVc3zTa9TcY3L1jwNgb4o2rqqr+Il35tLr8WejHovLv5+Te3Sciw6q4jNFhG3waZmhQDq75s+WxcAw62U11QQraM1P4HzYOU61NIzHu8U9ZMkqZITP0kGSE78JBkgU9Xxjx8/3gVXnMR9l/UW1nOciUrhc6w7qp7tdlHVdkc5/c0FFXFtOb3Y6fi19QXV41vjt7tybM5T/bzWX813yM9C74XXNmpek0B//cKZwBgdDzbPqp7Mz8IFuuC2nX7u9H+31uX099aceV1fJyqdJMlvBTnxk2SATN2cV/Pcch5dzCRidQ0nXroY6jVPMleHiqXcfyems3ipYhxfp3UwLhU248Rjl++AcenMnEclPwu3cYvrcPeidXB7zvzo1Bsnftf65ep36p+7N+dtOckmLCB/8ZNkkOTET5IBkhM/SQbI1HfnjfVEp9M7/YXNb6ovOnNeTd9Vk5rTmbk9l9vO6dbOFNfSFuDNRq3rHHxd6zUu2IZSWyvRdQKN91/rowtgwuPj3gG+Tt2g+Rm6MXW7St36kzvHdfC4uQCptbWjdNlNkqRKTvwkGSBTF/Vr8fScSaMmvqnY6OKO14IYqLjK51RsYrWgNU66U1ucWccFbmBUXaj1a7EqAYuerWK09oufiz5/jn+oOyW5jlZRWUVg/sz37LziWp8LUN9Zp/11nnutnpKtJu8W8hc/SQZITvwkGSBTFfVXrFjRraZqnDonetY81VTMbQ2YULMSAH2xWkNjcx0cDKJ1s43iRHi+TutgcVD7WBMHVbx0seK4PZd2ylkXaiKrUwk0JHqraOtUAl695+fkNtG498gFHHGedYyr36kLrR6hLeQvfpIMkJz4STJAcuInyQCZuo4/1hnVY8uZilgXZt2xVVfS6/jY6dluNxr3V+vgcy5YiFujcKm8jx492h1rWuia6Ul1QDad6RpCbTdd625F17bqphzMY926db1z/Jx43HR9wgU+re2ec+VaTZ1aT+tYtQbeVJxX3/hca0COBe8wIk6OiDsj4scRsTsi/nz0/YURcUdE7ImIr0ZEfQYlSfKyouVP268BvLeU8hYAlwO4MiKuAPBpAJ8tpWwB8DSAa5avm0mSLCUtufMKgLFceNLoXwHwXgB/NPr+egCfBPB5V9fx48c7MdWJwC4gA5ebJCZZrZxTCZwprnWTkfPcc6mxWNRX8ZizympGXx5X7r9uSuHPLniFwz2LWmCOVhUM6JsS1WzZ0pa2x2Oz2Ht2QVGcuuq882rxIBUup/0dq0xLJuoDQESsHGXKPQzgFgAPAThaShmP+H4Am2rXJ0ny8qJp4pdSjpdSLgewGcDbAVwyX7H5ro2IHRGxKyJ2jSPsJkkyWyYy55VSjgK4DcAVANZExFjO2QzgQOWanaWUraWUrZOE1E6SZPlYUMePiDMA/KaUcjQiTgHwe5hb2Ps2gA8D+AqAqwHcuFBdx48f73KxLUVcfYe60dYCWapetpi23C4+t0urdQeXSkpsAnMx5p2Jyq01cD671l1rTrd0QS74PdB7qe1Q1H64dN21NZvWHHuAX3+qofeyGP3fBXhRxu9cq47fYsffCOD6iFiJOQnha6WUb0TEvQC+EhH/GcAPAXyhqcUkSWZOy6r+TwC8dZ7v92JO30+S5BXGVD33Vq1ahTPPPBPASz33XMAHFnmcmavVJKN9qlHzjtK23Y4trd+Jzlw/e6epiYf7xYEsgL6Xn/MkW0wqMvWYa92NVospN1/btfZ4N2RrW1qHU7NacWPlxHl+FqqG8rvE78uzzz7bK+fUgEyTnSTJguTET5IBMvUUWmMmWSFm8dCtmDO6caa22um8/1o9whytKZG0LLftVBpVb3hF3gXK4M+qSvDY8bGK+k5tqWWA5RV4wG9oqsVG1DF1HnlcR2uAjUm8LWtBWLQfzlrkArkwPFa1MUhRP0mSKjnxk2SA5MRPkgEys7j6qtfwZ6dbu4AGrH+p7s66sIvz7mC9qhY3XsspLpgC40xlrMezmQt46S68Gmz2U7MR74rjtlxgTx1v1q1d8BQeO2fq4/dDn1ktGAbQN3fWAnsoOt6t6btcHYzb2dmarnvStNgvaeeErk6S5BVJTvwkGSBTFfVXr16N8847DwDw05/+tHfOZQJdzCYJF8zDmVZcIISaGKZ18DkV61zgCe6XE/V5Y8vatWt75zhfAYvw2g+OYa/qAgf3YBFbxVoWgdUTk+vgcqqK8JhqHWz64+enMQjdu1MLiuI8Nl08PlfWpcly6g5/diY757k3Htc05yVJUiUnfpIMkJz4STJAph5Xf6xPOrdIl+qYdUSnz6jLLutRLkBiLQ77fJ/HOPOMC6zYmutPA3GsX7++O9Y1iZ/85CfdMevZ559/fq8cj8/ZZ5/dO8c6tNs1+eSTT857DPQDgnIegHe/+929cvwMx0Fa5mvPuR+3xsF3a0XO7ZfHQPX9Wm6+1uCj2i8XwNSZsk877bR5v6+Rv/hJMkBy4ifJAJmqqH/gwAF88pOfBNAXVwHgne98Z3fsvKNY3HGpn5ypzIldNe88xYl/LsYc43a0ORH49ttv745VTOe2N2zYUO0Hqw8qHrKo73bWcR2ckkv7cfjw4e744Ycf7pVjL0FNB8amSn7WboecG28nBrtzLk9Cq+rGtHooulRhqnaN1brW1OL5i58kAyQnfpIMkKmK+seOHcN3vvMdAC8VG9/1rndVr2NRiEVI9eByIlltRdSpBA4WN3XDkfMQcyJlTbXYvHlz7/O+ffu6Y/V24001fKx1s2iu97xmzZrumMdHN/M4dYFF+AsuuKA7PnToUK8cew3q8+R+uPiBrSnLnPecs164jVyt4aydCM7n2Erj3s1du3b1zu3evRsAcOTIkab+5C9+kgyQnPhJMkBy4ifJAJl6sM2x/nTgQD/V3mOPPdYdX3zxxb1ziw2cwdRSKbXqaFrW6YSM8zLTtQFe92CdWc06vDtP4+qzRx7XoZ6MLr0Wf+a1ANXxa0E5gf69OBMp6/i6c4/vs7a7Uut355znngv26sy6LvgL48x+tba1j2zWvffee3vn9u7dC+Cl70qN5l/8UarsH0bEN0afL4yIOyJiT0R8NSJWL1RHkiQvDyYR9T8O4D76/GkAny2lbAHwNIBrlrJjSZIsH02ifkRsBvDPAPwXAP825mST9wL4o1GR6wF8EsDnXT2llE4c4kAQAPDNb36zO77ssst651iEavWOUmqi4iSifi2FluuTi8en17Epx8Wzq12j1DzwgL4Y6eL08XVsogN8+qjapig12bkAGzWzqDO3uTFtVcmcZ2Br/gBV41wsPe6XE9XZA/LgwYO9c+P3rPV9bp1FnwPwZwDGPVwP4GgpZfxW7wewqbGuJElmzIITPyJ+H8DhUspd/PU8Ref9UxMROyJiV0Tscn9xkySZHi2i/jYAfxARHwJwMoDTMScBrImIVaNf/c0ADsx3cSllJ4CdALB69ep2uTpJkmVjwYlfSvkEgE8AQERsB/DvSil/HBF/CeDDAL4C4GoANy5UV0R0OpEGibznnnu6Y9WLWY9y+m6r66Yrx207fYl1Nqfja/0ueGWtnEPr4OtYf3Y6p1sncGYoNtm5oBG1drWPi81H2GpuczhTH9/LYuPlc50uTbb7fmyyA15qWp107etEHHiuxdxC34OY0/m/cAJ1JUkyRSZy4Cml3AbgttHxXgBvX/ouJUmy3MzMc089vThm2/79+3vnLrroou6YxR8nKqu4tph4/K2pjtzONyfWuRjwtT4B/bFzXneM3guPnZrRnMmxVqfbScb161i5PAO1sZpErK0FN9ExZbVFvUN1fBg2v7mY+Ny2muxq6cY0uMlDDz3UHet7NX4nMq5+kiRVcuInyQCZqqhfSulEGRVzebPG/fff3ztXE/VV3HGeaoxbwXUbeFiMcp51btPIYgJ9qKjJn12qJid6Om9IViXcqj6PlfM4G4d+BnyWZO0jWxucp2Grt1rrRhwXE0/Hu3XF36k0NZVVVV723NNnMa5zqT33kiT5LSInfpIMkJz4STJApm7O6xoWsw7rlZpC+wMf+MC8dTj9VqkF22zVuRUXp9+ZjVyqplbTE6O6dc0zUD29uM+c4gqoB9hQEyyj6xD8fJ13HvfX5SdwawjOtMrP3fXfvRMunn1Nx3fx99VTkuvg8XjwwQd75Ti4qd7nuL005yVJUiUnfpIMkKmK+rxJx20McWYMjjHXahbRslxukhjtrem1+JyL7aaiZ80LzHmqOdHTBX/gtjTHAasFLMJrIA6+FzWx1YJ7tKoEQD1NWWuMeqA+Bk5NdCqH84B0gTjc5qxaujRNN+ZMsONzac5LkqRKTvwkGSA58ZNkgEzdZXesi6juy3qPpkvmXHGch02p5dhzaD+ca2XtOhck0unnugus1n+tg8+17jrkPHqTwPem9+lcn2trIPo9j4Grw62ptLpns56taxCtwTCV2jPTtQBeU3HBXx599NHumNe2gP746PvtdhDOR/7iJ8kAyYmfJANkZuY8F4edTRpAf7feFVdc0R2r+YdTRnOKZaDuuae41Fi161Rkdx5nLG7qdXw/bGJjjy2gbwp97rnneufUNFdDr6vB9bk4gNou3xvnUFAx14nzLHLz2LTGXdQ6uP9qTubdoS4XQmuaLC3nPD35Mwfb0Oe+lOQvfpIMkJz4STJAZrZJx6126wolb9o5dOhQd6yeZCwqqljKoqcLyOA20dRW07Ut9nzTDTBHjhxBDV5pZnVH63AiK4vw3EfOPKt1uKAUPG61jSHzwWPFq9NqXWCVTNOq8XiwKuEyEDtPSfd+uHRg7n2pbbBReDxULeLYejxWqi64zU6tVqwx+YufJAMkJ36SDJCc+EkyQGam46v+wrqZmukeeOCB7njPnj3dsXrxOa+7mjnPBcpUnbamI6rZhdchdKchn1Ndr2ZSUj2eP6tuxyZN7u8TTzzRK9eqm/I6yiS74hgebzVhsh5/xhln9M6tX7++O3ax+XltQNcyatc577zWIKv62aVfc0FReD2H34laQE3t03yfF6Jp4kfEPgDHABwH8EIpZWtErAPwVQAXANgH4J+XUp6eqPUkSWbCJKL+75ZSLi+lbB19vg7AraWULQBuHX1OkuQVwImI+lcB2D46vh5zOfWuXeiisfgyifmBxcObb765O37HO97RK8cin9u04FJEtWbcdaIyi3Xr1q3rnWNxXlMksZjH46MmMO6/UwNYxHYehFoHl+VjvU/nJcj9Z3Fex5THUc2Wjz322Lx1qNnv7LPPnrddANiwYUN3rGoG0xqkw4nwLlDGsWPHumN97rWgKJOI8+N+LHUgjgLgHyLirojYMfrurFLKwVFjBwGc2VhXkiQzpvUXf1sp5UBEnAngloi4f8GbwUpkAAALRElEQVQrRoz+UOwA2nOVJ0myvDT94pdSDoz+Pwzg65hLj/14RGwEgNH/hyvX7iylbC2lbJ0ky2mSJMvHgj/BEfEaACtKKcdGxx8A8J8A3ATgagCfGv1/Y0uDY11kkj8CbN679957u+Pvf//7vXJOn2Md0e2Qc+6fNZOVmh85V5wzL6k+yjouH7uY8i6oI6NtsV6vOj7rrRzcVHVf1lt1bFy8fIav03UZbpufn655uD7WAmC0pv/WOt04sn7t3HKdSbAV7cd4jJtzMzaUOQvA10cVrgLwP0spfxcRPwDwtYi4BsAjAD7S3OskSWbKghO/lLIXwFvm+f5JAO9bjk4lSbK8TH21rRa/zMU1Y9iU86Uvfal3bvv27d0xm82AvjjOYp4LmNAac1/FeW5LxXS324091WpiP9DfgedEWxYH3fg67z8XtITHWFUMrpPHw4n9GgePvQb5nKoEfK41Nt8kcfVaA6swGkyGn5nrY2tq8xNdKM/VtiQZIDnxk2SA5MRPkgEy9bj6Y72lNeAl0NdnWLfWHWePPPJId8wmHqCuWzsXUhedh/VMvRfWOXUnltsJx7ow71RT8xXXqWPF+jmX03adCYnHquZGrOiaCo8JPws1fXJb+sx4jFtTXDu3XD6n98LjoXo8m+z0XC2WvguU6dKjs46vbfH4qDmydY2su36i0kmS/FaQEz9JBshURf0VK1Z04tskogmLUE6UY08yVQNYVGSxVMUuFqFc7H8XzKMm5gJ9UVzrr6XG1ntmM5eqKnxv7EGonmTcZz3H9ddSdwPeA7IW5FL722qmcwFSam1pH50ax6gawKY4Z7Zk1UrFdH7/XEBQl5bMxeaf1LyXv/hJMkBy4ifJAJnZPlkVTVxc81rgDF3tdh55nIGXPeRU9HQrvzXxfhIPK65f75PLOpGv1fLgAmC4jLv8ma+bpB8sEvOxrkZzH/VcbbXbpSxTUZxFbhdzj69T1cfVz2oAe1jqc+d703M1C9diLV8t5C9+kgyQnPhJMkBy4ifJAJm6514tEIfTgWq6r+o5HOxAdTjWxdgTS/vh0jGzuckFq+T+qonK6W18nfOsc3105jeG9VbnxebytbmgIrW1ATfeCpd1uyb5WasZrZYm23k8ah38WdOL8zvH9WteR6b1/XbrMrrOkZ57SZIsSE78JBkgUxX1I6IT31TkYzHGbUBwZh0XL4+vYw8/bYvrdPXX+q7lVOVwwSBqgTO0nBOdWfxms5Qz+7XmD3Bx6lwdtSAoC1ETX1UU5/tUEb41fwOL8E5NVFGf3yUOEqNBRVgddGK6M0kzLh9EC/mLnyQDJCd+kgyQnPhJMkBm5rLrdqa14vRFp0exvqUBE5xZsWbmcrvFHC4IiBuPxcRhd7i00C5gJ6816JpKzc3ataV1sHmsNY11ax0aIMXlQuQxUB2f2+Odkc79WE2YXNaty/B12sfx+77UufOSJPktIid+kgyQqZvzxmKxM4Gp6MxlW2OSqzjF4r0TmXiHlcbLr6VgcqZJZ+ZS0bnmneY869Rr0AUqYZwqwWPCIqqL26f3WTNLtapPQN0UN0mOAP7sYue5MWXvPD3HJjy3+8+ZNLlt9+4wtSAdrWpm0y9+RKyJiL+KiPsj4r6IeGdErIuIWyJiz+j/tU0tJkkyc1pF/f8K4O9KKf8Ec+m07gNwHYBbSylbANw6+pwkySuAlmy5pwP4HQD/EgBKKc8DeD4irgKwfVTsegC3Abi2tWHnnefOuY0sLrabbpYZo2Ghn3zyyXnbBfqiP4ty6qXlVrF51dZt2GFUfHUbcWrhsFVEdRtWaqKnivo8pk6lqW1CAfoisPOyq3kTar+0jto5FyZbV+55A4/GUKyl73JWDqfSLCaOITCZRyTQ9ot/EYAjAP5HRPwwIv77KF32WaWUgwAw+v/MiVpOkmRmtEz8VQDeBuDzpZS3AngWE4j1EbEjInZFxK5Wv+kkSZaXlom/H8D+Usodo89/hbk/BI9HxEYAGP1/eL6LSyk7SylbSylbTzTDZ5IkS8OCM7GUcigiHo2Ii0spDwB4H4B7R/+uBvCp0f83tjRY00VcamnWozhopurWrUEjXOx81uPVq68WKFP1eO6X6mmsg+u6Q+0Po9bP46N6dy1QifNGc/Hya6Y9vc55qjlzrMuZ0Lprjfvogmiyrq5rHjymzkSqAV5rZjpnVnWBRFwQVGcWnXR3XutP8L8B8OWIWA1gL4B/hTlp4WsRcQ2ARwB8ZKKWkySZGU0Tv5TyIwBb5zn1vqXtTpIk02DqMffGoswkgTj43NNPP90db9q0qVfOxW9jWmPFKezBxddpHPZ169Z1x04kc/H4namPVQnnqVbzNNT6tY5auio3Vs6MVssyDPTvWU1lro+1/uqzYHXNZf51XogcP0/7yOPDx/ouqvpaa9t57rkgLpNu3Epf/SQZIDnxk2SA5MRPkgHysjGs19xyAeDIkSPdMetAalpR8x5T202n8c9ZD9R+sJ7GgRxUH+c+btiwoXfOmRy5PWe+4ut4dxhQdzl2bTn92ZmXGO0j66Bs2tJn5HYr1kyTWo7Nb88880zvHOv4LkgF91H1c37P9J3g+6nlHAS8S21rkFW3/jS+LgNxJElSJSd+kgyQaBUNlqSxiCMAHgawAcATU2t4fl4OfQCyH0r2o8+k/Ti/lHLGQoWmOvG7RiN2lVLmcwgaVB+yH9mPWfUjRf0kGSA58ZNkgMxq4u+cUbvMy6EPQPZDyX70WZZ+zETHT5JktqSonyQDZKoTPyKujIgHIuLBiJhaVN6I+GJEHI6Ie+i7qYcHj4hzI+LboxDluyPi47PoS0ScHBF3RsSPR/3489H3F0bEHaN+fHUUf2HZiYiVo3iO35hVPyJiX0TcHRE/iohdo+9m8Y5MJZT91CZ+RKwE8N8AfBDApQA+GhGXTqn5LwG4Ur6bRXjwFwD8aSnlEgBXAPjYaAym3ZdfA3hvKeUtAC4HcGVEXAHg0wA+O+rH0wCuWeZ+jPk45kK2j5lVP363lHI5mc9m8Y5MJ5R9KWUq/wC8E8Df0+dPAPjEFNu/AMA99PkBABtHxxsBPDCtvlAfbgTw/ln2BcCrAfw/AO/AnKPIqvme1zK2v3n0Mr8XwDcAxIz6sQ/ABvluqs8FwOkAfobR2tty9mOaov4mAI/S5/2j72bFTMODR8QFAN4K4I5Z9GUkXv8Ic0FSbwHwEICjpZTxjpFpPZ/PAfgzAOPdKetn1I8C4B8i4q6I2DH6btrPZWqh7Kc58efbWjRIk0JEnArgrwH8SSnl5wuVXw5KKcdLKZdj7hf37QAuma/YcvYhIn4fwOFSyl389bT7MWJbKeVtmFNFPxYRvzOFNpUTCmU/CdOc+PsBnEufNwM4MMX2labw4EtNRJyEuUn/5VLK38yyLwBQSjmKuSxIVwBYExHjfafTeD7bAPxBROwD8BXMifufm0E/UEo5MPr/MICvY+6P4bSfywmFsp+EaU78HwDYMlqxXQ3gDwHcNMX2lZswFxYcmCA8+IkQcxuqvwDgvlLKX8yqLxFxRkSsGR2fAuD3MLeI9G0AH55WP0opnyilbC6lXIC59+F/lVL+eNr9iIjXRMRp42MAHwBwD6b8XEophwA8GhEXj74ah7Jf+n4s96KJLFJ8CMBPMadP/ocptnsDgIMAfoO5v6rXYE6XvBXAntH/66bQj3+KObH1JwB+NPr3oWn3BcCbAfxw1I97APzH0fcXAbgTwIMA/hLAq6b4jLYD+MYs+jFq78ejf7vH7+aM3pHLAewaPZu/BbB2OfqRnntJMkDScy9JBkhO/CQZIDnxk2SA5MRPkgGSEz9JBkhO/CQZIDnxk2SA5MRPkgHy/wFIRd8hKb+OlQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# reshaping to visualize\n", "plt.imshow(dataset.data[0].reshape(64, 64), cmap=plt.cm.gray)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAD8CAYAAABXXhlaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAD7FJREFUeJzt3V+sHOV5x/HvrzYOaRJkDDaybKixZKWgKpj4iDhyVRHnj1waBS6gIkorq7J6bqhE1EipaaW2qVSp3AR6UVWygMYXacAlf4x80cRyzEVvDOcESOw4jp3UBcun2BVYSXqBavL0Yuek6+05Z+fszr89z+8jrc7OeHbm8c48+77vzDvvKCIws1x+re0AzKx5TnyzhJz4Zgk58c0ScuKbJeTEN0vIiW+W0FiJL2mPpDOSzknaX1VQZlYvjdqBR9Iq4MfAJ4ELwMvAZyPih9WFZ2Z1WD3GZ+8BzkXETwEkPQvcDyya+JLG7ia4Y8eOcVdh1imzs7OVri8iNGyZcRJ/E/BG3/QF4CNjrK+UmZmZujdh1ihpaJ5WbpzEXyja/1eiS5oGpsfYjplVbJzEvwDc2je9Gbg4uFBEHAAOQDVVfbOVpv88W1Ol/zhn9V8Gtkm6XdIa4GHghWrCMrM6jVziR8RVSX8CfBtYBTwTEacqi8zMajPy5byRNlZBVd/jB9hKVkVVv8xZfffcM0vIiW+WkBPfLCEnvllCTnyzhJz4Zgk58c0SGqfLbmN87d6sWi7xzRJy4psl1Mmqvqv2ZvVyiW+WkBPfLCEnvllCTnyzhJz4Zgk58c0ScuKbJeTEN0vIiW+WkBPfLCEnvllCTnyzhJz4Zgl18u68uvU/tMB3AlpGQ0t8Sc9IuiTpZN+8dZKOSjpb/L2x3jDNrEplqvpfAfYMzNsPHIuIbcCxYtrMJkSpZ+dJ2gIciYjfKqbPAPdGxJykjcCLEfHBEuspVa+uu/o9alW/6kcYu5mxtKYeGT2o6f2y2PE46v+/zmfn3RIRc8VG5oANI67HzFpQ+8k9SdPAdN3bMbPyRk38NyVt7KvqX1pswYg4AByAah6TXYWyVbm6q5pLrT9jM6Ctqv2gwTiW2hdVXCGqonq/XKNW9V8A9hbv9wKHqwnHzJow9OSepK8B9wI3A28CfwV8CzgE3Aa8DjwUEW8N3VhHTu6V1WYJ1JXvoEldKfEH1V3iL7a+UZU5uVfqrH5VnPjldeU7aJITv7nEn+iee3W3kQfX0dUD06q1nGOnbPu87DrL/siMy331zRJy4pslNHFV/TYvgWVsdzepimpum/uo6mp6nf8Xl/hmCTnxzRJy4pslNHFt/EloZ49xV1XFkeQzCecCusAlvllCTnyzhDpT1e9i1avpnnqLba+L300dmvy+q9rWKD3yqmgKjhu/S3yzhJz4Zgl1pqrfpEm72WY5A0Nk1MZAFsO2t9Q+6sL+c4lvlpAT3ywhJ75ZQmna+FX06Kp7XP2y619JjwDr2l1rw7Y3yj4aXEcXuMQ3S8iJb5bQiq3q13FZp2u9r7qszv/bUtXoUZtPXauK180lvllCTnyzhJz4Zgl15oEaVcdR0YMJal1/3brSbm3yu6pjwM66j6UqDFziHf8x2ZJulXRc0mlJpyQ9WsxfJ+mopLPF3xvHitzMGlOmqn8V+EJE3AHsBB6RdCewHzgWEduAY8W0mU2AoYkfEXMR8b3i/c+B08Am4H7gYLHYQeCBYevasWMHEdFIFXR+O8NeS5G06KvstquIP6tRvoNR99lS66hiX4wSR52WdXJP0hbgbuAEcEtEzEHvxwHYUHVwZlaP0okv6f3A14HPR8TPlvG5aUkzkmYuX748SoxmVrFSiS/pOnpJ/9WI+EYx+01JG4t/3whcWuizEXEgIqYiYmr9+vVVxGxmYxraZVe9RsnTwOmI+HLfP70A7AX+rvh7eDkb7krbtStxDPLY/PXfDdnWOrqgTF/9XcAfAj+Q9Gox78/pJfwhSfuA14GH6gnRzKo2NPEj4t+AxX56P15tOGbWhBV7d15XLNULrIoeYpNQ9WzzjsRJ+H7a4L76Zgk58c0SclW/YaNUPV1dvdZymk+2MJf4Zgk58c0ScuKbJdTaQBxui9kwHiizvMoH4jCzlceJb5aQL+dZZ7kKXx+X+GYJOfHNEnLimyXkNr7ZBBr3LkeX+GYJOfHNEmo08ZscV9/MFucS3ywhJ75ZQk58s4Sc+GYJOfHNEnLimyXkxDdLaGjiS7pe0kuSXpN0StKXivm3Szoh6ayk5yStqT9cM6tCmRL/HWB3RNwFbAf2SNoJPA48ERHbgLeBffWFaWZVGpr40fOLYvK64hXAbuD5Yv5B4IFaIjSzypVq40taVTwp9xJwFPgJcCUirhaLXAA21ROimVWtVOJHxLsRsR3YDNwD3LHQYgt9VtK0pBlJM5cvXx49UjOrzLLO6kfEFeBFYCewVtL8/fybgYuLfOZARExFxNT69evHidXMKlLmrP56SWuL9+8FPgGcBo4DDxaL7QUOD1vX7Owskhp/VLKZXWvoAzUkfYjeybtV9H4oDkXE30jaCjwLrANeAf4gIt4Zsi4/UMOsAksVnmUeqOEn6ZhNoHET3wNxmCXkLrtmCTnxzRJy4psl5MQ3S8iJb5aQE98sIT9Cy2xCVNnj1SW+WUJOfLOEnPhmCTnxzRJy4psl5MQ3S8iX81aw/ss/Td8RudilJ9+ZWY/573VqaqrU8i7xzRJy4psl1NoIPINcBVzckNFWFl22ze90lKr+4Gd8TFyrbM+9zo3AY2bd4MQ3S8hn9SfAUtX5Yct2jYdWL6/O78olvllCTnyzhJz4Zgl15nJev663U215unKJcdK1cjmveFT2K5KOFNO3Szoh6ayk5yStKbsuM2vXcqr6j9J7WOa8x4EnImIb8Dawr8rAzKw+pRJf0mbg94CnimkBu4Hni0UOAg/UEaBNvvnHprma3x1lS/wngS8CvyymbwKuRMTVYvoCsKni2MysJkMTX9KngUsRMds/e4FFF/w5lzQtaUbSzIgxmlnFyvTc2wV8RtJ9wPXADfRqAGslrS5K/c3AxYU+HBEHgANQ/qy+mdVraIkfEY9FxOaI2AI8DHw3Ij4HHAceLBbbCxyuLUqzhCRd86rSOB14/gz4U0nn6LX5n64mJDOrmzvwmHXUqKV8mQ48nbw7zwMymNXLffXNEnLimyXkxDdLyIlvlpAT3ywhJ75ZQp28nGeWVVODkbrEN0vIiW+W0ERU9T1mW72WMZZbzZFYU1zimyXkxDdLyIlvltBEtPGtXm675+MS3ywhJ75ZQq7q25J8KbV+bTw63CW+WUJOfLOEJq6qn7Hq2UZVcCFVxJFln3WdS3yzhJz4Zgk58c0Smrg2/qTpSvu8K8Z4SETFkbSnC8dEqcSXdB74OfAucDUipiStA54DtgDngd+PiLfrCdPMqrScqv7HImJ7REwV0/uBYxGxDThWTJvZBBinjX8/cLB4fxB4YPxwVp6IWPRl5Q0+Obaup8hmUTbxA/iOpFlJ08W8WyJiDqD4u6GOAM2semVP7u2KiIuSNgBHJf2o7AaKH4rpoQuaWWOW/ZhsSX8N/AL4Y+DeiJiTtBF4MSI+OOSzldZvJ7267GpqNSbtOKh7v5d5TPbQqr6k90n6wPx74FPASeAFYG+x2F7g8OihjmbS23pLtf99LmBl6dpxOrTEl7QV+GYxuRr454j4W0k3AYeA24DXgYci4q0h66r1KF6pSdKVg6WrJmG/N7kPy5T4y67qj8OJPxon/tImYb93LfHdc28CVHFgT+KPxyQk9KRyX32zhJz4Zgk58c0SWlFt/Iyj85S1nO+j6vMBGfdF18+puMQ3S8iJb5bQiqrqWzUyVs2r0PXqfT+X+GYJOfHNEnLimyXkxDdLyIlvlpAT3ywhX84zG8MkXcLr5xLfLCEnvllCruqbLUPZqv1g78elbiBro7ngEt8sISe+WUJOfLOE3MY3G2JSL9ktxSW+WUJOfLOEXNU3GzBq1X6pAUy6NrhJqRJf0lpJz0v6kaTTkj4qaZ2ko5LOFn9vrDtYM6tG2ar+3wP/GhG/CdwFnAb2A8ciYhtwrJg2swlQ5qGZNwCvAVujb2FJZ2j5MdlL6VrVyrqtijP3ox5zNQxnPv5jsoGtwGXgnyS9Iump4nHZt0TEXLGhOWDDWNGaWWPKJP5q4MPAP0bE3cB/s4xqvaRpSTOSZkaM0cwqVibxLwAXIuJEMf08vR+CN4sqPsXfSwt9OCIORMRURExVEbCZjW9o4kfEfwJvSJpvv38c+CHwArC3mLcXOFxLhGYVkbToa1QR8avXJBl6cg9A0nbgKWAN8FPgj+j9aBwCbgNeBx6KiLeGrMcn96w1dXS9reI4a+PkXqnEr4oT39rkxP8/7rlnK5qf/Lsw99U3S8iJb5aQE98sIbfxbUXp6gm8rnGJb5aQE98soaar+v8F/Adwc/G+NiWqfLXHUJLjuFbn4mh5zL3lfh+/UWahRjvw/Gqj0kzbffe7EIPjcBxtxeGqvllCTnyzhNpK/AMtbbdfF2IAxzHIcVyrljhaaeObWbtc1TdLqNHEl7RH0hlJ5yQ1NiqvpGckXZJ0sm9e48ODS7pV0vFiiPJTkh5tIxZJ10t6SdJrRRxfKubfLulEEcdzktbUGUdfPKuK8RyPtBWHpPOSfiDp1flh4lo6RhoZyr6xxJe0CvgH4HeBO4HPSrqzoc1/BdgzMK+N4cGvAl+IiDuAncAjxXfQdCzvALsj4i5gO7BH0k7gceCJIo63gX01xzHvUXpDts9rK46PRcT2vstnbRwjzQxl3z90UJ0v4KPAt/umHwMea3D7W4CTfdNngI3F+43AmaZi6YvhMPDJNmMBfh34HvAReh1FVi+0v2rc/ubiYN4NHAHUUhzngZsH5jW6X4AbgH+nOPdWZxxNVvU3AW/0TV8o5rWl1eHBJW0B7gZOtBFLUb1+ld4gqUeBnwBXIuJqsUhT++dJ4IvAL4vpm1qKI4DvSJqVNF3Ma3q/NDaUfZOJv1C/x5SXFCS9H/g68PmI+FkbMUTEuxGxnV6Jew9wx0KL1RmDpE8DlyJitn9203EUdkXEh+k1RR+R9DsNbHPQWEPZL0eTiX8BuLVvejNwscHtDyo1PHjVJF1HL+m/GhHfaDMWgIi4ArxI75zDWknz9280sX92AZ+RdB54ll51/8kW4iAiLhZ/LwHfpPdj2PR+GWso++VoMvFfBrYVZ2zXAA/TG6K7LY0PD67e3R5PA6cj4sttxSJpvaS1xfv3Ap+gdxLpOPBgU3FExGMRsTkittA7Hr4bEZ9rOg5J75P0gfn3wKeAkzS8X6LJoezrPmkycJLiPuDH9NqTf9Hgdr8GzAH/Q+9XdR+9tuQx4Gzxd10Dcfw2vWrr94FXi9d9TccCfAh4pYjjJPCXxfytwEvAOeBfgPc0uI/uBY60EUexvdeK16n5Y7OlY2Q7MFPsm28BN9YRh3vumSXknntmCTnxzRJy4psl5MQ3S8iJb5aQE98sISe+WUJOfLOE/hc6yJ7EGUtf0wAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Example of normalization of an image\n", "from sklearn.preprocessing import Binarizer\n", "\n", "image = dataset.data[0].reshape(64, 64)\n", "normalized_image = Binarizer(threshold=0.6).fit_transform(image)\n", "plt.imshow(normalized_image, cmap=plt.cm.gray)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text\n", "\n", "Text has also to be transformed to samples-by-features format.\n", "In the simplest case each document is a sample and ocurrence of words are its features." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading 20news dataset. This may take a few minutes.\n", "Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of documents 594\n", "Beginning of the first document From: ron.roth@rose.com (ron roth)\n", "Subject: HYPOGLYCEMIA\n", "X-Gated-By: Usenet <==> RoseMail Gateway (v1.70)\n", "Organization: Rose Media Inc, Toronto, Ontario.\n", "Lines: 31\n", "\n", " anello@adcs00.fnal.gov (Anthony Anello) writes:\n", "\n", "A(> Can anyone tell me if a bloodcount of 40 when diagnosed as hypoglycemic is\n", "A(> dangerous, i.e. indicates a possible pancreatic problem? One Dr. says no, the\n", "A(> other (not his specialty) says the first is negligent and that another blood\n", "A(> test should be done. Also, wh\n" ] } ], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "emails = fetch_20newsgroups(subset='train', categories=['sci.med'], shuffle=True, random_state=0)\n", "print(\"Number of documents\", len(emails.data))\n", "print(\"Beginning of the first document\", emails.data[0][:500])" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(594, 16257)\n" ] } ], "source": [ "# For every document we count word ocurrence:\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "emails_in_ML_format = CountVectorizer().fit_transform(emails.data)\n", "print(emails_in_ML_format.shape)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "matrix([[1, 0, 0, ..., 0, 0, 0]])" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now this is how the first document looks like:\n", "emails_in_ML_format[0].todense()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "livereveal": { "start_slideshow_at": "selected" } }, "nbformat": 4, "nbformat_minor": 2 }