Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

# Week3 Assignment - Data retrieval and dataset preprocessing.

For this assignment, you will need the iris flowers dataset available on the course website, under Week3 files. This is a dataset of 150 observations of flowers, with petal lengths and widths, and sepal lengths and widths.

The basic goal of this assignment is to preprocess the iris dataset in preparation for a machine-learning algorithm.

In the first component, you will load the dataset into a Pandas dataframe.

In the second component, you will impute missing values in one of the columns. In this case, you will assign the average value in the column to the missing data points.

In the third component, you will create two new columns that will approximate the sepal and petal sizes: One will equal the petal length multiplied by the petal width, and the other will equal the sepal length multiplied by the sepal width.

In the fourth component, you will normalize the sepal and petal sizes.

In the fifth component, you will add a column with a boolean value representing whether a flower belongs to the setosa species.

All the exercises are designed so that the solutions will need only one or a few lines of code.

Do not hesitate to contact instuctors and TA via #week3 channel on Slack if you get stuck. Join the channel first by clicking on Channels.

## Part A. Read in the iris dataset. 

In this component you will read the iris dataset into a Pandas data frame. Make sure you download the iris.csv file from the course website. Do not download or load a different instance of the iris dataset: use the one from the course website as we have modified it for this exercise, and when we test your code we will test it against our version of the dataset. Also, do not manually modify the iris dataset.

Pay attention which directory you save the file to so that you can load it by its path.
If you prefer to load the dataset by its URL, you can do that: https://biof509.github.io/spring2019/_downloads/Iris.csv


Once downloaded, use Pandas to read the file into a data frame.

Save the data frame to a variable named iris_data.

In [None]:
import pandas as pd
import numpy as np

# Load the iris dataset into a pandas dataframe
# Make sure to save it as "iris_data"

# YOUR CODE HERE
pass

# You can make sure you loaded it correctly by looking at its first few rows, using the .head() function
# print(iris_data.head())

In [None]:
assert isinstance(iris_data, pd.core.frame.DataFrame)
assert iris_data.shape == (150, 6)
assert np.isclose(iris_data["SepalLengthCm"].mean(), 5.8153)
assert iris_data.isnull().sum().sum() == 7
assert iris_data.isnull().values.any()

## Part B. Impute missing data.

Unfortunately it appears that there are some missing data points in the PetalLengthCm column.

To resolve this, write a function that can find the mean of the PetalLengthCm column and replace the missing data points in the PetalLengthCm column with this mean.

The function takes as input a Pandas data frame and a column name. The function will return a new Pandas data frame, with the missing data points replaced with the column mean.

Run the function on the iris_data data frame and save the new data frame in the same variable (iris_data).

In [None]:
def impute_with_mean(df, column_name):
    """
    Accepts a Pandas data frame and a column name as input.
    
    Returns a new Pandas data frame with missing data points in the column
    replaced with the mean value for that column.
    """
    # YOUR CODE HERE
    pass

# Let's impute in our data frame
iris_data = impute_with_mean(iris_data, "PetalLengthCm")

In [None]:
assert not iris_data.isnull().values.any()
assert iris_data["PetalLengthCm"].median() == 4.25
assert iris_data.equals(impute_with_mean(iris_data, "SepalLengthCm"))

## Part C. Approximate full sizes.

In this component you will create two new columns that approximate the sepal and petal sizes.

The first new column will be named sepal_size and will be equal to: SepalLengthCm \* SepalWidthCm

The second new column will be named petal_size and will be equal to: PetalLengthCm \* PetalWidthCm

Add the appropriate columns below.

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert iris_data.shape[1] > 6
assert "sepal_size" in iris_data.columns
assert "petal_size" in iris_data.columns
assert np.isclose(np.round(iris_data["petal_size"].sum()), 868.)
assert np.isclose(np.round(iris_data["sepal_size"].sum()), 2656.)

## Part D. Normalize sizes.

For some machine-learning algorithms, we need to normalize data. This is generally done so that different features are on the same scales.

In our case, sepal_size and petal_size are on different scales. Normalize them by subtracting the mean from each and dividing by the standard deviation. You can do this manually if you prefer, or you can use the scale function from sklearn.preprocessing.

Note: If you do this manually, make sure to set the standard deviation to 0 degrees of freedom (ddof=0).

Since you need to perform this operation twice, you may consider writing a function to handle this, though that is not a requirement.

Save the new columns as: sepal_size_normalized, petal_size_normalized

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert np.isclose(iris_data["petal_size_normalized"].mean(), 0)
assert np.isclose(iris_data["petal_size_normalized"].std(ddof=0), 1)
assert np.isclose(iris_data["sepal_size_normalized"].mean(), 0)
assert np.isclose(iris_data["sepal_size_normalized"].std(ddof=0), 1)

## Part E. Add a boolean column.

We are specifically interested in whether a given flower is from the setosa species.

Add a column "is_setosa" to the data frame, that is true if we are dealing with a setosa and false otherwise.

In [None]:
# YOUR CODE HERE
pass

In [None]:
assert sum(iris_data["is_setosa"]) == 50
assert iris_data.loc[iris_data["Species"] == "Iris-setosa", "is_setosa"].all()
assert iris_data["is_setosa"].dtype == "bool"