# Week 2

## Data manipulation and visualization with python

### What are the keystones of Data Analysis

* Formulating Questions

* Data wrangling: gather, access, clean, tranform

* Exploratory data analysis

* Making conclusions and predictions: modeling, machine learning

* Reporting and communication

### Working with data in Python

Data types and data structures - 
containers to hold, access and modify data efficiently.

Our options:
- Python built-in data types
- Python built-in data structures and functions
- Python packages extend the built-in capabilities (packages in standard library and pip)
- 3rd party (non-pythonic) tools that we can run from Python

### Built-in data types, structures and functions

* int, float, complex, bool
* dict, list, set and frozenset, tuple, str, bytes
* https://docs.python.org/3/library/stdtypes.html
* https://docs.python.org/3/library/datatypes.html


### List

In [None]:
a = [1,2,3,4,5]
print(a[0])
print(a[-1])

In [None]:
print(a[:])
print(a[:3], a[3:])
print(a[0:3], a[3:5])

In [None]:
print(a[slice(0, 3)])
print(type(slice(0, 3)))

In [None]:
print(a[::2])

In [None]:
del a[::2]
print(a)

In [None]:
from math import sin
print([sin(x) for x in range(4)])

In [None]:
list(map(sin, range(4)))

In [None]:
a.append(6)
a

In [None]:
a.extend(a)
a

### Arrays in Python

* list
* array 1-dimensional https://docs.python.org/3/library/array.html
* numpy supports multidimensional arrays

In [None]:
from array import array
from statistics import mean

al = array('l', [1, 2, 3, 4, 5])
print(al)
print(sum(al), mean(al))

In [None]:
ad = array('d', [1.0, 2.0, 3.14])
print(ad)
print(sum(ad), mean(ad))

In [None]:
# However:
print(2 * ad)

print(ad + ad)

### Other numeric issues

In [None]:
0.1 + 0.1 + 0.1 == 0.3

In [None]:
from decimal import Decimal

Decimal('0.1') + Decimal('0.1') + Decimal('0.1') == Decimal('0.3')

In [None]:
import numpy as np

np.isclose(0.1 + 0.1 + 0.1, 0.3)

### Overview of packages

* __numpy__  - N-dimensional arrays and algebra
* __scipy__ - scientific computing (uses numpy)
* __pandas__  - data structures & analysis (uses numpy)
* __matplotlib__, __seaborn__ - plotting
* __jupyter__ - notebook, integration with pandas and plotting
* __scikit-learn (sklearn)__  - Machine learning algorithms (uses numpy and scipy)
* statistics - standard package - basic descriptive statistics
* statsmodels - statistical modeling, hypothesis testing

Make sure you know where to find documentation for these packages

### Datasets:

* https://catalog.data.gov/dataset
* http://mlr.cs.umass.edu/ml/datasets.html
* https://www.kaggle.com/datasets
* https://opendata.socrata.com


__Tabular data__: database tables, Excel, CSV


### Accessing data

* Example datasets "red wine quality"
* Download CSV from https://archive.ics.uci.edu/ml/datasets/wine+quality

In [None]:
winequality_file = "winequality-red.csv"

In [None]:
from itertools import islice

with open(winequality_file) as f:
    for line in islice(f, 0, 5):
        print(line.split(","))

# exclude header and line endings, convert to float

In [None]:
# Python CSVReader
import csv

with open(winequality_file) as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in islice(reader, 0, 5):
        print(', '.join(row))

In [None]:
# Python CSVReader
import csv

with open(winequality_file) as csvfile:
    reader = csv.DictReader(csvfile, delimiter=',')
    for row in islice(reader, 0, 3):
        print(', '.join(row.values()))
        
print(row.keys())
print(row['pH'])
# limitations - data structure is not suitable for data analysis

In [None]:
# Numpy: read CSV
from numpy import genfromtxt
wine_np = genfromtxt(winequality_file, delimiter=',', skip_header=1)
print(wine_np)
print(wine_np.shape)

In [None]:
# Pandas: read CSV
from pandas import read_csv
wine_df = read_csv(winequality_file, sep=',',header=0)
print(wine_df.shape)
wine_df.head()

In [None]:
# numpy array operations:
print(wine_np[0,0])  # first element
print(wine_np[0,...]) # row
print(wine_np[...,0]) # column

pH = wine_np[...,8]
print(pH.min(), pH.mean(), pH.max())

In [None]:
# filtering
print(wine_np[pH < 3.2, ...])
print(wine_np[pH < 3.2, ...].shape)

In [None]:
import numpy as np

empty_array = np.zeros((3,4,2))
empty_array

In [None]:
np.random.rand(3,4,2)

In [None]:
# This will ensure the random samples below can be reproduced. 
# This means the random samples will always be identical.

np.random.seed?

# learn how to invoke docstring help

In [None]:
help(np.random.seed)

In [None]:
np.random.seed(0)
print(np.random.rand(3))

np.random.seed(0)
print(np.random.rand(3))

np.random.seed(1000)
print(np.random.rand(3))

In [None]:
wine_np[0:3, 8]  # pH values for 3 first wines in the array

In [None]:
# changing values
wine_np[0:3, 8] = [3., 3., 3.]
wine_np[0:3, 8]

In [None]:
wine_np.dtype.name

# change it with wine_np.astype(int)

In [None]:
print(pH)
print(2*pH)
print(pH + pH)
print(np.exp(pH))

In [None]:
M = wine_np[0:2, 0:3]
print(M)
print()
print(M.T)

In [None]:
M.dot(M.T) # matrix multiplication

In [None]:
M.T.dot(M) # matrix multiplication

In [None]:
x = np.array([3,3,3])
M*x  # multiply rows by x. This is a broadcasting operation

### Broadcasting in numpy:
 - The last dimension of each array is compared.
 - If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
 - If the dimension lengths are not equal, and none of the dimensions have length 1, then there's an error.
 - Continue checking dimensions until the shortest array is out of dimensions.

In [None]:
M.dot(x) # matrix multiplication

In [None]:
# but not this:
x.dot(M)  # try to fix it

In [None]:
print(type(wine_np)) # ndarray object
print(wine_np.shape) # note, shape is attribute
wine_np.sum()  # sum() is method

In [None]:
wine_np.sum(axis=0)  # collapsed dimensions

In [None]:
wine_np.sum(axis=1)

In [None]:
# Pandas Dataframe (as in R)
wine_df.info()

In [None]:
wine_df.dtypes

In [None]:
wine_df.describe()

In [None]:
wine_df['pH'].head()

In [None]:
wine_df['pH'].head(10)

In [None]:
wine_df['pH'][:5]

In [None]:
wine_df[:5]

In [None]:
wine_df[:5]['pH']

In [None]:
wine_df[:5][['chlorides', 'pH']]

In [None]:
wine_df[['chlorides', 'pH']][:5]

In [None]:
wine_df['quality'].unique()

In [None]:
wine_df['quality'].nunique()

In [None]:
# a histogram by quality
wine_df['quality'].value_counts()

In [None]:
%matplotlib inline
wine_df['quality'].value_counts().plot(kind='bar')

In [None]:
wine_df['quality'] == 3
# This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where.

In [None]:
# You can also combine more than one condition with the & operator like this:
bad_wine = wine_df['quality'] == 3
acidic_wine = wine_df['pH'] < 3.3

wine_df[bad_wine & acidic_wine]

In [None]:
# pandas columns are numpy arrays internally
import pandas as pd

pd.Series([1,2,3])

In [None]:
pd.Series([1,2,3]).values

In [None]:
np.mean(wine_df['pH'].values)

In [None]:
# group by 
print(wine_df.groupby('quality'))

In [None]:
wine_df.groupby('quality')['alcohol'].mean()

In [None]:
wine_sorted = wine_df.sort_values(['alcohol'], ascending=False)
wine_sorted.head()

In [None]:
wine_sorted.tail()

In [None]:
# loc gets rows (or columns) with particular labels from the index.
wine_df.loc[544:545]

In [None]:
# iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
wine_df.iloc[544:545]

In [None]:
wine_df.loc[:3, 'pH']

In [None]:
wine_df.iloc[:3, 8]

In [None]:
wine_df.at[0, 'pH']

In [None]:
wine_df.iat[0, 8]

In [None]:
wine_df.at[0, 'pH'] = 3.50
wine_df.at[0, 'pH']

In [None]:
wine_df.at[0, 'body'] = 'full'
wine_df.at[1, 'body'] = 'light'
wine_df.head()

In [None]:
print(wine_df['body'].isna().head())
wine_df.loc[wine_df['body'].isna(), 'quality'] = 0
# wine_df.head()

In [None]:
# transpose
wine_df.T

In [None]:
wine_df.T.to_csv('transposed.csv', index=False)

In [None]:
pd.merge?

## Section 2: Visualization

In [None]:
# Given a numpy array we can create a pandas data frame

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import datasets
iris = datasets.load_iris()
print(iris)

In [None]:
columns = ("sepal_length", "sepal_width", "petal_length", "petal_width", "species")

iris_df = pd.DataFrame(data= np.c_[iris["data"], iris["target"]], columns=columns)
iris_df["species_name"] = iris_df["species"].map({0.0: "Setosa", 1.0: "Versicolour", 2.0: "Virginica"})

iris_df.head()

In [None]:
# convention is to rename pyplot to plt
import matplotlib.pyplot as plt

# magic command to display matplotlib plots inline
%matplotlib inline

# Plot
plt.scatter(
    iris_df["sepal_length"], # X axis is the sepal length
    iris_df["sepal_width"],  # Y axis is the sepal width
    c=iris_df["species"]     # Color is the species
)

# create labels
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

# show
plt.show()

In [None]:
# Let's set some limits on the axes
plt.scatter(
    iris_df["sepal_length"], iris_df["sepal_width"],
    c=iris_df["species"],
    # cmap=plt.cm.Set1, # Set2, etc
    # alpha=0.5,
    # s=10, s=iris_df["petal_length"] * 20,
    # marker="x",
)

# name the labels, for clarity
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

# Remove ticks
#plt.xticks(())
#plt.yticks(())

plt.show()

In [None]:
# Histograms:  https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

plt.hist(iris_df["sepal_width"])
plt.xlabel('Sepal width')
plt.ylabel('Count')
plt.show()

In [None]:
# Bar plots https://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html

species = ("Setosa", "Versicolour", "Virginica")
y_pos = np.arange(len(species))
mean_widths = tuple(np.mean(iris_df.loc[iris_df["species_name"] == x, "sepal_width"]) for x in species)
 
plt.bar(y_pos, mean_widths)
plt.xticks(y_pos, species)
plt.ylabel('Mean Sepal Width')
plt.title('Sepal Widths Per Species')
 
plt.show()

In [None]:
plt.bar(y_pos, mean_widths, color="red" )#, color=("red", "yellow", "green"), edgecolor=("blue", "green", "black")) # bottom=0, align='edge', width=(.1, .2, .3) , alpha=0.5)
# barh / yticks
plt.xticks(y_pos, species)
plt.ylabel('Mean Sepal Width')
plt.title('Sepal Widths Per Species')
 
plt.show()

In [None]:
# Box plots:
# https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html

data_to_plot = list(iris_df.loc[iris_df["species_name"] == x, "sepal_width"] for x in species)
plt.boxplot(data_to_plot)
plt.xticks((1, 2, 3), species)
plt.show()

In [None]:
plt.boxplot(data_to_plot)#, notch=True, sym="x", vert=False, widths=.9, patch_artist=True, labels=species, showmeans=True)
# lie with medians, set outliers, etc
plt.show()

## Generic plot commands and subplots
* Doc: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html

### Figure
The canvas we are using.
### Subplot
Subsection of figure.
### Axes
Where we are plotting.

### plt.plot
Generic plot function that accepts x, y, and styling parameters.

In [None]:
# evenly .2 intervals
t = np.arange(0., 5., 0.2)

# red dots
plt.plot(
    t, t, #linewidth=10, color="green",
)
plt.show()

In [None]:
# evenly .2 intervals
t = np.arange(0., 5., 0.2)

# red dots
plt.plot(
    t, t, 'ro',
)
plt.show()

In [None]:
# red dashes, blue squares and green triangles
plt.plot(
    t, t, 'r--',
)

plt.plot(
    t, t**2, 'bs',
)

plt.show()

In [None]:
# subplots: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html
fig = plt.figure(figsize=(5, 5))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = fig.add_subplot(2, 1, 1)

ax1.plot(
    t, t, 'r--',
)

ax2.plot(
    t, t**2, 'bs',
)

plt.show()

### Trend lines

In [None]:
z = np.polyfit(iris_df["petal_length"], iris_df["petal_width"], 1)
p = np.poly1d(z)
print(z)
print(p)

In [None]:
# PLT scatter plot (X, y, and color)
plt.scatter(iris_df["petal_length"], iris_df["petal_width"])
z = np.polyfit(iris_df["petal_length"], iris_df["petal_width"], 1)
p = np.poly1d(z)
plt.plot(iris_df["petal_length"], p(iris_df["petal_length"]),"r--")
# name the labels, for clarity
plt.xlabel('Petal length')
plt.ylabel('Petal width')

In [None]:
# PLT scatter plot (X, y, and color)
plt.scatter(iris_df["petal_length"], iris_df["petal_width"])
z = np.polyfit(iris_df["petal_length"], iris_df["petal_width"], 1)
p = np.poly1d(z)
plt.plot(iris_df["petal_length"], p(iris_df["petal_length"]),"r--")
# name the labels, for clarity
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.savefig("x.png")

### Links
* https://matplotlib.org/index.html
* https://www.python-course.eu/matplotlib.php
* https://matplotlib.org/gallery.html

### Styles and other packages
* Matplotlib styles
* Pandas plotting based on matplotlib
* Seaborn
* Bokeh

### Pandas plotting based on matlab

* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html



In [None]:
plt.style.use("ggplot")
plt.figure()
ax = iris_df.plot(x="petal_length", y="petal_width", style="ro", legend=False)
ax.set_xlim(0.5, 8.0)
ax.set_ylim(0, 5.0)
plt.show()

In [None]:
plt.style.use("classic")
plt.figure()
ax = iris_df.hist(column="sepal_length", color="red", alpha=0.5)
plt.title("Sepal Lengths")
plt.show()

### Seaborn
* https://seaborn.pydata.org/



In [None]:
import seaborn as sns
sns.lmplot(x="petal_length", y="petal_width", data=iris_df) # , hue="species", fit_reg=False)


In [None]:
# import seaborn as sns
sns.distplot(iris_df["sepal_length"])