Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".

# Week4 Assignments - Functions, NumPy and Pandas


This assignment has four components. Each of the first three components receives one point if all the tests pass. 
The last component on Pandas may require more effort, it consists of three subtasks, each of the subtasks receives 1 point.

All the exercises are designed so that the solutions will need only one or a few lines of code.

Some concepts may be new to you and may require digging into the Python, NumPy or Pandas documentation, the links are provided.

Do not hesitate to contact instuctors and TA via #week4 channel on Slack if you get stuck. Join the channel first by clicking on Channels.

## Part A. Create a missing function (1 point)

In this exercise you need to create a function __missing_link(x)__ that is passed to another function as an argument in order to perform a calculation.

We know the final result (see the assert operator), but we do not know the intermediate calculation leading to that result.

Read about Python built-in functions __all()__ and __zip()__
https://docs.python.org/3.3/library/functions.html

and about the iterators and generators here:
https://docs.python.org/3.3/library/stdtypes.html#typeiter

In [None]:
def calculate(func, it):
    """
    Performs calculation by applying *func* to each item of iterator *it*
    Returns a generator as a result.
    """
    return (2 * func(a) for a in it)


In [None]:
def missing_link(x):
    """Define a function that will be passed to calculate() as an argument"""
    # YOUR CODE HERE
    raise NotImplementedError()

## You can check the result of the missing_link() function and of calculate() if you wish:
# print(list(map(missing_link, range(5))))
# print(list(calculate(missing_link, range(5))))

In [None]:
_observed_results = calculate(missing_link, range(7))
_expected_results = [0, 2, 8, 18, 32, 50, 72]

assert all(a == b for a, b in zip(_observed_results, _expected_results))

## Part B. Create a filter function (1 point)

In this exercise you need to create a filter function __filter_DNA(c)__ that accepts one character as an argument and returns True if it belongs to the DNA alphabet "ACGT" and False otherwise. The function should be insensitive to case of the character, i.e. both "A" and "a" are valid.

__filter_DNA(c)__ will be applied to a string in order to exclude all characters not belonging to the DNA alphabet.

Read more about __filter()__ -- a Python built-in function https://docs.python.org/3/library/functions.html#filter


In [None]:
def filter_DNA(c):
    """
    Accepts one character as an argument
    Returns True if it belongs to the DNA alphabet "ACGT" and False otherwise
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert ", ".join(filter(filter_DNA, "ACGT")) == "A, C, G, T"
assert ", ".join(filter(filter_DNA, "acgt")) == "a, c, g, t"
assert ", ".join(filter(filter_DNA, "#*UEOHaSDNKcSDPUgDNBH#tSBDHe")) == "a, c, g, t"
assert ", ".join(filter(filter_DNA, "aTGXXAxGxCXT")) == "a, T, G, A, G, C, T"

## Part C. NumPy (1 point)

Define __x__ as a subdivision of an interval from -4 PI to 4 PI into 32 equal parts, i.e. with a PI/4 step. Including both endpoints that should give 33 points.
Using NumPy calculate __cos()__ and __sin()__ and find the values of __x__ where __cos(x)__ is equal to __sin(x)__ and store these values in the variable __y__. Use NumPy vector operations.

Use __np.pi__ constant and __np.linspace()__ function: 
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html

Note that due to the way floating points are stored in memory exact comparison is nearly always impossible. You should use __np.isclose()__ instead. That would allow some room for floating point errors.
https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.isclose.html

This plot may be helpful:
http://www.wolframalpha.com/input/?i=plot+sinx+and+cosx+from+-4pi+to+4pi

In [None]:
import numpy as np

# define x and y variables

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert x.shape[0] == 33
assert -4*np.pi in x
assert 0.0 in x
assert 4*np.pi in x

assert y.shape[0] == 8
assert np.all(np.isclose(y/np.pi, np.array([-3.75, -2.75, -1.75, -0.75,  0.25,  1.25,  2.25,  3.25])))

## Part D. Working with Pandas dataframes (3 points)

We will explore FBI reports on gun checks provided by the National Instant Criminal Background Check System (NICS)
https://www.fbi.gov/services/cjis/nics

Before ringing up the sale, cashiers call in a check to the FBI or to other designated agencies to ensure that each customer does not have a criminal record or isnâ€™t otherwise ineligible to make a purchase. More than 230 million such checks have been made, leading to more than 1.3 million denials.

NICS and background checks is a hot topic and it is important to be able to do some basic fact-checking using the data available. https://www.washingtonpost.com/news/fact-checker/wp/2018/02/23/fact-checking-trump-nra-claims-on-gun-background-checks/?utm_term=.3e0284ad3774

The FBI NICS provides data as PDF reports, which is a really bad example of distributing the data.
There is a community-developed parser that extracted the data from PDF files. Parsed dataset that we will be using is available here: 
https://github.com/BuzzFeedNews/nics-firearm-background-checks/blob/master/README.md

Note that the number of background checks can not be directly interpreted as the number of guns sold because the actual sale protocols vary state to state.

In [None]:
import pandas as pd

# NICS parsed dataset url
url = "https://github.com/BuzzFeedNews/nics-firearm-background-checks/blob/master/data/nics-firearm-background-checks.csv?raw=true"
guns = pd.read_csv(url)


In [None]:
# Use .head() .info() and .describe() to explore the dataset

### Part D. Subtask 1 (1 point)

First, use __pd.to_datetime()__ with argument __yearfirst=True__ to convert the column __"month"__ to a Pandas Series with DateTime objects. Add a new column __"year"__ to __guns__ dataframe and save the results of conversion there.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

You can access Python __datetime.date__ objects via the __.dt__ property of Pandas Series:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dt.date.html

Look up the attributes of __datetime.date__ class, we will need attribute __.year__
https://docs.python.org/3/library/datetime.html


In [None]:
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert guns['year'].min(), guns['year'].max() == (1998, 2018)

### Part D. Subtask 2 (1 point)

Group __guns__ dataframe by year and sum up the __totals__ (together, regardless of state). Use the variables
__totals_2000__ and __totals_2017__ to store the corresponding results.

You will need https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert totals_2000 == 8427096
assert totals_2017 == 24955919

### Part D. Subtask 3 (1 point)

Group data by state (regardless of year) and calculate the mean value of __long_gun__ and __handgun__ checks separately for each state. Calculate the number of states that had more long gun background checks on average over the years than handgun checks. Calculate the number of states with more handgun checks. Store these results in __states_with_more_long_guns__ and __states_with_more_handguns__ variables, respectively.

Hint: Use vector operations. No for loops are needed. A result of comparison of two vectors is a vector of booleans. You can sum up the vector of booleans to calculate the number of True values in it.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert states_with_more_long_guns == 37
assert states_with_more_handguns == 18