Introduction to Data Analysis in Python

Filtering data

In the last section we looked at how to act on entire columns at once. For example when we did:

tips["total_bill"] * 100

it applied the multiplication to every row, multiplying each number by 100.

Sometimes we don't want to have to deal with entire columns at once, we might only want to grab a subset of the data and look in just that part. For example, with the tips data, we might think that the day of the week will affect the data so we just want to grab the data for Saturdays.

In Pandas there are two steps to asking a question like this.

  1. create a filter which describes the question you want to ask
  2. apply that filter to the data to get just the bits you are interested in

You create a filter by performing some operation on your DataFrame or a column within it. To ask about only those rows which refer to Saturday, you grab the day column and compare it to "Sat":

In [1]:
import pandas as pd
tips = pd.read_csv("https://milliams.com/courses/data_analysis_python/tips.csv")
In [2]:
tips["day"] == "Sat"
Out[2]:
0      False
1      False
2      False
3      False
4      False
       ...  
239     True
240     True
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool

This has created a filter object (sometimes called a mask or a boolean array) which has True set for the rows where the day is Saturday and False elsewhere.

We could save this filter as a variable:

In [3]:
sat_filter = tips["day"] == "Sat"

We can use this to filter the DataFrame as a whole. tips["day"] == "Sat" has returned a Series containing booleans. Passing it back into tips as an indexing operation will use it to filter based on the day column, only keeping those rows which contained True in the filter:

In [4]:
tips[sat_filter]
Out[4]:
total_bill tip day time size
19 20.65 2.34 Sat Dinner 3
20 17.92 2.86 Sat Dinner 2
21 20.29 1.92 Sat Dinner 2
22 15.77 1.56 Sat Dinner 2
23 39.42 5.31 Sat Dinner 4
... ... ... ... ... ...
238 35.83 3.27 Sat Dinner 3
239 29.03 4.14 Sat Dinner 3
240 27.18 1.40 Sat Dinner 2
241 22.67 1.40 Sat Dinner 2
242 17.82 1.22 Sat Dinner 2

87 rows × 5 columns

Notice that it now says that the table only has 87 rows, down from 244. However, the index has been maintained. This is because the row labels are connected to the row, they're not just row numbers.

It is more common to do this in one step, rather than creating and naming a filter object. So the code becomes:

In [5]:
tips[tips["day"] == "Sat"]
Out[5]:
total_bill tip day time size
19 20.65 2.34 Sat Dinner 3
20 17.92 2.86 Sat Dinner 2
21 20.29 1.92 Sat Dinner 2
22 15.77 1.56 Sat Dinner 2
23 39.42 5.31 Sat Dinner 4
... ... ... ... ... ...
238 35.83 3.27 Sat Dinner 3
239 29.03 4.14 Sat Dinner 3
240 27.18 1.40 Sat Dinner 2
241 22.67 1.40 Sat Dinner 2
242 17.82 1.22 Sat Dinner 2

87 rows × 5 columns

This has given us back our subset of data as another DataFrame which can used in exactly the same way as the previous one (further filtering, summarising etc.).

Exercise 1

  • Select the data for only Thursdays.
  • Calculate the mean of the tip column for Thursdays
  • Compare this with the mean of the tip column for Saturdays

answer

Other filters

As well as filtering with the == operator (which only checks for exact matches), you can do other types of comparisons. Any of the standard Python comparisons will work (i.e. ==, !=, <, <=, >, >=).

To grab only the rows where the total bill is less than £8 we can use <:

In [6]:
tips[tips["total_bill"] < 8]
Out[6]:
total_bill tip day time size
67 3.07 0.70 Sat Dinner 1
92 5.75 0.70 Fri Dinner 2
111 7.25 0.70 Sat Dinner 1
149 7.51 1.40 Thur Lunch 2
172 7.25 3.60 Sun Dinner 2
195 7.56 1.01 Thur Lunch 2
218 7.74 1.01 Sat Dinner 2

Exercise 2

Filter the data to only include parties of 5 or more people.

answer

Combining filters

If you want to apply multiple filters, for example to select only "Saturdays with small total bills" you can do it in one of two different ways. Either split the question into multiple steps, or ask it all at once.

Let's do it multiple steps first since we already have tools we need for that:

In [7]:
sat_tips = tips[tips["day"] == "Sat"]  # First grab the Saturday data and save it as a variable
sat_tips[sat_tips["total_bill"] < 8]  # Then act on the new DataFrame as use it as before
Out[7]:
total_bill tip day time size
67 3.07 0.70 Sat Dinner 1
111 7.25 0.70 Sat Dinner 1
218 7.74 1.01 Sat Dinner 2

Or, you can combine the questions together using the & operator with a syntax like:

df[(filter_1) & (filter_2)]

so in our case filter 1 is tips["day"] == "Sat" and filter 2 is tips["total_bill"] < 8 so it becomes:

In [8]:
tips[(tips["day"] == "Sat") & (tips["total_bill"] < 8)]
Out[8]:
total_bill tip day time size
67 3.07 0.70 Sat Dinner 1
111 7.25 0.70 Sat Dinner 1
218 7.74 1.01 Sat Dinner 2

If you want to do an "or" operation, then instead of & you can use |.

Exercise 3

Filter the data to only include parties of 4 or more people which happened at lunch time.

Hint: The size and time columns are what you want to use here.

answer

DataFrame indexing

When we use the square bracket syntax on a DataFrame directly there are a few different types of object that can be passed:

A single string
This will select a single column form the DataFrame, returning a Series object.
A list of strings
This will select those columns by name, returning a DataFrame.
A filter (a Series of True/False)
This will filter the table as a whole, returning a DataFrame with only the rows matching Trueincluded.

These are provided as shortcuts as they are the most common operations to do an a DataFrame. This is why some of them operate on columns and other on rows.

If you want to be explicit about which axis you are acting on, you can pass these same types of objects to the .loc[rows, columns] attribute with one argument per axis. This means that

tips[sat_filter]

is equivalent to

tips.loc[sat_filter]

and that

tips["size"]

is equivalent to

tips.loc[:, "size"]

The full set of rules for DataFrame.loc are in the documentation.