Introduction to Data Analysis in Python

Querying your data

Once your data is read in and available as a DataFrame, Pandas provides a whole suite of tools for extracting information from it.

Let's start by looking at some example data which contains information about the amounts that people at a restaurant paid and tipped for their meals:

In [1]:
import pandas as pd

tips = pd.read_csv("https://milliams.com/courses/data_analysis_python/tips.csv")
tips
Out[1]:
total_bill tip day time size
0 16.99 0.71 Sun Dinner 2
1 10.34 1.16 Sun Dinner 3
2 21.01 2.45 Sun Dinner 3
3 23.68 2.32 Sun Dinner 2
4 24.59 2.53 Sun Dinner 4
... ... ... ... ... ...
239 29.03 4.14 Sat Dinner 3
240 27.18 1.40 Sat Dinner 2
241 22.67 1.40 Sat Dinner 2
242 17.82 1.22 Sat Dinner 2
243 18.78 2.10 Thur Dinner 2

244 rows × 5 columns

The first thing that you usually want to be able to do is to pull apart the overall table to get at specific bits of data from inside.

When using lists and dicts in Python, the square-bracket syntax was used to fetch an item from the container. In Pandas we can use the same syntax but it's a much more powerful tool.

If you pass a single string to the square brackets of a DataFrame it will return to you just that one column:

In [2]:
tips["total_bill"]
Out[2]:
0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

Accessing a column like this returns an object called a Series which is the second of the two main Pandas data types. Don't worry too much about these just yet but think of them as being a single column of the DataFrame, along with the index of the DataFrame.

If you pass a list of column names to the square brackets then you can grab out just those columns:

In [3]:
tips[["total_bill", "tip"]]
Out[3]:
total_bill tip
0 16.99 0.71
1 10.34 1.16
2 21.01 2.45
3 23.68 2.32
4 24.59 2.53
... ... ...
239 29.03 4.14
240 27.18 1.40
241 22.67 1.40
242 17.82 1.22
243 18.78 2.10

244 rows × 2 columns

In this case it gives you back another DataFrame, just with only the required columns present.

Getting rows

If you want to select a row from a DataFrame then you can use the .loc (short for "location") attribute which allows you to pass index values like:

In [4]:
tips.loc[2]
Out[4]:
total_bill     21.01
tip             2.45
day              Sun
time          Dinner
size               3
Name: 2, dtype: object

If you want to grab a single value from the table, you can follow the row label with the column name that you want:

In [5]:
tips.loc[2, "total_bill"]
Out[5]:
21.01

Exercise 1

The size column in the data is the number of people in the dining party. Extract this column from the DataFrame.

answer

Descriptive statistics

Now that we know how to refer to individual columns, we can start asking questions about the data therein. If you've worked with columns of data in Excel for example, you've probably come across the SUM() and AVERAGE() functions to summarise data. We can do the same thing in pandas by calling the sum() or mean() methods on a column:

In [6]:
tips["total_bill"].sum()
Out[6]:
4827.77
In [7]:
tips["total_bill"].mean()
Out[7]:
19.78594262295082

You can see a list of all the possible functions you can call in the documentation for Series. So for example, you can also ask for the maximum value from a column with the max() method.

In [8]:
tips["tip"].max()
Out[8]:
7.0

In some situations, you don't just want to get the value of the maximum, but rather to find out which row it came from. In cases like that there is the idxmax() method which give you the index label of the row with the maximum:

In [9]:
tips["total_bill"].idxmax()
Out[9]:
170

So we know that the value of the maximum bill was £7 and it was found in the row with the label 170.

You can then use this information with the .loc attribute to get the rest of the information for that row:

In [10]:
index_of_max_bill = tips["total_bill"].idxmax()
tips.loc[index_of_max_bill]
Out[10]:
total_bill     50.81
tip              7.0
day              Sat
time          Dinner
size               3
Name: 170, dtype: object

Exercise 2

Find the value of the tip that was paid for the smallest total bill.

Hint: Have a look at the documentation page for Series. There's a function which works like idxmax() but finds the minimum.

answer

Acting on columns

Functions like sum() and max() summarise down the column to a single value. In some situations we instead want to manipulate a column to create a new column.

For example, the data in the table is in British pounds. If we wanted to convert it into the number of pennies then we need to multiply each value by 100. In Pandas you can refer to an entire column and perform mathematical operations on it and it will apply the operation to each row:

In [11]:
tips["total_bill"] * 100
Out[11]:
0      1699.0
1      1034.0
2      2101.0
3      2368.0
4      2459.0
        ...  
239    2903.0
240    2718.0
241    2267.0
242    1782.0
243    1878.0
Name: total_bill, Length: 244, dtype: float64

The data in row 0 was previously 16.99 but the result here is 1699.0, and likewise for every other row.

You can do any mathematical operation that Python supports, such as +, - and /.

Combining columns

As well as operating on individual columns, you can combine together multiple columns. Any operation you do between two columns will be done row-wise, that is adding two columns will add together the two values from the first row of each, then the second row from each etc.

For example if we wanted to find out, for each entry in our table what the ratio between tip amount and total bill was, we could divide one column by the other:

In [12]:
tips["tip"] / tips["total_bill"]
Out[12]:
0      0.041789
1      0.112186
2      0.116611
3      0.097973
4      0.102887
         ...   
239    0.142611
240    0.051508
241    0.061756
242    0.068462
243    0.111821
Length: 244, dtype: float64

Of course, if we want the tip percentage so we need to multiply the value by 100:

In [13]:
(tips["tip"] / tips["total_bill"])*100
Out[13]:
0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

It can get messy and hard-to-read doing too many things on one line, so it's a good idea to split each part of your calculation onto its own line, giving each step its own variable name along the way.

In [14]:
tip_fraction = tips["tip"] / tips["total_bill"]
tip_percent = tip_fraction*100
tip_percent
Out[14]:
0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

Exercise 3

The total_bill column give the total amount for the entire dining party. Calculate the amount spent per person for each row in the DataFrame.

Extra: calculate the average and the standard deviation of this data. You might need to take a look at the documentation page for the Series type.

answer

Adding new columns

New columns can be added to a DataFrame by assigning them by index (as you would for a Python dict):

In [15]:
tips["percent_tip"] = (tips["tip"] / tips["total_bill"])*100
tips
Out[15]:
total_bill tip day time size percent_tip
0 16.99 0.71 Sun Dinner 2 4.178929
1 10.34 1.16 Sun Dinner 3 11.218569
2 21.01 2.45 Sun Dinner 3 11.661114
3 23.68 2.32 Sun Dinner 2 9.797297
4 24.59 2.53 Sun Dinner 4 10.288735
... ... ... ... ... ... ...
239 29.03 4.14 Sat Dinner 3 14.261109
240 27.18 1.40 Sat Dinner 2 5.150846
241 22.67 1.40 Sat Dinner 2 6.175562
242 17.82 1.22 Sat Dinner 2 6.846240
243 18.78 2.10 Thur Dinner 2 11.182109

244 rows × 6 columns

Exercise 4

Take the "bill per person" result you calculated in the last exercise and add it as a new column, bill_per_person, in the DataFrame.

answer