Introduction to Data Analysis in Python

DataFrame

While you can think of the Series as a one-dimensional list of data, pandas' DataFrame is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a Series.

In [1]:
from pandas import DataFrame

There are many ways of creating a DataFrame but if you already have your data in Python then the simplest is by passing in a dictionary:

In [2]:
data = {
    "city": ["Paris", "Paris", "Paris", "Paris",
             "London", "London", "London", "London",
             "Rome", "Rome", "Rome", "Rome"],
    "year": [2001, 2008, 2009, 2010,
             2001, 2006, 2011, 2015,
             2001, 2006, 2009, 2012],
    "pop": [2.148, 2.211, 2.234, 2.244,
            7.322, 7.657, 8.174, 8.615,
            2.547, 2.627, 2.734, 2.627]
}
census = DataFrame(data)

This has created a DataFrame from the dictionary data. The keys of the dictionary will become the column headers and the dictionary values will be the values in each column. As with the Series, an index will be created automatically.

In [3]:
census
Out[3]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234
3 Paris 2010 2.244
4 London 2001 7.322
5 London 2006 7.657
6 London 2011 8.174
7 London 2015 8.615
8 Rome 2001 2.547
9 Rome 2006 2.627
10 Rome 2009 2.734
11 Rome 2012 2.627

Or, if you just want a peek at the data, you can just grab the first few rows with:

In [4]:
census.head(3)
Out[4]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234

When we accessed elements from a Series object, it would select an element by row. However, by default DataFrames index primarily by column. You can access any column directly:

In [5]:
census["city"]
Out[5]:
0      Paris
1      Paris
2      Paris
3      Paris
4     London
5     London
6     London
7     London
8       Rome
9       Rome
10      Rome
11      Rome
Name: city, dtype: object

Accessing a column like this returns a Series which will act in the same way as those we were using earlier which we can see by doing

In [6]:
type(census["city"])
Out[6]:
pandas.core.series.Series

Note that there is one additional part to this output, Name: city. Pandas has remembered that this Series was created from the 'city' column in the DataFrame.

Querying

We can start to ask questions of our data in the same way as we did with Series. If we grab a column from the DataFrame and do a comparison operation on it:

In [7]:
census["city"] == "Paris"
Out[7]:
0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: city, dtype: bool

This has created a new Series which has True set where the city is Paris and False elsewhere.

We can use filtered Series like this to filter the DataFrame as a whole. census['city'] == 'Paris' has returned a Series containing booleans. Passing it back into census as an indexing operation will use it to filter based on the 'city' column.

In [8]:
census[census["city"] == "Paris"]
Out[8]:
city year pop
0 Paris 2001 2.148
1 Paris 2008 2.211
2 Paris 2009 2.234
3 Paris 2010 2.244

You can then carry on and grab another column after that filter:

In [9]:
census[census["city"] == "Paris"]["year"]
Out[9]:
0    2001
1    2008
2    2009
3    2010
Name: year, dtype: int64

Getting rows

If you want to select a row from a DataFrame then you can use the .loc attribute which allows you to pass index values like:

In [10]:
census.loc[2]
Out[10]:
city    Paris
year     2009
pop     2.234
Name: 2, dtype: object
In [11]:
census.loc[2]["city"]
Out[11]:
'Paris'

Adding new columns

New columns can be added to a DataFrame simply by assigning them by index (as you would for a Python dict) and can be deleted with the del keyword in the same way:

In [12]:
census["continental"] = census["city"] != "London"
census
Out[12]:
city year pop continental
0 Paris 2001 2.148 True
1 Paris 2008 2.211 True
2 Paris 2009 2.234 True
3 Paris 2010 2.244 True
4 London 2001 7.322 False
5 London 2006 7.657 False
6 London 2011 8.174 False
7 London 2015 8.615 False
8 Rome 2001 2.547 True
9 Rome 2006 2.627 True
10 Rome 2009 2.734 True
11 Rome 2012 2.627 True
In [13]:
del census["continental"]

Exercise

  • Create the DataFrame containing the census data for the three cities.
  • Select the data for the year 2001. Which city had the smallest population that year?
  • Find all the cities which had a population smaller than 2.6 million.
  • answer