DataFrame¶

While you can think of the Series as a one-dimensional list of data, pandas' DataFrame is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a Series.

from pandas import DataFrame

There are many ways of creating a DataFrame but if you already have your data in Python then the simplest is by passing in a dictionary:

data = {
    "city": ["Paris", "Paris", "Paris", "Paris",
             "London", "London", "London", "London",
             "Rome", "Rome", "Rome", "Rome"],
    "year": [2001, 2008, 2009, 2010,
             2001, 2006, 2011, 2015,
             2001, 2006, 2009, 2012],
    "pop": [2.148, 2.211, 2.234, 2.244,
            7.322, 7.657, 8.174, 8.615,
            2.547, 2.627, 2.734, 2.627]
}
census = DataFrame(data)

This has created a DataFrame from the dictionary data. The keys of the dictionary will become the column headers and the dictionary values will be the values in each column. As with the Series, an index will be created automatically.

census

Or, if you just want a peek at the data, you can just grab the first few rows with:

census.head(3)

When we accessed elements from a Series object, it would select an element by row. However, by default DataFrames index primarily by column. You can access any column directly:

census["city"]

0      Paris
1      Paris
2      Paris
3      Paris
4     London
5     London
6     London
7     London
8       Rome
9       Rome
10      Rome
11      Rome
Name: city, dtype: object

Accessing a column like this returns a Series which will act in the same way as those we were using earlier which we can see by doing

type(census["city"])

pandas.core.series.Series

Note that there is one additional part to this output, Name: city. Pandas has remembered that this Series was created from the 'city' column in the DataFrame.

Querying¶

We can start to ask questions of our data in the same way as we did with Series. If we grab a column from the DataFrame and do a comparison operation on it:

census["city"] == "Paris"

0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: city, dtype: bool

This has created a new Series which has True set where the city is Paris and False elsewhere.

We can use filtered Series like this to filter the DataFrame as a whole. census['city'] == 'Paris' has returned a Series containing booleans. Passing it back into census as an indexing operation will use it to filter based on the 'city' column.

census[census["city"] == "Paris"]

You can then carry on and grab another column after that filter:

census[census["city"] == "Paris"]["year"]

0    2001
1    2008
2    2009
3    2010
Name: year, dtype: int64

Getting rows¶

If you want to select a row from a DataFrame then you can use the .loc attribute which allows you to pass index values like:

census.loc[2]

city    Paris
year     2009
pop     2.234
Name: 2, dtype: object

census.loc[2]["city"]

'Paris'

Adding new columns¶

New columns can be added to a DataFrame simply by assigning them by index (as you would for a Python dict) and can be deleted with the del keyword in the same way:

census["continental"] = census["city"] != "London"
census

del census["continental"]

Exercise¶

Create the DataFrame containing the census data for the three cities.
Select the data for the year 2001. Which city had the smallest population that year?
Find all the cities which had a population smaller than 2.6 million.
answer

Previous | Next

	city	year	pop
0	Paris	2001	2.148
1	Paris	2008	2.211
2	Paris	2009	2.234
3	Paris	2010	2.244
4	London	2001	7.322
5	London	2006	7.657
6	London	2011	8.174
7	London	2015	8.615
8	Rome	2001	2.547
9	Rome	2006	2.627
10	Rome	2009	2.734
11	Rome	2012	2.627

Introduction to Data Analysis in Python

DataFrame¶

Querying¶

Getting rows¶

Adding new columns¶

Exercise¶