Melt your data with Pandas

Posted on Wed 27 May 2020 by Matt Williams in python

While trying to find some example data for a new course I'm writing, I came across a dataset in an unusual format and had to learn some new Pandas tricks to deal with it.

If you're new to pandas and would like to learn about it, you can access my free online teaching material or watch the full two-hour YouTube video of the course.

I've used monthly data from the UK Met Office's Hadley Centre before which came in a nice format with a row for each year and a column for each month. It was not quite normalised but was easy enough to work with since at least it was a regular shape.

I found that they also provide daily data records for days going back to January 1772. This data is in a format with was much more difficult to work with as I will show.

Looking at the data

I started at the HadCET Data download page and grabbed their daily data. It comes with a format description which is:

Column 1: year
Column 2: day
Columns 3-14: daily CET values expressed in tenths of a degree. There are 12 columns; one for each of the 12 months.

and the first few rows of it look like:

1772    1   32  -15   18   25   87  128  187  177  105  111   78  112
1772    2   20    7   28   38   77  138  154  158  143  150   85   62
1772    3   27   15   36   33   84  170  139  153  113  124   83   60

We can see the first column is indeed the year, the second is counting up (1, 2, 3 etc.) and so represents the day of the month and the rest (12) of the columns are data per month. It's also worth noting that it is whitespace-separated so we will use delim_whitespace when calling read_csv.

That's enough information to load the file into Pandas with read_csv and get to work:

import pandas as pd
df = pd.read_csv(
    "cetdl1772on.dat",
    delim_whitespace=True,
    index_col=False,
    names=["YEAR", "DAY", *range(1, 13)],
)

I set index_col=False as otherwise Pandas will interpret the first column (the year) as the index but since the year is repeated for multiple rows it is not very useful. The data file does not have column headings so we set these manually choosing "YEAR" for the year column, "DAY" for the day of the month and then the numeric value of the month (1-12) for the remaining columns:

YEAR DAY 1 2 3 4 5 6 7 8 9 10 11 12
0 1772 1 32 -15 18 25 87 128 187 177 105 111 78 112
1 1772 2 20 7 28 38 77 138 154 158 143 150 85 62
2 1772 3 27 15 36 33 84 170 139 153 113 124 83 60

This is looking good but if we look a little further down the table we will see where the data breaks down a little. Since each row is representing a day of the month in each year, there must be a row for day 31 in those months that have one. However, since we're dealing with a rectangular frame, all the month columns will need to have some value for that cell:

Let's have a look at that slice:

df.iloc[28:32]
YEAR DAY 1 2 3 4 5 6 7 8 9 10 11 12
28 1772 29 -33 56 83 50 113 131 170 182 135 140 63 12
29 1772 30 -10 -999 66 77 121 122 179 163 143 143 55 15
30 1772 31 -8 -999 46 -999 108 -999 168 144 -999 145 -999 22
31 1773 1 20 0 79 13 93 174 104 151 171 131 68 55

For column 1 (January) all the cells are filled as January has 31 days. Looking at the next column (February), however, shows how the non-existent days are represented. They are using -999 to stand in for missing data. We therefore need to update our read_csv call to account for it:

import pandas as pd
df = pd.read_csv(
    "cetdl1772on.dat",
    delim_whitespace=True,
    index_col=False,
    names=["YEAR", "DAY", *range(1, 13)],
    na_values=["-999"],
)

We now have all the data read in correctly but it's not still not in a nice normalised form.

Melt

What we currently have is a row for each year and day but the months are split over multiple columns with the temperature values under each. This means that if we want to read each day in turn through all the years, we first read down some of the first column, then back to the top to read down some of the second column etc. Once we've read all twelve columns we have completed one year and to read the next year we go back to the first column, skip to where we stopped previously and do the same thing again. We're having to zig-zag across the table just to read sequentially.

We want to transform this data so that we have one column for each of year, month and day and then a single column for the relevant temperature values, effectively straightening out the zig-zag.

For demonstation purposes, let's first look at a smaller subset of the data (subset = df[["YEAR", "DAY", 1, 2]].iloc[0:2]) and see what we can to with it. That is, we want to go from something like:

YEAR DAY 1 2
0 1772 1 32 -15
1 1772 2 20 7

which contains the data for the first and second of January and February, to something like:

YEAR DAY MONTH temp
0 1772 1 1 32
1 1772 2 1 20
2 1772 1 2 -15
3 1772 2 2 7

which contains the same temperature values but having a single measurement per row. The data was previously zig-zagging (down column 1 and then down column 2) but it has now been straightened.

To do this, pandas provides a function called melt. The way to use melt is first identify which columns in your DataFrame you want to keep in the result. In our case, we want to keep "YEAR" and "DAY".

The values in the cells in the rest of the table (32, 20, -15 and 7) are then going to be melted. Each of those values will be given their own row in the result with the column header that they are currently under moved into a new column to keep track of where they came from.

melt has a number of arguments so let's have a look through them and see how they apply to our situation:

id_vars
The documentation says "Column(s) to use as identifier variables." and this means "the columns which should not be melted". In our case this is "YEAR" and "DAY" so we pass in a list of the labels: ["YEAR", "DAY"]
value_vars
These are the columns which should be melted. If left unset it will melt everything that is not in id_vars. In our case we want to melt 1, 2, etc. so let's just leave this variable out.

First, let's try it with just id_vars used:

subset.melt(id_vars=["YEAR", "DAY"])

which gives us

YEAR DAY variable value
0 1772 1 1 32
1 1772 2 1 20
2 1772 1 2 -15
3 1772 2 2 7

We can see that the four temperature values now each live on their own row under the column "value" and what were the column headings for those values has now moved into the column "variable". These are not useful names for these columns but luckily the melt function allows us to name them when performing the melt with the arguments:

var_name
This argument lets us set the name of the column that holds what used to be the column headings. For us, we would like this to be "MONTH"
value_name
This is used to give the name to the new column which holds the data. Since this is all temperature data, this should be set to "temp"
subset.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp")

to give us

YEAR DAY MONTH temp
0 1772 1 1 32
1 1772 2 1 20
2 1772 1 2 -15
3 1772 2 2 7

If we now run this on the full data, df, it works as we expect:

df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").head(3)
YEAR DAY MONTH temp
0 1772 1 1 32
1 1772 2 1 20
2 1772 3 1 27

However, let's take a look towards the end of some months to see what's happened to our NaNs:

df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").query("MONTH == 2 and DAY == 30").head(3)
YEAR DAY MONTH temp
7748 1772 30 2 nan
7779 1773 30 2 nan
7810 1774 30 2 nan

The NaNs are still there. Since these missing data are no longer providing a useful padding function, we can drop them:

melted = df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").dropna(subset=["temp"])

Setting a useful index

We're done with the melting now but our index is still the default one and our date information is spread over multiple columns:

YEAR DAY MONTH temp
0 1772 1 1 32

Fortunately, pandas provides a function to_datatime which can accept a variety of forms of data but one is a list of columns with predictable names which contain the date information.

pd.to_datetime(melted[["YEAR", "MONTH", "DAY"]])

we can then set this as the index of our data frame and drop the columns it came from:

melted = melted.set_index(pd.to_datetime(melted[["YEAR", "MONTH", "DAY"]])).drop(["YEAR", "MONTH", "DAY"], axis=1)

Finally, since the measurements are in tenths of a degree, we want to scale it all to Celcius:

melted = melted / 10
melted.head()
temp
1772-01-01 00:00:00 3.2
1772-01-02 00:00:00 2
1772-01-03 00:00:00 2.7
1772-01-04 00:00:00 2.7
1772-01-05 00:00:00 1.5

which we can then plot with

ax = melted["temp"].resample("Y").mean().plot()
ax.set_xlabel("Year")
ax.set_ylabel("Mean annual temperature (°C)")