While trying to find some example data for a new course I'm writing, I came across a dataset in an unusual format and had to learn some new Pandas tricks to deal with it.
I've used monthly data from the UK Met Office's Hadley Centre before which came in a nice format with a row for each year and a column for each month. It was not quite normalised but was easy enough to work with since at least it was a regular shape.
I found that they also provide daily data records for days going back to January 1772. This data is in a format with was much more difficult to work with as I will show.
Looking at the data
Column 1: year
Column 2: day
Columns 3-14: daily CET values expressed in tenths of a degree. There are 12 columns; one for each of the 12 months.
and the first few rows of it look like:
1772 1 32 -15 18 25 87 128 187 177 105 111 78 112 1772 2 20 7 28 38 77 138 154 158 143 150 85 62 1772 3 27 15 36 33 84 170 139 153 113 124 83 60
We can see the first column is indeed the year, the second is counting up (1, 2, 3 etc.) and so represents the day of the month and the rest (12) of the columns are data per month. It's also worth noting that it is whitespace-separated so we will use delim_whitespace when calling read_csv.
That's enough information to load the file into Pandas with read_csv and get to work:
import pandas as pd df = pd.read_csv( "cetdl1772on.dat", delim_whitespace=True, index_col=False, names=["YEAR", "DAY", *range(1, 13)], )
I set index_col=False as otherwise Pandas will interpret the first column (the year) as the index but since the year is repeated for multiple rows it is not very useful. The data file does not have column headings so we set these manually choosing "YEAR" for the year column, "DAY" for the day of the month and then the numeric value of the month (1-12) for the remaining columns:
This is looking good but if we look a little further down the table we will see where the data breaks down a little. Since each row is representing a day of the month in each year, there must be a row for day 31 in those months that have one. However, since we're dealing with a rectangular frame, all the month columns will need to have some value for that cell:
Let's have a look at that slice:
For column 1 (January) all the cells are filled as January has 31 days. Looking at the next column (February), however, shows how the non-existent days are represented. They are using -999 to stand in for missing data. We therefore need to update our read_csv call to account for it:
import pandas as pd df = pd.read_csv( "cetdl1772on.dat", delim_whitespace=True, index_col=False, names=["YEAR", "DAY", *range(1, 13)], na_values=["-999"], )
We now have all the data read in correctly but it's not still not in a nice normalised form.
What we currently have is a row for each year and day but the months are split over multiple columns with the temperature values under each. This means that if we want to read each day in turn through all the years, we first read down some of the first column, then back to the top to read down some of the second column etc. Once we've read all twelve columns we have completed one year and to read the next year we go back to the first column, skip to where we stopped previously and do the same thing again. We're having to zig-zag across the table just to read sequentially.
We want to transform this data so that we have one column for each of year, month and day and then a single column for the relevant temperature values, effectively straightening out the zig-zag.
For demonstation purposes, let's first look at a smaller subset of the data (subset = df[["YEAR", "DAY", 1, 2]].iloc[0:2]) and see what we can to with it. That is, we want to go from something like:
which contains the data for the first and second of January and February, to something like:
which contains the same temperature values but having a single measurement per row. The data was previously zig-zagging (down column 1 and then down column 2) but it has now been straightened.
To do this, pandas provides a function called melt. The way to use melt is first identify which columns in your DataFrame you want to keep in the result. In our case, we want to keep "YEAR" and "DAY".
The values in the cells in the rest of the table (32, 20, -15 and 7) are then going to be melted. Each of those values will be given their own row in the result with the column header that they are currently under moved into a new column to keep track of where they came from.
melt has a number of arguments so let's have a look through them and see how they apply to our situation:
- The documentation says "Column(s) to use as identifier variables." and this means "the columns which should not be melted". In our case this is "YEAR" and "DAY" so we pass in a list of the labels: ["YEAR", "DAY"]
- These are the columns which should be melted. If left unset it will melt everything that is not in id_vars. In our case we want to melt 1, 2, etc. so let's just leave this variable out.
First, let's try it with just id_vars used:
which gives us
We can see that the four temperature values now each live on their own row under the column "value" and what were the column headings for those values has now moved into the column "variable". These are not useful names for these columns but luckily the melt function allows us to name them when performing the melt with the arguments:
- This argument lets us set the name of the column that holds what used to be the column headings. For us, we would like this to be "MONTH"
- This is used to give the name to the new column which holds the data. Since this is all temperature data, this should be set to "temp"
subset.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp")
to give us
If we now run this on the full data, df, it works as we expect:
df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").head(3)
However, let's take a look towards the end of some months to see what's happened to our NaNs:
df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").query("MONTH == 2 and DAY == 30").head(3)
The NaNs are still there. Since these missing data are no longer providing a useful padding function, we can drop them:
melted = df.melt(id_vars=["YEAR", "DAY"], var_name="MONTH", value_name="temp").dropna(subset=["temp"])
Setting a useful index
We're done with the melting now but our index is still the default one and our date information is spread over multiple columns:
Fortunately, pandas provides a function to_datatime which can accept a variety of forms of data but one is a list of columns with predictable names which contain the date information.
pd.to_datetime(melted[["YEAR", "MONTH", "DAY"]])
we can then set this as the index of our data frame and drop the columns it came from:
melted = melted.set_index(pd.to_datetime(melted[["YEAR", "MONTH", "DAY"]])).drop(["YEAR", "MONTH", "DAY"], axis=1)
Finally, since the measurements are in tenths of a degree, we want to scale it all to Celcius:
melted = melted / 10 melted.head()
which we can then plot with
ax = melted["temp"].resample("Y").mean().plot() ax.set_xlabel("Year") ax.set_ylabel("Mean annual temperature (°C)")