Applied Data Analysis in Python

scitkit-learn requires the X parameter of the fit() function to be two-dimensional and the y parameter to be one-dimensional.

X must be two-dimensional, even if there is only one feature (column) present in your data. This can sometimes be a bit confusing as to humans there's little difference between a table with one column and a simple list of values. Computers, however are very explicit about this difference and so we need to make sure we're doing the right thing.

First, let's grab the data we were working with:

In [1]:
from pandas import read_csv

data = read_csv("https://milliams.com/courses/applied_data_analysis/linear.csv")

2D DataFrames

If we look at it, we see it's a pandas DataFrame which is always inherently two-dimensional:

In [2]:
data.head()
Out[2]:
x y
0 3.745401 3.229269
1 9.507143 14.185654
2 7.319939 9.524231
3 5.986585 6.672066
4 1.560186 -3.358149

To get a more specific idea of the shape of the data structure, we can use the shape attribute:

In [3]:
data.shape
Out[3]:
(50, 2)

This tell us that it's a $(50 \times 2)$ structure so is two dimensional.

To be explicit, we can also query its dimensionality directly with ndim:

In [4]:
data.ndim
Out[4]:
2

1D Series

If we ask a DataFrame for one of its columns, it returns it to us as a pandas Series. These objects are always one-dimensional (ignoring the potential for multi-indexes):

In [5]:
data["x"].head()
Out[5]:
0    3.745401
1    9.507143
2    7.319939
3    5.986585
4    1.560186
Name: x, dtype: float64
In [6]:
type(data["x"])
Out[6]:
pandas.core.series.Series
In [7]:
data["x"].shape
Out[7]:
(50,)

Note that the shape is (50,). This might look like it could have multiple values but this is just how Python represents a tuple with one value. To check the dimensionality explicitly, we can peek at ndim again:

In [8]:
data["x"].ndim
Out[8]:
1

2D subsets of DataFrames

If we want to ask a DataFrame for a subset of its columns, it will return the answer to us as a another DataFrame as this is the only way to represent data with multiple columns.

We can ask for multiple columns by passing a list of column names to the DataFrame indexing operator.

Pay attention here as the outer pair of square brackets are denoting the indexing operator being called while the inner pair denotes the list being created.

In [9]:
data[["x", "y"]].head()
Out[9]:
x y
0 3.745401 3.229269
1 9.507143 14.185654
2 7.319939 9.524231
3 5.986585 6.672066
4 1.560186 -3.358149
In [10]:
data[["x", "y"]].shape
Out[10]:
(50, 2)

We can see here that when we asked the DataFrame for multiple columns by passing a list of column names it returns a two-dimensional object.

If we want to extract just one column but still maintain the dimensionality, we can pass a list with only one column name:

In [11]:
data[["x"]].head()
Out[11]:
x
0 3.745401
1 9.507143
2 7.319939
3 5.986585
4 1.560186

If we check the shape and dimensionality of this, we see that it is a $(50 \times 1)$ structure with two dimensions:

In [12]:
data[["x"]].shape
Out[12]:
(50, 1)
In [13]:
data[["x"]].ndim
Out[13]:
2

Final comparison

Finally, to reiterate, the difference between

In [14]:
data["x"].head()
Out[14]:
0    3.745401
1    9.507143
2    7.319939
3    5.986585
4    1.560186
Name: x, dtype: float64

and

In [15]:
data[["x"]].head()
Out[15]:
x
0 3.745401
1 9.507143
2 7.319939
3 5.986585
4 1.560186

is not really in the data itself, but in the mathematical structure. One is a vector and and the other is a matrix. One is one-dimensional and the other is two-dimensional.

In [16]:
data["x"].ndim
Out[16]:
1
In [17]:
data[["x"]].ndim
Out[17]:
2