Applied Data Analysis in Python
scitkit-learn requires the X
parameter of the fit()
function to be two-dimensional and the y
parameter to be one-dimensional.
X
must be two-dimensional, even if there is only one feature (column) present in your data. This can sometimes be a bit confusing as to humans there's little difference between a table with one column and a simple list of values. Computers, however are very explicit about this difference and so we need to make sure we're doing the right thing.
First, let's grab the data we were working with:
from pandas import read_csv
data = read_csv("https://milliams.com/courses/applied_data_analysis/linear.csv")
2D DataFrame
s¶
If we look at it, we see it's a pandas DataFrame
which is always inherently two-dimensional:
data.head()
To get a more specific idea of the shape of the data structure, we can use the shape
attribute:
data.shape
This tell us that it's a $(50 \times 2)$ structure so is two dimensional.
To be explicit, we can also query its dimensionality directly with ndim
:
data.ndim
1D Series
¶
If we ask a DataFrame
for one of its columns, it returns it to us as a pandas Series
. These objects are always one-dimensional (ignoring the potential for multi-indexes):
data["x"].head()
type(data["x"])
data["x"].shape
Note that the shape
is (50,)
. This might look like it could have multiple values but this is just how Python represents a tuple with one value. To check the dimensionality explicitly, we can peek at ndim
again:
data["x"].ndim
2D subsets of DataFrame
s¶
If we want to ask a DataFrame
for a subset of its columns, it will return the answer to us as a another DataFrame
as this is the only way to represent data with multiple columns.
We can ask for multiple columns by passing a list of column names to the DataFrame
indexing operator.
Pay attention here as the outer pair of square brackets are denoting the indexing operator being called while the inner pair denotes the list being created.
data[["x", "y"]].head()
data[["x", "y"]].shape
We can see here that when we asked the DataFrame
for multiple columns by passing a list of column names it returns a two-dimensional object.
If we want to extract just one column but still maintain the dimensionality, we can pass a list with only one column name:
data[["x"]].head()
If we check the shape and dimensionality of this, we see that it is a $(50 \times 1)$ structure with two dimensions:
data[["x"]].shape
data[["x"]].ndim
Final comparison¶
Finally, to reiterate, the difference between
data["x"].head()
and
data[["x"]].head()
is not really in the data itself, but in the mathematical structure. One is a vector and and the other is a matrix. One is one-dimensional and the other is two-dimensional.
data["x"].ndim
data[["x"]].ndim