Applied Data Analysis in Python

Let's find the most negative and the most positive (ignoring self-correlation) values

In [1]:
from pandas import DataFrame
from sklearn.datasets import fetch_california_housing

housing_data = fetch_california_housing()
housing = DataFrame(housing_data.data, columns=housing_data.feature_names)

corr = housing.corr()

corr
Out[1]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
MedInc 1.000000 -0.119034 0.326895 -0.062040 0.004834 0.018766 -0.079809 -0.015176
HouseAge -0.119034 1.000000 -0.153277 -0.077747 -0.296244 0.013191 0.011173 -0.108197
AveRooms 0.326895 -0.153277 1.000000 0.847621 -0.072213 -0.004852 0.106389 -0.027540
AveBedrms -0.062040 -0.077747 0.847621 1.000000 -0.066197 -0.006181 0.069721 0.013344
Population 0.004834 -0.296244 -0.072213 -0.066197 1.000000 0.069863 -0.108785 0.099773
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 1.000000 0.002366 0.002476
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366 1.000000 -0.924664
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476 -0.924664 1.000000

Most negative correlation

Find the most negative correlation for each column:

In [2]:
corr.min()
Out[2]:
MedInc       -0.119034
HouseAge     -0.296244
AveRooms     -0.153277
AveBedrms    -0.077747
Population   -0.296244
AveOccup     -0.006181
Latitude     -0.924664
Longitude    -0.924664
dtype: float64

Find the column which has the lowest correlation:

In [3]:
corr.min().idxmin()
Out[3]:
'Latitude'

Extract the Latitude column and get the index of the most negative value in it:

In [4]:
corr[corr.min().idxmin()].idxmin()
Out[4]:
'Longitude'

The most negative correlation is therefore between:

In [5]:
corr.min().idxmin(), corr[corr.min().idxmin()].idxmin()
Out[5]:
('Latitude', 'Longitude')

with the value:

In [6]:
corr.min().min()
Out[6]:
-0.9246644339150366

Most positive correlation

First we need to remove the 1.0 values on the diagonal:

In [7]:
import numpy as np

np.fill_diagonal(corr.values, np.nan)
corr
Out[7]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
MedInc NaN -0.119034 0.326895 -0.062040 0.004834 0.018766 -0.079809 -0.015176
HouseAge -0.119034 NaN -0.153277 -0.077747 -0.296244 0.013191 0.011173 -0.108197
AveRooms 0.326895 -0.153277 NaN 0.847621 -0.072213 -0.004852 0.106389 -0.027540
AveBedrms -0.062040 -0.077747 0.847621 NaN -0.066197 -0.006181 0.069721 0.013344
Population 0.004834 -0.296244 -0.072213 -0.066197 NaN 0.069863 -0.108785 0.099773
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 NaN 0.002366 0.002476
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366 NaN -0.924664
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476 -0.924664 NaN
In [8]:
corr.max().idxmax(), corr[corr.max().idxmax()].idxmax()
Out[8]:
('AveRooms', 'AveBedrms')
In [9]:
corr.max().max()
Out[9]:
0.8476213257130424