import pandas as pd
import seaborn as sns
titanic = pd.read_csv("https://milliams.com/courses/data_analysis_python/titanic.csv")
titanic
titanic["age"].mean()
all_males = titanic[titanic["gender"] == "male"]
all_males["age"].mean()
titanic[titanic["class"] == "3rd"]
The technique shown in class was to combine together multiple selectors with |
:
passengers = titanic[
(titanic["class"] == "1st") |
(titanic["class"] == "2nd") |
(titanic["class"] == "3rd")
]
However, it is also possible to use the isin
method to select from a list of matching options:
passengers = titanic[titanic["class"].isin(["1st", "2nd", "3rd"])]
Using displot
with age
as the main variable shows the distribution. YOu can overlay the two genders using hue="gender"
. To simplify the view, you can set kind="kde"
. Since KDE mode smooths the data, you can also set a cutoff of 0 to avoid it showing negative ages:
sns.displot(
data=passengers,
x="age",
hue="gender",
kind="kde",
cut=0
)
All that has changed from the last plot is adding in the split by class
over multiple columns:
sns.displot(
data=passengers,
x="age",
hue="gender",
kind="kde",
cut=0,
col="class",
col_order=["1st", "2nd", "3rd"]
)
To reduce the duplication of effort here, I create a function which, given a set of data, calculated the survived fraction within. This is then called three times, once for each class:
def survived_ratio(df):
yes = df[df["survived"] == "yes"]
return len(yes) / len(df)
ratio_1st = survived_ratio(passengers[passengers["class"] == "1st"])
ratio_2nd = survived_ratio(passengers[passengers["class"] == "2nd"])
ratio_3rd = survived_ratio(passengers[passengers["class"] == "3rd"])
print(ratio_1st, ratio_2nd, ratio_3rd)