数据清洗¶
导入数据¶
导入 Titanic
数据集
[1]:
import numpy as np
import pandas as pd
import random as rnd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
titanic = pd.read_csv('../data/titanic.csv')
titanic.head()
[1]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
查看数据集¶
通过 pd.info()
方法能够访问数据集的简明摘要:
[2]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
通过 pd.describe()
方法能够房屋数据集的描述性统计数据:
[3]:
titanic.describe()
[3]:
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
观察发现,数据集中 Ticket
缺乏实际意义,Cabin
存在大量缺失值,可以考虑删除这两列数据,在 Pandas 中可以用过 pd.drop()
实现:
[4]:
titanic = titanic.drop(['Ticket', 'Cabin'], axis=1)
titanic.head()
[4]:
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | S |
缺失值补全¶
观察发现,数据集中 Embraked
存在部分缺失值,可以考虑使用 Embraked
中出现次数最多的元素进行补全,缺失值补全可以通过 pd.fillna()
方法实现,pd.dropna()
返回删除缺失值的数据序列,pd.mode()
以列表形式返回序列中出现次数最多的元素。
[5]:
titanic.Embarked = titanic.Embarked.fillna(titanic.Embarked.dropna().mode()[0])
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB
用 ‘Age’ 数据中均值补全缺失值,结果如下:
[6]:
titanic.Age = titanic.Age.fillna(titanic.Age.dropna().mean())
titanic.Age = titanic.Age.astype(int)
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null int64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 891 non-null object
dtypes: float64(1), int64(6), object(3)
memory usage: 69.7+ KB
多列数据合为一列¶
观察发现,数据集中 SibSp
和 Parch
都是反应家庭人数的特征变量,可以将二者相加生成新的特征变量:
[7]:
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch']
titanic = titanic.drop(['SibSp', 'Parch'], axis=1)
titanic.head()
[7]:
PassengerId | Survived | Pclass | Name | Sex | Age | Fare | Embarked | FamilySize | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 7.2500 | S | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 71.2833 | C | 1 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 7.9250 | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 53.1000 | S | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 8.0500 | S | 0 |
文本数据提取¶
观察发现,数据集中 Name
数据中存在称呼称呼,可以考虑提取出来,具体方法如下:
[8]:
def Call(x):
return x.split(',')[1].split('.')[0][1:]
titanic['Call'] = titanic.Name.apply(Call)
titanic = titanic.drop(['Name'], axis=1)
titanic.head()
[8]:
PassengerId | Survived | Pclass | Sex | Age | Fare | Embarked | FamilySize | Call | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | male | 22 | 7.2500 | S | 1 | Mr |
1 | 2 | 1 | 1 | female | 38 | 71.2833 | C | 1 | Mrs |
2 | 3 | 1 | 3 | female | 26 | 7.9250 | S | 0 | Miss |
3 | 4 | 1 | 1 | female | 35 | 53.1000 | S | 1 | Mrs |
4 | 5 | 0 | 3 | male | 35 | 8.0500 | S | 0 | Mr |
通过 pd.unique()
返回数据集中 Call
的唯一值:
[9]:
titanic.Call.unique()
[9]:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
'Jonkheer'], dtype=object)
观察发现,Call
数据中唯一值过多,不利于模型的建立,使用 pd.groupdy
方法对 Call
进行分组统计数量,使用 pd.sort_values()
对统计数据进行排序,结构如下:
[10]:
titanic[['Call','Survived']].groupby('Call').count().sort_values(by='Survived', ascending=False)
[10]:
Survived | |
---|---|
Call | |
Mr | 517 |
Miss | 182 |
Mrs | 125 |
Master | 40 |
Dr | 7 |
Rev | 6 |
Major | 2 |
Col | 2 |
Mlle | 2 |
Sir | 1 |
Ms | 1 |
Capt | 1 |
Mme | 1 |
Lady | 1 |
Jonkheer | 1 |
Don | 1 |
the Countess | 1 |
观察发现,Call
数据中存在部分标记错误,['Mr','Miss','Mrs','Master']
等元素出现的次数远高于其他元素,可以考虑将其他元素归为 other
,可通过 pd.replace()
方法进行替换,结果如下:
[11]:
titanic.Call = titanic.Call.replace('Mlle', 'Miss')
titanic.Call = titanic.Call.replace('Ms', 'Miss')
titanic.Call = titanic.Call.replace('Mme', 'Mrs')
titanic.Call = titanic.Call.replace(titanic.Call.unique()[4:],'other')
titanic[['Call','Survived']].groupby('Call').mean().sort_values(by='Survived', ascending=False)
[11]:
Survived | |
---|---|
Call | |
Mrs | 0.793651 |
Miss | 0.702703 |
Master | 0.575000 |
other | 0.347826 |
Mr | 0.156673 |
文本数据数值化¶
文本数据不利于模型的建立,可通过 pd.map()
方法将文本数据数值化,结果如下:
[12]:
titanic['Call'] = titanic.Call.map(dict(zip(titanic.Call.unique().tolist(),range(1,6))))
titanic['Sex'] = titanic.Sex.map(dict(zip(titanic.Sex.unique().tolist(),range(2))))
titanic['Embarked'] = titanic.Embarked.map(dict(zip(titanic.Embarked.unique().tolist(),range(1,4))))
titanic.head()
[12]:
PassengerId | Survived | Pclass | Sex | Age | Fare | Embarked | FamilySize | Call | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 0 | 22 | 7.2500 | 1 | 1 | 1 |
1 | 2 | 1 | 1 | 1 | 38 | 71.2833 | 2 | 1 | 2 |
2 | 3 | 1 | 3 | 1 | 26 | 7.9250 | 1 | 0 | 3 |
3 | 4 | 1 | 1 | 1 | 35 | 53.1000 | 1 | 1 | 2 |
4 | 5 | 0 | 3 | 0 | 35 | 8.0500 | 1 | 0 | 1 |
连续数据离散化¶
连续性数据不利于分类模型的建立,可通过 pd.cut()
方法将连续数据离散化,将 Age
数据离散化结果如下:
[13]:
titanic['Age_Band'] = pd.cut(titanic.Age, bins=5)
titanic[['Age_Band', 'Survived']].groupby(['Age_Band'], as_index=False).mean().sort_values(by='Age_Band', ascending=True)
[13]:
Age_Band | Survived | |
---|---|---|
0 | (-0.08, 16.0] | 0.550000 |
1 | (16.0, 32.0] | 0.344762 |
2 | (32.0, 48.0] | 0.403226 |
3 | (48.0, 64.0] | 0.434783 |
4 | (64.0, 80.0] | 0.090909 |
通过 pd.cat.codes
方法将分段数据数值化,结果如下:
[14]:
titanic.Age = titanic.Age_Band.cat.codes
titanic[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Age', ascending=True)
[14]:
Age | Survived | |
---|---|---|
0 | 0 | 0.550000 |
1 | 1 | 0.344762 |
2 | 2 | 0.403226 |
3 | 3 | 0.434783 |
4 | 4 | 0.090909 |
同理可将 Fare
数据离散化,结果如下:
[15]:
titanic['Fare_Band'] = pd.cut(titanic.Fare, bins=2)
titanic.Fare = titanic.Fare_Band.cat.codes
titanic[['Fare', 'Survived']].groupby(['Fare'], as_index=False).mean().sort_values(by='Survived', ascending=False)
[15]:
Fare | Survived | |
---|---|---|
1 | 1 | 0.777778 |
0 | 0 | 0.379819 |
最终结果¶
经过上述步骤后,结果如下:
[16]:
titanic = titanic.drop(['Age_Band', 'Fare_Band'], axis=1)
titanic.head()
[16]:
PassengerId | Survived | Pclass | Sex | Age | Fare | Embarked | FamilySize | Call | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 |
1 | 2 | 1 | 1 | 1 | 2 | 0 | 2 | 1 | 2 |
2 | 3 | 1 | 3 | 1 | 1 | 0 | 1 | 0 | 3 |
3 | 4 | 1 | 1 | 1 | 2 | 0 | 1 | 1 | 2 |
4 | 5 | 0 | 3 | 0 | 2 | 0 | 1 | 0 | 1 |