数据清洗

导入数据

导入 Titanic 数据集

[1]:
import numpy as np
import pandas as pd
import random as rnd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

titanic = pd.read_csv('../data/titanic.csv')
titanic.head()
[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

查看数据集

通过 pd.info() 方法能够访问数据集的简明摘要:

[2]:
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

通过 pd.describe() 方法能够房屋数据集的描述性统计数据:

[3]:
titanic.describe()
[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

观察发现,数据集中 Ticket 缺乏实际意义,Cabin 存在大量缺失值,可以考虑删除这两列数据,在 Pandas 中可以用过 pd.drop() 实现:

[4]:
titanic = titanic.drop(['Ticket', 'Cabin'], axis=1)

titanic.head()
[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S

缺失值补全

观察发现,数据集中 Embraked 存在部分缺失值,可以考虑使用 Embraked 中出现次数最多的元素进行补全,缺失值补全可以通过 pd.fillna() 方法实现,pd.dropna() 返回删除缺失值的数据序列,pd.mode() 以列表形式返回序列中出现次数最多的元素。

[5]:
titanic.Embarked = titanic.Embarked.fillna(titanic.Embarked.dropna().mode()[0])

titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB

用 ‘Age’ 数据中均值补全缺失值,结果如下:

[6]:
titanic.Age = titanic.Age.fillna(titanic.Age.dropna().mean())

titanic.Age = titanic.Age.astype(int)
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null int64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(1), int64(6), object(3)
memory usage: 69.7+ KB

多列数据合为一列

观察发现,数据集中 SibSpParch 都是反应家庭人数的特征变量,可以将二者相加生成新的特征变量:

[7]:
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch']
titanic = titanic.drop(['SibSp', 'Parch'], axis=1)

titanic.head()
[7]:
PassengerId Survived Pclass Name Sex Age Fare Embarked FamilySize
0 1 0 3 Braund, Mr. Owen Harris male 22 7.2500 S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 71.2833 C 1
2 3 1 3 Heikkinen, Miss. Laina female 26 7.9250 S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 53.1000 S 1
4 5 0 3 Allen, Mr. William Henry male 35 8.0500 S 0

文本数据提取

观察发现,数据集中 Name 数据中存在称呼称呼,可以考虑提取出来,具体方法如下:

[8]:
def Call(x):
    return x.split(',')[1].split('.')[0][1:]
titanic['Call'] = titanic.Name.apply(Call)
titanic = titanic.drop(['Name'], axis=1)

titanic.head()
[8]:
PassengerId Survived Pclass Sex Age Fare Embarked FamilySize Call
0 1 0 3 male 22 7.2500 S 1 Mr
1 2 1 1 female 38 71.2833 C 1 Mrs
2 3 1 3 female 26 7.9250 S 0 Miss
3 4 1 1 female 35 53.1000 S 1 Mrs
4 5 0 3 male 35 8.0500 S 0 Mr

通过 pd.unique() 返回数据集中 Call 的唯一值:

[9]:
titanic.Call.unique()
[9]:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

观察发现,Call 数据中唯一值过多,不利于模型的建立,使用 pd.groupdy方法对 Call 进行分组统计数量,使用 pd.sort_values() 对统计数据进行排序,结构如下:

[10]:
titanic[['Call','Survived']].groupby('Call').count().sort_values(by='Survived', ascending=False)
[10]:
Survived
Call
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Major 2
Col 2
Mlle 2
Sir 1
Ms 1
Capt 1
Mme 1
Lady 1
Jonkheer 1
Don 1
the Countess 1

观察发现,Call 数据中存在部分标记错误,['Mr','Miss','Mrs','Master'] 等元素出现的次数远高于其他元素,可以考虑将其他元素归为 other,可通过 pd.replace() 方法进行替换,结果如下:

[11]:
titanic.Call = titanic.Call.replace('Mlle', 'Miss')
titanic.Call = titanic.Call.replace('Ms', 'Miss')
titanic.Call = titanic.Call.replace('Mme', 'Mrs')
titanic.Call = titanic.Call.replace(titanic.Call.unique()[4:],'other')

titanic[['Call','Survived']].groupby('Call').mean().sort_values(by='Survived', ascending=False)
[11]:
Survived
Call
Mrs 0.793651
Miss 0.702703
Master 0.575000
other 0.347826
Mr 0.156673

文本数据数值化

文本数据不利于模型的建立,可通过 pd.map() 方法将文本数据数值化,结果如下:

[12]:
titanic['Call'] = titanic.Call.map(dict(zip(titanic.Call.unique().tolist(),range(1,6))))
titanic['Sex'] = titanic.Sex.map(dict(zip(titanic.Sex.unique().tolist(),range(2))))
titanic['Embarked'] = titanic.Embarked.map(dict(zip(titanic.Embarked.unique().tolist(),range(1,4))))

titanic.head()
[12]:
PassengerId Survived Pclass Sex Age Fare Embarked FamilySize Call
0 1 0 3 0 22 7.2500 1 1 1
1 2 1 1 1 38 71.2833 2 1 2
2 3 1 3 1 26 7.9250 1 0 3
3 4 1 1 1 35 53.1000 1 1 2
4 5 0 3 0 35 8.0500 1 0 1

连续数据离散化

连续性数据不利于分类模型的建立,可通过 pd.cut() 方法将连续数据离散化,将 Age 数据离散化结果如下:

[13]:
titanic['Age_Band'] = pd.cut(titanic.Age, bins=5)

titanic[['Age_Band', 'Survived']].groupby(['Age_Band'], as_index=False).mean().sort_values(by='Age_Band', ascending=True)
[13]:
Age_Band Survived
0 (-0.08, 16.0] 0.550000
1 (16.0, 32.0] 0.344762
2 (32.0, 48.0] 0.403226
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909

通过 pd.cat.codes 方法将分段数据数值化,结果如下:

[14]:
titanic.Age = titanic.Age_Band.cat.codes

titanic[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Age', ascending=True)
[14]:
Age Survived
0 0 0.550000
1 1 0.344762
2 2 0.403226
3 3 0.434783
4 4 0.090909

同理可将 Fare 数据离散化,结果如下:

[15]:
titanic['Fare_Band'] = pd.cut(titanic.Fare, bins=2)
titanic.Fare = titanic.Fare_Band.cat.codes

titanic[['Fare', 'Survived']].groupby(['Fare'], as_index=False).mean().sort_values(by='Survived', ascending=False)
[15]:
Fare Survived
1 1 0.777778
0 0 0.379819

最终结果

经过上述步骤后,结果如下:

[16]:
titanic = titanic.drop(['Age_Band', 'Fare_Band'], axis=1)

titanic.head()
[16]:
PassengerId Survived Pclass Sex Age Fare Embarked FamilySize Call
0 1 0 3 0 1 0 1 1 1
1 2 1 1 1 2 0 2 1 2
2 3 1 3 1 1 0 1 0 3
3 4 1 1 1 2 0 1 1 2
4 5 0 3 0 2 0 1 0 1