数据清洗¶

导入数据¶

导入 Titanic 数据集

[1]:

import numpy as np
import pandas as pd
import random as rnd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

titanic = pd.read_csv('../data/titanic.csv')
titanic.head()

[1]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

查看数据集¶

通过 pd.info() 方法能够访问数据集的简明摘要：

[2]:

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

通过 pd.describe() 方法能够房屋数据集的描述性统计数据：

[3]:

titanic.describe()

[3]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

观察发现，数据集中 Ticket 缺乏实际意义，Cabin 存在大量缺失值，可以考虑删除这两列数据，在 Pandas 中可以用过 pd.drop() 实现：

[4]:

titanic = titanic.drop(['Ticket', 'Cabin'], axis=1)

titanic.head()

[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	S

缺失值补全¶

观察发现，数据集中 Embraked 存在部分缺失值，可以考虑使用 Embraked 中出现次数最多的元素进行补全，缺失值补全可以通过 pd.fillna() 方法实现，pd.dropna() 返回删除缺失值的数据序列，pd.mode() 以列表形式返回序列中出现次数最多的元素。

[5]:

titanic.Embarked = titanic.Embarked.fillna(titanic.Embarked.dropna().mode()[0])

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB

用 ‘Age’ 数据中均值补全缺失值，结果如下：

[6]:

titanic.Age = titanic.Age.fillna(titanic.Age.dropna().mean())

titanic.Age = titanic.Age.astype(int)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null int64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(1), int64(6), object(3)
memory usage: 69.7+ KB

多列数据合为一列¶

观察发现，数据集中 SibSp 和 Parch 都是反应家庭人数的特征变量，可以将二者相加生成新的特征变量：

[7]:

titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch']
titanic = titanic.drop(['SibSp', 'Parch'], axis=1)

titanic.head()

[7]:

	PassengerId	Survived	Pclass	Name	Sex	Age	Fare	Embarked	FamilySize
0	1	0	3	Braund, Mr. Owen Harris	male	22	7.2500	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	71.2833	C	1
2	3	1	3	Heikkinen, Miss. Laina	female	26	7.9250	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	53.1000	S	1
4	5	0	3	Allen, Mr. William Henry	male	35	8.0500	S	0

文本数据提取¶

观察发现，数据集中 Name 数据中存在称呼称呼，可以考虑提取出来，具体方法如下：

[8]:

def Call(x):
    return x.split(',')[1].split('.')[0][1:]
titanic['Call'] = titanic.Name.apply(Call)
titanic = titanic.drop(['Name'], axis=1)

titanic.head()

[8]:

	PassengerId	Survived	Pclass	Sex	Age	Fare	Embarked	FamilySize	Call
0	1	0	3	male	22	7.2500	S	1	Mr
1	2	1	1	female	38	71.2833	C	1	Mrs
2	3	1	3	female	26	7.9250	S	0	Miss
3	4	1	1	female	35	53.1000	S	1	Mrs
4	5	0	3	male	35	8.0500	S	0	Mr

通过 pd.unique() 返回数据集中 Call 的唯一值：

[9]:

titanic.Call.unique()

[9]:

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

观察发现，Call 数据中唯一值过多，不利于模型的建立，使用 pd.groupdy方法对 Call 进行分组统计数量，使用 pd.sort_values() 对统计数据进行排序，结构如下：

[10]:

titanic[['Call','Survived']].groupby('Call').count().sort_values(by='Survived', ascending=False)

[10]:

	Survived
Call
Mr	517
Miss	182
Mrs	125
Master	40
Dr	7
Rev	6
Major	2
Col	2
Mlle	2
Sir	1
Ms	1
Capt	1
Mme	1
Lady	1
Jonkheer	1
Don	1
the Countess	1

观察发现，Call 数据中存在部分标记错误，['Mr','Miss','Mrs','Master'] 等元素出现的次数远高于其他元素，可以考虑将其他元素归为 other，可通过 pd.replace() 方法进行替换，结果如下：

[11]:

titanic.Call = titanic.Call.replace('Mlle', 'Miss')
titanic.Call = titanic.Call.replace('Ms', 'Miss')
titanic.Call = titanic.Call.replace('Mme', 'Mrs')
titanic.Call = titanic.Call.replace(titanic.Call.unique()[4:],'other')

titanic[['Call','Survived']].groupby('Call').mean().sort_values(by='Survived', ascending=False)

[11]:

	Survived
Call
Mrs	0.793651
Miss	0.702703
Master	0.575000
other	0.347826
Mr	0.156673

文本数据数值化¶

文本数据不利于模型的建立，可通过 pd.map() 方法将文本数据数值化，结果如下：

[12]:

titanic['Call'] = titanic.Call.map(dict(zip(titanic.Call.unique().tolist(),range(1,6))))
titanic['Sex'] = titanic.Sex.map(dict(zip(titanic.Sex.unique().tolist(),range(2))))
titanic['Embarked'] = titanic.Embarked.map(dict(zip(titanic.Embarked.unique().tolist(),range(1,4))))

titanic.head()

[12]:

	PassengerId	Survived	Pclass	Sex	Age	Fare	Embarked	FamilySize	Call
0	1	0	3	0	22	7.2500	1	1	1
1	2	1	1	1	38	71.2833	2	1	2
2	3	1	3	1	26	7.9250	1	0	3
3	4	1	1	1	35	53.1000	1	1	2
4	5	0	3	0	35	8.0500	1	0	1

连续数据离散化¶

连续性数据不利于分类模型的建立，可通过 pd.cut() 方法将连续数据离散化，将 Age 数据离散化结果如下：

[13]:

titanic['Age_Band'] = pd.cut(titanic.Age, bins=5)

titanic[['Age_Band', 'Survived']].groupby(['Age_Band'], as_index=False).mean().sort_values(by='Age_Band', ascending=True)

[13]:

	Age_Band	Survived
0	(-0.08, 16.0]	0.550000
1	(16.0, 32.0]	0.344762
2	(32.0, 48.0]	0.403226
3	(48.0, 64.0]	0.434783
4	(64.0, 80.0]	0.090909

通过 pd.cat.codes 方法将分段数据数值化，结果如下：

[14]:

titanic.Age = titanic.Age_Band.cat.codes

titanic[['Age', 'Survived']].groupby(['Age'], as_index=False).mean().sort_values(by='Age', ascending=True)

[14]:

	Age	Survived
0	0	0.550000
1	1	0.344762
2	2	0.403226
3	3	0.434783
4	4	0.090909

同理可将 Fare 数据离散化，结果如下：

[15]:

titanic['Fare_Band'] = pd.cut(titanic.Fare, bins=2)
titanic.Fare = titanic.Fare_Band.cat.codes

titanic[['Fare', 'Survived']].groupby(['Fare'], as_index=False).mean().sort_values(by='Survived', ascending=False)

[15]:

	Fare	Survived
1	1	0.777778
0	0	0.379819

最终结果¶

经过上述步骤后，结果如下：

[16]:

titanic = titanic.drop(['Age_Band', 'Fare_Band'], axis=1)

titanic.head()

[16]:

	PassengerId	Survived	Pclass	Sex	Age	Embarked	FamilySize	Call
0	1	0	3	0	1	1	1	1
1	2	1	1	1	2	2	1	2
2	3	1	3	1	1	1	0	3
3	4	1	1	1	2	1	1	2
4	5	0	3	0	2	1	0	1