数据来源:House Prices - Advanced Regression Techniques
参考文献:
1. 导入数据
import pandas as pd
import warnings
warnings.filterwarnings('ignore') # 忽略警告
df_train = pd.read_csv('./house-prices-advanced-regression-techniques/train.csv')
df_train.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
df_train['SalePrice'].describe()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
2. 变量分析
import seaborn as sns
# 绘制售价的直方图
sns.distplot(df_train['SalePrice']);
# 输出偏度和峰度
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282
- 偏度(Skewness)是一种衡量随机变量概率分布的偏斜方向和程度的度量,是统计数据分布非对称程度的数字特征。偏度可以用来反映数据分布相对于对称分布的偏斜程度。偏度的取值范围为 (?∞,+∞),完全服从正态分布的数据的偏度值为0,偏度值越大,表示数据分布的右侧尾部较长和较厚,称为右偏态或正偏态;偏度值越小,表示数据分布的左侧尾部较长和较厚,称为左偏态或负偏态。
- 峰度(Kurtosis)是一种衡量随机变量概率分布的峰態的指标。峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。峰度可以用来度量数据分布的平坦度(flatness),即数据取值分布形态陡缓程度的统计量。
# 绘制GrLivArea(生活面积)的散点图
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));
# 绘制TotalBsmtSF(地下室面积)的散点图
var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0, 800000));
import matplotlib.pyplot as plt
# 绘制OverallQual(房屋整体材料和装修质量)的箱线图
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y='SalePrice', data=data)
fig.axis(ymin=0, ymax=800000);
# 绘制YearBuilt(建造年份)的箱线图
var = 'YearBuilt'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y='SalePrice', data=data)
fig.axis(ymin=0, ymax=800000)
plt.xticks(rotation=90); # 旋转x轴标签90度
GrLivArea
和TotalBsmtSF
与SalePrice
呈线性关系。TotalBsmtSF
的斜率更大,说明地下室面积对售价的影响更大。OverallQual
和YearBuilt
与SalePrice
呈正相关关系。