Examples of how to work with missing data (NAN or NULL values) in a pandas DataFrame:
Table of contents
- Create a DataFrame with Pandas
- Find columns with missing data
- Get a list of columns with missing data
- Get the number of missing data per column
- Get the column with the maximum number of missing data
- Get the number total of missing data in the DataFrame
- Remove columns that contains more than 50% of missing data
- Find rows with missing data
- Get a list of rows with missing data
- Get the number of missing data per row
- Get the number of missing data for a given row
- Get the row with the largest number of missing data
- Remove rows with missing data
- References
Create a DataFrame with Pandas
Let's consider the csv file train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():
>>> import pandas as pd
>>> data = pd.read_csv('train.csv')
Get DataFrame shape
>>> data.shape
(1460, 81)
Get an overview of the dataframe header:
>>> df.head()
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
0 1 60 RL 65.0 8450 Pave NaN Reg
1 2 20 RL 80.0 9600 Pave NaN Reg
2 3 60 RL 68.0 11250 Pave NaN IR1
3 4 70 RL 60.0 9550 Pave NaN IR1
4 5 60 RL 84.0 14260 Pave NaN IR1
LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \
0 Lvl AllPub ... 0 NaN NaN NaN 0
1 Lvl AllPub ... 0 NaN NaN NaN 0
2 Lvl AllPub ... 0 NaN NaN NaN 0
3 Lvl AllPub ... 0 NaN NaN NaN 0
4 Lvl AllPub ... 0 NaN NaN NaN 0
MoSold YrSold SaleType SaleCondition SalePrice
0 2 2008 WD Normal 208500
1 5 2007 WD Normal 181500
2 9 2008 WD Normal 223500
3 2 2006 WD Abnorml 140000
4 12 2008 WD Normal 250000
Find columns with missing data
To find columns with missing data (with NAN or NULL values), a solution is to use (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) and any, example:
>>> df.isnull().any()
Id False
MSSubClass False
MSZoning False
LotFrontage True
LotArea False
Street False
Alley True
LotShape False
LandContour False
Utilities False
LotConfig False
LandSlope False
Neighborhood False
Condition1 False
Condition2 False
BldgType False
HouseStyle False
OverallQual False
OverallCond False
YearBuilt False
YearRemodAdd False
RoofStyle False
RoofMatl False
Exterior1st False
Exterior2nd False
MasVnrType True
MasVnrArea True
ExterQual False
ExterCond False
Foundation False
...
BedroomAbvGr False
KitchenAbvGr False
KitchenQual False
TotRmsAbvGrd False
Functional False
Fireplaces False
FireplaceQu True
GarageType True
GarageYrBlt True
GarageFinish True
GarageCars False
GarageArea False
GarageQual True
GarageCond True
PavedDrive False
WoodDeckSF False
OpenPorchSF False
EnclosedPorch False
3SsnPorch False
ScreenPorch False
PoolArea False
PoolQC True
Fence True
MiscFeature True
MiscVal False
MoSold False
YrSold False
SaleType False
SaleCondition False
SalePrice False
dtype: bool
Get a list of columns with missing data
It is then possible to get a list of columns with missing value:
>>> df.columns[df.isnull().any()]
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
'MiscFeature'],
dtype='object')
Get the number of missing data per column
To get the number of missing data per column, a solution is to use sum():
>>> df.isnull().sum()
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
Street 0
Alley 1369
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Neighborhood 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrType 8
MasVnrArea 8
ExterQual 0
ExterCond 0
Foundation 0
...
BedroomAbvGr 0
KitchenAbvGr 0
KitchenQual 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
3SsnPorch 0
ScreenPorch 0
PoolArea 0
PoolQC 1453
Fence 1179
MiscFeature 1406
MiscVal 0
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Get the column with the maximum number of missing data
To get the column with the largest number of missing data there is the function nlargest(1):
>>> df.isnull().sum().nlargest(1)
PoolQC 1453
dtype: int64
Another example: with the first 3 columns with the largest number of missing data:
>>> df.isnull().sum().nlargest(3)
PoolQC 1453
MiscFeature 1406
Alley 1369
dtype: int64
Get the number total of missing data in the DataFrame
>>> df.isnull().sum().sum()
6965
Remove columns that contains more than 50% of missing data
Display columns with missing data:
>>> column_with_nan = df.columns[df.isnull().any()]
>>> df.shape
(1460, 81)
>>> for column in column_with_nan:
... print(column, df[column].isnull().sum())
...
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406
Remove from the DataFrame the columns with more than 50% of missing data using drop():
>>> for column in column_with_nan:
... if df[column].isnull().sum()*100.0/df_shape[0] > 50:
... df.drop(column,1, inplace=True)
...
>>> df.shape
(1460, 77)
Find rows with missing data
>>> df.isnull().any(axis=1)
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 True
8 False
9 False
10 True
11 False
12 True
13 False
14 True
15 True
16 True
17 True
18 True
19 True
20 False
21 False
22 False
23 False
24 True
25 False
26 True
27 False
28 False
29 True
...
1430 False
1431 True
1432 True
1433 False
1434 False
1435 False
1436 True
1437 False
1438 True
1439 False
1440 False
1441 True
1442 False
1443 True
1444 True
1445 True
1446 True
1447 False
1448 True
1449 True
1450 True
1451 False
1452 True
1453 True
1454 True
1455 False
1456 False
1457 False
1458 True
1459 True
dtype: bool
Get a list of rows with missing data
>>> df.index[df.isnull().any(axis=1)]
Int64Index([ 0, 5, 7, 10, 12, 14, 15, 16, 17, 18,
...
1445, 1446, 1448, 1449, 1450, 1452, 1453, 1454, 1458, 1459],
dtype='int64', length=877)
Get the number of missing data per row
print(df.isnull().sum(axis=1))
gives
0 1
1 0
2 0
3 0
4 0
5 1
6 0
7 1
8 0
9 0
10 1
11 0
12 2
13 0
14 1
15 1
16 1
17 6
18 1
19 1
20 0
21 0
22 0
23 0
24 1
25 0
26 1
27 0
28 0
29 1
..
1430 0
1431 2
1432 1
1433 0
1434 0
1435 0
1436 1
1437 0
1438 1
1439 0
1440 0
1441 1
1442 0
1443 1
1444 1
1445 1
1446 2
1447 0
1448 1
1449 6
1450 6
1451 0
1452 1
1453 6
1454 1
1455 0
1456 0
1457 0
1458 1
1459 1
dtype: int64
Get the number of missing data for a given row
>>> df.iloc[1453,:].isnull().sum()
6
Get the row with the largest number of missing data
>>> df.isnull().sum(axis=1).nlargest(1)
39 11
dtype: int64
>>> df.isnull().sum(axis=1).nlargest(3)
39 15
520 11
533 11
dtype: int64
Remove rows with missing data
>>> index_with_nan = df.index[df.isnull().any(axis=1)]
>>> index_with_nan.shape
(877,)
>>> df.drop(index_with_nan,0, inplace=True)
>>> df.shape
(583, 77)
>>>
References
Links | Site |
---|---|
How do I get a summary count of missing/NaN data by column in 'pandas'? | stackoverflow |
How to count nan values in a pandas DataFrame?) | stackoverflow |
How to count the NaN values in a column in pandas DataFrame) | stackoverflow |
How to find which columns contain any NaN value in Pandas dataframe (python) | stackoverflow |
isnull | pandas doc |
any | pandas doc |