How to filter missing data (NAN or NULL values) in a pandas DataFrame ?

Published: October 28, 2019

Tags: Python; Pandas; DataFrame;

DMCA.com Protection Status

Examples of how to work with missing data (NAN or NULL values) in a pandas DataFrame:

Create a DataFrame with Pandas

Let's consider the csv file train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():

>>> import pandas as pd
>>> data = pd.read_csv('train.csv')

Get DataFrame shape

>>> data.shape
(1460, 81)

Get an overview of the dataframe header:

>>> df.head()
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1

  LandContour Utilities    ...     PoolArea PoolQC Fence MiscFeature MiscVal  \
0         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
1         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
2         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
3         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
4         Lvl    AllPub    ...            0    NaN   NaN         NaN       0

  MoSold YrSold  SaleType  SaleCondition  SalePrice  
0      2   2008        WD         Normal     208500  
1      5   2007        WD         Normal     181500  
2      9   2008        WD         Normal     223500  
3      2   2006        WD        Abnorml     140000  
4     12   2008        WD         Normal     250000

Find columns with missing data

To find columns with missing data (with NAN or NULL values), a solution is to use (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) and any, example:

>>> df.isnull().any()
Id               False
MSSubClass       False
MSZoning         False
LotFrontage       True
LotArea          False
Street           False
Alley             True
LotShape         False
LandContour      False
Utilities        False
LotConfig        False
LandSlope        False
Neighborhood     False
Condition1       False
Condition2       False
BldgType         False
HouseStyle       False
OverallQual      False
OverallCond      False
YearBuilt        False
YearRemodAdd     False
RoofStyle        False
RoofMatl         False
Exterior1st      False
Exterior2nd      False
MasVnrType        True
MasVnrArea        True
ExterQual        False
ExterCond        False
Foundation       False
                 ...  
BedroomAbvGr     False
KitchenAbvGr     False
KitchenQual      False
TotRmsAbvGrd     False
Functional       False
Fireplaces       False
FireplaceQu       True
GarageType        True
GarageYrBlt       True
GarageFinish      True
GarageCars       False
GarageArea       False
GarageQual        True
GarageCond        True
PavedDrive       False
WoodDeckSF       False
OpenPorchSF      False
EnclosedPorch    False
3SsnPorch        False
ScreenPorch      False
PoolArea         False
PoolQC            True
Fence             True
MiscFeature       True
MiscVal          False
MoSold           False
YrSold           False
SaleType         False
SaleCondition    False
SalePrice        False
dtype: bool

Get a list of columns with missing data

It is then possible to get a list of columns with missing value:

>>> df.columns[df.isnull().any()]
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

Get the number of missing data per column

To get the number of missing data per column, a solution is to use sum():

>>> df.isnull().sum()
Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive          0
WoodDeckSF          0
OpenPorchSF         0
EnclosedPorch       0
3SsnPorch           0
ScreenPorch         0
PoolArea            0
PoolQC           1453
Fence            1179
MiscFeature      1406
MiscVal             0
MoSold              0
YrSold              0
SaleType            0
SaleCondition       0
SalePrice           0

Get the column with the maximum number of missing data

To get the column with the largest number of missing data there is the function nlargest(1):

>>> df.isnull().sum().nlargest(1)
PoolQC    1453
dtype: int64

Another example: with the first 3 columns with the largest number of missing data:

>>> df.isnull().sum().nlargest(3)
PoolQC         1453
MiscFeature    1406
Alley          1369
dtype: int64

Get the number total of missing data in the DataFrame

>>> df.isnull().sum().sum()
6965

Remove columns that contains more than 50% of missing data

Display columns with missing data:

>>> column_with_nan = df.columns[df.isnull().any()]
>>> df.shape
(1460, 81)
>>> for column in column_with_nan:
...     print(column, df[column].isnull().sum())
... 
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

Remove from the DataFrame the columns with more than 50% of missing data using drop():

>>> for column in column_with_nan:
...     if df[column].isnull().sum()*100.0/df_shape[0] > 50:
...             df.drop(column,1, inplace=True)
... 
>>> df.shape
(1460, 77)

Find rows with missing data

>>> df.isnull().any(axis=1)
0        True
1       False
2       False
3       False
4       False
5        True
6       False
7        True
8       False
9       False
10       True
11      False
12       True
13      False
14       True
15       True
16       True
17       True
18       True
19       True
20      False
21      False
22      False
23      False
24       True
25      False
26       True
27      False
28      False
29       True
        ...  
1430    False
1431     True
1432     True
1433    False
1434    False
1435    False
1436     True
1437    False
1438     True
1439    False
1440    False
1441     True
1442    False
1443     True
1444     True
1445     True
1446     True
1447    False
1448     True
1449     True
1450     True
1451    False
1452     True
1453     True
1454     True
1455    False
1456    False
1457    False
1458     True
1459     True
dtype: bool

Get a list of rows with missing data

>>> df.index[df.isnull().any(axis=1)]
Int64Index([   0,    5,    7,   10,   12,   14,   15,   16,   17,   18,
            ...
            1445, 1446, 1448, 1449, 1450, 1452, 1453, 1454, 1458, 1459],
           dtype='int64', length=877)

Get the number of missing data per row

print(df.isnull().sum(axis=1))

gives

0       1
1       0
2       0
3       0
4       0
5       1
6       0
7       1
8       0
9       0
10      1
11      0
12      2
13      0
14      1
15      1
16      1
17      6
18      1
19      1
20      0
21      0
22      0
23      0
24      1
25      0
26      1
27      0
28      0
29      1
       ..
1430    0
1431    2
1432    1
1433    0
1434    0
1435    0
1436    1
1437    0
1438    1
1439    0
1440    0
1441    1
1442    0
1443    1
1444    1
1445    1
1446    2
1447    0
1448    1
1449    6
1450    6
1451    0
1452    1
1453    6
1454    1
1455    0
1456    0
1457    0
1458    1
1459    1
dtype: int64

Get the number of missing data for a given row

>>> df.iloc[1453,:].isnull().sum()
6

Get the row with the largest number of missing data

>>> df.isnull().sum(axis=1).nlargest(1)
39    11
dtype: int64
>>> df.isnull().sum(axis=1).nlargest(3)
39      15
520     11
533    11
dtype: int64

Remove rows with missing data

>>> index_with_nan = df.index[df.isnull().any(axis=1)]
>>> index_with_nan.shape
(877,)
>>> df.drop(index_with_nan,0, inplace=True)
>>> df.shape
(583, 77)
>>>

References