How to drop (e.g remove) one or multiple columns in a pandas DataFrame in python ?

Published: November 01, 2019

Tags: Python; Pandas; DataFrame; Drop;

DMCA.com Protection Status

Examples of how to drop one or multiple columns in a pandas DataFrame in python:

Remove one column

Lets create a simple dataframe

>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(100, size=(10,10))
>>> data
array([[55, 38, 97, 91, 22, 12, 87, 55, 38, 91],
       [44, 39, 64, 45, 33, 26, 28,  9, 86, 40],
       [10, 98, 55,  3, 99, 41, 25,  8, 36, 47],
       [84, 82, 55, 73, 29, 28, 85, 99,  7, 92],
       [84, 38, 45, 75,  9, 45, 89, 19, 62, 57],
       [75, 62, 86, 46, 82, 25, 83, 61, 74, 17],
       [69, 23, 50, 67,  6, 36, 28, 30,  4, 23],
       [86, 32, 97, 10, 30, 99, 42, 57, 11, 83],
       [31, 47, 89, 91, 34, 94, 34, 93, 32, 78],
       [78, 74, 95, 45, 24, 22, 89, 61, 60, 85]])
>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df
   c1  c2  c3  c4  c5  c6  c7  c8  c9  c10
0  55  38  97  91  22  12  87  55  38   91
1  44  39  64  45  33  26  28   9  86   40
2  10  98  55   3  99  41  25   8  36   47
3  84  82  55  73  29  28  85  99   7   92
4  84  38  45  75   9  45  89  19  62   57
5  75  62  86  46  82  25  83  61  74   17
6  69  23  50  67   6  36  28  30   4   23
7  86  32  97  10  30  99  42  57  11   83
8  31  47  89  91  34  94  34  93  32   78
9  78  74  95  45  24  22  89  61  60   85

Remove the column 7

>>> df.drop('c7',1,inplace=True)

returns

>>> df
   c1  c2  c3  c4  c5  c6  c8  c9  c10
0  55  38  97  91  22  12  55  38   91
1  44  39  64  45  33  26   9  86   40
2  10  98  55   3  99  41   8  36   47
3  84  82  55  73  29  28  99   7   92
4  84  38  45  75   9  45  19  62   57
5  75  62  86  46  82  25  61  74   17
6  69  23  50  67   6  36  30   4   23
7  86  32  97  10  30  99  57  11   83
8  31  47  89  91  34  94  93  32   78
9  78  74  95  45  24  22  61  60   85

Remove a list of columns

>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df.drop(['c1','c3','c7'],1,inplace=True)
>>> df
   c2  c4  c5  c6  c8  c9  c10
0  38  91  22  12  55  38   91
1  39  45  33  26   9  86   40
2  98   3  99  41   8  36   47
3  82  73  29  28  99   7   92
4  38  75   9  45  19  62   57
5  62  46  82  25  61  74   17
6  23  67   6  36  30   4   23
7  32  10  30  99  57  11   83
8  47  91  34  94  93  32   78
9  74  45  24  22  61  60   85

Remove multiple consecutive columns

>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])
>>> df.drop(df.iloc[:,3:7],1,inplace=True)
>>> df
   c1  c2  c3  c8  c9  c10
0  55  38  97  55  38   91
1  44  39  64   9  86   40
2  10  98  55   8  36   47
3  84  82  55  99   7   92
4  84  38  45  19  62   57
5  75  62  86  61  74   17
6  69  23  50  30   4   23
7  86  32  97  57  11   83
8  31  47  89  93  32   78
9  78  74  95  61  60   85

Remove columns with misssing data (NAN ou NULL)

Lets consider the following dataset train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():

>>> import pandas as pd
>>> data = pd.read_csv('train.csv')

Get dataframe shape

>>> data.shape
(1460, 81)

Get a dataset preview:

>>> df.head()
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1

  LandContour Utilities    ...     PoolArea PoolQC Fence MiscFeature MiscVal  \
0         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
1         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
2         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
3         Lvl    AllPub    ...            0    NaN   NaN         NaN       0   
4         Lvl    AllPub    ...            0    NaN   NaN         NaN       0

  MoSold YrSold  SaleType  SaleCondition  SalePrice  
0      2   2008        WD         Normal     208500  
1      5   2007        WD         Normal     181500  
2      9   2008        WD         Normal     223500  
3      2   2006        WD        Abnorml     140000  
4     12   2008        WD         Normal     250000

Print the number of missing data per columns:

>>> column_with_nan = df.columns[df.isnull().any()]
>>> df.shape
(1460, 81)
>>> for column in column_with_nan:
...     print(column, df[column].isnull().sum())
... 
LotFrontage 259
Alley 1369
MasVnrType 8
MasVnrArea 8
BsmtQual 37
BsmtCond 37
BsmtExposure 38
BsmtFinType1 37
BsmtFinType2 38
Electrical 1
FireplaceQu 690
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
PoolQC 1453
Fence 1179
MiscFeature 1406

Remove columns that have more than 50% of missing data:

>>> for column in column_with_nan:
...     if df[column].isnull().sum()*100.0/df_shape[0] > 50:
...             df.drop(column,1, inplace=True)
... 
>>> df.shape
(1460, 77)

References