Examples of how to drop one or multiple columns in a pandas DataFrame in python:
Remove one column
Lets create a simple dataframe
>>> import pandas as pd>>> import numpy as np>>> data = np.random.randint(100, size=(10,10))>>> dataarray([[55, 38, 97, 91, 22, 12, 87, 55, 38, 91],[44, 39, 64, 45, 33, 26, 28, 9, 86, 40],[10, 98, 55, 3, 99, 41, 25, 8, 36, 47],[84, 82, 55, 73, 29, 28, 85, 99, 7, 92],[84, 38, 45, 75, 9, 45, 89, 19, 62, 57],[75, 62, 86, 46, 82, 25, 83, 61, 74, 17],[69, 23, 50, 67, 6, 36, 28, 30, 4, 23],[86, 32, 97, 10, 30, 99, 42, 57, 11, 83],[31, 47, 89, 91, 34, 94, 34, 93, 32, 78],[78, 74, 95, 45, 24, 22, 89, 61, 60, 85]])>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])>>> dfc1 c2 c3 c4 c5 c6 c7 c8 c9 c100 55 38 97 91 22 12 87 55 38 911 44 39 64 45 33 26 28 9 86 402 10 98 55 3 99 41 25 8 36 473 84 82 55 73 29 28 85 99 7 924 84 38 45 75 9 45 89 19 62 575 75 62 86 46 82 25 83 61 74 176 69 23 50 67 6 36 28 30 4 237 86 32 97 10 30 99 42 57 11 838 31 47 89 91 34 94 34 93 32 789 78 74 95 45 24 22 89 61 60 85
Remove the column 7
>>> df.drop('c7',1,inplace=True)
returns
>>> dfc1 c2 c3 c4 c5 c6 c8 c9 c100 55 38 97 91 22 12 55 38 911 44 39 64 45 33 26 9 86 402 10 98 55 3 99 41 8 36 473 84 82 55 73 29 28 99 7 924 84 38 45 75 9 45 19 62 575 75 62 86 46 82 25 61 74 176 69 23 50 67 6 36 30 4 237 86 32 97 10 30 99 57 11 838 31 47 89 91 34 94 93 32 789 78 74 95 45 24 22 61 60 85
Remove a list of columns
>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])>>> df.drop(['c1','c3','c7'],1,inplace=True)>>> dfc2 c4 c5 c6 c8 c9 c100 38 91 22 12 55 38 911 39 45 33 26 9 86 402 98 3 99 41 8 36 473 82 73 29 28 99 7 924 38 75 9 45 19 62 575 62 46 82 25 61 74 176 23 67 6 36 30 4 237 32 10 30 99 57 11 838 47 91 34 94 93 32 789 74 45 24 22 61 60 85
Remove multiple consecutive columns
>>> df = pd.DataFrame(data=data,columns=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10'])>>> df.drop(df.iloc[:,3:7],1,inplace=True)>>> dfc1 c2 c3 c8 c9 c100 55 38 97 55 38 911 44 39 64 9 86 402 10 98 55 8 36 473 84 82 55 99 7 924 84 38 45 19 62 575 75 62 86 61 74 176 69 23 50 30 4 237 86 32 97 57 11 838 31 47 89 93 32 789 78 74 95 61 60 85
Remove columns with misssing data (NAN ou NULL)
Lets consider the following dataset train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():
>>> import pandas as pd>>> data = pd.read_csv('train.csv')
Get dataframe shape
>>> data.shape(1460, 81)
Get a dataset preview:
>>> df.head()Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \0 1 60 RL 65.0 8450 Pave NaN Reg1 2 20 RL 80.0 9600 Pave NaN Reg2 3 60 RL 68.0 11250 Pave NaN IR13 4 70 RL 60.0 9550 Pave NaN IR14 5 60 RL 84.0 14260 Pave NaN IR1LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \0 Lvl AllPub ... 0 NaN NaN NaN 01 Lvl AllPub ... 0 NaN NaN NaN 02 Lvl AllPub ... 0 NaN NaN NaN 03 Lvl AllPub ... 0 NaN NaN NaN 04 Lvl AllPub ... 0 NaN NaN NaN 0MoSold YrSold SaleType SaleCondition SalePrice0 2 2008 WD Normal 2085001 5 2007 WD Normal 1815002 9 2008 WD Normal 2235003 2 2006 WD Abnorml 1400004 12 2008 WD Normal 250000
Print the number of missing data per columns:
>>> column_with_nan = df.columns[df.isnull().any()]>>> df.shape(1460, 81)>>> for column in column_with_nan:... print(column, df[column].isnull().sum())...LotFrontage 259Alley 1369MasVnrType 8MasVnrArea 8BsmtQual 37BsmtCond 37BsmtExposure 38BsmtFinType1 37BsmtFinType2 38Electrical 1FireplaceQu 690GarageType 81GarageYrBlt 81GarageFinish 81GarageQual 81GarageCond 81PoolQC 1453Fence 1179MiscFeature 1406
Remove columns that have more than 50% of missing data:
>>> for column in column_with_nan:... if df[column].isnull().sum()*100.0/df_shape[0] > 50:... df.drop(column,1, inplace=True)...>>> df.shape(1460, 77)
References
| Links | Site |
|---|---|
| Delete column from pandas DataFrame | stackoverflow |
| How do I get a summary count of missing/NaN data by column in 'pandas'? | stackoverflow |
| How to count nan values in a pandas DataFrame?) | stackoverflow |
| How to count the NaN values in a column in pandas DataFrame) | stackoverflow |
| How to find which columns contain any NaN value in Pandas dataframe (python) | stackoverflow |
| isnull | pandas doc |
| any | pandas doc |
