Example of how to extract one or several columns of data in a pandas dataframe
Create a dataframe with pandas
Let's consider the csv file train.csv (that can be downloaded on kaggle). To read the file a solution is to use read_csv():
>>> import pandas as pd
>>> df = pd.read_csv('train.csv')
>>> df.shape
(1460, 81)
Get column names
>>> df.columns
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
Print column names and associated indexes:
>>> for idx,column in enumerate(df.columns):
... print(idx,column)
...
0 Id
1 MSSubClass
2 MSZoning
3 LotFrontage
4 LotArea
5 Street
6 Alley
7 LotShape
8 LandContour
9 Utilities
10 LotConfig
11 LandSlope
12 Neighborhood
13 Condition1
14 Condition2
15 BldgType
16 HouseStyle
17 OverallQual
18 OverallCond
19 YearBuilt
20 YearRemodAdd
21 RoofStyle
22 RoofMatl
23 Exterior1st
24 Exterior2nd
25 MasVnrType
26 MasVnrArea
27 ExterQual
28 ExterCond
29 Foundation
30 BsmtQual
31 BsmtCond
32 BsmtExposure
33 BsmtFinType1
34 BsmtFinSF1
35 BsmtFinType2
36 BsmtFinSF2
37 BsmtUnfSF
38 TotalBsmtSF
39 Heating
40 HeatingQC
41 CentralAir
42 Electrical
43 1stFlrSF
44 2ndFlrSF
45 LowQualFinSF
46 GrLivArea
47 BsmtFullBath
48 BsmtHalfBath
49 FullBath
50 HalfBath
51 BedroomAbvGr
52 KitchenAbvGr
53 KitchenQual
54 TotRmsAbvGrd
55 Functional
56 Fireplaces
57 FireplaceQu
58 GarageType
59 GarageYrBlt
60 GarageFinish
61 GarageCars
62 GarageArea
63 GarageQual
64 GarageCond
65 PavedDrive
66 WoodDeckSF
67 OpenPorchSF
68 EnclosedPorch
69 3SsnPorch
70 ScreenPorch
71 PoolArea
72 PoolQC
73 Fence
74 MiscFeature
75 MiscVal
76 MoSold
77 YrSold
78 SaleType
79 SaleCondition
80 SalePrice
Extract one column of data
Example 1 of how to extract one column of data (for example "SalePrice"):
>>> df['SalePrice']
0 208500
1 181500
2 223500
3 140000
4 250000
5 143000
6 307000
7 200000
8 129900
9 118000
10 129500
11 345000
12 144000
13 279500
14 157000
15 132000
16 149000
17 90000
18 159000
19 139000
20 325300
21 139400
22 230000
23 129900
24 154000
25 256300
26 134800
27 306000
28 207500
29 68500
...
1430 192140
1431 143750
1432 64500
1433 186500
1434 160000
1435 174000
1436 120500
1437 394617
1438 149700
1439 197000
1440 191000
1441 149300
1442 310000
1443 121000
1444 179600
1445 129000
1446 157900
1447 240000
1448 112000
1449 92000
1450 136000
1451 287090
1452 145000
1453 84500
1454 185000
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, Length: 1460, dtype: int64
Example 2 using the pandas function loc:
>>> df.loc[:,'SalePrice']
0 208500
1 181500
2 223500
3 140000
4 250000
5 143000
6 307000
7 200000
8 129900
9 118000
10 129500
11 345000
12 144000
13 279500
14 157000
15 132000
16 149000
17 90000
18 159000
19 139000
20 325300
21 139400
22 230000
23 129900
24 154000
25 256300
26 134800
27 306000
28 207500
29 68500
...
1430 192140
1431 143750
1432 64500
1433 186500
1434 160000
1435 174000
1436 120500
1437 394617
1438 149700
1439 197000
1440 191000
1441 149300
1442 310000
1443 121000
1444 179600
1445 129000
1446 157900
1447 240000
1448 112000
1449 92000
1450 136000
1451 287090
1452 145000
1453 84500
1454 185000
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, dtype: int64
Example 3 using the pandas function iloc (Note: 'SalePrice' corresponds to 80):
>>> df.iloc[:,80]
0 208500
1 181500
2 223500
3 140000
4 250000
5 143000
6 307000
7 200000
8 129900
9 118000
10 129500
11 345000
12 144000
13 279500
14 157000
15 132000
16 149000
17 90000
18 159000
19 139000
20 325300
21 139400
22 230000
23 129900
24 154000
25 256300
26 134800
27 306000
28 207500
29 68500
...
1430 192140
1431 143750
1432 64500
1433 186500
1434 160000
1435 174000
1436 120500
1437 394617
1438 149700
1439 197000
1440 191000
1441 149300
1442 310000
1443 121000
1444 179600
1445 129000
1446 157900
1447 240000
1448 112000
1449 92000
1450 136000
1451 287090
1452 145000
1453 84500
1454 185000
1455 175000
1456 210000
1457 266500
1458 142125
1459 147500
Name: SalePrice, dtype: int64
Extract several columns of data
>>> mycolumns = ['GrLivArea','SalePrice']
>>> df[mycolumns]
GrLivArea SalePrice
0 1710 208500
1 1262 181500
2 1786 223500
3 1717 140000
4 2198 250000
5 1362 143000
6 1694 307000
7 2090 200000
8 1774 129900
9 1077 118000
10 1040 129500
11 2324 345000
12 912 144000
13 1494 279500
14 1253 157000
15 854 132000
16 1004 149000
17 1296 90000
18 1114 159000
19 1339 139000
20 2376 325300
21 1108 139400
22 1795 230000
23 1060 129900
24 1060 154000
25 1600 256300
26 900 134800
27 1704 306000
28 1600 207500
29 520 68500
... ... ...
1430 1838 192140
1431 958 143750
1432 968 64500
1433 1792 186500
1434 1126 160000
1435 1537 174000
1436 864 120500
1437 1932 394617
1438 1236 149700
1439 1725 197000
1440 2555 191000
1441 848 149300
1442 2007 310000
1443 952 121000
1444 1422 179600
1445 913 129000
1446 1188 157900
1447 2090 240000
1448 1346 112000
1449 630 92000
1450 1792 136000
1451 1578 287090
1452 1072 145000
1453 1140 84500
1454 1221 185000
1455 1647 175000
1456 2073 210000
1457 2340 266500
1458 1078 142125
1459 1256 147500
[1460 rows x 2 columns]
Example 2 using loc:
>>> df.loc[:,mycolumns]
GrLivArea SalePrice
0 1710 208500
1 1262 181500
2 1786 223500
3 1717 140000
4 2198 250000
5 1362 143000
6 1694 307000
7 2090 200000
8 1774 129900
9 1077 118000
10 1040 129500
11 2324 345000
12 912 144000
13 1494 279500
14 1253 157000
15 854 132000
16 1004 149000
17 1296 90000
18 1114 159000
19 1339 139000
20 2376 325300
21 1108 139400
22 1795 230000
23 1060 129900
24 1060 154000
25 1600 256300
26 900 134800
27 1704 306000
28 1600 207500
29 520 68500
... ... ...
1430 1838 192140
1431 958 143750
1432 968 64500
1433 1792 186500
1434 1126 160000
1435 1537 174000
1436 864 120500
1437 1932 394617
1438 1236 149700
1439 1725 197000
1440 2555 191000
1441 848 149300
1442 2007 310000
1443 952 121000
1444 1422 179600
1445 913 129000
1446 1188 157900
1447 2090 240000
1448 1346 112000
1449 630 92000
1450 1792 136000
1451 1578 287090
1452 1072 145000
1453 1140 84500
1454 1221 185000
1455 1647 175000
1456 2073 210000
1457 2340 266500
1458 1078 142125
1459 1256 147500
[1460 rows x 2 columns]
Example 3 using iloc:
>>> df.iloc[:,[46,80]]
GrLivArea SalePrice
0 1710 208500
1 1262 181500
2 1786 223500
3 1717 140000
4 2198 250000
5 1362 143000
6 1694 307000
7 2090 200000
8 1774 129900
9 1077 118000
10 1040 129500
11 2324 345000
12 912 144000
13 1494 279500
14 1253 157000
15 854 132000
16 1004 149000
17 1296 90000
18 1114 159000
19 1339 139000
20 2376 325300
21 1108 139400
22 1795 230000
23 1060 129900
24 1060 154000
25 1600 256300
26 900 134800
27 1704 306000
28 1600 207500
29 520 68500
... ... ...
1430 1838 192140
1431 958 143750
1432 968 64500
1433 1792 186500
1434 1126 160000
1435 1537 174000
1436 864 120500
1437 1932 394617
1438 1236 149700
1439 1725 197000
1440 2555 191000
1441 848 149300
1442 2007 310000
1443 952 121000
1444 1422 179600
1445 913 129000
1446 1188 157900
1447 2090 240000
1448 1346 112000
1449 630 92000
1450 1792 136000
1451 1578 287090
1452 1072 145000
1453 1140 84500
1454 1221 185000
1455 1647 175000
1456 2073 210000
1457 2340 266500
1458 1078 142125
1459 1256 147500
[1460 rows x 2 columns]
The advantage of the iloc() function is to easily select several consecutive columns, example:
>>> df.iloc[:,76:80]
MoSold YrSold SaleType SaleCondition
0 2 2008 WD Normal
1 5 2007 WD Normal
2 9 2008 WD Normal
3 2 2006 WD Abnorml
4 12 2008 WD Normal
5 10 2009 WD Normal
6 8 2007 WD Normal
7 11 2009 WD Normal
8 4 2008 WD Abnorml
9 1 2008 WD Normal
10 2 2008 WD Normal
11 7 2006 New Partial
12 9 2008 WD Normal
13 8 2007 New Partial
14 5 2008 WD Normal
15 7 2007 WD Normal
16 3 2010 WD Normal
17 10 2006 WD Normal
18 6 2008 WD Normal
19 5 2009 COD Abnorml
20 11 2006 New Partial
21 6 2007 WD Normal
22 9 2008 WD Normal
23 6 2007 WD Normal
24 5 2010 WD Normal
25 7 2009 WD Normal
26 5 2010 WD Normal
27 5 2010 WD Normal
28 12 2006 WD Normal
29 5 2008 WD Normal
... ... ... ... ...
1430 7 2006 WD Normal
1431 10 2009 WD Normal
1432 8 2007 WD Normal
1433 5 2008 WD Normal
1434 5 2006 WD Normal
1435 7 2008 COD Abnorml
1436 5 2007 WD Normal
1437 11 2008 New Partial
1438 4 2010 WD Normal
1439 11 2007 WD Normal
1440 9 2008 WD Normal
1441 5 2008 WD Normal
1442 4 2009 WD Normal
1443 5 2009 WD Normal
1444 11 2007 WD Normal
1445 5 2007 WD Normal
1446 4 2010 WD Normal
1447 12 2007 WD Normal
1448 5 2007 WD Normal
1449 8 2006 WD Abnorml
1450 9 2009 WD Normal
1451 5 2009 New Partial
1452 5 2006 WD Normal
1453 7 2006 WD Abnorml
1454 10 2009 WD Normal
1455 8 2007 WD Normal
1456 2 2010 WD Normal
1457 5 2010 WD Normal
1458 4 2010 WD Normal
1459 6 2008 WD Normal
[1460 rows x 4 columns]
References
Links | Site |
---|---|
Selecting Subsets of Data in Pandas: Part 1 | medium.com |
Select Rows & Columns by Name or Index in DataFrame using loc & iloc Python Pandas | thispointer.com |
pandas.DataFrame.loc | pandas doc |
pandas.DataFrame.iloc | pandas doc |