Example of how to select randomly (sample) the rows of a dataframe using pandas in python:
1 -- Create a simple dataframe
Créons une simple dataframe avec 5 colonnes et 20 lignes:
>>> import pandas as pd
>>> import numpy as np
>>> data = np.arange(1,101)
>>> data = data.reshape(20,5)
>>> df = pd.DataFrame(data=data,columns=['a','b','c','d','e'])
>>> df
a b c d e
0 1 2 3 4 5
1 6 7 8 9 10
2 11 12 13 14 15
3 16 17 18 19 20
4 21 22 23 24 25
5 26 27 28 29 30
6 31 32 33 34 35
7 36 37 38 39 40
8 41 42 43 44 45
9 46 47 48 49 50
10 51 52 53 54 55
11 56 57 58 59 60
12 61 62 63 64 65
13 66 67 68 69 70
14 71 72 73 74 75
15 76 77 78 79 80
16 81 82 83 84 85
17 86 87 88 89 90
18 91 92 93 94 95
19 96 97 98 99 100
2 -- Select randomly rows using the function sample()
To sample a dataframe using pandas, a solution is ti use pandas.DataFrame.sample. Example: let's randomly select 5 rows from the dataframe df defined above:
>>> df_sub_cutoff = df.sample(n=5)
>>> df_sub_cutoff
a b c d e
11 56 57 58 59 60
0 1 2 3 4 5
18 91 92 93 94 95
15 76 77 78 79 80
9 46 47 48 49 50
Lets create another sample of size n=5
>>> df_sub_cutoff = df.sample(n=5)
>>> df_sub_cutoff
a b c d e
0 1 2 3 4 5
4 21 22 23 24 25
12 61 62 63 64 65
5 26 27 28 29 30
16 81 82 83 84 85
or of size n=2:
>>> df_sub_cutoff = df.sample(n=2)
>>> df_sub_cutoff
a b c d e
0 1 2 3 4 5
15 76 77 78 79 80
Note: to always get the same sample, a solution is to use the option "random_state" (with random_state=42 for example):
>>> df_sub_cutoff = df.sample(n=5, random_state = 42)
>>> df_sub_cutoff
a b c d e
0 1 2 3 4 5
17 86 87 88 89 90
15 76 77 78 79 80
1 6 7 8 9 10
8 41 42 43 44 45
Lets do it again using random_state = 42, to check that we got the same sample:
>>> df_sub_cutoff = df.sample(n=5, random_state = 42)
>>> df_sub_cutoff
a b c d e
0 1 2 3 4 5
17 86 87 88 89 90
15 76 77 78 79 80
1 6 7 8 9 10
8 41 42 43 44 45