Examples of how to create a random sample with a reservoir using pandas in python
Create a list of dataframes
To create a sample from a dataframe, a straightforward solution is to use the pandas's function called sample() (see the previous article: How to select randomly (sample) the rows of a dataframe using pandas in python: ). However it does not work if you have a lot of data, for example let's assume we want to create a sample from a list of files and each file have a lot of data inside. If you try to stack together all the files your python code is going to crash. Another solution is to implement a random sample with a reservoir (see the wikipedia's article Reservoir sampling ). Let's create a fake list of dataframes:
import pandas as pd
import numpy as np
import random
list_of_files = []
nb_files = 200
for f in range(nb_files):
data = np.random.randn(100000)
df = pd.DataFrame(data=data,columns=['x'])
list_of_files.append(df)
Create a random sample with a reservoir with pandas
Step 1: create a sample with a resevoir of size k (with a fill value = 9999.0):
k = 10000
res = np.full((k,1), -9999.0)
res_df = pd.DataFrame(data=res,columns=['x'])
Step 2: select one file ("dataframe" here)
df = list_of_files[0]
Step 3: add a column called 'i': index of the item currently under consideration
i_start = 1
file_nb_rows = df.shape[0]
col = np.arange(0,file_nb_rows)
col = col + i_start
df['i'] = col
print(df)
gives for example
x i
0 1.031251 1
1 -0.337949 2
2 0.191543 3
3 1.026738 4
4 0.402292 5
... ... ...
99995 1.226392 99996
99996 -0.777223 99997
99997 -0.185092 99998
99998 -0.861963 99999
99999 0.993446 100000
Step 4: add a new column called 'j': the algorithm then generates a random number j between (and including) 0 and i
def myfunc(i):
return random.randrange(0,i)
df['j'] = df['i'].apply(myfunc)
print(df)
gives for example
x i j
0 1.031251 1 0
1 -0.337949 2 1
2 0.191543 3 0
3 1.026738 4 3
4 0.402292 5 4
... ... ... ...
99995 1.226392 99996 43950
99996 -0.777223 99997 10858
99997 -0.185092 99998 75163
99998 -0.861963 99999 632
99999 0.993446 100000 92049
Step 5: select only rows with j < k:
df = df[ df['j'] < k ]
Step 6: set column j as index
df = df.set_index('j')
Step 7: replace rows of the reservoir with the corresponding index of df:
res_df.loc[df.index, :] = df[:]
print( res_df )
gives then for example as a reservoir
x
0 -1.596012
1 0.720636
2 -1.125773
3 0.234868
4 -0.141145
... ...
9995 1.158669
9996 -1.503172
9997 0.216314
9998 -0.243413
9999 1.908080
Step 8: update i_start
i_start += file_nb_rows
Finally let's combine all steps together:
k = 10000
res = np.full((k,1), -9999.0)
res_df = pd.DataFrame(data=res,columns=['x'])
i_start = 1
for df in list_of_files:
file_nb_rows = df.shape[0]
col = np.arange(0,file_nb_rows)
col = col + i_start
df['i'] = col
def myfunc(i):
return random.randrange(0,i)
df['j'] = df['i'].apply(myfunc)
df = df[ df['j'] < k ]
df = df.set_index('j')
res_df.loc[df.index, :] = df[:]
i_start += file_nb_rows
We then get the following sample with a reservoir:
x
0 1.934902
1 1.882526
2 0.019944
3 1.217078
4 -0.320754
... ...
9995 -0.080966
9996 3.036373
9997 0.876503
9998 -0.152433
9999 0.932511
Done !
Create a weighted random sample with a reservoir with pandas
In the previous example, each row has the same probability to be randomly selected. To implement a weigted random sampling, there are multiple solutions (see for example the following research paper: Weighted random sampling with a reservoir):
Here it is an example of how to implement a weigted random sampling:
Step 1: create fake data:
import pandas as pd
import numpy as np
import random
data = np.random.uniform(0,100,100000)
df = pd.DataFrame(data=data,columns=['x'])
df.hist()
Step 2: Attribute a weight to each row ( w=x**2 for example):
Note that rows with largest w have a higher probability to be selected.
def weights(i):
return i**2
df['w'] = df['x'].apply(weights)
print(df)
gives for example
x w
0 87.345129 7629.171630
1 1.802819 3.250155
2 18.481825 341.577844
3 85.596719 7326.798226
4 51.299716 2631.660854
... ... ...
99995 5.409944 29.267489
99996 96.474256 9307.281975
99997 89.549894 8019.183497
99998 95.647101 9148.367968
99999 66.882506 4473.269595
Step 3: Create a list of fake files:
list_of_files = []
nb_files = 100
def weights(i):
return i**2
for f in range(nb_files):
data = np.random.uniform(0,100,100000)
df = pd.DataFrame(data=data,columns=['x'])
df['w'] = df['x'].apply(weights)
list_of_files.append(df)
print(list_of_files[1])
Step 4: Create a random sampling with a reservoir:
k = 10000
res = np.full((k,4), -9999.0)
res_df = pd.DataFrame(data=res,columns=['x', 'w', 'ui', 'ki'])
Iterate over each dataframes:
for df in list_of_files:
file_nb_rows = df.shape[0]
col = np.random.uniform(0,1,file_nb_rows)
df['ui'] = col
def myfunc(c):
return c['ui']**(1.0/c['w'])
df['ki'] = df.apply(myfunc, axis=1)
#print(df)
for index, row in df.iterrows():
if res_df[ res_df['ki'] < 0.0 ].shape[0] > 0:
fillv_idx = res_df[ res_df['ki'] < 0.0 ].index[0]
res_df.iloc[fillv_idx,:] = row
else:
idxmin = res_df[['ki']].idxmin()
if row['ki'] > res_df['ki'].iloc[idxmin[0]]:
res_df.iloc[idxmin[0]] = row
print( res_df )
gives for example
x w ui ki
0 53.454493 2857.382844 0.998571 0.999999
1 89.968302 8094.295344 0.989442 0.999999
2 70.193962 4927.192322 0.947424 0.999989
3 91.303779 8336.380103 0.919453 0.999990
4 67.163507 4510.936621 0.937989 0.999986
... ... ... ... ...
9995 71.289684 5082.219003 0.929686 0.999986
9996 21.750293 473.075265 0.998853 0.999998
9997 85.458925 7303.227943 0.921676 0.999989
9998 76.983555 5926.467703 0.973879 0.999996
9999 85.240184 7265.888955 0.922595 0.999989