Examples of how to create a random sample with a reservoir using pandas in python
Create a list of dataframes
To create a sample from a dataframe, a straightforward solution is to use the pandas's function called sample() (see the previous article: How to select randomly (sample) the rows of a dataframe using pandas in python: ). However it does not work if you have a lot of data, for example let's assume we want to create a sample from a list of files and each file have a lot of data inside. If you try to stack together all the files your python code is going to crash. Another solution is to implement a random sample with a reservoir (see the wikipedia's article Reservoir sampling ). Let's create a fake list of dataframes:
import pandas as pdimport numpy as npimport randomlist_of_files = []nb_files = 200for f in range(nb_files):data = np.random.randn(100000)df = pd.DataFrame(data=data,columns=['x'])list_of_files.append(df)
Create a random sample with a reservoir with pandas
Step 1: create a sample with a resevoir of size k (with a fill value = 9999.0):
k = 10000res = np.full((k,1), -9999.0)res_df = pd.DataFrame(data=res,columns=['x'])
Step 2: select one file ("dataframe" here)
df = list_of_files[0]
Step 3: add a column called 'i': index of the item currently under consideration
i_start = 1file_nb_rows = df.shape[0]col = np.arange(0,file_nb_rows)col = col + i_startdf['i'] = colprint(df)
gives for example
x i0 1.031251 11 -0.337949 22 0.191543 33 1.026738 44 0.402292 5... ... ...99995 1.226392 9999699996 -0.777223 9999799997 -0.185092 9999899998 -0.861963 9999999999 0.993446 100000
Step 4: add a new column called 'j': the algorithm then generates a random number j between (and including) 0 and i
def myfunc(i):return random.randrange(0,i)df['j'] = df['i'].apply(myfunc)print(df)
gives for example
x i j0 1.031251 1 01 -0.337949 2 12 0.191543 3 03 1.026738 4 34 0.402292 5 4... ... ... ...99995 1.226392 99996 4395099996 -0.777223 99997 1085899997 -0.185092 99998 7516399998 -0.861963 99999 63299999 0.993446 100000 92049
Step 5: select only rows with j < k:
df = df[ df['j'] < k ]
Step 6: set column j as index
df = df.set_index('j')
Step 7: replace rows of the reservoir with the corresponding index of df:
res_df.loc[df.index, :] = df[:]print( res_df )
gives then for example as a reservoir
x0 -1.5960121 0.7206362 -1.1257733 0.2348684 -0.141145... ...9995 1.1586699996 -1.5031729997 0.2163149998 -0.2434139999 1.908080
Step 8: update i_start
i_start += file_nb_rows
Finally let's combine all steps together:
k = 10000res = np.full((k,1), -9999.0)res_df = pd.DataFrame(data=res,columns=['x'])i_start = 1for df in list_of_files:file_nb_rows = df.shape[0]col = np.arange(0,file_nb_rows)col = col + i_startdf['i'] = coldef myfunc(i):return random.randrange(0,i)df['j'] = df['i'].apply(myfunc)df = df[ df['j'] < k ]df = df.set_index('j')res_df.loc[df.index, :] = df[:]i_start += file_nb_rows
We then get the following sample with a reservoir:
x0 1.9349021 1.8825262 0.0199443 1.2170784 -0.320754... ...9995 -0.0809669996 3.0363739997 0.8765039998 -0.1524339999 0.932511Done !
Create a weighted random sample with a reservoir with pandas
In the previous example, each row has the same probability to be randomly selected. To implement a weigted random sampling, there are multiple solutions (see for example the following research paper: Weighted random sampling with a reservoir):
Weighted random sampling with a reservoir size:100">
Here it is an example of how to implement a weigted random sampling:
Step 1: create fake data:
import pandas as pdimport numpy as npimport randomdata = np.random.uniform(0,100,100000)df = pd.DataFrame(data=data,columns=['x'])df.hist()
Step 2: Attribute a weight to each row ( w=x**2 for example):
Note that rows with largest w have a higher probability to be selected.
def weights(i):return i**2df['w'] = df['x'].apply(weights)print(df)
gives for example
x w0 87.345129 7629.1716301 1.802819 3.2501552 18.481825 341.5778443 85.596719 7326.7982264 51.299716 2631.660854... ... ...99995 5.409944 29.26748999996 96.474256 9307.28197599997 89.549894 8019.18349799998 95.647101 9148.36796899999 66.882506 4473.269595
Step 3: Create a list of fake files:
list_of_files = []nb_files = 100def weights(i):return i**2for f in range(nb_files):data = np.random.uniform(0,100,100000)df = pd.DataFrame(data=data,columns=['x'])df['w'] = df['x'].apply(weights)list_of_files.append(df)print(list_of_files[1])
Step 4: Create a random sampling with a reservoir:
k = 10000res = np.full((k,4), -9999.0)res_df = pd.DataFrame(data=res,columns=['x', 'w', 'ui', 'ki'])
Iterate over each dataframes:
for df in list_of_files:file_nb_rows = df.shape[0]col = np.random.uniform(0,1,file_nb_rows)df['ui'] = coldef myfunc(c):return c['ui']**(1.0/c['w'])df['ki'] = df.apply(myfunc, axis=1)#print(df)for index, row in df.iterrows():if res_df[ res_df['ki'] < 0.0 ].shape[0] > 0:fillv_idx = res_df[ res_df['ki'] < 0.0 ].index[0]res_df.iloc[fillv_idx,:] = rowelse:idxmin = res_df[['ki']].idxmin()if row['ki'] > res_df['ki'].iloc[idxmin[0]]:res_df.iloc[idxmin[0]] = rowprint( res_df )
gives for example
x w ui ki0 53.454493 2857.382844 0.998571 0.9999991 89.968302 8094.295344 0.989442 0.9999992 70.193962 4927.192322 0.947424 0.9999893 91.303779 8336.380103 0.919453 0.9999904 67.163507 4510.936621 0.937989 0.999986... ... ... ... ...9995 71.289684 5082.219003 0.929686 0.9999869996 21.750293 473.075265 0.998853 0.9999989997 85.458925 7303.227943 0.921676 0.9999899998 76.983555 5926.467703 0.973879 0.9999969999 85.240184 7265.888955 0.922595 0.999989
