How to create a random sample using a reservoir with pandas in python ?

Published: November 03, 2021

Tags: Python; Pandas; DataFrame; Big Data;

DMCA.com Protection Status

Examples of how to create a random sample with a reservoir using pandas in python

Create a list of dataframes

To create a sample from a dataframe, a straightforward solution is to use the pandas's function called sample() (see the previous article: How to select randomly (sample) the rows of a dataframe using pandas in python: ). However it does not work if you have a lot of data, for example let's assume we want to create a sample from a list of files and each file have a lot of data inside. If you try to stack together all the files your python code is going to crash. Another solution is to implement a random sample with a reservoir (see the wikipedia's article Reservoir sampling ). Let's create a fake list of dataframes:

import pandas as pd
import numpy as np
import random

list_of_files = []

nb_files = 200

for f in range(nb_files):
        data = np.random.randn(100000)
        df = pd.DataFrame(data=data,columns=['x'])
        list_of_files.append(df)

Create a random sample with a reservoir with pandas

Step 1: create a sample with a resevoir of size k (with a fill value = 9999.0):

k = 10000

res = np.full((k,1), -9999.0)

res_df = pd.DataFrame(data=res,columns=['x'])

Step 2: select one file ("dataframe" here)

df = list_of_files[0]

Step 3: add a column called 'i': index of the item currently under consideration

i_start = 1

file_nb_rows = df.shape[0]

col = np.arange(0,file_nb_rows)

col = col + i_start

df['i'] = col

print(df)

gives for example

            x         i
0      1.031251       1
1     -0.337949       2
2      0.191543       3
3      1.026738       4
4      0.402292       5
...         ...     ...
99995  1.226392   99996
99996 -0.777223   99997
99997 -0.185092   99998
99998 -0.861963   99999
99999  0.993446  100000

Step 4: add a new column called 'j': the algorithm then generates a random number j between (and including) 0 and i

def myfunc(i):
        return random.randrange(0,i)

df['j'] = df['i'].apply(myfunc)

print(df)

gives for example

             x        i      j
0      1.031251       1      0
1     -0.337949       2      1
2      0.191543       3      0
3      1.026738       4      3
4      0.402292       5      4
...         ...     ...    ...
99995  1.226392   99996  43950
99996 -0.777223   99997  10858
99997 -0.185092   99998  75163
99998 -0.861963   99999    632
99999  0.993446  100000  92049

Step 5: select only rows with j < k:

df = df[ df['j'] < k ]

Step 6: set column j as index

df = df.set_index('j')

Step 7: replace rows of the reservoir with the corresponding index of df:

res_df.loc[df.index, :] = df[:]

print( res_df )

gives then for example as a reservoir

              x
0    -1.596012
1     0.720636
2    -1.125773
3     0.234868
4    -0.141145
...        ...
9995  1.158669
9996 -1.503172
9997  0.216314
9998 -0.243413
9999  1.908080

Step 8: update i_start

i_start += file_nb_rows

Finally let's combine all steps together:

k = 10000

res = np.full((k,1), -9999.0)

res_df = pd.DataFrame(data=res,columns=['x'])

i_start = 1

for df in list_of_files:

        file_nb_rows = df.shape[0]

        col = np.arange(0,file_nb_rows)

        col = col + i_start

        df['i'] = col

        def myfunc(i):
                return random.randrange(0,i)

        df['j'] = df['i'].apply(myfunc)

        df = df[ df['j'] < k ]

        df = df.set_index('j')

        res_df.loc[df.index, :] = df[:]

        i_start += file_nb_rows

We then get the following sample with a reservoir:

            x
0     1.934902
1     1.882526
2     0.019944
3     1.217078
4    -0.320754
...        ...
9995 -0.080966
9996  3.036373
9997  0.876503
9998 -0.152433
9999  0.932511

Done !

Create a weighted random sample with a reservoir with pandas

In the previous example, each row has the same probability to be randomly selected. To implement a weigted random sampling, there are multiple solutions (see for example the following research paper: Weighted random sampling with a reservoir):

<a href=Weighted random sampling with a reservoir size:100">
Weighted random sampling with a reservoir size:100

Here it is an example of how to implement a weigted random sampling:

Step 1: create fake data:

import pandas as pd
import numpy as np
import random

data = np.random.uniform(0,100,100000)

df = pd.DataFrame(data=data,columns=['x'])

df.hist()

Step 2: Attribute a weight to each row ( w=x**2 for example):

Note that rows with largest w have a higher probability to be selected.

def weights(i):
                return i**2

df['w'] = df['x'].apply(weights)

print(df)

gives for example

            x       w
0      87.345129  7629.171630
1       1.802819     3.250155
2      18.481825   341.577844
3      85.596719  7326.798226
4      51.299716  2631.660854
...          ...          ...
99995   5.409944    29.267489
99996  96.474256  9307.281975
99997  89.549894  8019.183497
99998  95.647101  9148.367968
99999  66.882506  4473.269595

Step 3: Create a list of fake files:

list_of_files = []

nb_files = 100

def weights(i):
                return i**2

for f in range(nb_files):
        data = np.random.uniform(0,100,100000)
        df = pd.DataFrame(data=data,columns=['x'])
        df['w'] = df['x'].apply(weights)
        list_of_files.append(df)

print(list_of_files[1])

Step 4: Create a random sampling with a reservoir:

k = 10000

res = np.full((k,4), -9999.0)

res_df = pd.DataFrame(data=res,columns=['x', 'w', 'ui', 'ki'])

Iterate over each dataframes:

for df in list_of_files:

        file_nb_rows = df.shape[0]

        col = np.random.uniform(0,1,file_nb_rows)

        df['ui'] = col

        def myfunc(c):
                return c['ui']**(1.0/c['w'])

        df['ki'] = df.apply(myfunc, axis=1)

        #print(df)

        for index, row in df.iterrows():

                if res_df[ res_df['ki'] < 0.0 ].shape[0] > 0:            
                        fillv_idx = res_df[ res_df['ki'] < 0.0 ].index[0]
                        res_df.iloc[fillv_idx,:] = row
                else:
                        idxmin = res_df[['ki']].idxmin()
                        if row['ki'] > res_df['ki'].iloc[idxmin[0]]:
                                res_df.iloc[idxmin[0]] = row

print( res_df )

gives for example

              x        w        ui        ki
0     53.454493  2857.382844  0.998571  0.999999
1     89.968302  8094.295344  0.989442  0.999999
2     70.193962  4927.192322  0.947424  0.999989
3     91.303779  8336.380103  0.919453  0.999990
4     67.163507  4510.936621  0.937989  0.999986
...         ...          ...       ...       ...
9995  71.289684  5082.219003  0.929686  0.999986
9996  21.750293   473.075265  0.998853  0.999998
9997  85.458925  7303.227943  0.921676  0.999989
9998  76.983555  5926.467703  0.973879  0.999996
9999  85.240184  7265.888955  0.922595  0.999989

References

Image

of