How to Randomly Shuffle DataFrame Rows and Create Samples with Pandas ?

Introduction

This article explains how to shuffle rows and sample data from a Pandas DataFrame using the sample method and how these operations work internally. We’ll explore examples, understand the underlying mechanics, and even reimplement the functionality from scratch.

Generating a Sample Dataset

Let’s begin by creating a simple DataFrame using random integers. This dataset will serve as the base for our sampling operations.

1
2
3
4
5
6
7
import numpy as np
import pandas as pd

data = np.random.randint(0, 100, (10, 2))

df = pd.DataFrame(data=data, columns=['Column A', 'Column B'])
print(df)

Example output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Column A  Column B
0        33        34
1        52         7
2        97        69
3        50        42
4        69        21
5        15        96
6         9        55
7        48        20
8        99        63
9        54        67

Using the sample Method

The Pandas sample method provides a straightforward way to shuffle rows or extract a subset of rows from a DataFrame.

Shuffle All Rows

To shuffle all rows of the DataFrame:

1
df.sample(frac=1.0)

Example shuffled output:

1
2
3
4
5
Column A  Column B
3        50        42
9        54        67
6         9        55
1        52         7
  • frac=1.0 means all rows are sampled, but their order is randomized.

Sample a Fraction of Rows

You can specify the fraction of rows to sample using frac:

1
df.sample(frac=0.4)

Example output:

1
2
3
Column A  Column B
4        69        21
6         9        55
  • Here, frac=0.4 randomly selects 40% of the rows.

Adding a Seed for Reproducibility

For consistent results, use random_state to set a seed:

1
df.sample(frac=0.4, random_state=42)

Example output:

1
2
3
Column A  Column B
8        99        63
1        52         7

Resetting the Index

Resetting the index after sampling ensures a clean, sequential index:

1
df.sample(frac=0.4, random_state=42).reset_index(drop=True)

Example output:

1
2
3
Column A  Column B
0        99        63
1        52         7

Shuffling Using NumPy

While pandas provides the convenient sample method for shuffling rows, you can also achieve this using NumPy for more control. This approach involves directly manipulating the indices of the DataFrame.

Step 1: Get the DataFrame Indices
First, extract the indices of the DataFrame using the index.values attribute. This gives you an array of the row indices.

1
df.index.values

Output:

1
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Step 2: Shuffle the Indices
Use np.random.shuffle to randomly rearrange these indices in place. This step shuffles the row order.

1
np.random.shuffle(sampled_indices)

Output:

1
array([0, 9, 7, 2, 5, 4, 8, 6, 3, 1])

Step 3: Select a Subset of Indices (Optional)
To sample only a fraction of the rows, slice the shuffled indices array. For example, selecting the first 4 rows from the shuffled indices:

1
sampled_indices = sampled_indices[:4]

Output:

1
array([0, 9, 7, 2])

Step 4: Use the Shuffled Indices to Access Rows
Finally, use the .iloc method to fetch rows corresponding to the shuffled indices:

1
df.iloc[sampled_indices]

Output:

1
2
3
4
5
Column A  Column B
0        33        34
9        54        67
7        48        20
2        97        69

Explanation
- np.random.shuffle: Randomly rearranges the elements of an array in place.
- iloc: Accesses rows by their integer location in the DataFrame, which is ideal for working with custom shuffled indices.

Recreating the sample Method

To better understand how sample works, let’s reimplement it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def sample_function(df, n=None, frac=None, replace=False, weights=None, random_state=None):
    # Determine random number generator
    rng = np.random.default_rng(random_state)

    # Calculate the number of samples
    if frac is not None:
        n_samples = int(len(df) * frac)
    elif n is not None:
        n_samples = n
    else:
        raise ValueError("Either `n` or `frac` must be specified.")

    # Generate random indices
    sampled_indices = rng.choice(
        a=df.index,
        size=n_samples,
        replace=replace,
        p=weights
    )

    # Return sampled DataFrame
    return df.loc[sampled_indices]

Example Usage

1
sample_function(df, n=2)

Output:

1
2
3
Column A  Column B
9        54        67
5        15        96

References

Links Site
pandas.DataFrame.sample pandas.pydata.org
numpy.random.randint numpy.org
numpy.random.shuffle numpy.org