Introduction
This article explains how to shuffle rows and sample data from a Pandas DataFrame using the sample
method and how these operations work internally. We’ll explore examples, understand the underlying mechanics, and even reimplement the functionality from scratch.
Generating a Sample Dataset
Let’s begin by creating a simple DataFrame using random integers. This dataset will serve as the base for our sampling operations.
1 2 3 4 5 6 7 | import numpy as np import pandas as pd data = np.random.randint(0, 100, (10, 2)) df = pd.DataFrame(data=data, columns=['Column A', 'Column B']) print(df) |
Example output:
1 2 3 4 5 6 7 8 9 10 11 | Column A Column B 0 33 34 1 52 7 2 97 69 3 50 42 4 69 21 5 15 96 6 9 55 7 48 20 8 99 63 9 54 67 |
Using the sample Method
The Pandas sample
method provides a straightforward way to shuffle rows or extract a subset of rows from a DataFrame.
Shuffle All Rows
To shuffle all rows of the DataFrame:
1 | df.sample(frac=1.0) |
Example shuffled output:
1 2 3 4 5 | Column A Column B 3 50 42 9 54 67 6 9 55 1 52 7 |
frac=1.0
means all rows are sampled, but their order is randomized.
Sample a Fraction of Rows
You can specify the fraction of rows to sample using frac
:
1 | df.sample(frac=0.4) |
Example output:
1 2 3 | Column A Column B 4 69 21 6 9 55 |
- Here,
frac=0.4
randomly selects 40% of the rows.
Adding a Seed for Reproducibility
For consistent results, use random_state
to set a seed:
1 | df.sample(frac=0.4, random_state=42) |
Example output:
1 2 3 | Column A Column B 8 99 63 1 52 7 |
Resetting the Index
Resetting the index after sampling ensures a clean, sequential index:
1 | df.sample(frac=0.4, random_state=42).reset_index(drop=True) |
Example output:
1 2 3 | Column A Column B 0 99 63 1 52 7 |
Shuffling Using NumPy
While pandas provides the convenient sample
method for shuffling rows, you can also achieve this using NumPy for more control. This approach involves directly manipulating the indices of the DataFrame.
Step 1: Get the DataFrame Indices
First, extract the indices of the DataFrame using the index.values
attribute. This gives you an array of the row indices.
1 | df.index.values |
Output:
1 | array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) |
Step 2: Shuffle the Indices
Use np.random.shuffle
to randomly rearrange these indices in place. This step shuffles the row order.
1 | np.random.shuffle(sampled_indices) |
Output:
1 | array([0, 9, 7, 2, 5, 4, 8, 6, 3, 1]) |
Step 3: Select a Subset of Indices (Optional)
To sample only a fraction of the rows, slice the shuffled indices array. For example, selecting the first 4 rows from the shuffled indices:
1 | sampled_indices = sampled_indices[:4] |
Output:
1 | array([0, 9, 7, 2]) |
Step 4: Use the Shuffled Indices to Access Rows
Finally, use the .iloc
method to fetch rows corresponding to the shuffled indices:
1 | df.iloc[sampled_indices] |
Output:
1 2 3 4 5 | Column A Column B 0 33 34 9 54 67 7 48 20 2 97 69 |
Explanation
- np.random.shuffle
: Randomly rearranges the elements of an array in place.
- iloc
: Accesses rows by their integer location in the DataFrame, which is ideal for working with custom shuffled indices.
Recreating the sample
Method
To better understand how sample
works, let’s reimplement it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def sample_function(df, n=None, frac=None, replace=False, weights=None, random_state=None): # Determine random number generator rng = np.random.default_rng(random_state) # Calculate the number of samples if frac is not None: n_samples = int(len(df) * frac) elif n is not None: n_samples = n else: raise ValueError("Either `n` or `frac` must be specified.") # Generate random indices sampled_indices = rng.choice( a=df.index, size=n_samples, replace=replace, p=weights ) # Return sampled DataFrame return df.loc[sampled_indices] |
Example Usage
1 | sample_function(df, n=2) |
Output:
1 2 3 | Column A Column B 9 54 67 5 15 96 |
References
Links | Site |
---|---|
pandas.DataFrame.sample | pandas.pydata.org |
numpy.random.randint | numpy.org |
numpy.random.shuffle | numpy.org |