Introduction
In data analysis with Python and pandas, one of the most common tasks is grouping data and summarizing it. While summing or averaging numeric columns is straightforward, sometimes you need to collect all the values in a column into a list—for example, to group rows by a timestamp and get a list of associated filenames.
This tutorial shows you how to use pandas’ groupby()
along with custom aggregation to:
- Sum numeric values
- Aggregate a column into a list of unique values (like filenames or IDs)
What We'll Achieve
Starting with a DataFrame like this:
time_coverage | filename | frp |
---|---|---|
2025-06-09T12:00Z | file_001.nc | 15.2 |
2025-06-09T12:00Z | file_002.nc | 17.5 |
2025-06-09T12:00Z | file_001.nc | 5.1 |
2025-06-09T13:00Z | file_003.nc | 9.8 |
We want to group by time_coverage
, then:
- Sum the
frp
values - Collect a unique list of filenames for each group
Step 1: Create the DataFrame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import pandas as pd # Sample data df = pd.DataFrame({ 'time_coverage': [ '2025-06-09T12:00Z', '2025-06-09T12:00Z', '2025-06-09T12:00Z', '2025-06-09T13:00Z' ], 'filename': [ 'file_001.nc', 'file_002.nc', 'file_001.nc', 'file_003.nc' ], 'frp': [15.2, 17.5, 5.1, 9.8] }) |
Step 2: Group and Aggregate
Use groupby()
and agg()
to compute the sum and collect the filenames into a list:
1 2 3 4 | df_grouped = df.groupby('time_coverage').agg({ 'frp': 'sum', 'filename': list # collect all filenames (may include duplicates) }).reset_index() |
Output:
time_coverage | frp | filename |
---|---|---|
2025-06-09T12:00Z | 37.8 | ['file_001.nc', 'file_002.nc', 'file_001.nc'] |
2025-06-09T13:00Z | 9.8 | ['file_003.nc'] |
You can see that file_001.nc
appears twice in the list for 2025-06-09T12:00Z
.
Step 3: Get Unique Filenames Only
To remove duplicates from the list of filenames, wrap the aggregation in set()
:
1 2 3 4 | df_grouped = df.groupby('time_coverage').agg({ 'frp': 'sum', 'filename': lambda x: list(set(x)) }).reset_index() |
Output:
time_coverage | frp | filename |
---|---|---|
2025-06-09T12:00Z | 37.8 | ['file_001.nc', 'file_002.nc'] |
2025-06-09T13:00Z | 9.8 | ['file_003.nc'] |
Use sorted(set(x))
if you want filenames in a consistent alphabetical order.
1 2 3 4 | df_grouped = df.groupby('time_coverage').agg({ 'frp': 'sum', 'filename': lambda x: sorted(set(x)) }).reset_index() |
Step 4: Loop Through and Print the Results
To print each group and its unique filenames:
1 2 3 4 5 6 | for idx, row in df_grouped.iterrows(): print(f"Time coverage: {row['time_coverage']}") print("Unique filenames:") for fname in row['filename']: print(f" - {fname}") print() |
Why Removing Duplicates Matters
Removing duplicates in group aggregation is useful when:
- You want a clean, deduplicated report of files or identifiers
- You’re counting distinct items per group
- You want to avoid redundant entries in summaries or visualizations
Summary
- Use
groupby()
to group your DataFrame by key columns. - Use
agg()
to apply summary functions: 'sum'
,'mean'
, etc., for numeric datalambda x: list(set(x))
to collect unique values in a list- Use
sorted(set(x))
for readability and consistent ordering.
Use Cases
- Grouping by date/time and collecting associated filenames
- Aggregating unique user IDs per session or category
- Compiling tags or keywords per article or record
- Deduplicating sources associated with an event