How to Use Pandas groupby to Aggregate Lists of Values ?

Introduction

In data analysis with Python and pandas, one of the most common tasks is grouping data and summarizing it. While summing or averaging numeric columns is straightforward, sometimes you need to collect all the values in a column into a list—for example, to group rows by a timestamp and get a list of associated filenames.

This tutorial shows you how to use pandas’ groupby() along with custom aggregation to:

  • Sum numeric values
  • Aggregate a column into a list of unique values (like filenames or IDs)

What We'll Achieve

Starting with a DataFrame like this:

time_coverage filename frp
2025-06-09T12:00Z file_001.nc 15.2
2025-06-09T12:00Z file_002.nc 17.5
2025-06-09T12:00Z file_001.nc 5.1
2025-06-09T13:00Z file_003.nc 9.8

We want to group by time_coverage, then:

  • Sum the frp values
  • Collect a unique list of filenames for each group

Step 1: Create the DataFrame

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import pandas as pd

# Sample data
df = pd.DataFrame({
    'time_coverage': [
        '2025-06-09T12:00Z',
        '2025-06-09T12:00Z',
        '2025-06-09T12:00Z',
        '2025-06-09T13:00Z'
    ],
    'filename': [
        'file_001.nc',
        'file_002.nc',
        'file_001.nc',
        'file_003.nc'
    ],
    'frp': [15.2, 17.5, 5.1, 9.8]
})

Step 2: Group and Aggregate

Use groupby() and agg() to compute the sum and collect the filenames into a list:

1
2
3
4
df_grouped = df.groupby('time_coverage').agg({
    'frp': 'sum',
    'filename': list  # collect all filenames (may include duplicates)
}).reset_index()

Output:

time_coverage frp filename
2025-06-09T12:00Z 37.8 ['file_001.nc', 'file_002.nc', 'file_001.nc']
2025-06-09T13:00Z 9.8 ['file_003.nc']

You can see that file_001.nc appears twice in the list for 2025-06-09T12:00Z.

Step 3: Get Unique Filenames Only

To remove duplicates from the list of filenames, wrap the aggregation in set():

1
2
3
4
df_grouped = df.groupby('time_coverage').agg({
    'frp': 'sum',
    'filename': lambda x: list(set(x))
}).reset_index()

Output:

time_coverage frp filename
2025-06-09T12:00Z 37.8 ['file_001.nc', 'file_002.nc']
2025-06-09T13:00Z 9.8 ['file_003.nc']

Use sorted(set(x)) if you want filenames in a consistent alphabetical order.

1
2
3
4
df_grouped = df.groupby('time_coverage').agg({
    'frp': 'sum',
    'filename': lambda x: sorted(set(x))
}).reset_index()

Step 4: Loop Through and Print the Results

To print each group and its unique filenames:

1
2
3
4
5
6
for idx, row in df_grouped.iterrows():
    print(f"Time coverage: {row['time_coverage']}")
    print("Unique filenames:")
    for fname in row['filename']:
        print(f"  - {fname}")
    print()

Why Removing Duplicates Matters

Removing duplicates in group aggregation is useful when:

  • You want a clean, deduplicated report of files or identifiers
  • You’re counting distinct items per group
  • You want to avoid redundant entries in summaries or visualizations

Summary

  • Use groupby() to group your DataFrame by key columns.
  • Use agg() to apply summary functions:
  • 'sum', 'mean', etc., for numeric data
  • lambda x: list(set(x)) to collect unique values in a list
  • Use sorted(set(x)) for readability and consistent ordering.

Use Cases

  • Grouping by date/time and collecting associated filenames
  • Aggregating unique user IDs per session or category
  • Compiling tags or keywords per article or record
  • Deduplicating sources associated with an event