How to Spatially and Temporally Colocate Two Datasets in Python ?

Introduction

In Earth observation and environmental data analysis, it's often necessary to combine observations from different sources — for example, satellite products from different instruments, ground-based measurements, or model outputs. These datasets may differ in spatial resolution, geolocation accuracy, time coverage, or sampling strategy, making straightforward comparisons or fusions non-trivial.

This article walks through practical methods to co-locate two datasets in both space and time using Python. We'll cover:

  • Merging data when observations share exact geolocation and time
  • Using H3 spatial indexing for approximate spatial joins
  • Computing spatial and temporal proximity with precise geodesic distances
  • Accelerating co-location with nearest-neighbor search using KDTree and BallTree

Each method is applicable to specific scenarios, from high-resolution fire detection data fusion to satellite-to-ground validation or multi-sensor integration.

By the end of this guide, you'll be able to:

  • Choose the right strategy for co-location depending on your data
  • Implement scalable matching routines for large geospatial datasets
  • Incorporate both spatial and temporal tolerances in your fusion logic

Scenario 1: Exact Match on Lat/Lon/Time

In many remote sensing or Earth observation workflows, we often need to combine information from two different datasets that represent the same time and place. When the datasets share identical geolocation (latitude/longitude) and timestamps, co-location can be as simple as a direct merge.

Problem Setup

Assume we have two pandas DataFrames df1 and df2 with the following common columns:

  • lat: Latitude (float)
  • lon: Longitude (float)
  • time_coverage: A timestamp (e.g., ISO 8601 string or datetime object)

We want to combine them into a single DataFrame, df3, that contains matched entries from both datasets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd

# Sample data
df1 = pd.DataFrame({
    'lat': [34.5, 34.6, 34.7],
    'lon': [-120.3, -120.4, -120.5],
    'time_coverage': pd.to_datetime([
        '2023-10-01T10:00:00Z',
        '2023-10-01T10:05:00Z',
        '2023-10-01T10:10:00Z'
    ]),
    'product1_value': [0.1, 0.2, 0.3]
})

df2 = pd.DataFrame({
    'lat': [34.5, 34.6, 34.7],
    'lon': [-120.3, -120.4, -120.5],
    'time_coverage': pd.to_datetime([
        '2023-10-01T10:00:00Z',
        '2023-10-01T10:05:00Z',
        '2023-10-01T10:10:00Z'
    ]),
    'product2_value': [1.1, 1.2, 1.3]
})

Merge on Spatial and Temporal Keys

To colocate the two datasets, we simply merge them using pandas.merge() on the lat, lon, and time_coverage columns:

1
df3 = pd.merge(df1, df2, on=['lat', 'lon', 'time_coverage'], how='inner')

This will return rows that are present in both datasets with identical coordinates and time.

1
print(df3)

Output

1
2
3
4
lat     lon       time_coverage  product1_value  product2_value
0  34.5 -120.3 2023-10-01 10:00:00             0.1             1.1
1  34.6 -120.4 2023-10-01 10:05:00             0.2             1.2
2  34.7 -120.5 2023-10-01 10:10:00             0.3             1.3

Notes

  • This method only works when lat/lon and timestamps exactly match (i.e., from the same grid or sampling system).
  • In many real-world applications (different resolutions, instruments, or sampling rates), we need to allow for approximate matching using tolerances or nearest-neighbor searches. We'll cover that in the next steps of the article.

Approximate Spatial Merge Using H3

In this initial step, we assume that two datasets (df1 and df2) share the same spatial resolution or footprint. If the latitude, longitude, and time match exactly, we can use a basic merge() on these keys. But what if the locations are close but not exactly the same?

When lat/lon don't match exactly but are nearby, you can convert both datasets into a common H3 resolution and merge on that.

What is H3?: H3 is a hexagonal hierarchical geospatial indexing system developed by Uber. It converts latitude/longitude to a unique hexagon ID at a given resolution (from \~1000 km² down to \~1 m²), enabling:

  • Spatial binning
  • Fast joins
  • Spatial aggregation and summarization

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import h3
import pandas as pd

# Define H3 resolution (e.g., 6 is ~1 km²)
resolution = 6

# Compute H3 index for each dataset
df1['h3_index'] = df1.apply(lambda row: h3.geo_to_h3(row['lat'], row['lon'], resolution), axis=1)
df2['h3_index'] = df2.apply(lambda row: h3.geo_to_h3(row['lat'], row['lon'], resolution), axis=1)

# Optional: round time to common granularity if needed (e.g., minute)
df1['time_rounded'] = df1['time_coverage'].dt.floor('min')
df2['time_rounded'] = df2['time_coverage'].dt.floor('min')

# Merge on h3_index and (optional) time_rounded
df3 = pd.merge(df1, df2, on=['h3_index', 'time_rounded'], how='inner', suffixes=('_1', '_2'))

Example

1
print(df3[['h3_index', 'time_rounded', 'product1_value', 'product2_value']])

Output

1
2
3
h3_index        time_rounded  product1_value  product2_value
0  862a1072ffff 2023-10-01 10:00:00             0.1             1.1
1  862a10c4ffff 2023-10-01 10:05:00             0.2             1.2

When to Use H3?

  • Datasets use slightly different lat/lon but have overlapping footprints
  • You want to aggregate data spatially (e.g., average FRP per hex)
  • You're working with large-scale Earth observation data, especially from different sources or resolutions

Tips

  • Choose the H3 resolution based on the spatial uncertainty of your data (e.g., use resolution 6–8 for satellite swath pixels).
  • You can extend this to temporal bins using pd.Grouper, floor('15min'), etc.

Scenario 2: Approximate Matching Using Distance and Time Difference

When two datasets use different geolocation and/or time grids, a direct merge is no longer sufficient. Instead, we want to associate each observation in one dataset (df1) with its closest observation in another (df2) — both spatially and temporally.

To do that, we'll compute:

  • distance: the great-circle distance between two points (in meters)
  • time_diff: the absolute time difference between the two timestamps (in seconds or minutes)

Let’s use the pyproj.Geod class for precise geodesic distance and simple datetime subtraction for time difference.

Sample DataFrames

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
from pyproj import Geod
from datetime import timedelta

# Reference ellipsoid
geod = Geod(ellps="WGS84")

# Example: df1 (e.g., satellite A)
df1 = pd.DataFrame({
    'lat': [34.52, 34.65],
    'lon': [-120.33, -120.45],
    'time_coverage': pd.to_datetime([
        '2023-10-01T10:00:00Z',
        '2023-10-01T10:05:00Z'
    ]),
    'product1_value': [0.1, 0.2]
})

# Example: df2 (e.g., satellite B, coarser resolution)
df2 = pd.DataFrame({
    'lat': [34.51, 34.70],
    'lon': [-120.30, -120.47],
    'time_coverage': pd.to_datetime([
        '2023-10-01T10:01:30Z',
        '2023-10-01T10:08:00Z'
    ]),
    'product2_value': [1.1, 1.3]
})

Step-by-Step Matching Logic

We compute the pairwise distance and time difference between all df1 and df2 combinations. Then, for each row in df1, we find the closest observation in df2 based on a chosen distance metric.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np

matches = []

for i, row1 in df1.iterrows():
    best_match = None
    min_score = np.inf  # Customize this if you want to weigh time vs distance

    for j, row2 in df2.iterrows():
        # Geodesic distance (meters)
        _, _, distance = geod.inv(row1['lon'], row1['lat'], row2['lon'], row2['lat'])

        # Time difference (seconds)
        time_diff = abs((row1['time_coverage'] - row2['time_coverage']).total_seconds())

        # You can combine distance and time with a weighted score if needed
        score = distance + time_diff  # naive sum — adjust weights if necessary

        if score < min_score:
            min_score = score
            best_match = {
                'df1_index': i,
                'df2_index': j,
                'distance': distance,
                'time_diff_sec': time_diff,
                'product1_value': row1['product1_value'],
                'product2_value': row2['product2_value']
            }

    matches.append(best_match)

# Convert match list to DataFrame
df_matches = pd.DataFrame(matches)

Example

1
print(df_matches[['df1_index', 'df2_index', 'distance', 'time_diff_sec', 'product1_value', 'product2_value']])

Output

1
2
3
df1_index  df2_index     distance  time_diff_sec  product1_value  product2_value
0          0          0   2711.04689           90.0             0.1             1.1
1          1          1   2709.92131          180.0             0.2             1.3

Notes

  • The distance is computed along the Earth's ellipsoid — more accurate than simple Euclidean distance.
  • The time_diff is in seconds; you can convert it to minutes if preferred.
  • This method is computationally expensive for large datasets (O(N*M)); in the next step, we’ll optimize it using spatial data structures like KDTree and BallTree.

Scenario 3: Efficient Nearest-Neighbor Matching Using KDTree

The pairwise matching approach in Step 2 becomes computationally expensive when working with large Earth observation datasets. Instead of computing distances between all pairs, we can use scalable nearest-neighbor search algorithms.

Why KDTree?

  • scipy.spatial.KDTree or sklearn.neighbors.BallTree can quickly find the nearest observation in a second dataset.
  • When spatial coordinates are dense and uniformly distributed, KDTree is very efficient.
  • BallTree is better for angular (haversine) distances on a sphere — we’ll demonstrate both options.

Step 3A: KDTree for Small-Scale/Planar Coordinates

When the data spans a small area (local scale), converting latitude and longitude to a Cartesian projection (like UTM or simple x/y meters) is acceptable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
from scipy.spatial import cKDTree
import numpy as np

# Assume df1 and df2 from previous steps
# Convert lat/lon to approximate meters (equirectangular projection for simplicity)
def latlon_to_xy(lat, lon):
    R = 6371000  # Earth radius in meters
    x = np.radians(lon) * R * np.cos(np.radians(lat))
    y = np.radians(lat) * R
    return np.column_stack((x, y))

# Prepare coordinates
coords1 = latlon_to_xy(df1['lat'].values, df1['lon'].values)
coords2 = latlon_to_xy(df2['lat'].values, df2['lon'].values)

# Build KDTree on df2
tree = cKDTree(coords2)

# Query nearest neighbor for each df1 point
distances, indices = tree.query(coords1, k=1)  # You can set distance_upper_bound=...

# Combine into new DataFrame
df_nn = df1.copy()
df_nn['nearest_index'] = indices
df_nn['distance_m'] = distances
df_nn['matched_product2_value'] = df2.loc[indices, 'product2_value'].values
df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices, 'time_coverage'].values) / np.timedelta64(1, 's'))

Why Use 3D (x, y, z) Instead of 2D (x, y)?

Translating lat/lon to 3D Cartesian coordinates (x, y, z) can indeed be more appropriate in many Earth observation applications, particularly when:

Aspect 2D Projection (latlon_to_xy) 3D Cartesian (latlon_to_xyz)
Accuracy Approximate, planar (flat Earth) Accurate over global scales (spherical Earth)
Use case Local areas (small distances, <100 km) Global matching or large areas
Projection distortion Introduces error with increasing area No distortion (unit sphere or Earth radius)
Efficiency Slightly faster (fewer dimensions) Slightly more costly, but still fast

Recommended: 3D Cartesian Coordinates for Global Matching

Here's how to modify your code:
Step 1: Convert lat/lon to 3D Cartesian Coordinates

1
2
3
4
5
6
7
8
def latlon_to_xyz(lat, lon):
    R = 6371000  # Earth radius in meters
    lat_rad = np.radians(lat)
    lon_rad = np.radians(lon)
    x = R * np.cos(lat_rad) * np.cos(lon_rad)
    y = R * np.cos(lat_rad) * np.sin(lon_rad)
    z = R * np.sin(lat_rad)
    return np.column_stack((x, y, z))

Step 2: Build KDTree and Match

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Prepare 3D coordinates
coords1 = latlon_to_xyz(df1['lat'].values, df1['lon'].values)
coords2 = latlon_to_xyz(df2['lat'].values, df2['lon'].values)

# Build KDTree on df2
tree = cKDTree(coords2)

# Query nearest neighbors
distances, indices = tree.query(coords1, k=1)

# Add to result DataFrame
df_nn = df1.copy()
df_nn['nearest_index'] = indices
df_nn['distance_m'] = distances
df_nn['matched_product2_value'] = df2.loc[indices, 'product2_value'].values
df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices, 'time_coverage'].values) / np.timedelta64(1, 's'))

Note on Interpretation:

  • The 3D Euclidean distance in Cartesian coordinates is still an approximation of great-circle (haversine) distance — but it’s very close, and more accurate than a planar (x, y) projection for large areas.
  • If you need exact great-circle distances, you can use BallTree with haversine metric — but you’ll have to convert lat/lon to radians.

Step 3B: BallTree for Global/Geodesic Coordinates (Haversine Distance)

When working globally, it’s better to use BallTree with haversine distance. This requires converting lat/lon to radians.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.neighbors import BallTree

# Convert lat/lon to radians
radians1 = np.radians(df1[['lat', 'lon']].values)
radians2 = np.radians(df2[['lat', 'lon']].values)

# BallTree expects [lat, lon] in radians
tree = BallTree(radians2, metric='haversine')

# Query the nearest neighbor (returns radians)
dist_rad, indices = tree.query(radians1, k=1)

# Convert haversine distance to meters
earth_radius = 6371000  # in meters
dist_m = dist_rad[:, 0] * earth_radius

# Build result DataFrame
df_nn = df1.copy()
df_nn['nearest_index'] = indices[:, 0]
df_nn['distance_m'] = dist_m
df_nn['matched_product2_value'] = df2.loc[indices[:, 0], 'product2_value'].values
df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices[:, 0], 'time_coverage'].values) / np.timedelta64(1, 's'))

Example

1
print(df_nn[['lat', 'lon', 'distance_m', 'time_diff_sec', 'product1_value', 'matched_product2_value']])

Output:

1
2
3
lat     lon  distance_m  time_diff_sec  product1_value  matched_product2_value
0  34.52 -120.33   2711.046         90.0              0.1                     1.1
1  34.65 -120.45   2709.921        180.0              0.2                     1.3

Things to Consider

  • BallTree is preferred for global or high-latitude data.
  • Both methods assume a 1:1 nearest neighbor match, but you can extend this to radius-based matching or find all matches within distance using .query_radius().
  • You can apply a temporal tolerance filter afterward to ensure both spatial and temporal proximity.