Introduction
In Earth observation and environmental data analysis, it's often necessary to combine observations from different sources — for example, satellite products from different instruments, ground-based measurements, or model outputs. These datasets may differ in spatial resolution, geolocation accuracy, time coverage, or sampling strategy, making straightforward comparisons or fusions non-trivial.
This article walks through practical methods to co-locate two datasets in both space and time using Python. We'll cover:
- Merging data when observations share exact geolocation and time
- Using H3 spatial indexing for approximate spatial joins
- Computing spatial and temporal proximity with precise geodesic distances
- Accelerating co-location with nearest-neighbor search using KDTree and BallTree
Each method is applicable to specific scenarios, from high-resolution fire detection data fusion to satellite-to-ground validation or multi-sensor integration.
By the end of this guide, you'll be able to:
- Choose the right strategy for co-location depending on your data
- Implement scalable matching routines for large geospatial datasets
- Incorporate both spatial and temporal tolerances in your fusion logic
Scenario 1: Exact Match on Lat/Lon/Time
In many remote sensing or Earth observation workflows, we often need to combine information from two different datasets that represent the same time and place. When the datasets share identical geolocation (latitude/longitude) and timestamps, co-location can be as simple as a direct merge.
Problem Setup
Assume we have two pandas DataFrames df1
and df2
with the following common columns:
lat
: Latitude (float)lon
: Longitude (float)time_coverage
: A timestamp (e.g., ISO 8601 string ordatetime
object)
We want to combine them into a single DataFrame, df3
, that contains matched entries from both datasets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import pandas as pd # Sample data df1 = pd.DataFrame({ 'lat': [34.5, 34.6, 34.7], 'lon': [-120.3, -120.4, -120.5], 'time_coverage': pd.to_datetime([ '2023-10-01T10:00:00Z', '2023-10-01T10:05:00Z', '2023-10-01T10:10:00Z' ]), 'product1_value': [0.1, 0.2, 0.3] }) df2 = pd.DataFrame({ 'lat': [34.5, 34.6, 34.7], 'lon': [-120.3, -120.4, -120.5], 'time_coverage': pd.to_datetime([ '2023-10-01T10:00:00Z', '2023-10-01T10:05:00Z', '2023-10-01T10:10:00Z' ]), 'product2_value': [1.1, 1.2, 1.3] }) |
Merge on Spatial and Temporal Keys
To colocate the two datasets, we simply merge them using pandas.merge()
on the lat
, lon
, and time_coverage
columns:
1 | df3 = pd.merge(df1, df2, on=['lat', 'lon', 'time_coverage'], how='inner') |
This will return rows that are present in both datasets with identical coordinates and time.
1 | print(df3) |
Output
1 2 3 4 | lat lon time_coverage product1_value product2_value 0 34.5 -120.3 2023-10-01 10:00:00 0.1 1.1 1 34.6 -120.4 2023-10-01 10:05:00 0.2 1.2 2 34.7 -120.5 2023-10-01 10:10:00 0.3 1.3 |
Notes
- This method only works when lat/lon and timestamps exactly match (i.e., from the same grid or sampling system).
- In many real-world applications (different resolutions, instruments, or sampling rates), we need to allow for approximate matching using tolerances or nearest-neighbor searches. We'll cover that in the next steps of the article.
Approximate Spatial Merge Using H3
In this initial step, we assume that two datasets (df1
and df2
) share the same spatial resolution or footprint. If the latitude, longitude, and time match exactly, we can use a basic merge()
on these keys. But what if the locations are close but not exactly the same?
When lat/lon don't match exactly but are nearby, you can convert both datasets into a common H3 resolution and merge on that.
What is H3?: H3 is a hexagonal hierarchical geospatial indexing system developed by Uber. It converts latitude/longitude to a unique hexagon ID at a given resolution (from \~1000 km² down to \~1 m²), enabling:
- Spatial binning
- Fast joins
- Spatial aggregation and summarization
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import h3 import pandas as pd # Define H3 resolution (e.g., 6 is ~1 km²) resolution = 6 # Compute H3 index for each dataset df1['h3_index'] = df1.apply(lambda row: h3.geo_to_h3(row['lat'], row['lon'], resolution), axis=1) df2['h3_index'] = df2.apply(lambda row: h3.geo_to_h3(row['lat'], row['lon'], resolution), axis=1) # Optional: round time to common granularity if needed (e.g., minute) df1['time_rounded'] = df1['time_coverage'].dt.floor('min') df2['time_rounded'] = df2['time_coverage'].dt.floor('min') # Merge on h3_index and (optional) time_rounded df3 = pd.merge(df1, df2, on=['h3_index', 'time_rounded'], how='inner', suffixes=('_1', '_2')) |
Example
1 | print(df3[['h3_index', 'time_rounded', 'product1_value', 'product2_value']]) |
Output
1 2 3 | h3_index time_rounded product1_value product2_value 0 862a1072ffff 2023-10-01 10:00:00 0.1 1.1 1 862a10c4ffff 2023-10-01 10:05:00 0.2 1.2 |
When to Use H3?
- Datasets use slightly different lat/lon but have overlapping footprints
- You want to aggregate data spatially (e.g., average FRP per hex)
- You're working with large-scale Earth observation data, especially from different sources or resolutions
Tips
- Choose the H3 resolution based on the spatial uncertainty of your data (e.g., use resolution 6–8 for satellite swath pixels).
- You can extend this to temporal bins using
pd.Grouper
,floor('15min')
, etc.
Scenario 2: Approximate Matching Using Distance and Time Difference
When two datasets use different geolocation and/or time grids, a direct merge is no longer sufficient. Instead, we want to associate each observation in one dataset (df1
) with its closest observation in another (df2
) — both spatially and temporally.
To do that, we'll compute:
distance
: the great-circle distance between two points (in meters)time_diff
: the absolute time difference between the two timestamps (in seconds or minutes)
Let’s use the pyproj.Geod
class for precise geodesic distance and simple datetime subtraction for time difference.
Sample DataFrames
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import pandas as pd from pyproj import Geod from datetime import timedelta # Reference ellipsoid geod = Geod(ellps="WGS84") # Example: df1 (e.g., satellite A) df1 = pd.DataFrame({ 'lat': [34.52, 34.65], 'lon': [-120.33, -120.45], 'time_coverage': pd.to_datetime([ '2023-10-01T10:00:00Z', '2023-10-01T10:05:00Z' ]), 'product1_value': [0.1, 0.2] }) # Example: df2 (e.g., satellite B, coarser resolution) df2 = pd.DataFrame({ 'lat': [34.51, 34.70], 'lon': [-120.30, -120.47], 'time_coverage': pd.to_datetime([ '2023-10-01T10:01:30Z', '2023-10-01T10:08:00Z' ]), 'product2_value': [1.1, 1.3] }) |
Step-by-Step Matching Logic
We compute the pairwise distance and time difference between all df1
and df2
combinations. Then, for each row in df1
, we find the closest observation in df2
based on a chosen distance metric.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | import numpy as np matches = [] for i, row1 in df1.iterrows(): best_match = None min_score = np.inf # Customize this if you want to weigh time vs distance for j, row2 in df2.iterrows(): # Geodesic distance (meters) _, _, distance = geod.inv(row1['lon'], row1['lat'], row2['lon'], row2['lat']) # Time difference (seconds) time_diff = abs((row1['time_coverage'] - row2['time_coverage']).total_seconds()) # You can combine distance and time with a weighted score if needed score = distance + time_diff # naive sum — adjust weights if necessary if score < min_score: min_score = score best_match = { 'df1_index': i, 'df2_index': j, 'distance': distance, 'time_diff_sec': time_diff, 'product1_value': row1['product1_value'], 'product2_value': row2['product2_value'] } matches.append(best_match) # Convert match list to DataFrame df_matches = pd.DataFrame(matches) |
Example
1 | print(df_matches[['df1_index', 'df2_index', 'distance', 'time_diff_sec', 'product1_value', 'product2_value']]) |
Output
1 2 3 | df1_index df2_index distance time_diff_sec product1_value product2_value 0 0 0 2711.04689 90.0 0.1 1.1 1 1 1 2709.92131 180.0 0.2 1.3 |
Notes
- The
distance
is computed along the Earth's ellipsoid — more accurate than simple Euclidean distance. - The
time_diff
is in seconds; you can convert it to minutes if preferred. - This method is computationally expensive for large datasets (
O(N*M)
); in the next step, we’ll optimize it using spatial data structures like KDTree and BallTree.
Scenario 3: Efficient Nearest-Neighbor Matching Using KDTree
The pairwise matching approach in Step 2 becomes computationally expensive when working with large Earth observation datasets. Instead of computing distances between all pairs, we can use scalable nearest-neighbor search algorithms.
Why KDTree?
scipy.spatial.KDTree
orsklearn.neighbors.BallTree
can quickly find the nearest observation in a second dataset.- When spatial coordinates are dense and uniformly distributed, KDTree is very efficient.
- BallTree is better for angular (haversine) distances on a sphere — we’ll demonstrate both options.
Step 3A: KDTree for Small-Scale/Planar Coordinates
When the data spans a small area (local scale), converting latitude and longitude to a Cartesian projection (like UTM or simple x/y meters) is acceptable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import pandas as pd from scipy.spatial import cKDTree import numpy as np # Assume df1 and df2 from previous steps # Convert lat/lon to approximate meters (equirectangular projection for simplicity) def latlon_to_xy(lat, lon): R = 6371000 # Earth radius in meters x = np.radians(lon) * R * np.cos(np.radians(lat)) y = np.radians(lat) * R return np.column_stack((x, y)) # Prepare coordinates coords1 = latlon_to_xy(df1['lat'].values, df1['lon'].values) coords2 = latlon_to_xy(df2['lat'].values, df2['lon'].values) # Build KDTree on df2 tree = cKDTree(coords2) # Query nearest neighbor for each df1 point distances, indices = tree.query(coords1, k=1) # You can set distance_upper_bound=... # Combine into new DataFrame df_nn = df1.copy() df_nn['nearest_index'] = indices df_nn['distance_m'] = distances df_nn['matched_product2_value'] = df2.loc[indices, 'product2_value'].values df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices, 'time_coverage'].values) / np.timedelta64(1, 's')) |
Why Use 3D (x, y, z) Instead of 2D (x, y)?
Translating lat/lon to 3D Cartesian coordinates (x, y, z) can indeed be more appropriate in many Earth observation applications, particularly when:
Aspect | 2D Projection (latlon_to_xy ) |
3D Cartesian (latlon_to_xyz ) |
---|---|---|
Accuracy | Approximate, planar (flat Earth) | Accurate over global scales (spherical Earth) |
Use case | Local areas (small distances, <100 km) | Global matching or large areas |
Projection distortion | Introduces error with increasing area | No distortion (unit sphere or Earth radius) |
Efficiency | Slightly faster (fewer dimensions) | Slightly more costly, but still fast |
Recommended: 3D Cartesian Coordinates for Global Matching
Here's how to modify your code:
Step 1: Convert lat/lon to 3D Cartesian Coordinates
1 2 3 4 5 6 7 8 | def latlon_to_xyz(lat, lon): R = 6371000 # Earth radius in meters lat_rad = np.radians(lat) lon_rad = np.radians(lon) x = R * np.cos(lat_rad) * np.cos(lon_rad) y = R * np.cos(lat_rad) * np.sin(lon_rad) z = R * np.sin(lat_rad) return np.column_stack((x, y, z)) |
Step 2: Build KDTree and Match
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # Prepare 3D coordinates coords1 = latlon_to_xyz(df1['lat'].values, df1['lon'].values) coords2 = latlon_to_xyz(df2['lat'].values, df2['lon'].values) # Build KDTree on df2 tree = cKDTree(coords2) # Query nearest neighbors distances, indices = tree.query(coords1, k=1) # Add to result DataFrame df_nn = df1.copy() df_nn['nearest_index'] = indices df_nn['distance_m'] = distances df_nn['matched_product2_value'] = df2.loc[indices, 'product2_value'].values df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices, 'time_coverage'].values) / np.timedelta64(1, 's')) |
Note on Interpretation:
- The 3D Euclidean distance in Cartesian coordinates is still an approximation of great-circle (haversine) distance — but it’s very close, and more accurate than a planar (x, y) projection for large areas.
- If you need exact great-circle distances, you can use
BallTree
with haversine metric — but you’ll have to convert lat/lon to radians.
Step 3B: BallTree for Global/Geodesic Coordinates (Haversine Distance)
When working globally, it’s better to use BallTree
with haversine distance. This requires converting lat/lon to radians.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from sklearn.neighbors import BallTree # Convert lat/lon to radians radians1 = np.radians(df1[['lat', 'lon']].values) radians2 = np.radians(df2[['lat', 'lon']].values) # BallTree expects [lat, lon] in radians tree = BallTree(radians2, metric='haversine') # Query the nearest neighbor (returns radians) dist_rad, indices = tree.query(radians1, k=1) # Convert haversine distance to meters earth_radius = 6371000 # in meters dist_m = dist_rad[:, 0] * earth_radius # Build result DataFrame df_nn = df1.copy() df_nn['nearest_index'] = indices[:, 0] df_nn['distance_m'] = dist_m df_nn['matched_product2_value'] = df2.loc[indices[:, 0], 'product2_value'].values df_nn['time_diff_sec'] = abs((df1['time_coverage'].values - df2.loc[indices[:, 0], 'time_coverage'].values) / np.timedelta64(1, 's')) |
Example
1 | print(df_nn[['lat', 'lon', 'distance_m', 'time_diff_sec', 'product1_value', 'matched_product2_value']]) |
Output:
1 2 3 | lat lon distance_m time_diff_sec product1_value matched_product2_value 0 34.52 -120.33 2711.046 90.0 0.1 1.1 1 34.65 -120.45 2709.921 180.0 0.2 1.3 |
Things to Consider
BallTree
is preferred for global or high-latitude data.- Both methods assume a 1:1 nearest neighbor match, but you can extend this to radius-based matching or find all matches within distance using
.query_radius()
. - You can apply a temporal tolerance filter afterward to ensure both spatial and temporal proximity.