Introduction
Global geospatial datasets are commonly used in Earth observation, climate science, and environmental monitoring. However, many projects require focusing on a specific region such as the United States.
This article presents a practical workflow to filter a global dataset and retain only observations located within U.S. territory using GeoPandas. The example is based on a real use case involving VIIRS fire data aggregated using H3 spatial indexing.
GeoPandas DataFrame
Use case
A global dataset is available as a Pandas DataFrame containing H3 indices. The objective is to extract only observations located within U.S. territory for further analysis.
The dataset used in this example is an annual aggregation of VIIRS EFIRE thermal anomaly data.
Load dataset
1 2 3 4 5 6 | import pandas as pd import geopandas as gpd df_viirs_annual_agg = pd.read_parquet( f'/Volumes/HD15TB/Datasets/VIIRS_EFIRE_Thermal_Anomaly_Frequency_Distribution/{platform}/{year}_EFIRE_Thermal_Anomaly_Frequency_Distribution.parquet' ) |
Convert H3 indices to latitude / longitude
The dataset contains H3 cells but no geographic coordinates. Conversion to latitude and longitude is required.
1 2 3 4 | import h3 df_viirs_annual_agg['lat'] = df_viirs_annual_agg['h3_cell'].apply(lambda h: h3.cell_to_latlng(h)[0]) df_viirs_annual_agg['lon'] = df_viirs_annual_agg['h3_cell'].apply(lambda h: h3.cell_to_latlng(h)[1]) |
Convert to GeoDataFrame
1 2 3 4 5 | gdf = gpd.GeoDataFrame( df_viirs_annual_agg, geometry=gpd.points_from_xy(df_viirs_annual_agg.lon, df_viirs_annual_agg.lat), crs="EPSG:4326" ) |

At this stage, the dataset is geospatially enabled and ready for spatial operations.
Load U.S. Boundary Polygon
Using Natural Earth (built-in dataset)
1 2 3 | world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres")) usa = world[world['name'] == 'United States of America'] |
Using higher-resolution Natural Earth data
For more accurate boundaries (recommended for research applications), download higher-resolution data.
Download source:
Example using a local shapefile:
1 2 3 4 5 | world = gpd.read_file( "/Volumes/HD15TB/Datasets/Natural_Earth_Data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp" ) usa = world[world['ADMIN'] == 'United States of America'] |
Important notes
Included by default:
- Continental U.S.
- Alaska
- Hawaii
Potential limitations:
- Overseas territories (Puerto Rico, Guam, etc.) may be missing depending on dataset resolution
For full coverage, use:
1 2 | world = gpd.read_file("ne_10m_admin_0_countries.shp") usa = world[world['ADMIN'] == 'United States of America'] |
Spatial Filtering (Point-in-Polygon)
The filtering step uses a geometric containment test.
1 | gdf_us = gdf[gdf.within(usa.unary_union)] |
This operation keeps only points located inside the U.S. polygon.
Performance note
For large datasets (millions of points), consider:
- Using spatial indexing (
rtreeorpygeos) - Pre-filtering with bounding boxes
- Converting H3 cells to polygons for higher accuracy near borders
Visual Validation with Folium
A map is useful to verify that the filtering behaves as expected.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import folium usa = usa.to_crs(epsg=4326) m = folium.Map(location=[39.5, -98.5], zoom_start=4, tiles="CartoDB positron") folium.GeoJson( usa, name="United States", style_function=lambda feature: { "fillColor": "blue", "color": "blue", "weight": 2, "fillOpacity": 0.2, }, tooltip=folium.GeoJsonTooltip(fields=["ADMIN"], aliases=["Country:"]) ).add_to(m) folium.LayerControl().add_to(m) m |

Optional: overlay filtered points
1 2 3 4 5 6 7 8 | for _, row in gdf_us.sample(10000).iterrows(): folium.CircleMarker( location=[row["lat"], row["lon"]], radius=2, color="red", fill=True, fill_opacity=0.7 ).add_to(m) |

Faster Alternative (Approximate Filtering)
A simple bounding box can be used for quick filtering.
1 2 3 4 | df_us = df_viirs_annual_agg[ (df_viirs_annual_agg['lat'] >= 24) & (df_viirs_annual_agg['lat'] <= 72) & (df_viirs_annual_agg['lon'] >= -170) & (df_viirs_annual_agg['lon'] <= -65) ] |
Limitations
- Includes ocean regions
- Includes parts of Canada and Mexico
- Not suitable for scientific analysis
Best Practices
- Use GeoPandas spatial filtering for accuracy
- Use high-resolution Natural Earth data for better boundaries
- Use H3 polygon intersection instead of centroids for precise edge handling
- Cache results (
is_usflag) to avoid recomputation
References
| Topic | Link |
|---|---|
| GeoPandas documentation | https://geopandas.org |
| Natural Earth datasets | https://www.naturalearthdata.com/downloads/ |
| H3 documentation | https://h3geo.org/docs |
| Folium documentation | https://python-visualization.github.io/folium/ |
| Shapely (geometry operations) | https://shapely.readthedocs.io |
