Introduction
When working with publicly accessible data on AWS S3, such as NOAA environmental satellite products, it's often useful to programmatically list either all the files or subdirectories within a specific path (also called a "prefix") of a bucket.
In this guide, we show how to use Python and the boto3 library to retrieve:
- All file names under a specified prefix
- All "folder" names (i.e., common prefixes) under a given path
We’ll use NOAA's noaa-nesdis-snpp-pds public bucket as an example, which contains a wide variety of satellite products.
Installation
Install boto3 if you haven't already:
1 | pip install boto3 |
List All Files in a Given Directory
Use the S3 resource interface and filter by a prefix to retrieve all file keys under a "directory":
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import boto3 from botocore import UNSIGNED from botocore.config import Config # Initialize anonymous S3 resource s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED)) bucket_name = "noaa-nesdis-snpp-pds" folder = "VIIRS-IMG-GEO-TC/2023/05/30/" # Example prefix s3_bucket = s3.Bucket(bucket_name) # List all .h5 files under the prefix files_in_s3 = [ obj.key for obj in s3_bucket.objects.filter(Prefix=folder) if obj.key.endswith(".h5") ] print(f"Found {len(files_in_s3)} files") print("First 10 files:") for f in files_in_s3[:10]: print(f) |
List All Subdirectories (Folders)
If you'd rather get a list of all top-level folders (or subfolders under a given prefix), use the S3 client interface with the Delimiter parameter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import boto3 from botocore import UNSIGNED from botocore.config import Config # Anonymous S3 client s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED)) bucket_name = "noaa-nesdis-snpp-pds" prefix = "" # Root level; change to e.g., 'VIIRS-IMG-GEO-TC/2023/' to go deeper response = s3_client.list_objects_v2( Bucket=bucket_name, Prefix=prefix, Delimiter="/" # Important: groups by "folders" ) # Extract folder names folders = [cp['Prefix'] for cp in response.get('CommonPrefixes', [])] print("Folders found:") for folder in folders: print(folder) |
Example Output
Running the above folder-listing code on the root of the noaa-nesdis-snpp-pds bucket gives:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | Folders found: ATMS-SCIENCE-RDR/ ATMS-SDR-GEO/ ATMS-SDR/ ATMS-SFR/ ATMS-TDR/ ATMS_BUFR/ CRIS-SCIENCE-RDR/ CrIS-FS-SDR/ CrIS-SDR-GEO/ GRIDDED_VIIRS_LSA_DLY/ ... VIIRS-MOD-GEO-TC/ VIIRS-NCC-EDR/ VIIRS_SurfaceReflectance_EDR/ VIIRS_VFM_MWS_MOSAIC/ VI_BWKL_GLB/ |
These are the "subdirectories" or data product categories under the top-level prefix.
Notes:
- Delimiter="/" tells S3 to group keys by "directory levels".
- CommonPrefixes in the response contains the "subfolder names" under your prefix.
- To paginate if you have many folders, use ContinuationToken (or I can show that if needed).
Tips and Troubleshooting
- Public Buckets: Ensure you're using
Config(signature_version=UNSIGNED)for public, unauthenticated access. - Prefix Format: Always include a trailing slash (
/) in prefixes when simulating folders. - Pagination: If there are more than 1000 objects or folders, use
ContinuationTokento handle pagination (boto3's paginator can help). - Check Bucket Access: If access is denied, double-check that you're accessing a public bucket or that your AWS credentials are set up.
References
| Resource | Link |
|---|---|
| Boto3 Docs | https://boto3.amazonaws.com/v1/documentation/api/latest/index.html |
| NOAA SNPP S3 Bucket | https://noaa-nesdis-snpp-pds.s3.amazonaws.com/index.html |
Boto3 list_objects_v2 |
Docs |
