As part of a research project about Canada wildfires (details can be found here), I was looking for a way to retrieve a list of all file names that are stored in a particular directory within an AWS S3 bucket. For instance, I used the following NOAA directory as an example.
Table of contents
Using boto3
First, create an instance of the boto3 resource object. This object is used to interact with the AWS S3 service. Make sure that you configure your credentials correctly before continuing:
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
Next, retrieve a reference to the bucket in which your files are located. You can do this by specifying the name of the bucket:
bucket_name = "noaa-nesdis-n21-pds"
s3_bucket = s3.Bucket(bucket_name)
Now, we can use s3_bucket.objects.filter(Prefix=folder).all() to retrieve a list of all the files stored in this bucket. This will return an iterator object which contains a list of all the files stored in this bucket. You can use Python's built-in list comprehension syntax to extract the file names from this list:
folder = 'VIIRS-IMG-GEO-TC/2023/05/30/'
files_in_s3 = [f.key.split(folder + "/")[0] for f in s3_bucket.objects.filter(Prefix=folder).all()]
Then in this example
len(files_in_s3)
gives
1013
and
files_in_s3[:10]
returns
['VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0000174_e0001403_b02847_c20230530001703169412_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0001416_e0003062_b02847_c20230530001707439275_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0003075_e0004321_b02847_c20230530001731082801_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0004334_e0005562_b02847_c20230530001744657321_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0005575_e0007222_b02847_c20230530002443133327_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0007234_e0008481_b02847_c20230530003708469736_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0008493_e0010122_b02847_c20230530003658835766_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0010134_e0011381_b02847_c20230530003510984930_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0011393_e0013040_b02847_c20230530003512044584_oeac_ops.h5',
'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0013052_e0014281_b02847_c20230530003524393947_oeac_ops.h5']
By following these steps, you'll be able to get all file names in an AWS S3 bucket directory using Python. This is a great way to keep track of what files are stored.
References
Links | Site |
---|---|
boto3 | pypi.org |