How to get all file names in a specific AWS S3 bucket directory using python ?

Published: August 22, 2023

Tags: Python; AWS; boto3;

DMCA.com Protection Status

As part of a research project about Canada wildfires (details can be found here), I was looking for a way to retrieve a list of all file names that are stored in a particular directory within an AWS S3 bucket. For instance, I used the following NOAA directory as an example.

Table of contents

Using boto3

First, create an instance of the boto3 resource object. This object is used to interact with the AWS S3 service. Make sure that you configure your credentials correctly before continuing:

import boto3

from botocore import UNSIGNED
from botocore.config import Config

s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))

Next, retrieve a reference to the bucket in which your files are located. You can do this by specifying the name of the bucket:

bucket_name = "noaa-nesdis-n21-pds"

s3_bucket = s3.Bucket(bucket_name)

Now, we can use s3_bucket.objects.filter(Prefix=folder).all() to retrieve a list of all the files stored in this bucket. This will return an iterator object which contains a list of all the files stored in this bucket. You can use Python's built-in list comprehension syntax to extract the file names from this list:

folder = 'VIIRS-IMG-GEO-TC/2023/05/30/'

files_in_s3 = [f.key.split(folder + "/")[0] for f in s3_bucket.objects.filter(Prefix=folder).all()]

Then in this example

len(files_in_s3)

gives

1013

and

files_in_s3[:10]

returns

['VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0000174_e0001403_b02847_c20230530001703169412_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0001416_e0003062_b02847_c20230530001707439275_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0003075_e0004321_b02847_c20230530001731082801_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0004334_e0005562_b02847_c20230530001744657321_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0005575_e0007222_b02847_c20230530002443133327_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0007234_e0008481_b02847_c20230530003708469736_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0008493_e0010122_b02847_c20230530003658835766_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0010134_e0011381_b02847_c20230530003510984930_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0011393_e0013040_b02847_c20230530003512044584_oeac_ops.h5',
 'VIIRS-IMG-GEO-TC/2023/05/30/GITCO_j02_d20230530_t0013052_e0014281_b02847_c20230530003524393947_oeac_ops.h5']

By following these steps, you'll be able to get all file names in an AWS S3 bucket directory using Python. This is a great way to keep track of what files are stored.

References

Links Site
boto3 pypi.org