Introduction
Filenames often contain far more information than just a label. In scientific computing, Earth-observation (EO) workflows, log processing, and automated pipelines, filenames frequently embed: (1) Timestamps, (2) Product or sensor IDs, (3) Platform or satellite identifiers, (4) Processing levels, (5) Versions, (6) Location codes or unique IDs.
Table of contents
- Introduction
- Extracting the Filename and Extension
- Removing Known Prefixes
- Splitting the Filename Into Components
- Parsing Encoded Dates into Python datetime
- When Basic String Methods Are Not Enough**
- Extracting Information Using Regular Expressions (re)
- Full Example: Parsing a GOES Filename
- When to Use What?
- References
Python provides a broad set of tools to help with this:
- Basic string methods (
split, slicing,.removeprefix()…) - Path utilities (
os.path,pathlib) - Regular expressions (
re) - The
datetimemodule - Full custom parsers
In this tutorial, we start with the basics and build up to a real-world example:
Parsing GOES ABI Level-2 filenames, such as:
1 | OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc |
Extracting the Filename and Extension
The easiest way is with os.path:
1 2 3 4 5 6 7 8 9 | import os filename = "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc" basename = os.path.basename(filename) stem, ext = os.path.splitext(basename) print(stem) # filename without extension print(ext) # .nc |
Or with pathlib:
1 2 3 4 5 6 | from pathlib import Path file = Path(filename) file.stem # 'OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107' file.suffix # '.nc' file.name # full filename |
Removing Known Prefixes
Python 3.9+ introduces .removeprefix():
1 2 | clean = stem.removeprefix("OR_") print(clean) |
This is cleaner than manually slicing or using replace.
Splitting the Filename Into Components
Simple filenames can be parsed with .split("_"):
1 2 | parts = clean.split("_") parts |
Result:
1 | ['ABI-L2-FDCC-M6', 'G18', 's20250090526174', 'e20250090528547', 'c20250090529107'] |
Assign variables:
1 2 3 4 5 | product = parts[0] platform = parts[1] start_code = parts[2] # ex: 's20250090526174' end_code = parts[3] creation_code = parts[4] |
This works great for predictable, strict formats.
Parsing Encoded Dates into Python datetime
GOES timestamps encode:
1 | YYYY + DDD + HHMM + SS + S |
Where DDD is the day of year.
First, remove the leading identifier (s, e, c):
1 2 3 | start_str = start_code[1:] end_str = end_code[1:] creation_str = creation_code[1:] |
Then convert:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from datetime import datetime def parse_goes_timestamp(ts): year = ts[0:4] doy = ts[4:7] hour = ts[7:9] minute= ts[9:11] second= ts[11:13] return datetime.strptime( f"{year}-{doy} {hour}:{minute}:{second}", "%Y-%j %H:%M:%S" ) start_dt = parse_goes_timestamp(start_str) end_dt = parse_goes_timestamp(end_str) creation_dt = parse_goes_timestamp(creation_str) |
When Basic String Methods Are Not Enough**
split() and slicing are great when:
- Format is fixed
- Everything is separated by known delimiters
- No optional fields or variable length
But what if:
- Some parts vary in length?
- Some fields are optional?
- You want to validate filenames?
- You need a more generalizable method?
This is where regular expressions (re) are ideal.
Extracting Information Using Regular Expressions (re)
Regular expressions let you define a pattern describing what the filename looks like.
Python’s re module supports:
- Named groups
- Optional fields
- Validation
- Capture of multiple numeric sequences
Example: Extract platform, product, timestamps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import re filename = "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc" pattern = ( r"OR_(?P<product>[^_]+)_" # ABI-L2-FDCC-M6 r"(?P<platform>G\d+)_" # G18 r"s(?P<start>\d{15})_" # sYYYYDDDHhmmssS r"e(?P<end>\d{15})_" r"c(?P<creation>\d{15})" ) match = re.match(pattern, Path(filename).stem) match.groupdict() |
Output:
1 2 3 4 5 6 7 | { 'product': 'ABI-L2-FDCC-M6', 'platform': 'G18', 'start': '20250090526174', 'end': '20250090528547', 'creation': '20250090529107' } |
Extracting numbers only
1 | re.findall(r"\d+", filename) |
Optional fields example
1 | pattern = r"(?P<product>.+?)_(?P<platform>G\d+)(_v(?P<version>\d+))?" |
This matches filenames with or without a _vXX version suffix.
Full Example: Parsing a GOES Filename
Here’s a complete parser function that:
- Extracts filename info
- Removes prefixes
- Uses regex for validation
- Converts timestamps to
datetime - Returns a structured dictionary
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import re from datetime import datetime from pathlib import Path def parse_goes_filename(fname): file = Path(fname) stem = file.stem.removeprefix("OR_") pattern = ( r"(?P<product>[^_]+)_" r"(?P<platform>G\d+)_" r"s(?P<start>\d{15})_" r"e(?P<end>\d{15})_" r"c(?P<creation>\d{15})" ) match = re.match(pattern, stem) if not match: raise ValueError("Invalid GOES filename format") info = match.groupdict() def parse_ts(ts): return datetime.strptime( f"{ts[0:4]}-{ts[4:7]} {ts[7:9]}:{ts[9:11]}:{ts[11:13]}", "%Y-%j %H:%M:%S" ) return { "product": info["product"], "platform": info["platform"], "start_time": parse_ts(info["start"]), "end_time": parse_ts(info["end"]), "creation_time": parse_ts(info["creation"]), "extension": file.suffix, "filename_no_ext": stem, "filename_full": file.name } |
Using the Parser
1 2 3 4 5 6 | info = parse_goes_filename( "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc" ) for k, v in info.items(): print(k, ":", v) |
When to Use What?
| Task | Best Tools |
|---|---|
| Extract extension, stem, name | pathlib, os.path |
| Remove fixed prefixes/suffixes | .removeprefix(), .removesuffix() |
| Split predictable filename structures | .split("_") |
| Extract variable-length text or numbers | slicing or regex |
| Validate filename format | re.match() |
| Extract multiple timestamp fields | regex + datetime |
| Build general-purpose parsers | regex + helper functions |
References
| Links | Site |
|---|---|
| https://docs.python.org/3/library/pathlib.html | Python Docs — pathlib module |
| https://docs.python.org/3/library/os.path.html | Python Docs — os.path utilities |
| https://docs.python.org/3/library/stdtypes.html#string-methods | Python Docs — String methods (split, removeprefix, slicing…) |
| https://docs.python.org/3/library/re.html | Python Docs — re (Regular Expressions) |
| https://docs.python.org/3/library/datetime.html | Python Docs — datetime parsing |
| https://docs.python.org/3/library/time.html | Python Docs — Time utilities |
| https://regex101.com/ | Regex testing tool (not official Python docs but widely used) |
