How to Extract Text Information from Filenames in Python ?

Introduction

Filenames often contain far more information than just a label. In scientific computing, Earth-observation (EO) workflows, log processing, and automated pipelines, filenames frequently embed: (1) Timestamps, (2) Product or sensor IDs, (3) Platform or satellite identifiers, (4) Processing levels, (5) Versions, (6) Location codes or unique IDs.

Python provides a broad set of tools to help with this:

  • Basic string methods (split, slicing, .removeprefix()…)
  • Path utilities (os.path, pathlib)
  • Regular expressions (re)
  • The datetime module
  • Full custom parsers

In this tutorial, we start with the basics and build up to a real-world example:
Parsing GOES ABI Level-2 filenames, such as:

1
OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc

Extracting the Filename and Extension

The easiest way is with os.path:

1
2
3
4
5
6
7
8
9
import os

filename = "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc"

basename = os.path.basename(filename)
stem, ext = os.path.splitext(basename)

print(stem)  # filename without extension
print(ext)   # .nc

Or with pathlib:

1
2
3
4
5
6
from pathlib import Path

file = Path(filename)
file.stem      # 'OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107'
file.suffix    # '.nc'
file.name      # full filename

Removing Known Prefixes

Python 3.9+ introduces .removeprefix():

1
2
clean = stem.removeprefix("OR_")
print(clean)

This is cleaner than manually slicing or using replace.

Splitting the Filename Into Components

Simple filenames can be parsed with .split("_"):

1
2
parts = clean.split("_")
parts

Result:

1
['ABI-L2-FDCC-M6', 'G18', 's20250090526174', 'e20250090528547', 'c20250090529107']

Assign variables:

1
2
3
4
5
product = parts[0]
platform = parts[1]
start_code = parts[2]  # ex: 's20250090526174'
end_code = parts[3]
creation_code = parts[4]

This works great for predictable, strict formats.

Parsing Encoded Dates into Python datetime

GOES timestamps encode:

1
YYYY + DDD + HHMM + SS + S

Where DDD is the day of year.

First, remove the leading identifier (s, e, c):

1
2
3
start_str = start_code[1:]
end_str = end_code[1:]
creation_str = creation_code[1:]

Then convert:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from datetime import datetime

def parse_goes_timestamp(ts):
    year  = ts[0:4]
    doy   = ts[4:7]
    hour  = ts[7:9]
    minute= ts[9:11]
    second= ts[11:13]

    return datetime.strptime(
        f"{year}-{doy} {hour}:{minute}:{second}",
        "%Y-%j %H:%M:%S"
    )

start_dt = parse_goes_timestamp(start_str)
end_dt = parse_goes_timestamp(end_str)
creation_dt = parse_goes_timestamp(creation_str)

When Basic String Methods Are Not Enough**

split() and slicing are great when:

  • Format is fixed
  • Everything is separated by known delimiters
  • No optional fields or variable length

But what if:

  • Some parts vary in length?
  • Some fields are optional?
  • You want to validate filenames?
  • You need a more generalizable method?

This is where regular expressions (re) are ideal.

Extracting Information Using Regular Expressions (re)

Regular expressions let you define a pattern describing what the filename looks like.

Python’s re module supports:

  • Named groups
  • Optional fields
  • Validation
  • Capture of multiple numeric sequences

Example: Extract platform, product, timestamps

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import re

filename = "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc"

pattern = (
    r"OR_(?P<product>[^_]+)_"       # ABI-L2-FDCC-M6
    r"(?P<platform>G\d+)_"          # G18
    r"s(?P<start>\d{15})_"          # sYYYYDDDHhmmssS
    r"e(?P<end>\d{15})_"
    r"c(?P<creation>\d{15})"
)

match = re.match(pattern, Path(filename).stem)
match.groupdict()

Output:

1
2
3
4
5
6
7
{
 'product': 'ABI-L2-FDCC-M6',
 'platform': 'G18',
 'start': '20250090526174',
 'end': '20250090528547',
 'creation': '20250090529107'
}

Extracting numbers only

1
re.findall(r"\d+", filename)

Optional fields example

1
pattern = r"(?P<product>.+?)_(?P<platform>G\d+)(_v(?P<version>\d+))?"

This matches filenames with or without a _vXX version suffix.

Full Example: Parsing a GOES Filename

Here’s a complete parser function that:

  • Extracts filename info
  • Removes prefixes
  • Uses regex for validation
  • Converts timestamps to datetime
  • Returns a structured dictionary

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import re
from datetime import datetime
from pathlib import Path

def parse_goes_filename(fname):
    file = Path(fname)
    stem = file.stem.removeprefix("OR_")

    pattern = (
        r"(?P<product>[^_]+)_"
        r"(?P<platform>G\d+)_"
        r"s(?P<start>\d{15})_"
        r"e(?P<end>\d{15})_"
        r"c(?P<creation>\d{15})"
    )

    match = re.match(pattern, stem)
    if not match:
        raise ValueError("Invalid GOES filename format")

    info = match.groupdict()

    def parse_ts(ts):
        return datetime.strptime(
            f"{ts[0:4]}-{ts[4:7]} {ts[7:9]}:{ts[9:11]}:{ts[11:13]}",
            "%Y-%j %H:%M:%S"
        )

    return {
        "product": info["product"],
        "platform": info["platform"],
        "start_time": parse_ts(info["start"]),
        "end_time": parse_ts(info["end"]),
        "creation_time": parse_ts(info["creation"]),
        "extension": file.suffix,
        "filename_no_ext": stem,
        "filename_full": file.name
    }

Using the Parser

1
2
3
4
5
6
info = parse_goes_filename(
    "OR_ABI-L2-FDCC-M6_G18_s20250090526174_e20250090528547_c20250090529107.nc"
)

for k, v in info.items():
    print(k, ":", v)

When to Use What?

Task Best Tools
Extract extension, stem, name pathlib, os.path
Remove fixed prefixes/suffixes .removeprefix(), .removesuffix()
Split predictable filename structures .split("_")
Extract variable-length text or numbers slicing or regex
Validate filename format re.match()
Extract multiple timestamp fields regex + datetime
Build general-purpose parsers regex + helper functions

References

Links Site
https://docs.python.org/3/library/pathlib.html Python Docs — pathlib module
https://docs.python.org/3/library/os.path.html Python Docs — os.path utilities
https://docs.python.org/3/library/stdtypes.html#string-methods Python Docs — String methods (split, removeprefix, slicing…)
https://docs.python.org/3/library/re.html Python Docs — re (Regular Expressions)
https://docs.python.org/3/library/datetime.html Python Docs — datetime parsing
https://docs.python.org/3/library/time.html Python Docs — Time utilities
https://regex101.com/ Regex testing tool (not official Python docs but widely used)