How to use python re to find all content between two HTML tags ?

Published: March 18, 2024

Updated: March 18, 2024

Tags: Python;

DMCA.com Protection Status

Introduction

In this tutorial, we'll explore how to use the Python re module to find all content located between two specific HTML tags—namely

1
<pre><code>

and

1
</code></pre>

This is particularly useful for extracting code snippets from HTML documents, among other applications.

Case study

Take the problem I encountered as a case in point. I created content with Python Markdown and now intend to extract sections delimited by two HTML tags:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
content =  """

some text

<pre><code>code_1</code></pre>

<pre><code>code_2</code></pre>

some text

<pre><code>
code_3
</code></pre>

some text

Warning: This content contains some html tags
"""

Using re.findall()

One can observe that the content above comprises multiple code sections. To extract them, a solution is to utilize the re.findall() function, which returns a list of all non-overlapping matches for a given regular expression pattern.

First, we'll need to import the re module:

1
import re

Next, we'll define a regular expression pattern that will match our desired content:

1
pattern = r"(?<=<pre><code>).*?(?=</code></pre>)"

Let's break this down. We're using

1
(?<=)

and

1
(?=)

to specify lookbehind and lookahead assertions, respectively. This allows us to match content that is preceded by

1
<pre><code>

and followed by

1
</code></pre>

, without including these tags in our result.

The .*? wildcard matches any number of characters, non-greedily, between the two specified tags.

Now we can use re.findall() on our content string:

1
2
3
matches = re.findall(pattern, content)

print(matches)

The output will be:

1
['code_1', 'code_2']

In the code above, an issue arises as the code fails to locate the latest code section spanning multiple lines. To address this, a solution is to utilize the re.DOTALL flag in Python's re engine. This flag enables the dot (.) in our pattern to match all characters, including newline characters, which is crucial for handling multi-line code blocks effectively:

1
re.findall(pattern, content, re.DOTALL)

The output will be:

1
['code_1', 'code_2', '\ncode_3\n']

An alternative is to use [\s\S]. Whereas the dot operator is newline-unaware, the square brackets notation is safe to match any character or whitespace. This pattern can also be used with re.findall():

1
2
3
pattern= r"(?<=<pre><code>)[\s\S]*?(?=</code></pre>)"

re.findall(pattern, content)

Either approach will yield the same results:

1
['code_1', 'code_2', '\ncode_3\n']

Iterating Over Matches

1
2
3
for match in matches:
    print(match)
    print('----------')

This loop prints each matched content block:

1
2
3
4
5
6
7
8
code_1
----------
code_2
----------

code_3

----------

Conclusion

The Python re module offers a suite of tools for working with regular expressions, which are patterns designed to match character combinations in strings. When it comes to parsing HTML or XML documents, while regular expressions can be used for simple tasks, it's generally recommended to use specialized libraries like BeautifulSoup for more complex and robust parsing needs. However, for our simple need of extracting content between specific tags, re suffices.

References