How To Find All Hyperlinks from an HTML Page Using Python ?

Introduction

Extracting hyperlinks from a web page is a common task in web scraping, data mining, and automation workflows. In this article, you'll learn how to use Python to extract all Hyperlinks from an HTML page — and even download files (like .hdf) from NASA using a token.

Here’s a simple example:

1
<a href="https://en.moonbooks.org">Visit moonbooks</a>

Explanation:

1
2
3
* `<a>` is the anchor tag used to define a hyperlink.
* `href` is the attribute that contains the URL.
* The text inside the tags is what users click  in this case: **Visit moonbooks**.

Requirements

You’ll need:

  • Python 3.x
  • beautifulsoup4 and requests libraries

Install them with:

1
pip install beautifulsoup4 requests

Step-by-Step Guide

Step 1: Import the Required Libraries

1
2
import requests
from bs4 import BeautifulSoup

Step 2: Fetch the Web Page

1
2
3
url = 'https://example.com'
response = requests.get(url)
html_content = response.text

Use response.content instead of response.text if you're dealing with binary or non-UTF-8 data.

Step 3: Parse the HTML with BeautifulSoup

1
soup = BeautifulSoup(html_content, 'html.parser')
1
2
3
4
5
6
anchor_tags = soup.find_all('a')

for tag in anchor_tags:
    href = tag.get('href')
    if href:
        print(href)

This will print all hyperlinks found in the page.

Example: Resolve Full URLs with urljoin

Many links on websites are relative (e.g., "/page.html"). Use urljoin to convert them into absolute URLs:

1
2
3
4
5
6
7
from urllib.parse import urljoin

for tag in anchor_tags:
    href = tag.get('href')
    if href:
        full_url = urljoin(url, href)
        print(full_url)

Filter & Download Specific Files

If you only want .pdf, .hdf, or other file types:

1
2
if href.endswith('.hdf'):
    print(href)

Example: Accessing Protected NASA URLs Using a Bearer Token

To download .hdf files from NASA's ASDC (CALIPSO), you first need a NASA Earthdata Login token.

Step-by-Step

  1. Go to https://urs.earthdata.nasa.gov/ and create an account.
  2. Generate a bearer token under Applications > Generate Token.

Sample Code to List .hdf Files

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import urllib.request
from bs4 import BeautifulSoup

token = 'YOUR_TOKEN_HERE'

opener = urllib.request.build_opener()
opener.addheaders = [('Authorization', f'Bearer {token}')]
urllib.request.install_opener(opener)

url = 'https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_05kmALay-Standard-V4-51/2011/01/'
html_page = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html_page, 'html.parser')

# Extract all HDF file links
links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.hdf')]

for link in links:
    print(link)

Note:

  • build_opener(): Creates a custom URL opener.
  • addheaders: Adds the Authorization header to all HTTP requests (using your bearer token).
  • install_opener(): Globally sets this opener as the default, so that urlopen() will use it.

Without this part, you’d get a 403 Forbidden when accessing protected Earthdata URLs.

Outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
https://forum.earthdata.nasa.gov/viewforum.php?f=7&sid=6967a8430c1959276aac21d89b560ed5&DAAC=1&Discipline=&Projects=&ServicesUsage=15,16&keywords=&author=&startDate=&endDate=&bestAnswer=&tagMatch=all&searchWithin=all
https://nasa.github.io/earthdata-download/
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T00-41-58ZN.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T00-41-58ZN.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T01-28-18ZD.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T01-28-18ZD.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T02-20-48ZN.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T02-20-48ZN.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T03-07-08ZD.hdf
CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T03-07-08ZD.hdf
.
.
.

Download the Files

To download only files ending with .hdf, you can add a simple if condition inside your loop to filter them. Here's the corrected and complete version of the code:

1
2
3
4
5
6
7
8
from urllib.parse import urljoin
import urllib.request

for file_name in links:
    if file_name.endswith('.hdf'):
        full_url = urljoin(url, file_name)
        print(f"Downloading {file_name}...")
        urllib.request.urlretrieve(full_url, file_name)

Explanation:

-file_name.endswith('.hdf') ensures that only files with the .hdf extension are downloaded.
- urljoin(url, file_name) creates the full absolute URL.
- urllib.request.urlretrieve() downloads the file to your current working directory.

SSL Certificate Error Fix

If you get this error:

1
SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed

Your Python environment can’t verify the website’s SSL certificate. Here's how to fix it:

Option 1: macOS Certificate Installer

Run this if you're on macOS and installed Python from python.org:

1
/Applications/Python\ 3.x/Install\ Certificates.command
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

url = 'https://en.moonbooks.org'
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')

links = [a['href'] for a in soup.find_all('a', href=True)]
for link in links:
    print(link)

Option 3: Use certifi to Provide Trusted CAs

Installation

1
pip install certifi

Code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import requests
import certifi
from bs4 import BeautifulSoup

url = 'https://en.moonbooks.org'
response = requests.get(url, verify=certifi.where())
soup = BeautifulSoup(response.text, 'html.parser')

links = [a['href'] for a in soup.find_all('a', href=True)]
for link in links:
    print(link)

Conclusion

With just a few lines of Python and the BeautifulSoup + requests combo, you can:

  • Extract hyperlinks from web pages
  • Filter and download specific file types
  • Handle authentication using bearer tokens
  • Fix SSL issues if needed