Introduction
Extracting hyperlinks from a web page is a common task in web scraping, data mining, and automation workflows. In this article, you'll learn how to use Python to extract all Hyperlinks from an HTML page — and even download files (like .hdf
) from NASA using a token.
What is a Hyperlink in HTML?
Here’s a simple example:
1 | <a href="https://en.moonbooks.org">Visit moonbooks</a> |
Explanation:
1 2 3 | * `<a>` is the anchor tag used to define a hyperlink. * `href` is the attribute that contains the URL. * The text inside the tags is what users click — in this case: **Visit moonbooks**. |
Requirements
You’ll need:
- Python 3.x
beautifulsoup4
andrequests
libraries
Install them with:
1 | pip install beautifulsoup4 requests |
Step-by-Step Guide
Step 1: Import the Required Libraries
1 2 | import requests from bs4 import BeautifulSoup |
Step 2: Fetch the Web Page
1 2 3 | url = 'https://example.com' response = requests.get(url) html_content = response.text |
Use response.content
instead of response.text
if you're dealing with binary or non-UTF-8 data.
Step 3: Parse the HTML with BeautifulSoup
1 | soup = BeautifulSoup(html_content, 'html.parser') |
Step 4: Extract All hyperlink Tags
1 2 3 4 5 6 | anchor_tags = soup.find_all('a') for tag in anchor_tags: href = tag.get('href') if href: print(href) |
This will print all hyperlinks found in the page.
Example: Resolve Full URLs with urljoin
Many links on websites are relative (e.g., "/page.html"
). Use urljoin
to convert them into absolute URLs:
1 2 3 4 5 6 7 | from urllib.parse import urljoin for tag in anchor_tags: href = tag.get('href') if href: full_url = urljoin(url, href) print(full_url) |
Filter & Download Specific Files
If you only want .pdf
, .hdf
, or other file types:
1 2 | if href.endswith('.hdf'): print(href) |
Example: Accessing Protected NASA URLs Using a Bearer Token
To download .hdf
files from NASA's ASDC (CALIPSO), you first need a NASA Earthdata Login token.
Step-by-Step
- Go to https://urs.earthdata.nasa.gov/ and create an account.
- Generate a bearer token under Applications > Generate Token.
Sample Code to List .hdf
Files
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import urllib.request from bs4 import BeautifulSoup token = 'YOUR_TOKEN_HERE' opener = urllib.request.build_opener() opener.addheaders = [('Authorization', f'Bearer {token}')] urllib.request.install_opener(opener) url = 'https://asdc.larc.nasa.gov/data/CALIPSO/LID_L2_05kmALay-Standard-V4-51/2011/01/' html_page = urllib.request.urlopen(url).read() soup = BeautifulSoup(html_page, 'html.parser') # Extract all HDF file links links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.hdf')] for link in links: print(link) |
Note:
- build_opener(): Creates a custom URL opener.
- addheaders: Adds the Authorization header to all HTTP requests (using your bearer token).
- install_opener(): Globally sets this opener as the default, so that urlopen() will use it.
Without this part, you’d get a 403 Forbidden when accessing protected Earthdata URLs.
Outputs:
1 2 3 4 5 6 7 8 9 10 11 12 13 | https://forum.earthdata.nasa.gov/viewforum.php?f=7&sid=6967a8430c1959276aac21d89b560ed5&DAAC=1&Discipline=&Projects=&ServicesUsage=15,16&keywords=&author=&startDate=&endDate=&bestAnswer=&tagMatch=all&searchWithin=all https://nasa.github.io/earthdata-download/ CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T00-41-58ZN.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T00-41-58ZN.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T01-28-18ZD.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T01-28-18ZD.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T02-20-48ZN.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T02-20-48ZN.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T03-07-08ZD.hdf CAL_LID_L2_05kmALay-Standard-V4-51.2011-01-01T03-07-08ZD.hdf . . . |
Download the Files
To download only files ending with .hdf, you can add a simple if condition inside your loop to filter them. Here's the corrected and complete version of the code:
1 2 3 4 5 6 7 8 | from urllib.parse import urljoin import urllib.request for file_name in links: if file_name.endswith('.hdf'): full_url = urljoin(url, file_name) print(f"Downloading {file_name}...") urllib.request.urlretrieve(full_url, file_name) |
Explanation:
-file_name.endswith('.hdf') ensures that only files with the .hdf extension are downloaded.
- urljoin(url, file_name) creates the full absolute URL.
- urllib.request.urlretrieve() downloads the file to your current working directory.
SSL Certificate Error Fix
If you get this error:
1 | SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed |
Your Python environment can’t verify the website’s SSL certificate. Here's how to fix it:
Option 1: macOS Certificate Installer
Run this if you're on macOS and installed Python from python.org:
1 | /Applications/Python\ 3.x/Install\ Certificates.command |
Option 2: Temporarily Disable Verification (⚠️ Not Recommended)
1 2 3 4 5 6 7 8 9 10 11 12 13 | import requests from bs4 import BeautifulSoup import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) url = 'https://en.moonbooks.org' response = requests.get(url, verify=False) soup = BeautifulSoup(response.text, 'html.parser') links = [a['href'] for a in soup.find_all('a', href=True)] for link in links: print(link) |
Option 3: Use certifi
to Provide Trusted CAs
Installation
1 | pip install certifi |
Code
1 2 3 4 5 6 7 8 9 10 11 | import requests import certifi from bs4 import BeautifulSoup url = 'https://en.moonbooks.org' response = requests.get(url, verify=certifi.where()) soup = BeautifulSoup(response.text, 'html.parser') links = [a['href'] for a in soup.find_all('a', href=True)] for link in links: print(link) |
Conclusion
With just a few lines of Python and the BeautifulSoup
+ requests
combo, you can:
- Extract hyperlinks from web pages
- Filter and download specific file types
- Handle authentication using bearer tokens
- Fix SSL issues if needed