How to read an HDF4 file in Python using the xarray library ?


What are the benefits of using Xarray ?

Xarray offers several advantages for working with labeled multidimensional arrays in Python, making it a powerful tool for data analysis, manipulation, and visualization. Some of the key advantages of Xarray include:

  • Labeled Data Structures: Xarray introduces labeled data structures, such as DataArray and Dataset, which provide metadata along with the array values. This allows users to attach descriptive labels to their data, making it easier to understand and work with complex datasets.

  • Dimensional Awareness: Xarray is designed to handle multidimensional arrays with ease, supporting labeled dimensions (e.g., time, latitude, longitude) and coordinate variables. This enables users to perform operations along specific dimensions without needing to manually handle indices.

  • Integration with Pandas: Xarray seamlessly integrates with Pandas, another popular Python library for data manipulation. Users can convert Xarray objects to Pandas DataFrame or Series and vice versa, facilitating interoperability between the two libraries and enabling a wide range of data analysis techniques.

  • Broadcasting and Alignment: Xarray supports broadcasting and alignment operations, allowing arrays with different dimensions and coordinates to be combined and operated on efficiently. This simplifies the process of working with heterogeneous datasets and performing complex computations across multiple arrays.

  • IO Operations: Xarray provides built-in support for reading and writing data from/to various file formats, including NetCDF, HDF5, GRIB, and CSV. This makes it easy to load data from external sources, manipulate it using Xarray's functionality, and save the results back to disk.

  • Data Visualization: Xarray integrates seamlessly with popular data visualization libraries such as Matplotlib and Seaborn. Users can quickly create informative plots and visualizations of their data, leveraging Xarray's labeled dimensions and coordinate information to produce clear and insightful visualizations.

  • Time Series Analysis: Xarray includes specialized functionality for handling time series data, such as resampling, rolling windows, and group-by operations based on time coordinates. This makes it well-suited for analyzing temporal datasets commonly encountered in climate science, finance, and other domains.

  • Parallel Processing: Xarray supports parallel processing and distributed computing through integration with Dask, a parallel computing library in Python. This allows users to scale their data analysis workflows to larger datasets and take advantage of multi-core processors or distributed computing clusters.

Setting up a Conda environment for reading HDF4 files

To read an HDF4 file and utilize Xarray effectively, let's first set up an environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Create a new environment named worklab
conda create -n worklab python numpy pandas -c conda-forge

# Activate the newly created environment
conda activate worklab

# Install the netCDF4 library for handling NetCDF files
conda install netcdf4 -c conda-forge

# Install Xarray for working with labeled multidimensional arrays
conda install xarray -c conda-forge

# Additionally, install Zarr for handling chunked, compressed, and parallelizable arrays
conda install zarr -c conda-forge

Explanation of Commands:

  • conda create -n worklab python numpy pandas -c conda-forge: This command creates a new Conda environment named worklab and installs Python, NumPy, and Pandas packages into it. These are foundational libraries for data manipulation and analysis.

  • conda activate worklab: This command activates the newly created worklab environment, ensuring that subsequent installations and commands are executed within this isolated environment.

  • conda install netcdf4 -c conda-forge: This command installs the netCDF4 library, which provides tools for reading and writing NetCDF files, a common format in scientific data.

  • conda install xarray -c conda-forge: This command installs Xarray, a powerful library for working with labeled multidimensional arrays in Python. Xarray integrates seamlessly with netCDF4 and provides advanced functionality for data manipulation and analysis.

  • conda install zarr -c conda-forge: This command installs Zarr, a library for working with chunked, compressed, and parallelizable arrays. While not strictly necessary for reading HDF4 files with Xarray, Zarr complements Xarray's capabilities and can be useful for handling large datasets efficiently.

Reading an HDF4 file in Python using an xarray

To demonstrate how to read an HDF4 file, let's delve into an example by exploring the contents of an AQUA MODIS Level 2 cloud optical product file, hereafter referred to as MYD06_L2.A2007219.2010.006.2014053202546.hdf.

To read an HDF4 file and create an Xarray dataset, we can utilize the following Python code:

1
2
3
import xarray as xr

ds=xr.open_dataset('MYD06_L2.A2007219.2010.006.2014053202546.hdf',engine='netcdf4')

It's noteworthy that Xarray provides support for reading HDF4 files by leveraging the netCDF4 engine. This engine seamlessly handles the conversion from HDF4 to the NetCDF format, allowing for easy integration with Xarray's capabilities.

By executing

1
print(ds)

, you'll obtain a comprehensive overview of the Xarray Dataset ds, including its size, dimensions, coordinates, and data variables. Here's a sample output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<xarray.Dataset> Size: 2GB
Dimensions:                                   (Cell_Along_Swath_5km:mod06: 406,
                                               Cell_Across_Swath_5km:mod06: 270,
                                               Band_Number:mod06: 7,
                                               Band_Forcing:mod06: 5,
                                               Band_Ratio:mod06: 5,
                                               Cell_Along_Swath_1km:mod06: 2030,
                                               ...
                                               RadTran_NRE_Liq:mod06: 18,
                                               SPI_nband:mod06: 2,
                                               RFM_nband:mod06: 3,
                                               ACR_nband:mod06: 6,
                                               QA_Parameter_1km:mod06: 9,
                                               fakeDim17: 17)
Dimensions without coordinates: Cell_Along_Swath_5km:mod06,
                                Cell_Across_Swath_5km:mod06, Band_Number:mod06,
                                Band_Forcing:mod06, Band_Ratio:mod06,
                                Cell_Along_Swath_1km:mod06,
                                Cell_Across_Swath_1km:mod06,
                                Cloud_Mask_5km_Num_Bytes:mod06,
                                QA_Parameter_5km:mod06,
                                Cloud_Mask_1km_Num_Bytes:mod06,
                                RadTran_NRE_Ice:mod06, RadTran_NWL:mod06,
                                RadTran_NRE_Liq:mod06, SPI_nband:mod06,
                                RFM_nband:mod06, ACR_nband:mod06,
                                QA_Parameter_1km:mod06, fakeDim17
Data variables: (12/127)
    Latitude                                  (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) float64 877kB ...
    Longitude                                 (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) float64 877kB ...
    Scan_Start_Time                           (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) datetime64[ns] 877kB ...
    Solar_Zenith                              (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) float64 877kB ...
    Solar_Zenith_Day                          (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) float64 877kB ...
    Solar_Zenith_Night                        (Cell_Along_Swath_5km:mod06, Cell_Across_Swath_5km:mod06) float64 877kB ...
    ...                                        ...
    Retrieval_Failure_Metric_16               (Cell_Along_Swath_1km:mod06, Cell_Across_Swath_1km:mod06, RFM_nband:mod06) float64 66MB ...
    Retrieval_Failure_Metric_37               (Cell_Along_Swath_1km:mod06, Cell_Across_Swath_1km:mod06, RFM_nband:mod06) float64 66MB ...
    Retrieval_Failure_Metric_1621             (Cell_Along_Swath_1km:mod06, Cell_Across_Swath_1km:mod06, RFM_nband:mod06) float64 66MB ...
    Atm_Corr_Refl                             (Cell_Along_Swath_1km:mod06, Cell_Across_Swath_1km:mod06, ACR_nband:mod06) float64 132MB ...
    Quality_Assurance_1km                     (Cell_Along_Swath_1km:mod06, Cell_Across_Swath_1km:mod06, QA_Parameter_1km:mod06) float64 198MB ...
    Statistics_1km_sds                        (fakeDim17) float32 68B ...
Attributes: (12/14)
    HDFEOSVersion:                     HDFEOS_V2.17
    StructMetadata.0:                  GROUP=SwathStructure\n\tGROUP=SWATH_1\...
    Number_of_Instrument_Scans:        2030
    Maximum_Number_of_1km_Frames:      1354
    history:                           $Id: MOD06_L2.CDL.fs,v 1.13 2013/06/19...
    title:                             MODIS Level 2 Cloud Properties        ...
    ...                                ...
    Clear_Sky_Restoral_Status:         y
    Collection_4_Phase_Used:           n
    Ice_Phase_Forced:                  n
    Water_Phase_Forced:                n
    identifier_product_doi:            10.5067/MODIS/MYD06_L2.006
    identifier_product_doi_authority:  http://dx.doi.org

Note that the type(ds) function returns an Xarray Dataset object:

1
<class 'xarray.core.dataset.Dataset'>

After successfully creating an Xarray containing all the data stored in the file, we can initiate data exploration. To start, we can generate a list of all variable names using:

1
list(ds.variables)

This command returns a list of variable names, showcasing the breadth of data available in the dataset. For instance:

1
['Latitude', 'Longitude', 'Scan_Start_Time', 'Solar_Zenith', 'Solar_Zenith_Day', 'Solar_Zenith_Night', 'Solar_Azimuth', 'Solar_Azimuth_Day', 'Solar_Azimuth_Night', 'Sensor_Zenith', 'Sensor_Zenith_Day', 'Sensor_Zenith_Night', 'Sensor_Azimuth', 'Sensor_Azimuth_Day', 'Sensor_Azimuth_Night', 'Brightness_Temperature', 'Surface_Temperature', 'Surface_Pressure', 'Cloud_Height_Method', 'Cloud_Top_Height', 'Cloud_Top_Height_Nadir', 'Cloud_Top_Height_Nadir_Day', 'Cloud_Top_Height_Nadir_Night', 'Cloud_Top_Pressure', 'Cloud_Top_Pressure_Nadir', 'Cloud_Top_Pressure_Night', 'Cloud_Top_Pressure_Nadir_Night', 'Cloud_Top_Pressure_Day', 'Cloud_Top_Pressure_Nadir_Day', 'Cloud_Top_Temperature', 'Cloud_Top_Temperature_Nadir', 'Cloud_Top_Temperature_Night', 'Cloud_Top_Temperature_Nadir_Night', 'Cloud_Top_Temperature_Day', 'Cloud_Top_Temperature_Nadir_Day', 'Tropopause_Height', 'Cloud_Fraction', 'Cloud_Fraction_Nadir', 'Cloud_Fraction_Night', 'Cloud_Fraction_Nadir_Night', 'Cloud_Fraction_Day', 'Cloud_Fraction_Nadir_Day', 'Cloud_Effective_Emissivity', 'Cloud_Effective_Emissivity_Nadir', 'Cloud_Effective_Emissivity_Night', 'Cloud_Effective_Emissivity_Nadir_Night', 'Cloud_Effective_Emissivity_Day', 'Cloud_Effective_Emissivity_Nadir_Day', 'Cloud_Top_Pressure_Infrared', 'Spectral_Cloud_Forcing', 'Cloud_Top_Pressure_From_Ratios', 'Radiance_Variance', 'Cloud_Phase_Infrared', 'Cloud_Phase_Infrared_Night', 'Cloud_Phase_Infrared_Day', 'Cloud_Phase_Infrared_1km', 'IRP_CTH_Consistency_Flag_1km', 'os_top_flag_1km', 'cloud_top_pressure_1km', 'cloud_top_height_1km', 'cloud_top_temperature_1km', 'cloud_emissivity_1km', 'cloud_top_method_1km', 'surface_temperature_1km', 'cloud_emiss11_1km', 'cloud_emiss12_1km', 'cloud_emiss13_1km', 'cloud_emiss85_1km', 'Cloud_Effective_Radius', 'Cloud_Effective_Radius_PCL', 'Cloud_Effective_Radius_16', 'Cloud_Effective_Radius_16_PCL', 'Cloud_Effective_Radius_37', 'Cloud_Effective_Radius_37_PCL', 'Cloud_Optical_Thickness', 'Cloud_Optical_Thickness_PCL', 'Cloud_Optical_Thickness_16', 'Cloud_Optical_Thickness_16_PCL', 'Cloud_Optical_Thickness_37', 'Cloud_Optical_Thickness_37_PCL', 'Cloud_Effective_Radius_1621', 'Cloud_Effective_Radius_1621_PCL', 'Cloud_Optical_Thickness_1621', 'Cloud_Optical_Thickness_1621_PCL', 'Cloud_Water_Path', 'Cloud_Water_Path_PCL', 'Cloud_Water_Path_1621', 'Cloud_Water_Path_1621_PCL', 'Cloud_Water_Path_16', 'Cloud_Water_Path_16_PCL', 'Cloud_Water_Path_37', 'Cloud_Water_Path_37_PCL', 'Cloud_Effective_Radius_Uncertainty', 'Cloud_Effective_Radius_Uncertainty_16', 'Cloud_Effective_Radius_Uncertainty_37', 'Cloud_Optical_Thickness_Uncertainty', 'Cloud_Optical_Thickness_Uncertainty_16', 'Cloud_Optical_Thickness_Uncertainty_37', 'Cloud_Water_Path_Uncertainty', 'Cloud_Effective_Radius_Uncertainty_1621', 'Cloud_Optical_Thickness_Uncertainty_1621', 'Cloud_Water_Path_Uncertainty_1621', 'Cloud_Water_Path_Uncertainty_16', 'Cloud_Water_Path_Uncertainty_37', 'Above_Cloud_Water_Vapor_094', 'IRW_Low_Cloud_Temperature_From_COP', 'Cloud_Phase_Optical_Properties', 'Cloud_Multi_Layer_Flag', 'Cirrus_Reflectance', 'Cirrus_Reflectance_Flag', 'Cloud_Mask_5km', 'Quality_Assurance_5km', 'Cloud_Mask_1km', 'Extinction_Efficiency_Ice', 'Asymmetry_Parameter_Ice', 'Single_Scatter_Albedo_Ice', 'Extinction_Efficiency_Liq', 'Asymmetry_Parameter_Liq', 'Single_Scatter_Albedo_Liq', 'Cloud_Mask_SPI', 'Retrieval_Failure_Metric', 'Retrieval_Failure_Metric_16', 'Retrieval_Failure_Metric_37', 'Retrieval_Failure_Metric_1621', 'Atm_Corr_Refl', 'Quality_Assurance_1km', 'Statistics_1km_sds']

It's noteworthy that the dataset comprises 127 variables, as indicated by:

1
len(list(ds))

Moving forward, if we wish to extract a specific variable, such as 'Cloud_Effective_Radius', we can accomplish this by:

1
ds['Cloud_Effective_Radius']

This returns a DataArray object, providing details such as its dimensions, size, and attributes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<xarray.DataArray 'Cloud_Effective_Radius' (Cell_Along_Swath_1km:mod06: 2030,
                                            Cell_Across_Swath_1km:mod06: 1354)> Size: 22MB
[2748620 values with dtype=float64]
Dimensions without coordinates: Cell_Along_Swath_1km:mod06,
                                Cell_Across_Swath_1km:mod06
Attributes:
    valid_range:                 [    0 10000]
    long_name:                   Cloud Particle Effective Radius two-channel ...
    units:                       micron
    Parameter_Type:              Output
    Cell_Along_Swath_Sampling:   [   1 2030    1]
    Cell_Across_Swath_Sampling:  [   1 1354    1]
    Geolocation_Pointer:         External MODIS geolocation product

To access the attributes of this variable, we can use:

1
ds['Cloud_Effective_Radius'].attrs

This returns a dictionary containing attributes such as 'valid_range', 'long_name', and 'units'. For instance:

1
{'valid_range': array([    0, 10000], dtype=int16), 'long_name': 'Cloud Particle Effective Radius two-channel retrieval using band 7(2.1um) and either band 1(0.65um), 2(0.86um), or 5(1.2um)  (specified in Quality_Assurance_1km)from best points: not failed in any way, not marked for clear sky restoral', 'units': 'micron', 'Parameter_Type': 'Output', 'Cell_Along_Swath_Sampling': array([   1, 2030,    1], dtype=int32), 'Cell_Across_Swath_Sampling': array([   1, 1354,    1], dtype=int32), 'Geolocation_Pointer': 'External MODIS geolocation product'}

To extract a specific attribute, such as 'units', we can utilize:

1
ds['Cloud_Effective_Radius'].attrs['units']

This retrieves the unit information:

1
'micron'

Furthermore, to obtain the values of the 'Cloud_Effective_Radius' variable, we can use:

1
ds['Cloud_Effective_Radius'].values

This returns an array containing the actual data values:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
array([[        nan,         nan, 34.93999922, ..., 29.73999934,
                nan,         nan],
       [        nan,         nan, 11.99999973, ...,         nan,
                nan,         nan],
       [        nan,         nan,  9.99999978, ...,         nan,
                nan,         nan],
       ...,
       [        nan,         nan, 15.89999964, ..., 11.89999973,
                nan,         nan],
       [        nan,         nan, 18.87999958, ..., 11.29999975,
                nan,         nan],
       [        nan,         nan, 18.60999958, ..., 11.29999975,
                nan,         nan]])

These operations enable us to navigate and extract valuable information from the Xarray Dataset efficiently, facilitating subsequent data analysis and visualization tasks.

References

Links Site
xarray.open_dataset docs.xarray.dev
xarray.DataArray.values docs.xarray.dev
xarray.Dataset.attrs docs.xarray.dev
xarray Installation docs.xarray.dev