How to save a large dataset in a hdf5 file using python ? (Quick Guide)

Published: October 22, 2019

DMCA.com Protection Status

Examples of how to store a large dataset in a hdf5 file using python:

Create arrays of data

Let's consider the following matrices of integers (dtype='i'):

>>> import numpy as np
>>> A = np.random.randint(100, size=(4,4))
>>> A
array([[ 1, 99, 79, 46],
       [69,  4, 29, 60],
       [56, 94, 16, 16],
       [52, 13, 37, 86]])

matrix B of dimensions (5,3,3)

>>> B = np.random.randint(100, size=(5,3,3))
>>> B
array([[[60, 89, 24],
        [ 4, 98, 48],
        [19, 39, 69]],

       [[72,  8, 80],
        [70, 96, 81],
        [50, 78, 85]],

       [[21, 91,  5],
        [43, 18, 58],
        [93, 48, 72]],

       [[25,  0, 45],
        [ 0, 21,  4],
        [ 5, 60, 39]],

       [[73, 28, 87],
        [97, 89, 87],
        [ 4, 76, 62]]])

Create a hdf5 file

Now, let's try to store those matrices in a hdf5 file. First step, lets import the h5py module (note: hdf5 is installed by default in anaconda)

>>> import h5py

Create an hdf5 file (for example called data.hdf5)

>>> f1 = h5py.File("data.hdf5", "w")

Save data in the hdf5 file

Store matrix A in the hdf5 file:

>>> dset1 = f1.create_dataset("dataset_01", (4,4), dtype='i', data=A)

Store matrix B in the hdf5 file:

>>> dset2 = f1.create_dataset("dataset_02", (5,3,3), dtype='i', data=B)

Add metadata

Add metadata

>>> dset1.attrs['scale'] = 0.01
>>> dset1.attrs['offset'] = 15

Close the file

>>> f1.close()

Read a HDF5 file

Let's try to retrieve our data from the hdf5 file. Read the file:

>>> f2 = h5py.File('data.hdf5', 'r')

Print dataset names:

>>> list(f2.keys())
['dataset_01', 'dataset_02']

Retrieve the data

>>> dset1 = f2['dataset_01']

>>> data = dset1[:]
>>> data
array([[ 1, 99, 79, 46],
       [69,  4, 29, 60],
       [56, 94, 16, 16],
       [52, 13, 37, 86]], dtype=int32)

>>> list(dset1.attrs.keys())
['scale', 'offset']

>>> dset1.attrs['scale']
0.01

Example the second dataset

>>> dset2 = f2['dataset_02']
>>> dset2
<HDF5 dataset "dataset_02": shape (5, 3, 3), type "<i4">
>>> data = dset2[:]
>>> data
array([[[60, 89, 24],
        [ 4, 98, 48],
        [19, 39, 69]],

       [[72,  8, 80],
        [70, 96, 81],
        [50, 78, 85]],

       [[21, 91,  5],
        [43, 18, 58],
        [93, 48, 72]],

       [[25,  0, 45],
        [ 0, 21,  4],
        [ 5, 60, 39]],

       [[73, 28, 87],
        [97, 89, 87],
        [ 4, 76, 62]]], dtype=int32)
>>> type(data)
<class 'numpy.ndarray'>

Example using a pandas data frame

Create a data frame with pandas:

import pandas as pd
import numpy as np

data = np.arange(1,13)
data = data.reshape(3,4)

columns = ['Home','Car','Sport','Food']
index = ['Alice','Bob','Emma']

df = pd.DataFrame(data=data,index=index,columns=columns)

and store the data using HDFStore
(see Save additional attributes in Pandas Dataframe)

store = pd.HDFStore('data.hdf5')

store.put('dataset_01', df)

metadata = {'scale':0.1,'offset':15}

store.get_storer('dataset_01').attrs.metadata = metadata

store.close()

Read the file:

import pandas as pd

with pd.HDFStore('data.hdf5') as store:
    data = store['dataset_01']
    metadata = store.get_storer('dataset_01').attrs.metadata

print(data)

print(metadata)

returns

       Home  Car  Sport  Food
Alice     1    2      3     4
Bob       5    6      7     8
Emma      9   10     11    12
{'scale': 0.1, 'offset': 15}

References