Examples of how to store a large dataset in a hdf5 file using python:
Create arrays of data
Let's consider the following matrices of integers (dtype='i'):
>>> import numpy as np>>> A = np.random.randint(100, size=(4,4))>>> Aarray([[ 1, 99, 79, 46],[69, 4, 29, 60],[56, 94, 16, 16],[52, 13, 37, 86]])
matrix B of dimensions (5,3,3)
>>> B = np.random.randint(100, size=(5,3,3))>>> Barray([[[60, 89, 24],[ 4, 98, 48],[19, 39, 69]],[[72, 8, 80],[70, 96, 81],[50, 78, 85]],[[21, 91, 5],[43, 18, 58],[93, 48, 72]],[[25, 0, 45],[ 0, 21, 4],[ 5, 60, 39]],[[73, 28, 87],[97, 89, 87],[ 4, 76, 62]]])
Create a hdf5 file
Now, let's try to store those matrices in a hdf5 file. First step, lets import the h5py module (note: hdf5 is installed by default in anaconda)
>>> import h5py
Create an hdf5 file (for example called data.hdf5)
>>> f1 = h5py.File("data.hdf5", "w")
Save data in the hdf5 file
Store matrix A in the hdf5 file:
>>> dset1 = f1.create_dataset("dataset_01", (4,4), dtype='i', data=A)
Store matrix B in the hdf5 file:
>>> dset2 = f1.create_dataset("dataset_02", (5,3,3), dtype='i', data=B)
Add metadata
Add metadata
>>> dset1.attrs['scale'] = 0.01>>> dset1.attrs['offset'] = 15
Close the file
>>> f1.close()
Read a HDF5 file
Let's try to retrieve our data from the hdf5 file. Read the file:
>>> f2 = h5py.File('data.hdf5', 'r')
Print dataset names:
>>> list(f2.keys())['dataset_01', 'dataset_02']
Retrieve the data
>>> dset1 = f2['dataset_01']>>> data = dset1[:]>>> dataarray([[ 1, 99, 79, 46],[69, 4, 29, 60],[56, 94, 16, 16],[52, 13, 37, 86]], dtype=int32)>>> list(dset1.attrs.keys())['scale', 'offset']>>> dset1.attrs['scale']0.01
Example the second dataset
>>> dset2 = f2['dataset_02']>>> dset2<HDF5 dataset "dataset_02": shape (5, 3, 3), type "<i4">>>> data = dset2[:]>>> dataarray([[[60, 89, 24],[ 4, 98, 48],[19, 39, 69]],[[72, 8, 80],[70, 96, 81],[50, 78, 85]],[[21, 91, 5],[43, 18, 58],[93, 48, 72]],[[25, 0, 45],[ 0, 21, 4],[ 5, 60, 39]],[[73, 28, 87],[97, 89, 87],[ 4, 76, 62]]], dtype=int32)>>> type(data)<class 'numpy.ndarray'>
Example using a pandas data frame
Create a data frame with pandas:
import pandas as pdimport numpy as npdata = np.arange(1,13)data = data.reshape(3,4)columns = ['Home','Car','Sport','Food']index = ['Alice','Bob','Emma']df = pd.DataFrame(data=data,index=index,columns=columns)
and store the data using HDFStore
(see Save additional attributes in Pandas Dataframe)
store = pd.HDFStore('data.hdf5')store.put('dataset_01', df)metadata = {'scale':0.1,'offset':15}store.get_storer('dataset_01').attrs.metadata = metadatastore.close()
Read the file:
import pandas as pdwith pd.HDFStore('data.hdf5') as store:data = store['dataset_01']metadata = store.get_storer('dataset_01').attrs.metadataprint(data)print(metadata)
returns
Home Car Sport FoodAlice 1 2 3 4Bob 5 6 7 8Emma 9 10 11 12{'scale': 0.1, 'offset': 15}
References
| Links | Site |
|---|---|
| Quick Start Guide | docs.h5py.org |
| How to store a dataframe using Pandas | stackoverflow |
| Attributes | docs.h5py |
| How to use HDF5 files in Python | pythonforthela |
| How to list dataset in h5py file? | stackoverflow |
| How to add meta_data to Pandas dataframe? | stackoverflow |
| Adding meta-information/metadata to pandas DataFrame | stackoverflow |
| Using HDFStore | riptutorial |
| Save additional attributes in Pandas Dataframe | stackoverflow |
| pandas.read_hdf | pandas.pydata.org |
| How to: Get the DataFrame metadata | kite.com |
