Examples of how to store a large dataset in a hdf5 file using python:
Create arrays of data
Let's consider the following matrices of integers (dtype='i'):
>>> import numpy as np
>>> A = np.random.randint(100, size=(4,4))
>>> A
array([[ 1, 99, 79, 46],
[69, 4, 29, 60],
[56, 94, 16, 16],
[52, 13, 37, 86]])
matrix B of dimensions (5,3,3)
>>> B = np.random.randint(100, size=(5,3,3))
>>> B
array([[[60, 89, 24],
[ 4, 98, 48],
[19, 39, 69]],
[[72, 8, 80],
[70, 96, 81],
[50, 78, 85]],
[[21, 91, 5],
[43, 18, 58],
[93, 48, 72]],
[[25, 0, 45],
[ 0, 21, 4],
[ 5, 60, 39]],
[[73, 28, 87],
[97, 89, 87],
[ 4, 76, 62]]])
Create a hdf5 file
Now, let's try to store those matrices in a hdf5 file. First step, lets import the h5py module (note: hdf5 is installed by default in anaconda)
>>> import h5py
Create an hdf5 file (for example called data.hdf5)
>>> f1 = h5py.File("data.hdf5", "w")
Save data in the hdf5 file
Store matrix A in the hdf5 file:
>>> dset1 = f1.create_dataset("dataset_01", (4,4), dtype='i', data=A)
Store matrix B in the hdf5 file:
>>> dset2 = f1.create_dataset("dataset_02", (5,3,3), dtype='i', data=B)
Add metadata
Add metadata
>>> dset1.attrs['scale'] = 0.01
>>> dset1.attrs['offset'] = 15
Close the file
>>> f1.close()
Read a HDF5 file
Let's try to retrieve our data from the hdf5 file. Read the file:
>>> f2 = h5py.File('data.hdf5', 'r')
Print dataset names:
>>> list(f2.keys())
['dataset_01', 'dataset_02']
Retrieve the data
>>> dset1 = f2['dataset_01']
>>> data = dset1[:]
>>> data
array([[ 1, 99, 79, 46],
[69, 4, 29, 60],
[56, 94, 16, 16],
[52, 13, 37, 86]], dtype=int32)
>>> list(dset1.attrs.keys())
['scale', 'offset']
>>> dset1.attrs['scale']
0.01
Example the second dataset
>>> dset2 = f2['dataset_02']
>>> dset2
<HDF5 dataset "dataset_02": shape (5, 3, 3), type "<i4">
>>> data = dset2[:]
>>> data
array([[[60, 89, 24],
[ 4, 98, 48],
[19, 39, 69]],
[[72, 8, 80],
[70, 96, 81],
[50, 78, 85]],
[[21, 91, 5],
[43, 18, 58],
[93, 48, 72]],
[[25, 0, 45],
[ 0, 21, 4],
[ 5, 60, 39]],
[[73, 28, 87],
[97, 89, 87],
[ 4, 76, 62]]], dtype=int32)
>>> type(data)
<class 'numpy.ndarray'>
Example using a pandas data frame
Create a data frame with pandas:
import pandas as pd
import numpy as np
data = np.arange(1,13)
data = data.reshape(3,4)
columns = ['Home','Car','Sport','Food']
index = ['Alice','Bob','Emma']
df = pd.DataFrame(data=data,index=index,columns=columns)
and store the data using HDFStore
(see Save additional attributes in Pandas Dataframe)
store = pd.HDFStore('data.hdf5')
store.put('dataset_01', df)
metadata = {'scale':0.1,'offset':15}
store.get_storer('dataset_01').attrs.metadata = metadata
store.close()
Read the file:
import pandas as pd
with pd.HDFStore('data.hdf5') as store:
data = store['dataset_01']
metadata = store.get_storer('dataset_01').attrs.metadata
print(data)
print(metadata)
returns
Home Car Sport Food
Alice 1 2 3 4
Bob 5 6 7 8
Emma 9 10 11 12
{'scale': 0.1, 'offset': 15}
References
Links | Site |
---|---|
Quick Start Guide | docs.h5py.org |
How to store a dataframe using Pandas | stackoverflow |
Attributes | docs.h5py |
How to use HDF5 files in Python | pythonforthela |
How to list dataset in h5py file? | stackoverflow |
How to add meta_data to Pandas dataframe? | stackoverflow |
Adding meta-information/metadata to pandas DataFrame | stackoverflow |
Using HDFStore | riptutorial |
Save additional attributes in Pandas Dataframe | stackoverflow |
pandas.read_hdf | pandas.pydata.org |
How to: Get the DataFrame metadata | kite.com |