How to save and compress a numpy array efficiently ?


Introduction

A numpy array is a data structure in Python that is used to store large amounts of homogeneous data. It can efficiently handle mathematical operations and is widely used in scientific computing, machine learning, and data analysis.

Numpy arrays are faster, more memory-efficient, and provide better performance compared to traditional lists in Python. They also offer a wide range of functions for performing various operations on the array elements.

In this tutorial, we will be discussing how to save and compress a numpy array efficiently in Python.

Creating a NumPy array

To showcase the procedure of saving and compressing a NumPy array, let's start by generating a 2D array with synthetic data:

import numpy as np

data = np.random.randint(0,2,(61198,8))

print(data)

The provided code will generate a two-dimensional array consisting of random numbers between 0 and 1. For example:

[[1 0 0 ... 1 1 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 1 1 0]
 ...
 [0 1 1 ... 0 1 0]
 [1 1 1 ... 0 0 1]
 [0 1 0 ... 1 1 0]]

To facilitate comparisons, we will begin by saving our data in an uncompressed file using the savetxt() function from the numpy library:

np.savetxt('data.txt', data, fmt='%i', delimiter=';')

On my computer, it will create a file with a size of approximately 979 KB.

Saving and compressing a numpy array

Using gzip module

Saving a numpy array is essential when you need to store the data for future use or share it with others. It also helps in reducing the memory usage, thus making it easier to work with large datasets.

Compressing a numpy array further reduces its size, making it more efficient to transfer over networks or store on disk. This is especially useful when dealing with large datasets that can take up a lot of space.

One possible solution for compressing a numpy array is to utilize the gzip module in Python. This module allows for efficient compression of data, thereby reducing the size of the array when it is saved.

In order to use the gzip module, you will need to import it into your Python code using the following line:

import gzip

f = gzip.GzipFile("data_02.npy.gz", "w")

np.save(file=f, arr=data)

f.close()

On my computer, it will generate a file with a size of around 115KB, which is significantly improved compared to the previous 979KB.

Accessing compressed file

To access the compressed file, simply follow these instructions:

f = gzip.GzipFile('data_02.npy.gz', "r")

data_uploaded =  np.load(f)

print(data_uploaded)

f.close()

The above code will print

[[1 0 0 ... 1 1 0]
 [0 0 0 ... 0 0 0]
 [1 0 0 ... 1 1 0]
 ...
 [0 1 1 ... 0 1 0]
 [1 1 1 ... 0 0 1]
 [0 1 0 ... 1 1 0]]

Saving and compressing a pandas dataframe

For the purpose of making comparisons, let's utilize a pandas dataframe to store our data and then save it to a file:

import pandas as pd

df = pd.DataFrame(data)

Using to_csv()

df.to_csv('df.csv.gz', compression='gzip')

On my computer, it will generate a file with a size of around 239KB.

Using HDFStore

An alternative method for saving a dataframe is to store it in an HDF file.

store = pd.HDFStore('data.hdf5', complevel=9, complib='blosc')

store.put('dataset', df)

metadata = {}

store.get_storer('dataset').attrs.metadata = metadata

store.close()

On my computer, it will generate a file with a size of around 307KB.

Saving an array with dimensions greater than two

If you have an array that consists of more than two dimensions, for example:

import numpy as np

data = np.random.randint(0,2,(61198,8,3))

print(data.shape)

which returns

(61198, 8, 3)

If you attempt to save an array of dimensions greater than two, you will then encounter the following error message. For instance, when using the "savetxt" function.

np.savetxt('data.txt', data, fmt='%i', delimiter=';')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 np.savetxt('data.txt', data, fmt='%i', delimiter=';')

File <__array_function__ internals>:180, in savetxt(*args, **kwargs)

File ~/opt/anaconda3/envs/worklab/lib/python3.9/site-packages/numpy/lib/npyio.py:1397, in savetxt(fname, X, fmt, delimiter, newline, header, footer, comments, encoding)
   1395 # Handle 1-dimensional arrays
   1396 if X.ndim == 0 or X.ndim > 2:
-> 1397     raise ValueError(
   1398         "Expected 1D or 2D array, got %dD array instead" % X.ndim)
   1399 elif X.ndim == 1:
   1400     # Common case -- 1d array of numbers
   1401     if X.dtype.names is None:

ValueError: Expected 1D or 2D array, got 3D array instead

You have two options available to you: utilizing an HDF file that has the capability to handle multidimensional arrays (see How to save a large dataset in a hdf5 file using python ? (Quick Guide)) or modify the shape of your array before saving it, for example, by using the flatten() function:

data = data.flatten()

f = gzip.GzipFile("data.npy.gz", "w")

np.save(file=f, arr=data)

f.close()

and to access and retrieve your data (Ensure that you preserve the shape of your initial array by storing it in a safe location):

f = gzip.GzipFile('data_02.npy.gz', "r")

data = np.load(f)

f.close()

data = data.reshape((61198,8,3))

References