What Comes After HDF5? Seeking a Data Storage Format for Deep Learning

In this article we are discussing that HDF5 is one of the most popular and reliable formats for non-tabular, numerical data. But this format is not optimized for deep learning work. This article suggests what kind of ML native data format should be to truly serve the needs of modern data scientists.



By Davit Buniatyan, CEO of Activeloop

Figure

 

HDF5 for unstructured data

 
 
HDF5 is one of the most popular and reliable formats for non-tabular, numerical data.1 Let me give you an example. NASA’s Earth Observing System satellites gather around 16 TBs of data a day in the HDF5 format.

Features like named datasets, hierarchically organized groups, user defined metadata and attributes, compression filters, lazy loading2 and array-like slicing across different axes make it suitable for many data processing tasks.

Even though it was created over twenty years ago, HDF remains popular with the PyData community through well-maintained open source libraries like h5py.

 

HDF5 is not optimized for Deep Learning

 
 
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. While this design decision means HDF can store petascale, unstructured numerical data like images and video, it was created before cloud object stores or deep learning. As result, there are a number of shortcomings when it comes to DL workloads:

  1. The layout of HDF files makes them difficult to query efficiently on cloud storage systems (like Amazon’s S33), where ML datasets increasingly are stored.4 You can’t read a HDF5 library directly from HDF5 files that are stored as S3 objects. In practice, the entire file, which very often can be gigabytes in size, should be copied to local storage and the first byte could only be read after that.
  2. It is difficult to apply transformations to data in a parallel fashion. Transformations, whether it is data augmentation for computer vision or tokenization for NLP, is core to any deep learning experiment.
  3. HDF5 (Python implementation) is basically single-threaded. That means only one core can read or write to a dataset at a given time. It is not readily accessible to concurrent reads, which limits the ability of HDF5 data to support multiple workers.5
  4. It’s not integrated with modern frameworks like Keras and PyTorch, which means researchers often need to write custom DataLoaders. As DataLoaders take on more responsibilities such as sharding datasets for parallel training, they become more difficult to define.
  5. Because HDF metadata is strewn throughout the file, it is difficult to access a random chunk of data. In fact, it needs to make many small reads in order to gather the metadata necessary to pull out a chunk of data.

Due to these challenges, most users convert data to another format (e.g. csv or TFRecords) before a single epoch can be run.

 

The ML community needs a new data storage format

 
 
Rather than rewriting or repurposing HDF for the cloud, we think there should be a modern data storage format designed for distributed deep learning. While there are packages and managed services that allow users to read HDF data directly from the cloud6, they are not designed for large collections of unstructured data.

With training with petascale datasets of cloud infrastructure becoming the norm for both the academic and industry, a deep-learning format means a format that is:

  • Cloud native. It should play well with cloud object storage systems (such as S3 and GCS) without sacrificing GPU utilization.7
  • Parallelizable. As most ML datasets are write-once, read many, it should support concurrent reads by multiple workers.
  • Random access. Since researchers rarely need the entire dataset upfront, data stored in this format should be accessible in a random manner.
  • Transformations. It should be able to apply arbitrary transformations to data at runtime with minimal code overhead. That’s because downloading TBs of data is often impractical: most users don’t have access to a big data cluster to slice and dice the data as they need after it’s downloaded.
  • Integration. It should be integrated with popular deep learning frameworks like Keras and PyTorch as well as other PyData libraries such as Numpy and Pandas. Ideally, researchers should be able to access these huge cloud-located datasets using the same scripts they have developed for smaller datasets stored locally.

Although designing and promoting a new data format is not a simple task (there have been several examples of unsuccessful “HDF for the cloud” libraries), we think a new format designed for ML purposes and unstructured data like images, videos, audio, to name a few, could help researchers and engineers do more with less.

 
Notes

  1. I recently used HDF5 to store 1024-dim embeddings for roughly one million images.
  2. Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5 dataset, the appropriate data is found and loaded into memory. Slicing in this fashion leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.
  3. Amazon has something called HDF5 virtual file layer.
  4. That’s because HDF5 needs to read its table of indices, then make range requests to each chunk. This slowdown is significant because the HDF library makes many small 4kB reads. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file.
  5. I often see concurrent writes to a single HDF5 file. This is not recommended because it is likely that multiple processes try to write at the same time.
  6. Even Pandas can read HDF5 files directly from S3.
  7. Of course, the biggest disadvantage of cloud object stores is egress.

 
Bio: Davit Buniatyan (@DBuniatyan) started his Ph.D. at Princeton University at 20. His research involved reconstructing the connectome of the mouse brain under the supervision of Sebastian Seung. Trying to solve hurdles he faced analyzing large datasets in the neuroscience lab, David became the founding CEO of Activeloop, a Y-Combinator alum startup. He is also a recipient of the Gordon Wu Fellowship and AWS Machine Learning Research Award. Davit is the creator of the open source package, Hub, the dataset format for AI.

Related: