Data Storage with Python


I've spent the last few years working with numerical data. Most of these data are gathered with various instruments, saved to some file that lives on a lab computer, then is transferred onto my laptop via SSH, FTP, Email, Dropbox, an HTTP server, or a USB stick, and then analyzed with either a Python or a Matlab script. I've worked with a lot of different types of file types, and I wanted to write up the pros and cons of each.

Ascii (Plain Text) Files

Writing in plain text is nice when you want to edit the data file using a text editor. They are best for light-weight or small datasets, or for configurations/settings. They also work well when sending a file to another person who doesn't use your specific workflow.

  • .txt Raw text files: Unstructured text. Good for jotting some notes, but easily lost and difficult to read programatically.
  • .yaml YAML ain't markup language: Easy-to-learn markup language. Great for writing down configurations and settings.
  • .json, .bson JavaScript Object Notation: Markup language of the internet. Many tools to parse in a variety of languages, most famously, JavaScript. Very similar to Python's dict class.
  • .csv, .tsv Comma-separated values, and more rarely tab-separated values: Rows of data, separated by commas. Very portable, but not very memory effecient.

Binary Files

These are somewhat niche data formats. They're great for the task they're designed for, don't have all the features of something like HDF5.

  • .npy, .npz Numpy Files: Default data format for NumPy. Portable between Python environments, but not outside of Python (e.g., Microsoft Excel, Matlab)
  • .xls, .xlsx Excel Spreadsheets: Portable to folks who like spreadsheets. Works well for medium-sized datasets.

HDF5-Based Binary Files

HDF5 has all the features you would ever dream for in a data format. While all of the features are available in the HDF5 base library, there are several formats that build on top of the base library to make writing/reading data and attributes a little more straightforward. If you and your team can use this format, you should use it. Steep learning curve.

  • .h5, .hdf5 Hierarchical Data Format 5: The Holy Grail of data storage. Self-describing multi-dimensional hierarhical datasets of arbitrary/mixed types. Supports compression.
  • .mat MATLAB Data Files: MATLAB's default data format, based on HDF5, but doesn't make use of all the features
  • PyTables: Adds an API layer on top of Python's basic h5py library. Ideally suited for time series, or other datasets where rows of data are added sequentially as time progresses.
  • .nc NetCDF4 (Network Common Data Format): Adds a bit more structure and APIs on top of the HDF5 format. Fantastic for saving arrays of data with annotations and coordinates. Gold-standard format for saving instrument characterization or plot data.

SQL

  • .sqlite3, .db, SQLite: A light-weight relational database management system all wrapped up in a single portable file. This is great when you multiple threads/workers running at the same time, reading job requests from and writing results back to a single file.