I've spent the last few years working with numerical data. Most of these data are gathered with various instruments, saved to some file that lives on a lab computer, then is transferred onto my laptop via SSH, FTP, Email, Dropbox, an HTTP server, or a USB stick, and then analyzed with either a Python or a Matlab script. I've worked with a lot of different types of file types, and I wanted to write up the pros and cons of each.
Writing in plain text is nice when you want to edit the data file using a text editor. They are best for light-weight or small datasets, or for configurations/settings. They also work well when sending a file to another person who doesn't use your specific workflow.
.txt
Raw text files: Unstructured text. Good for jotting some notes, but easily lost and difficult to read programatically..yaml
YAML ain't markup language: Easy-to-learn markup language. Great for writing down configurations and settings..json
, .bson
JavaScript Object Notation: Markup language of the internet. Many tools to parse in a variety of languages, most famously, JavaScript. Very similar to Python's dict
class..csv
, .tsv
Comma-separated values, and more rarely tab-separated values: Rows of data, separated by commas. Very portable, but not very memory effecient.These are somewhat niche data formats. They're great for the task they're designed for, don't have all the features of something like HDF5.
.npy
, .npz
Numpy Files: Default data format for NumPy. Portable between Python environments, but not outside of Python (e.g., Microsoft Excel, Matlab) .xls
, .xlsx
Excel Spreadsheets: Portable to folks who like spreadsheets. Works well for medium-sized datasets.HDF5 has all the features you would ever dream for in a data format. While all of the features are available in the HDF5 base library, there are several formats that build on top of the base library to make writing/reading data and attributes a little more straightforward. If you and your team can use this format, you should use it. Steep learning curve.
.h5
, .hdf5
Hierarchical Data Format 5: The Holy Grail of data storage. Self-describing multi-dimensional hierarhical datasets of arbitrary/mixed types. Supports compression..mat
MATLAB Data Files: MATLAB's default data format, based on HDF5, but doesn't make use of all the features.nc
NetCDF4 (Network Common Data Format): Adds a bit more structure and APIs on top of the HDF5 format. Fantastic for saving arrays of data with annotations and coordinates. Gold-standard format for saving instrument characterization or plot data..sqlite3
, .db
, SQLite: A light-weight relational database management system all wrapped up in a single portable file. This is great when you multiple threads/workers running at the same time, reading job requests from and writing results back to a single file.Christ-follower, brother, husband, father, grad student.