Tidy Data
Tidy data is a standardized way of organizing data to make it easier to analyze. Applications expect data to be tidy to create plots and analyses in a standardized way. Sphinx helps you create tidy data during the data import process and by applying transformations to your data in an analysis.
The key principles of tidy data are:
- Each variable forms a column: Each measured variable is placed in its own column.
- Each observation forms a row: Each different observation of that variable is placed in its own row.
- Each type of observational unit forms a table: Each dataset is organized into a table.
Messy vs Tidy Data
This table contains messy data, as each observations for a given wavelength and sample type are split across multiple columns. The experimental variable might be the gene and the treatment
Here are the data in a tidy format.
Notice that the variables encoded in columns names like ABS_280_Control
is represented in appropriate columns by splitting on the _
.
Further Reading
You can read more about tidy data and its impact.
- Wickham, Hadley. “Tidy data.” Journal of statistical software 59 (2014): 1-23. https://doi.org/10.18637/jss.v059.i10
- Wang, Earo, Dianne Cook, and Rob J. Hyndman. “A new tidy data structure to support exploration and modeling of temporal data.” Journal of Computational and graphical Statistics 29.3 (2020): 466-478. https://doi.org/10.48550/arXiv.1901.10257