Dataset Collections are two or more Datasets who have been added together by matching their columns. This allows you to combine related Datasets for use in an Analysis and match the columns across Datasets into one combined column in a Dataset Collection. You can see the list of Dataset Collections you have access to in the Dataset Library. You can click on the Dataset Collection button to view only Dataset Collections or the All button to see both Datasets and Dataset Collections. Dataset Collections are indicated by the ‘folder’ icon to the left of their name.

Dataset Collection

How Dataset Collections Work

Dataset Collections are a collection of two or more Datasets. The original instance of each Dataset is still present in the library, and an additional copy is added to the Dataset Collection. When in the Dataset Collection, columns from the Dataset(s) are mapped to columns of the Dataset Collection. This allows you to combine Datasets with different column names and datatypes into one consolidated collection.

Dataset Collections are automatically created when Sphinx detects a Dataset that matches an existing Dataset in your library. This is done by comparing the Datasets’ columns and associated data types. If they are an exact match they are added to Dataset Collection. These ‘automatically created’ Dataset Collections have a placeholder name in the form of <first dataset name>_<unique id>.

Example use cases for Dataset Collections include:

  1. Combining multiple Datasets for one assay to compare results over time.
  2. Adding together two Datasets that had different column names so you can compare results in one Analysis.
  3. Documenting and defining related Datasets created by you and your team.

How to Create a Dataset Collection

To create a Dataset Collection you will need at least two Datasets in your library. Read more about creating Datasets here.

1

Creating a Dataset Collection

You can create a new Dataset Collection from the Dataset Library by selecting two Datasets, and then selecting the “Create Collection” option in the upper toolbar. Doing so will take you to the Dataset Collection detail page.

Dataset Collection

2

Matching Columns

Dataset Collection

Sphinx matches columns in the Dataset Collection based on name and data type. You can change these mappings by locating the row for the Dataset, the columns for the Dataset Collection, and then selecting the appropriate column from the Dataset. If you have many columns to map, you can select the ‘Magic Map Columns’ option. This will attempt to match columns by looking at the similarity of column names across all Datasets and columns of the Dataset Collection.

Dataset Collection

Each column for a Dataset must be used exactly once. Columns in the Dataset Collection that are left completely empty will not be saved in the Dataset Collection.

3

Adding Additional Datasets

After you save a Dataset Collection, you can revisit it and add more Datasets. To do this, navigate to the Dataset Collection in the Dataset Library.

After adding more Datasets you can perform additional column matching.

4

Defining Dataset Collection Details

On the page for a Dataset Collection, you can access the preview, lineage, schema, related analyses, and settings. A brief overview on these options:

  1. “Preview Dataset” shows the first 50 rows, Bio Entity relationships, values, and column data types.
  2. ”Analyses” shows any related Analyses where the Dataset is used.
  3. ”Manage Datasets” allows you to add or remove Datasets from the Dataset Collection in addition to changing column mappings.
  4. ”Dataset Settings” lets you update the name, description, ELN entry, and tags for the Dataset.