Matched Tumor Samples¶
A relatively simple schema for the analysis of matched tumor/normal samples from cancer studies. The assumed setting is as follows.
- Each bio entity is a patient/donor.
- Each donor gives one normal (bio) sample (e.g., blood or saliva) and at least one (bio) sample from the cancer (e.g., primary tumor or metastesis).
- For each tumor and non-tumor sample, there is at least one DNA HTS library sequenced.
- For each tumor sample, there can be RNA HTS libraries.
- Only the first seen DNA/RNA library is considered for each sample (the “primary one”)
Note
The requirement of one DNA HTS library for each sample and RNA only for tumor can be dropped in the future.
Matched Tumor Fields¶
The following fields must be present for matched tumor sample sheets.
- BioSample
- isTumor – a boolean defining whether the sample was taken from tumor cells
Matched Tumor TSV Schema¶
Additionally, there is an alternative to defining schemas in JSON format for matched tumor sample sheets. Instead, a TSV-based schema can be used.
Optionally, the schema can contain meta data, starting with [Metadata]
INI-style section header (the data section has to start with [Data]
).
[Metadata]
schema cancer_matched
schema_version v1
title Example matched cancer tumor/normal study
description The study has two patients, P001 has one tumor sample, P002 has two
[Data]
The schema
and schema_version
lines are optional.
If the file does not start with an INI-style section header, it starts with tab-separated column names. An example is shown below:
patientName sampleName isTumor extractionType libraryType folderName
P001 N1 N DNA WES P001-N1-DNA1-WES1
P001 T1 Y DNA WES P001-T1-DNA1-WES1
P001 T1 Y RNA mRNA-seq P001-T1-RNA1-mRNAseq1
P002 N1 N DNA WES P001-N1-DNA1-WES1
P002 T1 Y DNA WES P001-T1-DNA1-WES1
P002 T1 Y DNA WES P001-T1-RNA1-RNAseq1
P002 T2 Y DNA WES P001-T2-DNA1-WES1
P002 T2 Y RNA mRNA-seq P001-T2-RNA1-mRNAseq1
They are as follows:
patientName
– name of the patient, used to identify the patient in the sample sheet. This value will be used as the secondary id of theBioEntity
of the patient.sampleName
– name of the sample, used to identify the sample for the patient in the sample sheet. The secondary ID of theBioSample
will be generated from thepatientName
andsampleName
.isTumor
– a flag identifying a sample as being from tumor, one of {Y
,N
,1
,0
}extractionType
– a valid extraction type as in the JSON schema, which is one ofDNA
,RNA
orother
. Based on theextractionType
(and the secondary ID of theBioSample
), the secondary ID of theTestSample
will be generated.libraryType
– a valid libraryType, as in the JSON schema, e.g.,WES
ormRNA-seq
. Based on thelibraryType
(and the secondary ID of theTestSample
), the secondary ID of theNGSLibrary
will be generated.folderName
– a folder name to search the library’s FASTQ files for. The list of base folders to search for is given in the configuration and this folder is searched for a folder with the name given here. Thus, no absolute path is given here, only the folder name.
Optionally, the following fields can be added:
seqPlatform
can be one ofIllumina
andPacBio
, default isIllumina