WGS Tutorial | Sequins

Homepage › WGS Tutorial

WGS Sequins Tutorial

This is an example workflow for processing data that have had Sequins spiked in. It highlights the places where Sequins specific steps must be taken, and it is not intended as an example of a production workflow. You should adapt each step to your own needs.

Install dependencies

Option 1 Self-serve comprehensive

We provide a Docker container that has all dependencies pre-installed, this is the recommended way to run this tutorial. You can download the latest version of the image with:

docker pull ghcr.io/sequinsbio/wgs_tutorial:1.0.0

Copy to clipboard

Note:

The Docker container only supports x86_64 architectures. If you’re running this tutorial on an ARM64 architecture (e.g., Apple Silicon), you should set the DOCKER_DEFAULT_PLATFORM environment variable to linux/amd64 before running any Docker commands. This ensures that the container runs in an x86_64 emulation mode, which is necessary for compatibility with the tools included in the container.

export DOCKER_DEFAULT_PLATFORM=linux/amd64

Copy to clipboard

Alternatively, you can add the option –platform linux/amd64 to every docker command.

All subsequent commands in this tutorial can be run with the docker image. Alternatively, if you are unable to run Docker or would prefer to run the tools natively, you can install each dependency locally.

Option 2 Independent tools

This workflow uses a number of popular bioinformatics tools that need to be installed.

bwa (https://github.com/lh3/bwa)
samtools (https://github.com/samtools/samtools)
GATK (https://github.com/broadinstitute/gatk)
RTG Tools (https://github.com/RealTimeGenomics/rtg-tools)

Sequintools is the only Sequins specific tool needed for the workflow. There are multiple ways
you can install it, see details at https://github.com/sequinsbio/sequintools.

Running the workflow

The following sections step through the workflow in detail, so you can follow along either
running 1) inside the Docker container or 2) on your local machine. You can start the Docker
container with:

docker run -it --rm -v "$PWD":"$PWD" -w "$PWD" -u "$(id -u)":"$(id -g)" \ghcr.io/sequinsbio/wgs_tutorial:1.0.0

Copy to clipboard

1) Obtain the Sequins resource bundle

The Sequins resource bundle contains Sequins specific files that are needed.

curl -sSLf -O https://sequins.blob.core.windows.net/resources/wgs_core_control_set_v1.1.zip

unzip wgs_core_control_set_v1.1.zip

Copy to clipboard

The following files are provided:

sequin_sequences-mirror.fa: FASTA file containing the sequence of every Sequin
in the control set.
sequin_decoy.chrQ_mirror.fa: The decoy chromosome that is concatenated to the
standard reference genome.
sequin_regions.chrQ_mirror.bed: A BED file containing the location of each
Sequin in the decoy chromosome.
sequin_variants.chrQ_mirror.vcf.gz: A VCF file detailing the variants in the
Sequins in the decoy chromosome coordinates.
sequin_regions.hg38.bed: A BED file containing the locations in GRCh38 that each
Sequin is based on.
sequin_variants.hg38.vcf.gz: A VCF file detailing the variants in the Sequins
converted to their GRCh38 coordinates.

2) Build a Sequins augmented reference genome

This step only needs to be done once, and the resulting FASTA file can be used for all
subsequent analyses.

The first step is to concatenate the decoy chromosome to your reference genome. Here we are
downloading a copy of GRCh38, but you can use your own reference genome if you prefer.

pushd wgs_core_control_set_v1.1 || exit 1
curl -sSLf -O https://ftp-
trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRC
h38_no_alt_analysis_set.fasta.gz
   gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
cat \
    GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \
sequin_decoy.chrQ_mirror.fa \
>grch38_with_sequins.fasta
samtools faidx grch38_with_sequins.fasta
samtools dict grch38_with_sequins.fasta >grch38_with_sequins.dict
bwa index grch38_with_sequins.fasta
popd || exit 1

Copy to clipboard

Download the tutorial example data

We provide some example data you can use to test the Sequins workflow. It’s taken from whole
genome sequencing of the reference sample HG002, and has been subsetted to reduce its size.

Note: subsetted regions of the Human genome are sequences that inspired the corresponding
Sequin mirrored molecules.

curl -sSLf -O https://sequins.blob.core.windows.net/example-data/wgs_tutorial_data_v1.zip
unzip wgs_tutorial_data_v1.zip

Copy to clipboard

The following steps will walk you through a basic NGS workflow.

4) Align data

We begin by aligning our data with bwa to the Sequins augmented reference genome we
created earlier.

FASTA=wgs_core_control_set_v1.1/grch38_with_sequins.fasta
R1=wgs_tutorial_data_v1/wgs_tutorial_data_v1_R1.fastq.gz
R2=wgs_tutorial_data_v1/wgs_tutorial_data_v1_R2.fastq.gz
READGROUP="@RG\\tID:A\\tSM:A\\tLB:libA\\tPU:puA\\tPL:ILLUMINA"
bwa mem -M -t 4 -R "$READGROUP" "$FASTA" "$R1" "$R2" |
samtools fixmate -u -m - - |
samtools sort -u - |
samtools markdup -u - - |
samtools view -b -o wgs_tutorial_data_v1.bam --write-index

Copy to clipboard

5) Calibrate the Sequins in the aligned BAM

Before calling variants, we need to calibrate the Sequins in the aligned BAM.

We do this to make comparisons between the Sequins and the sample data more comparable –
Sequins will typically be sequenced to a much higher coverage than the WGS sample, making it
more likely that variants can be called. The calibration step reduces the coverage of each
Sequin region to match the coverage in the sample at the corresponding location. This gives a
more realistic estimation of the true power to call variants in the sample.

a) Docker container tutorial option:

IMAGE=ghcr.io/sequinsbio/sequintools:0.7.1
docker --rm -v -v "$PWD":"$PWD" -w "$PWD" -u "$(id -u)":"$(id -g)" "$IMAGE" \
calibrate \
--sample-bed wgs_core_control_set_v1.1/sequin_regions.hg38.bed \
--bed wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--write-index \
--summary-report wgs_tutorial_data_v1.calibrated.summary_report.csv \
-o wgs_tutorial_data_v1.calibrated.bam \
wgs_tutorial_data_v1.bam

Copy to clipboard

b) Independent tools tutorial option

sequintools calibrate \
--sample-bed wgs_core_control_set_v1.1/sequin_regions.hg38.bed \
--bed wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--write-index \
--summary-report wgs_tutorial_data_v1.calibrated.summary_report.csv \
-o wgs_tutorial_data_v1.calibrated.bam \
wgs_tutorial_data_v1.bam

Copy to clipboard

6) Call variants

We can now call variants in the calibrated BAM file using GATK’s HaplotypeCaller.

gatk HaplotypeCaller \
-R "$FASTA" \
-I wgs_tutorial_data_v2.calibrated.bam \
-O wgs_tutorial_data_v2.calibrated.vcf.gz \
-L wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed

Copy to clipboard

Sequins are compatible with any variant caller, so you can use your preferred caller at this step;
however, your results may vary depending on the caller you choose.

7) Evaluate the results

To evaluate the variants with RTG Tools, we first need to create the required RTG Sequence
Data File (SDF) from the reference FASTA:

pushd wgs_core_control_set_v1.1 || exit 1
rtg format -o grch38_with_sequins.sdf grch38_with_sequins.fasta
popd || exit 1

Copy to clipboard

We can then perform the evaluation with rtg vcfeval.

rtg vcfeval \
-t wgs_core_control_set_v1.1/grch38_with_sequins.sdf \
-b wgs_core_control_set_v1.1/sequin_variants.chrQ_mirror.vcf.gz \
-c wgs_tutorial_data_v1.calibrated.vcf.gz \
-e wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--decompose \
--no-roc \
-o vcfeval_calibrated

Copy to clipboard

True-pos-call	False-pos	False-neg
102	14	1

Precision	Sensitivity	F-measure
0.8793	0.9903	0.9315

To simplify this tutorial, we will only be evaluating SNVs in this tutorial and NOT SVs.

The false positives and false negative all occur in Sequins that represent structural variants. If
we create a BED file that excludes these Sequins and re-evaluate the results, we can see that
the results are much better.

grep -f wgs_tutorial_data_v1/sv_groups.txt \
-v wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
>sequin_regions.chrQ_mirror.no_sv.bed
rtg vcfeval \
-t wgs_core_control_set_v1.1/grch38_with_sequins.sdf \
-b wgs_core_control_set_v1.1/sequin_variants.chrQ_mirror.vcf.gz \
-c wgs_tutorial_data_v1.calibrated.vcf.gz \
-e sequin_regions.chrQ_mirror.no_sv.bed \
--decompose \
--no-roc \
-o vcfeval_calibrated_no_sv

Copy to clipboard

True-pos-call	False-pos	False-neg
62	0	0

Precision	Sensitivity	F-measure
1.0000	1.0000	1.0000

Here we can see that Sequins have proven that the sequencing experiment was successful,
and the data is of good quality. If, however, we start to see false negatives or false positives, we
can investigate which types of Sequins are failing: low/high GC, repeats etc. and make informed
decisions about how to proceed with reporting variants in the sample.

Request access to tutorial datasets

"*" indicates required fields

Table of Contents

Request access to the datasets used in this tutorial.

Request Access