WGS Sequins Tutorial

This is an example workflow for processing data that have had Sequins spiked in. It highlights the places where Sequins specific steps must be taken, and it is not intended as an example of a production workflow. You should adapt each step to your own needs.

Install dependencies

Option 1 Self-serve comprehensive

We provide a Docker container that has all dependencies pre-installed, this is the recommended way to run this tutorial. You can download the latest version of the image with:

docker pull ghcr.io/sequinsbio/wgs_tutorial:1.0.0
Copy to clipboard

Note:

The Docker container only supports x86_64 architectures. If you’re running this tutorial on an ARM64 architecture (e.g., Apple Silicon), you should set the DOCKER_DEFAULT_PLATFORM environment variable to linux/amd64 before running any Docker commands. This ensures that the container runs in an x86_64 emulation mode, which is necessary for compatibility with the tools included in the container.

export DOCKER_DEFAULT_PLATFORM=linux/amd64
Copy to clipboard

Alternatively, you can add the option –platform linux/amd64 to every docker command.

All subsequent commands in this tutorial can be run with the docker image. Alternatively, if you are unable to run Docker or would prefer to run the tools natively, you can install each dependency locally.

 

Option 2 Independent tools

This workflow uses a number of popular bioinformatics tools that need to be installed.

Sequintools is the only Sequins specific tool needed for the workflow. There are multiple ways
you can install it, see details at https://github.com/sequinsbio/sequintools.

Running the workflow

The following sections step through the workflow in detail, so you can follow along either
running 1) inside the Docker container or 2) on your local machine. You can start the Docker
container with:

docker run -it --rm -v "$PWD":"$PWD" -w "$PWD" -u "$(id -u)":"$(id -g)" \ghcr.io/sequinsbio/wgs_tutorial:1.0.0
Copy to clipboard

1) Obtain the Sequins resource bundle

The Sequins resource bundle contains Sequins specific files that are needed.

curl -sSLf -O https://sequins.blob.core.windows.net/resources/wgs_core_control_set_v1.1.zip

unzip wgs_core_control_set_v1.1.zip
Copy to clipboard

The following files are provided:

  • sequin_sequences-mirror.fa: FASTA file containing the sequence of every Sequin
    in the control set.
  • sequin_decoy.chrQ_mirror.fa: The decoy chromosome that is concatenated to the
    standard reference genome.
  • sequin_regions.chrQ_mirror.bed: A BED file containing the location of each
    Sequin in the decoy chromosome.
  • sequin_variants.chrQ_mirror.vcf.gz: A VCF file detailing the variants in the
    Sequins in the decoy chromosome coordinates.
  • sequin_regions.hg38.bed: A BED file containing the locations in GRCh38 that each
    Sequin is based on.
  • sequin_variants.hg38.vcf.gz: A VCF file detailing the variants in the Sequins
    converted to their GRCh38 coordinates.

 

2) Build a Sequins augmented reference genome

This step only needs to be done once, and the resulting FASTA file can be used for all
subsequent analyses.

The first step is to concatenate the decoy chromosome to your reference genome. Here we are
downloading a copy of GRCh38, but you can use your own reference genome if you prefer.

pushd wgs_core_control_set_v1.1 || exit 1
curl -sSLf -O https://ftp-
trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRC
h38_no_alt_analysis_set.fasta.gz
   gzip -d GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz
cat \
    GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta \
sequin_decoy.chrQ_mirror.fa \
>grch38_with_sequins.fasta
samtools faidx grch38_with_sequins.fasta
samtools dict grch38_with_sequins.fasta >grch38_with_sequins.dict
bwa index grch38_with_sequins.fasta
popd || exit 1
Copy to clipboard

Download the tutorial example data

We provide some example data you can use to test the Sequins workflow. It’s taken from whole
genome sequencing of the reference sample HG002, and has been subsetted to reduce its size.

Note: subsetted regions of the Human genome are sequences that inspired the corresponding
Sequin mirrored molecules.

curl -sSLf -O https://sequins.blob.core.windows.net/example-data/wgs_tutorial_data_v1.zip
unzip wgs_tutorial_data_v1.zip
Copy to clipboard

The following steps will walk you through a basic NGS workflow.

4) Align data

We begin by aligning our data with bwa to the Sequins augmented reference genome we
created earlier.

FASTA=wgs_core_control_set_v1.1/grch38_with_sequins.fasta
R1=wgs_tutorial_data_v1/wgs_tutorial_data_v1_R1.fastq.gz
R2=wgs_tutorial_data_v1/wgs_tutorial_data_v1_R2.fastq.gz
READGROUP="@RG\\tID:A\\tSM:A\\tLB:libA\\tPU:puA\\tPL:ILLUMINA"
bwa mem -M -t 4 -R "$READGROUP" "$FASTA" "$R1" "$R2" |
samtools fixmate -u -m - - |
samtools sort -u - |
samtools markdup -u - - |
samtools view -b -o wgs_tutorial_data_v1.bam --write-index
Copy to clipboard

5) Calibrate the Sequins in the aligned BAM

Before calling variants, we need to calibrate the Sequins in the aligned BAM.

We do this to make comparisons between the Sequins and the sample data more comparable –
Sequins will typically be sequenced to a much higher coverage than the WGS sample, making it
more likely that variants can be called. The calibration step reduces the coverage of each
Sequin region to match the coverage in the sample at the corresponding location. This gives a
more realistic estimation of the true power to call variants in the sample.

a) Docker container tutorial option:

IMAGE=ghcr.io/sequinsbio/sequintools:0.7.1
docker --rm -v -v "$PWD":"$PWD" -w "$PWD" -u "$(id -u)":"$(id -g)" "$IMAGE" \
calibrate \
--sample-bed wgs_core_control_set_v1.1/sequin_regions.hg38.bed \
--bed wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--write-index \
--summary-report wgs_tutorial_data_v1.calibrated.summary_report.csv \
-o wgs_tutorial_data_v1.calibrated.bam \
wgs_tutorial_data_v1.bam
Copy to clipboard

b) Independent tools tutorial option

sequintools calibrate \
--sample-bed wgs_core_control_set_v1.1/sequin_regions.hg38.bed \
--bed wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--write-index \
--summary-report wgs_tutorial_data_v1.calibrated.summary_report.csv \
-o wgs_tutorial_data_v1.calibrated.bam \
wgs_tutorial_data_v1.bam
Copy to clipboard

6) Call variants

We can now call variants in the calibrated BAM file using GATK’s HaplotypeCaller.

gatk HaplotypeCaller \
-R "$FASTA" \
-I wgs_tutorial_data_v2.calibrated.bam \
-O wgs_tutorial_data_v2.calibrated.vcf.gz \
-L wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed
Copy to clipboard

Sequins are compatible with any variant caller, so you can use your preferred caller at this step;
however, your results may vary depending on the caller you choose.

 

7) Evaluate the results

To evaluate the variants with RTG Tools, we first need to create the required RTG Sequence
Data File (SDF) from the reference FASTA:

pushd wgs_core_control_set_v1.1 || exit 1
rtg format -o grch38_with_sequins.sdf grch38_with_sequins.fasta
popd || exit 1
Copy to clipboard

We can then perform the evaluation with rtg vcfeval.

rtg vcfeval \
-t wgs_core_control_set_v1.1/grch38_with_sequins.sdf \
-b wgs_core_control_set_v1.1/sequin_variants.chrQ_mirror.vcf.gz \
-c wgs_tutorial_data_v1.calibrated.vcf.gz \
-e wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
--decompose \
--no-roc \
-o vcfeval_calibrated
Copy to clipboard
True-pos-call False-pos False-neg
102 14 1
Precision Sensitivity F-measure
0.8793 0.9903 0.9315

To simplify this tutorial, we will only be evaluating SNVs in this tutorial and NOT SVs.

The false positives and false negative all occur in Sequins that represent structural variants. If
we create a BED file that excludes these Sequins and re-evaluate the results, we can see that
the results are much better.

grep -f wgs_tutorial_data_v1/sv_groups.txt \
-v wgs_core_control_set_v1.1/sequin_regions.chrQ_mirror.bed \
>sequin_regions.chrQ_mirror.no_sv.bed
rtg vcfeval \
-t wgs_core_control_set_v1.1/grch38_with_sequins.sdf \
-b wgs_core_control_set_v1.1/sequin_variants.chrQ_mirror.vcf.gz \
-c wgs_tutorial_data_v1.calibrated.vcf.gz \
-e sequin_regions.chrQ_mirror.no_sv.bed \
--decompose \
--no-roc \
-o vcfeval_calibrated_no_sv
Copy to clipboard
True-pos-call False-pos False-neg
62 0 0
Precision Sensitivity F-measure
1.0000 1.0000 1.0000

Here we can see that Sequins have proven that the sequencing experiment was successful,
and the data is of good quality. If, however, we start to see false negatives or false positives, we
can investigate which types of Sequins are failing: low/high GC, repeats etc. and make informed
decisions about how to proceed with reporting variants in the sample.

Request access to tutorial datasets

"*" indicates required fields

Area(s) of Interest*
You agree to our friendly Privacy Policy
Table of Contents

Request access to the datasets used in this tutorial.