Tutorials

RADseq

This part will deal with processing of a Restriction site Associated DNA sequencing (RADseq), which is a form of reduced representation sequencing. This allows for cost-effective analyses of a larger number of samples, making this a useful strategy when funding is limited (and it always is šŸ™).

This section relies heavily on resources created for RADcamp, so a big thank you to Dr. Isaac Overcast and all RADcamp contributors over the years! We will use data from cheetahs from Prost et al. (2022). Note that the data are downsampled, so these steps can be run quickly, for educational purposes only.

You can find more information about the raw data and steps for quality control here.
The steps for using the actual Ipyrad pipeline are here.
Note that there are exercises with additional information here.

šŸ“· Cheetah coalition in the Masai Mara, Kenya cheetahs Ā©Laura Bertola

Important note on why not to use a WGS pipeline on RADseq data

You technically can, but it’s usually a bad idea unless you’re very careful. Here’s why:

1. WGS tools assume (somewhat) even, genome-wide coverage
Tools in standard WGS pipelines (e.g., GATK best practices):

→ RADseq violates all of these assumptions. Note that there is a difference between randomly missing data (e.g. low depth WGS) and systematically missing data (e.g. RADseq).

2. High missingness confuses variant callers
Variant callers like GATK HaplotypeCaller, FreeBayes, etc., assume:

→ In RADseq, many loci are missing completely from many individuals, and coverage can be low and variable, which leads to poor genotype calling:

3. Duplicate handling & local realignment are overkill or misleading
WGS pipelines include steps like:

→ These aren’t helpful (and may be harmful) for RADseq because:

However, you can use a reference genome to map RADseq reads (e.g., with bwa mem) if you want:

But after mapping, you should still use a RADseq-aware pipeline (e.g., ipyrad, Stacks) to cluster loci and call variants. Note that mapping to a reference genome is integrated in the Ipyrad pipeline.

In summary

WGS and RADseq are fundamentally different.

pakeeza
FeatureĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā  Low-Depth WGS Pipelines RADseq Pipelines
Typical Tools GATK, ANGSD, bcftools, FreeBayes ipyrad, Stacks, pyRAD
Data Structure Reads spread across genome (sparse but random) Reads concentrated at restriction sites (locus-level)
Missing Data Pattern Randomly distributed Systematic (entire loci missing in individuals)
Causes of Missingness Low coverage, sequencing errors Restriction site dropout, uneven digestion, low coverage
Genotype Calling Approach Genotype likelihoods (GLs), population priors (e.g., in ANGSD, GATK’s HaplotypeCaller) Clustering-based consensus calling per locus
Handling of Missing Data Statistical models infer genotypes/frequencies even with low depth Filtering by minimum samples per locus (e.g., min_samples_locus)
Assumption About Coverage Generally assumes uniform or randomly missing Explicitly expects high missingness and dropout
Effect of Hard Filtering Can be tuned by depth/quality thresholds Over-stringent filtering removes most loci (bad)
Works Without Reference Genome? Yes, but reference improves results Yes (de novo mode common)
SNP Position Resolution Precise, genome-wide Clustered within RAD loci (may lack genomic coordinates if de novo)

Of course, there are different arguments for using a specific pipeline, and consistency with previously generated datasets is one of them. But RADseq pipelines exist for a reason.