How NGS index design works
NGS indexes (barcodes) are short synthetic DNA sequences ligated to each sample library before pooling. During sequencing, a separate "index read" decodes each fragment’s sample origin.
Key design principles:
1. Sufficient Hamming distance: Each pair of indexes in a pool should differ by ≥ 3 positions, allowing error correction during demultiplexing.
2. Color balance: Every cycle of the index read should have adequate representation of all four bases (or at minimum, signal in both fluorescent channels) for the sequencer to calibrate correctly.
3. No homopolymer runs: Long runs of the same base reduce sequencing quality.
4. Balanced GC content: Extreme GC bias in indexes can cause secondary structures or biased amplification.
Understanding color balance by chemistry type
4-channel SBS (MiSeq, HiSeq):
Each of the four bases emits a distinct fluorescent color. The sequencer needs all four signals for accurate base calling. If only one or two bases are present at a cycle, phasing/pre-phasing estimation fails.
2-channel SBS (NextSeq, NovaSeq X):
Two dye channels (green and red). A = both channels, C = red only, T = green only, G = dark (no signal). The sequencer needs signal in both channels: at least one A or T (for green) and at least one A or C (for red). A pool of only G and T would leave the red channel empty.
1-channel (iSeq 100):
Two sequential images. Image 1: A+T lit. Image 2: A+C lit. G = dark in both. Similar balance requirements as 2-channel but with imaging-based detection.
Demultiplexing and error correction
After sequencing, bcl2fastq or DRAGEN demultiplexes reads by matching index reads to the expected index sequences.
- 0 mismatches: Strictest setting. Only exact matches are assigned. Highest confidence but loses reads with sequencing errors in the index.
- 1 mismatch (default): Allows one base difference. Requires Hamming distance ≥ 3 between all index pairs to ensure unambiguous assignment.
- 2 mismatches: Allows two differences. Requires Hamming distance ≥ 5. Rarely used.
Reads that cannot be unambiguously assigned are placed in the "Undetermined" bin. High undetermined rates (> 5–10%) suggest index problems.