Speaker
Description
We present a mathematical framework for quantifying basecalling error at multiple scales in single-molecule (nanopore) sequencing, from individual bases to whole-sequence classification.
We define hierarchical Phred-like quality scores — per-base, per-read, and per-sequence — and prove via Jensen's inequality that averaging in the Phred domain systematically overestimates accuracy relative to the Phred transform of the mean error. This concavity-driven bias has direct consequences for quality reporting. We address it with an alignment-based correctness score incorporating predicted basecaller confidence and empirical accuracy into a single Phred-scale summary.
We then lift the analysis to sequences by constructing a confusion matrix indexed by true targets and basecaller assignments. Row-normalization produces a stochastic matrix of pairwise misclassification probabilities; its Frobenius distance from the identity yields a Phred-like scalar of classifier fidelity. Clopper–Pearson confidence intervals from the multinomial row structure, with minimum read-count bounds, ensure reliable estimation of rare confusions. A bridge between scales is provided by per-base reliability zones: under conditional independence, the product of position-specific error rates at discriminating sites predicts which off-diagonal entries dominate, enabling anticipation of sequence-level misclassification from basewise profiles