XHap: Haplotype Assembly using Long-distance Read Correlations learned by Transformers
Shorya Consul, Ziqi Ke, Haris Vikalo
Bioinformatics Advances, 2023
Paper
Reconstructing haplotypes of an organism from a set of sequencing reads is a computationally challenging problem. Limitations on read length limitations and sequencing errors render this problem difficult even for diploids; the complexity of the problem grows with the ploidy of the organism. We present XHap, a novel method for haplotype assembly that aims to learn correlations between pairs of sequencing reads, including those that do not overlap but may be separated by large genomic distances, and utilize the learned correlations to assemble the haplotypes. XHap accompishes this by utilizing transformers, a powerful deep-learning technique that relies on the attention mechanism, to discover dependencies between non-overlapping reads. Experiments on semi-experimental and real data demonstrate that the proposed method significantly outperforms state-of-the-art techniques in diploid and polyploid haplotype assembly tasks on both short and long sequencing reads, resulting in as much as a 300-fold increase in the size of the assembled haplotypes.
XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples
Shorya Consul, John Robertson, Haris Vikalo
Preprint
Many cancers can be linked to viral infections. With the advent of modern sequencing technolgoies, we now have access to massive amounts of tumor data, which in turn have allowed studies of the associations between viruses and cancers. However, the high diversity of oncoviral families makes detecting viral DNA challenging, thereby complicating the study of these associations. To that end, we propose XVir, a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. Results on semi-experimental data demonstrate that XVir is capable of achieving high detection accuracy, generally outperforming state-of-the-art competing methods while being more compact and less computationally demanding.
Differentially Private Median Forests for Regression and Classification
Shorya Consul, Sinead Williamson
Preprint
Decision forests are popular for both regression and classification but require a large number of queries for training. This makes attaining differential privacy especially challenging. We proposed a novel scheme, DiPriMe forests, that ensures differential privacy while maintaining high utility, i.e., high performance while guaranteeing differential privacy.
Reconstructing Intra-tumor Heterogenity via Convex Optimization and Branch-and-Bound Search
Shorya Consul, Haris Vikalo
ACM Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB), 2019
Paper | Code
Reconstructing tumor populations from heterogeneous samples from high-throughput sequencing data is highly valuable area of study due to its potential to inform targeted studies and treatments. This, however, is a challenging task due to the complex mutations present and read lenghts being too short to span regions exhibitng structural variations. We present a novel algorithmic framework, AMTHet, to infer the tumor clonal populations and their frequencies from a heterogeneous sample based on copy number variations.
A MAP Framework for Support Recovery of Sparse Signals Using Orthogonal Least Squares
Shorya Consul, Abolfazl Hashemi, Haris Vikalo
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
Paper
We propose MAP-AOLS, an algorithm that leverages the statistical information about the sensing matrix and signal to greedily reconstruct sparse binary signals from their compressed measurements. This stands in contrast to conventional greedy algorithms, such as OLS and OMP, that perform reconstruction without utilizing any knowledge about the statistical distributions.