Arun Das

As a computer scientist and computational biologist, my work is primarily focused on where these two disciplines intersect.

In computer science, my primary areas of interest are algorithms, machine learning approaches for large datasets, sketching and approximation algorithms to speed up existing tasks, and cryptography and computer security (especially around genomic data).

As a computational biologist, I am interested in identifying patterns of variation in certain populations, and understanding how these variants are represented in or absent from widely used genomic datasets. I am particularly interested in understanding the limitations of linear reference genomes, and in the variation - both globally common and individual- or population-specific - that is discarded in traditional approaches.

In my work, I have used my computer science background to develop methods that accelerate existing tasks in genomics, lowering the computational barriers to genomic analysis. I have also developed approaches to analyze large datasets, and combine data from a number of sources to construct a better picture of variation present in large genomic data.

I did my PhD in Computer Science with Dr. Michael Schatz at Johns Hopkins University, focusing on a range of topics in computational genomics. These included Sapling, a learned index structure approach for speeding up suffix array queries, sketching and sampling approaches to speed up long read classification from metagenomic experiments, and approaches to utilize unmapped reads to better understand variation in South Asian individuals relative to widely used reference genomes. Full details of these projects, as well as others I have worked on, can be found below.

My Google Scholar page can be found here, and my full CV can be found here.

Assembling unmapped reads reveals missing variation in South Asian Genomes

My latest work, and the major focus of my PhD dissertation, focuses on investigating and improving the representation of South Asian genomes in existing genomic datasets, as well as evaluating the improvements offered by new linear and pangenome references. This work has been presented at Biology of Genomes 2023, ASHG 2023, Biological Data Science 2024, and will be presented at Biology of Genomes 2025. As of May 2025, a pre-print is available on bioRxiv. This work was done with Dr. Mike Schatz, Dr. Rajiv McCoy, and Dr. Arjun Biddanda.

Beyond the Human Genome Project:
The Age of Complete Human Genome Sequences and Pangenome References

I was part of an effort to create a review summarizing the evolution of the human genome, from the earliest efforts in the late 90s and early 2000s, to the first complete human reference genome in T2T-CHM13, to the advent of human pangenomes. This work has been published in Annual Reviews, and can be found here.

Sketching and Sampling Approaches for Long Read Classification

My second project during my PhD was "Sketching and sampling approaches for fast and accurate long read classification", which focused on sketching and sampling approaches to speed up read classification compared to existing high-overhead index and alignment based methods. This work has been published in BMC Bioinformatics, and can be found here.

Sapling: accelerating suffix array queries with learned data models

My first project was "Sapling: Accelerating Suffix Array Queries with Learned Data Models", which focused on learned index structures/data models for genomics. It was published in Bioinformatics, and can be found here. This work was lead by Dr. Melanie Kirsche.

KinVis: a visualization tool to detect cryptic relatedness in genetic datasets

In 2017 I worked with Dr. Michael Aupetit on KinVis, a visual analytic tool (built entirely in R and Shiny) for exploring kinship information in Genome Wide Association Studies (GWAS). The paper can be found on Bioformatics .

Approaches in Genomic Privacy

Under the supervision of my undergraduate advisor Dr. Sorin Istrail I completed a senior undergraduate honors thesis, titled "Approaches in Genomic Privacy". My thesis surveyed the current state of genomic privacy, reviewed potential solutions (both technical and non-technical), and included my projections for the future of the field. It can be found here .