Research Projects

Ordinal Data Biological Data Data Storage
Ordinal Data Processing Analysis and Compression of Biological Data Data Storage in Non-Volatile Memory

Ordinal data processing

Ordinal data refers to information presented as comparisons rather than values, such as rankings and partial orders. One of the advantages of ordinal data is that it does not rely on a scale, which makes it a very flexible representation. As Kendall (Rank correlation methods, 1955) puts it: What the ranking loses in accuracy, it gains in generality, for if we stretch the scale of measurement (and even if we stretch it differently in different regions) the ranking remains unaltered. As a result, rankings have a wide range of applications, from identifying disease-causing genes to error-correcting codes. Since rankings depend on comparisons of items and not a subjective scale that may vary from user to user, they are also useful for representing user preference in recommender systems.

Distances on rankings and rank aggregation

Kendall tau

Weighted Kendall tau
Kendall tauWeighted Kendall
The permutation polytopes for Kendall tau and weighted Kendall distances. Edge lengths denote distance between permutations differing in an adjacent transposition.

To process ranked data, we often need a distance measure between rankings, for example, when evaluating the performance of search engines or in distance-based rank aggregation, which is the task of combining ordinal data obtained from multiple sources, such as a group of experts, voters, or databases, into one ranking.

My work in this area includes axiomatic derivation of distances capable of incorporating similarity information and positional factors, such as the non-uniform importance of different parts of rankings; algorithms for computing and approximating these distances; and distance-based rank aggregation algorithms. For example, I introduced the weighted Kendall distance, which can be tuned to be more sensitive to changes in top positions in rankings and is computable in time \(O(n\log n)\) (GitHub). In rank aggregation and voting, this distance enforces a higher degree of agreement between the sources and the aggregate in the top positions compared to conventional metrics and can eliminate negative outlier votes. We use these distances and methods to develop HyDRA, a tool for gene prioritization.

Sorting streaming data with limited storage

Sorting big data with small memory

The problem of sorting a data stream with limited storage arises both in big data contexts where the ordering of the elements of a large data stream is to be approximated, and in learning user preference rankings for recommender systems where users observe a set of items one at a time, such as movies on Netflix or songs on Spotify, and can recall information about only a small number of previous items. A useful tool for studying this problem is the rate-distortion theory of permutations. To solve the problem, we first find average- and worst-case rate-distortion bounds for the permutation space. These results are then used to provide bounds on the performance of algorithms for sorting streaming data with limited storage. Furthermore, we provide a simple algorithm that achieves these performance bonds in the asymptotic sense up to a constant factor.

Information-theoretic & machine learning techniques for analysis and compression of biological data

The amount of biological data is increasing at an unprecedented rate. For example, the EMBL-EBI database doubles in size every 9 months, while storage capacity doubles every 18 months. As a result, biological data is poised to become one of the most challenging types of data in terms of storage and analysis. My research aims to address these challenges by leveraging techniques from information theory, machine learning, and probabilistic modeling.

Gene prioritization

Hydra, a gene prioritization tool
HyDRA combines data from various databases (not all databases shown).

For the task of gene prioritization, we developed HyDRA, which identifies genes that contribute to the onset of a given disease among a set of test genes. For each database in a set of diverse biological databases, HyDRA creates a ranking of the test genes, for example, based on their similarity to a set of training genes as measured by that database. It then aggregates these rankings to produce a ranking of the test genes. HyDRA uses the aforementioned distances and algorithms for rank aggregation in order to achieve a higher accuracy in the top positions, which are typically selected to be experimentally tested. HyDRA also has the ability to augment the training set through gene discovery. We tested HyDRA for a number of diseases including autism and several types of cancer and showed it outperforms the state-of-the-art tools, such as Endeavour and ToppGene.

Modeling and analysis of sequence data

Estimation error for mutation rates
Estimation error versus the number of mutations in a simulated experiment.

To better exploit the statistical properties of biological sequence data for analysis and compression, I study mathematical models of DNA mutation processes including duplication and point mutation. As part of this work, we have studied the expressive power of tandem and interspersed duplication mutations by computing the capacities of biologically-inspired sequence growth models. In addition to quantifying the expressive power, these capacities determine the rate of compressibility of sequences arising in these models. We have also introduced a stochastic model for tandem duplication and point mutation and developed an efficient method for estimating from a single sequence the parameters of the model, i.e., the probabilities of various mutation events, as well as the total number of mutations (In the figure, \(q\) and \(\hat q\) respectively denote the distribution of tandem duplication lengths and its estimate, and \(n\) and \(\hat n\) denote the number of mutations and its estimate.).

The mutation model and the estimation method can, for example, be used in phylogenomics, to find evolutionary distance between species, and in the study of genetic diseases and cancer, which are caused by harmful DNA mutations. We also study the related combinatorial problem of reducing sequences by omitting (approximate) tandem repeats and show that the maximum number of omissions required to transform a binary sequence of length \(n\) to a repeat-free sequence is \(\Theta(n)\).

Compression of metagenomic sequencing data

Metagenomics refers to the genomic study of environmental samples containing DNA from several organisms. Due to the public health applications of metagenomic studies, including studies of bacterial communities in water, soil, atmosphere, and the human body, the field is receiving increased attention. For example, the NIH Human Microbiome Project is focused on the development of metagenomic tools and databases for characterizing human microbiota and their role in human health. As the size of metagenomic datasets grow, so does the need for efficient compression tools. By integrating taxonomy identification, alignment, and source coding, we provide a parallelized compression platform that when tested on a variety of metagenomic datasets reduced file sizes by more than 87%, outperforming standard tools.

Coding for storage systems

Estimation error for mutation rates
Constructions and capacity bound for codes correcting translocation errors.

Because of its advantages over other types of storage, including speed, robustness (due to lack of mechanical parts), and efficiency, flash memory has become a $30B industry and a widely used technology, especially in mobile and cloud applications. However, the flash technology suffers from loss of reliability as data density increases and from a relatively short life span. My research focuses on addressing these problems by designing i) error-correcting codes for flash memories with a large number of charge levels that allow high-density but reliable data storage and ii) rewriting codes that decrease the need for erasure cycles and thus improve the speed and endurance of the memory. To achieve these goals, we introduced the translocation error model, which can describe a wide class of flash memory errors, including multi-level charge drops, read- and write-disturbs, and over-injections. We also presented several code constructions and decoders for correcting errors in this model, as well as information-theoretic bounds on the maximum possible coding rate. The rewriting codes are based on multipermutations and allow writing new data without erasures for a number of times, thus increasing the speed and the endurance of the memory.