Simon Schlumbohm (HSU)
Efficient Algorithms for Improved Information Retention in Integration of Incomplete Omics Datasets
The acquisition of high-quality data in the biomedical field, particularly in omics studies
such as proteomics or transcriptomics, poses a significant challenge due to incomplete
measurements during data acquisition or simply small sample sizes. This issue results in
datasets with low statistical power that are in addition often compromised by missing
values, which impede downstream analysis and the accurate interpretation of biological
phenomena.
A common approach to mitigate such limitations is data integration, which combines
multiple datasets to increase cohort sizes by incorporating data from different studies or
laboratories. However, this approach introduces new challenges, notably the so-called
batch effect, which introduces internal biases and obscures biological meaning. Moreover,
infrequently measured features (e.g., proteins or genes) create additional gaps in the data
during integration tasks.
As the volume of available biological data continues to expand, there is an increasing
need for computational methods capable of efficiently processing and analyzing these
growing datasets. Expected future advancements in data acquisition with regards to
throughput necessitate the development of computationally efficient and robust algorithms.
In addition, to ensure accessibility and broad adoption, it is crucial that bioinformatics
tools must be user friendly, allowing researchers with varying levels of technical expertise
to effectively utilize them.
To this end, an integration and batch effect reduction tool has been developed, called
the HarmonizR algorithm. This work features various functionality that has been build
to tackle the aforementioned issues. Dataset integration aims for an increase in cohort
sizes and sample amounts, which is facilitated by the inclusion of a new unique removal
approach. It overcomes prior limitations regarding data retention, greatly increasing
HarmonizR’s benefits as a pipeline tool when used prior to data analysis by significantly
expanding the number of considerable features and data points of any given study. This
may be paired with the added functionality of accounting for user-defined experimental
information such as treatment-groups (i.e., covariate information) during adjustment,
leading to more robust and high-quality results. Regarding computational efficiency, a
novel blocking approach exploits the given data structure to brace the algorithm for
current and future big data challenges without negatively impacting adjustment quality.
Furthermore, the algorithm’s batch effect adjustment capabilities are proven effective
on various omics types – with a notable extension towards single cell count datasets by
employing further adjustment methodology – as well as non-biological data in the form of
an attention-deficit/hyperactivity disorder study.
To address remaining challenges, the newly developed BERT algorithm introduces a novel
architectural approach, offering improvements in information retention and computational
efficiency. A comparative analysis of BERT and HarmonizR explores the advantages
of BERT in terms of feature/overall data retention and reduced runtimes, providing a
valuable complement to the existing framework.
Lastly, to enhance accessibility and ease of use, plugins for the popular Perseus software
have been created and are described, enabling seamless integration of both algorithms
into established bioinformatics workflows, specifically aiding researchers less familiar with
the technical aspects of the here shown algorithms and bioinformatics in general.