Enhancing non-Perl bioinformatic applications with Perl
ChristosArgyropoulos7
72 views
34 slides
Jun 26, 2024
Slide 1 of 34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
About This Presentation
Building novel, component-based applications using Object Orientation, PDL, Alien, FFI,�Inline and OpenMP
Size: 21.87 MB
Language: en
Added: Jun 26, 2024
Slides: 34 pages
Slide Content
Enhancing non-Perl bioinformatic applications with Perl: Building novel, component-based applications using Object Orientation, PDL, Alien, FFI, Inline and OpenMP Christos Argyropoulos MD PhD FASN Associate Professor of Internal Medicine University of new Mexico, school of medicine Albuquerque, new Mexico, usa
Disclosures Patent on an RNA sequencing workflow (managed through Rainforest Innovations)
Learning Objectives Role of Perl in 21 st century High Performance Computing & Bioinformatics (HPC, HPB) Component Based Applications for HPB with Perl Object Orientation / Meta-Object Programming Abstractions & “ Jupyterization ” Data flows in HPB and parallelization opportunities Use cases Random Number Generation with Objects and Roles Command line enhancements Coarse (process) & fine (thread) level parallelization through Perl (MCE) and C (OpenMP)
A brief history of Perl’s role in Bioinformatics 30 years ago Perl was credited with saving the Human Genome Project Up to the early 2000s, the field’s major tasks were the management of workflows, managing databases of biological sequences (text) & “Web programming” The late 90s and early 00s were a transition point in the field: HGP completed Quantitative molecular biology emerged Experimental analysis tasks rose to prominence R took over: compact notation for cross classification & regression/containers for homogeneous objects/more aggressive advertising of capabilities to connect with low level languages/shiny new toy
HGB in the not-so-roaring 20s sitrep Explosion of techniques to find AND count sequences A single person can generate x100 more data than the entire consortium in more than one day Data is noisy text + quantitative features COVID19 put sequencers everywhere Compute has gone multicore & multi-device AI has made C (and the Python API) great again Intersection of HPC, Data Science (DS) for biological data The market for compute solutions targeting hgB https://humanprogress.org/the-fastest-learning-curve-in-history/
High Performance Scientific Computing : a more exact form of data science HPC = Data flows + “Leafy” Functions (most often in some low-level language, C/Fortran/ASM) Data : OSEMN, CRISP-DM, SEMMA, MS-TDSP (= SCRUM + CRISP-DM) etc. Applications : High Level Abstractions over Leafy Low-Level Components (LLLC) https://www.datascience-pm.com/osemn/ https://youtu.be/eaGeoOleq1c https://www.youtube.com/watch?v=s4uF8UOJz9k
Revisiting Perl for HPC/B or why the O’Reilly book sales pitch was prophetic Problem Abstractions for Data flows inside data intense HPC / HPB applications How do we talk to LLLC? How do we make sense of LLLC output? How do we embed LLLC in our apps? How do we make our applications multi-core friendly? How do we make HPC/HPB Perl application development friendlier? Perl solution To be discussed Numerous (too?) OO and very nice MOP facilities to aid the design FFI + Inline + Command-line PDL & R for High Level Processing Alien (bonus: target compilation) MCE to parallelize Perl steps, OpenMP & Environment::OpenMP to control thread “ Jupyterization ” through 1+2
An example from work
How can one use Perl to write scientific omponents in Perl for 21 st century bioinformatics? Use Cases: 1. Polymer 2. Edlib
A Tiny Role for Randomness Things that one can do with random numbers: Generation of test scenarios from the space of cases (in our case sequences) Scientific Calculations (Monte Carlo) for statistical inference & stochastic finance Systems level modeling & reliability testing Random Number Generation (RNG) is a great case for Roles & OOP Statistical distributions come with attributes: Domain of the random variable with the said distribution Parameters of the distribution Mathematical functions (probability & cumulative density functions and their inverses) Things one can do with statistical distributions (scientific LARP): Algorithms (such as the inverse CDF method) that can use properties of distributions to generate random numbers Deterministic calculations based on these distributions (e.g. p-values for decision testing) External Dependencies that can also be viewed as roles , or, objects for multiple inheritance
The Inverse CDF Method in Perl The ultimate generic rng algo Simulate a uniform number between 0 and 1.This is the CDF value from an unknown number that is guaranteed to be distributed according to the target distribution Use the inverse CDF function to find that unknown value A role-based implementation Import all the relevant definitions from a module that provides methods for the CDF and inverse CDF methods for a very large collection of random distributions into a new module Compose a role for the inverse CDF method by the module of the previous step P rovide the module for the simulation with a runtime plugin for the uniform random number simulator Designer Degrees of Freedom : Which uniform RNG to use ? Which collection of statistical distributions to use? How to implement the inverse CDF method: dynamic language, a low-level language with an API, or a dedicated matrix sub-language (e.g. PDL or Numpy )?
Exploring the universe of choices Fix the underlying space of distributions to those provided by GSL : Vectorized implementation (PDL::GSL) Non-vectorized (Math::GSL) Two random number generators: Perl and PDL Four possible combinations going from fully vectorized to partially vectorized & non-vectorized Questions: What is the overhead of OOP? How well does the implementation perform against a non-OO, “hook” based R version ? https://github.com/chrisarg/bio-seqalignment-examples-tailingpolyester
An application of the RNG modules: Polymer enhancement Polymer is one of the first realistic RNA sequencing simulations Coded in R, lightweight, does not require a supercomputer to run Missing certain steps for our protocol (e.g. adding the A tails) One can use the RNG to add these tails to sequences before they are used in Polymer simulations Developing the application took fewer than 4 hours (after the RNGs had been created) Took the R functional interface and asked Github Copilot to generate a getopt command line interface from the specification Took the R getopt processing code and asked Github Copilot to generate a Perl level Getopt ::Long from the R version of getopt Most of the time was spent nicely formatting the code & writing the documentation https://github.com/chrisarg/bio-seqalignment-examples-tailingpolyester https://github.com/chrisarg/bio-seqalignment-applications-sequencingsimulators-rnaseq-polyester
Copilot assisted command-line interface authoring
Options for trimming polyA tails Removing the polyA tail (a stretch of A’s at the end of the sequence string) is an overlooked activity in the sequencing workflows Length of the tail can be biologically important OR a nuisance (when introduced by the protocol) when identifying the sequence State of the art is a Python program called cutadapt written ~15 years ago There has never been an empirical demonstration that the cutadapt algorithm is the best in terms of accuracy or performance, because there have been no competing proposals! Cutadapt changed its algorithm a few years ago, noting that the new one performs better (unclear how better is defined!
An empirical evaluation of the performance of cutadapt & other polyA trimmers https://arxiv.org/abs/2406.10271 This task may be repeated 10s to 100s of millions times for each dataset cutC : cutadapt implementation in C, interfacing to Perl via Inline::C changepoint: statistical technique implemented in C, that is interfacing to Perl via Inline::C
Approximate string matching and alignments – the workhorse of sequencing workflows The ultimate output of sequencing instruments are massive amounts of error-prone text Sequencing the genome of a single person ~200-300GB of data. All this text must be searched against 3GB of text (human genome) In RNA sequencing applications, the text not only must be mapped (=find which are from the reference it came from), but these hits must be counted as well In meta-genomic applications, the reference database may be the genomes of the target organism AND many microbe/viruses (memory profile can get ugly really quickly) Both databases and samples may contain errors and/or duplications Gold standard for such searches are methods that can perform approximate string-matching accounting for errors (insertions/deletions of text, and substitution of one letter for another, think agrep )
A brief overview of alignment Dynamic programming techniques that are used to compare text sequences The require knowledge of the alphabet of the sequences (e.g. the bases of DNA), and specification of a awards (same letter found in the two sequences in the same position), or penalties (mismatches) Puts the letters of the two sequences in direct correspondence by incorporating “ edits ” ( insertions , deletions , substitutions ) Alignment can be local (finding common “words”) or “ global ” (matching the entire length of two sequences) using gaps (equivalent to spaces in a text)
The dropping shoe of choice Sequence comparison in a nutshell For every sequence comparison method: For every query sequence (Q) : For every reference sequence (R): Compute Similarity (R,Q) Find best match between Q and all R Compare sequence comparison methods using best match pairs The Effects of publish or perish Dynamic programming algorithms are a cottage industry: Multiple heuristics to address quadratic run time/memory SIMD and vectorized instruction sets(SSE2/SS4/AVX) made the problem worse Unclear which algorithm performs best under which circumstances esp for 3 rd gen sequencing platforms Nearly all algorithms developed for a different problem (evolutionary relationships) rather than database search Similarity and best match also admit multiple options for the same algorithm Clustering & filtering approaches can be used as preprocessing steps adding another layer complexity
Take a deep breath and … Clarify the data flows HPB = Data flows + Leafy Functions Control the number of libraries/modules one has to write before hitting the jackpot Control the complexity of each module Jupyterize (single file) your code Think about parallel performance BEFORE you even start coding “Leafy” function may already be somewhat parallel (threaded/SIMD aware etc ) Parallelize your top-level search loop in your design through multiprocessing Combine threads/processes (and nodes in clusters)
Two dataflows account for the vast majority of sequence mapping applications Aligner : Conducts Similarity Searching Mapper: Reduces the similarity scores to a single best value for each query Aligner/Mapper: Conducts similarity searching and reduction together From the perspective of the hig - level controller (i.e. Perl application) of the data-flow all 3 are “leafy” components.
Design of Bio:: SeqAlignment ::Components:: SeqMapping A wrapper around an external sequence mapping library must provide the following methods when consuming the LinearLinear dataflow seq align : inner most loop of similarity search extract sim metric : a method that extracts the similarity metric for each work element reduce sim metric : a method that reduces the similarity metric to a single value for each work element. cleanup : a method that handles cleanup operations, either in memory or in the disk In the Generic mapper these are coderefs Work provided as a (reference) to an array of distinct work elements. These could be individual query sequences, groups of sequences in files
User controlled, coarse (process) level parallelism through the Many Cores Engine Module for the LinearLinearGeneric dataflow How does one make mappers who inherit from Generic $self -aware?
Jupyterized , single *.pl file module development: Inherit from the generic mapper Apply roles for the corresponding workflow Write code for the required and any accessory methods, not a fully fleshed out distribution Use this single file for development, testing or work Once everything is complete, copy the required methods into a Moose module that does not inherit from Generic Can easily move around via email (yes, this is still a thing) Benefits: The generic and dataflow modules are truly generic and can be used to control many/most alignment applications Only the required methods must be implemented to wrap around a new similarity searching library The non-generic versions of the dataflows only differ in how the required methods are invoked from inside the dataflow module, hiding an implementation detail from users The *.pl files are functional and can be shipped as supplementary methods for publications. There is no need to create distributions just to report results Single file can be frozen in workflow/protocol sites for reproducible research purposes
Enhancing Edlib Edlib is one of the fastest libraries to compare strings using the edit distance Algorithmic improvement over Myer algorithm which improved over agrep Component in numerous other alignment algorithms & often used as comparator for new methods Current implementation : single threaded library that compares one query against one sequence A few older CPAN wrappers (but appear to be broken) Alienized in late 2023/early 2024 by the presenter
Two Edlib enhancements Process level parallelism Uses the dynamic library from the Alien distribution Interface through FFI::Platypus One process per query sequence is used (the outer loop is parallelized) Reduction of the similarity metric using PDL Process + thread level parallelism Uses the static library from the Alien distribution Inline::C used to communicate between C and Perl One process per query sequence One or more threads per process to parallelize search over the reference sequences (inner loop written in C and parallelized through OpenMP) Near full control of the OpenMP environment through Brett Estrade’s Environment::OpenMP module https://github.com/chrisarg/bio-seqalignment-components-libraries-edlib https://github.com/chrisarg/bio-seqalignment-examples-enhancing-edlib
MCE vs MCE-OpenMP Execution time Resource use (copied of reference database x exec time)
Why not both processes and threads?
OpenMP and MCE optimizations (github conversations with Mario Roy) Part I
OpenMP and MCE optimizations ( github conversations with Mario Roy) Part II
Summary The changing landscape of High Performance Scientific and Bioinformatics computing poses an opportunity for Perl development The environment is in need of approaches that can develop high level controller for applications that consist of complex dataflows to and from leaf-like low-level language components R played this role in Bioinformatics ~ 20 years ago and Python is increasingly playing this role everywhere. However: Perl’s multi-paradigm compute model makes it ideal for the rapid authoring of such applications that will be inherently multi-language Alien, MCE, Inline, FFI and Environment::OpenMP and PDL are going to be instrumental in the creation of performant applications as we illustrated herein Perl
Future Work Applied work will shift in wrapping other aligners using the set of modules in Bio:: SeqAligment for rapid, yet fully functional and performant prototype development. Assume control over process placement and thread affinity using OS facilities and OpenMP runtime directives Explore ways to control the data environment of threaded applications using Perl(guts) Extend the use of PDL for novel similarity metric reduction schemes Explore the use of Perl to provide memory safe dynamic allocation for very low-level code (SIMD assembly) for prefiltering or novel similarity search methods Explore internode communication schemes using Message Passing Interface (MPI) and other distributed actor frameworks for cluster environments Seek out rational solutions for incorporating cloud resources into these calculations