What is a RAP?
Principles of Reproducible Analytical Pipelines
Christophe Bontemps
1
SIAP-2024
Fundamental Principles of Official Statistics
Clear mention of the process used
to produce statistics
·
To retain trust in official statistics, the
statistical agencies need to decide
according to strictly professional
considerations, including scientific
principles and professional ethics, on
the methods and procedures for the
collection, processing, storage and
presentation of statistical data.
·
2/17
Usual practice: Theory vs reality
3/17
Usual practice: In the end
4/17
What are the issues?
Lots of files
Cut and paste is not a reliable, reproducible approach!
Mistakes hard to track
Each operator has his/her own approach
Several versions of code may coexist
The steps aren’t recorded
Testing is hard
Reproducibility is not granted
Quality is controlled only at the end
·
·
·
·
·
·
·
·
·
5/17
What is a Reproducible Analytical Pipeline (RAP)?
It is a process
It is easily repeatable
It is easily extendable
It is automated
It minimises mistakes
It is fast
It builds trust
·
·
·
·
·
·
·
6/17
What does a RAP look like?
It is a simple process:
linking inputs (data)
to outputs (publication)
·
·
7/17
What does a RAP look like?
This process can be decomposed:
Succession of tasks
Direct linkage of actions
·
·
8/17
What does a RAP look like?
This process can be decomposed:
Each task is coded
No manual actions
Each task uses inputs
Each task produces outputs
Easy to test tasks
individually
Each output is identified
·
·↪
·
·
·↪
·↪
9/17
What does a RAP look like?
This process is documented:
Each code has versions
Versions are annotated
Easy to follow tasks
development
Easy to track mistakes
·
·
·↪
·↪
10/17
What does a RAP look like?
This process is easy to save:
Each code is securely saved
Each version can be
revereted
Easy to undo/revert to
past version
Easy to test
·
·
·↪
·↪
11/17
What are the benefits?
Analysis within an RAP are:
Easy to use
Easy to find information
Easy for others to use
Easy to revise and adapt
Easy to reuse
Automated and fast
Open and promoting
trust
·
·
·
·
·
·
·
12/17
What do we need?
A good knowledge of the process
A good organisation:
An open source software
A versioning system
Time to learn
·
·
of files
of code
of documentation
-
-
-
·
·
·
13/17
RAP in practice
Implemented in some NSOs (Vanuatu)
Can be done easily with R/Rstudio
Can also be done with Python/Jupyter notebooks,
Quarto (both R, Python, Julia, others…)
Large community to help
·
·
·
·
14/17
Let’s Start!
Useful resources
The UK government RAP website.
UK best practice documentation.
A free RAP course to teach you all you need to know.
How the Data Science Campus sets its coding standards.
A new open-source book from the Alan Turing institute setting out how to do
reproducible data science.
·
·
·
·
·
16/17
Citing The Turing Way
Many of the beautiful images used in this presentation were taken from The
Turing Way book.
Full citation:
The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia
Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A
Handbook for Reproducible Data Science (Version v0.0.4). Zenodo.
http://doi.org/10.5281/zenodo.3233986
17/17