Enhancing Research Orchestration Capabilities at ORNL.pdf
globusonline
24 views
14 slides
May 31, 2024
Slide 1 of 14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
About This Presentation
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle ...
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Size: 3.23 MB
Language: en
Added: May 31, 2024
Slides: 14 pages
Slide Content
ORNL is managed by UT-Battelle LLC for the US Department of Energy
Enhancing research orchestration
capabilities at ORNL
Tyler J. Skluzacek
Research Scientist
Oak Ridge National Laboratory
22
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
As research becomes more autonomous, the landscape
of ‘human-machine interaction’ evolves…
Human driven
33
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
As research becomes more autonomous, the landscape
of ‘human-machine interaction’ evolves…
Machine driven
/
44
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Task Who does it?
Initiate experiment
Operate instrument
Initiate data transfers
Initiate compute
Computes!
Validate outputs
Initiate reactionary analysis
Publish, clean up
Our community has converged on
‘automation when possible; humans when required’.
Requirements driven
/
/
/
/
/
/
/
/
55
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Requirements driven experimentation demands software
that enables humans or machines to ‘drive’.
… and more soon!
Users
Human
Machine
Zambeze: distributed workflow
orchestration
The OLCF Facility API for
easy, remote, reliable
interactions with resources.
/status, /compute, /data …
OLCF
Globus Flows
66
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
The OLCF Facility API enables users to remotely interact
with our resources
Not a new idea, but we can build on it!
•FirecREST (CSCS):
https://firecrest.readthedocs.io/en/l
atest/overview.html#gateway
•SuperFacility (NERSC):
https://www.nersc.gov/research-an
d-development/superfacility/
•Tapis (TACC):
https://tapis.readthedocs.io/en/late
st/technical/index.html
77
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Coming soon: direct support for computational workflows
”The workflow representing how I perform computation”
1. Project-level auth
2.User-level auth
3.Check to see if resource online (/status)
4.Submit job (/compute)
5.Monitor job status.
6.Send data (/data… coming soon!)
88
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
What if we could largely ignore cross-facility
configuration/scheduling and just focus on science?
microscope
capture
train model
validate
model
create
visualization
store
visualization
science campaign
activities
Distributed workflow orchestration: act of
organizing and executing application and
data flow between separate workflow
management systems, between
potentially-separate compute and storage
resources
workflow orchestration != workflow management
99
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Zambeze for automated and distributed workflow orchestration
Facility A
Compute A1
storage A1 storage A2
Agent Agent
Compute B1
storage B
Agent
Compute B1
Agent
Facility B
activity
messages
control
messages
data
data
Compute A2
instrument A2
instrument B2
Agent
1010
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Zambeze enables cross-facility analysis
AtomAI use case
•deep learning models for semantic segmentation
•assign each pixel to a category of ‘what it represents’
1111
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Globus Flows @ OLCF
•Action providers allow flexible access to breadth of APIs
–Future: OLCF Facility API!
•Enables human-in-the-loop input
–Link to Globus web console
•Globus Auth enables secure access to most* facilities
•Extremely fast time-to-implementation
–Organization already has DTNs
–Globus Compute fast to install, uses existing virtual environments
1212
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
Globus Flows for tomographic reconstruction
In collaboration with Ryan Chard;
Credit to Will Engler for ALCF AP
…
1313
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
In conclusion,
Globus enables OLCF to provide key research orchestration capabilities.
@
1414
Tyler J. Skluzacek
Oak Ridge Leadership Computing Facility
If you would like to learn more, please reach out!
Others to thank for these efforts:
Tyler J. Skluzacek
Research Scientist
[email protected]
Paul Bryant Ryan Chard
Rafael Ferreira
da Silva
Ryan Prout
A.J. Ruckman Renan Santos
Souza
Mark Coletti
Fred Suter Gavin Wiggins