SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

insideHPC 1,670 views 29 slides Feb 16, 2019

Slide 1 of 29

About This Presentation

In this deck from the 2019 Stanford HPC Conference, Devendar Bureddy from Mellanox presents: SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol.

"Increased system size and a greater reliance on ut...

Size: 1.78 MB

Language: en

Added: Feb 16, 2019

Slides: 29 pages

Slide Content

1
© 2019 Mellanox Technologies
Devendar Bureddy
SHARP: In-Network Scalable Hierarchical
Aggregation and Reduction Protocol
Feb 15, 2019

2
© 2019 Mellanox Technologies
Accelerating HPC and AI Applications
Accelerating HPC Applications

Significantly MPI collective runtime

Increase CPU availability and efficiency

Enable communication and computation overlap
Enabling Artificial Intelligence Solutions to Perform Critical
and Timely Decision Making

Accelerating distributed machine learning

3© 2018 Mellanox Technologies | Confidential
Aggregation Operations

All reduce –vector operations, spread result to specific group (mostly the aggregating group)

Reduce –like all but send result to a single entity

Gather / All gather -vector concatenation operation

….

4© 2018 Mellanox Technologies | Confidential
Collective (Example) –Trees

Many2One and One2Many traffic patterns –possible network congestion

Probably not a good solution for large data

Large scale requires higher tree / larger radix

Result distribution –over the tree / MC
Switch
EndNode
Stage1
Stage2

5© 2018 Mellanox Technologies | Confidential
Collective (Example) -Recursive Doubling

The data is recursively divided, processed by CPUs and distributed

The rank’s CPUs are occupied performing the reduce algorithm

The data is sent at least 2x times, consumes at least twice the BW
Calculation phase
Result sending phase

6© 2018 Mellanox Technologies | Confidential
Which Offload Should We Suggest?

Lets aggregate the data while it is going through the network…

It will reduce the amount of data running through the network

It will reduce the latency because data will go through a shorter path

The operation will be fully offloaded

8© 2019 Mellanox Technologies | Confidential
SHARP AllReducePerformance Advantages (128 Nodes)
SHARP enables 75% Reduction in Latency
Providing Scalable Flat Latency
Scalable Hierarchical
Aggregation and
Reduction Protocol

9© 2019 Mellanox Technologies | Confidential
SHARP AllReducePerformance Advantages
1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest Performance
Scalable Hierarchical
Aggregation and
Reduction Protocol

10© 2018 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases

In-network Tree based aggregation mechanism

Large number of groups

Multiple simultaneous outstanding operations

Streaming aggregation
Accelerating HPC applications

Scalable High Performance Collective Offload

Barrier, Reduce, All-Reduce, Broadcast

Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

Integer and Floating-Point, 16 / 32 / 64 bit

Up to 1KB payload size (in Quantum)

Significantly reduce MPI collective runtime

Increase CPU availability and efficiency

Enable communication and computation overlap
Accelerating Machine Learning applications

Preventhe many-to-oneTraffic Pattern

CUDA , GPUDirectRDMA
SHArP Tree
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root

11
© 2019 Mellanox Technologies
Scalable Hierarchical Aggregation Protocol

SHARP Tree is a Logical Construct

Nodes in the SHArP Tree are IB Endnodes

Logical tree defined on top of the physical
underlying fabric

SHArP Tree Links are implemented on top of the IB
transport (Reliable Connection)

Expected to follow the physical topology for
performance but not required

SHARP Operations are Executed by a SHARP
Tree

Multiple SHArPTrees are Supported

Each SHArPTree can handle Multiple Outstanding
SHArPOperations

Within a SHArPTree, each Operation is Uniquely
Identified by a SHArP-Tuple

GroupID

SequenceNumber
Physical Topology
One SHArP Tree
Switch/Router
HCA
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root

14© 2018 Mellanox Technologies | Confidential
GPU Direct™ RDMA

Network adapter can directly read data from GPU device memory

Avoids copies through the host

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPISendRecvefficiency between GPUs in remote nodes

Fastest possible communication between GPU and other PCI-E devices

Allows for better asynchronous communication
With GPUDirect™ RDMA
Using PeerDirect™

15© 2018 Mellanox Technologies | Confidential
GPUDirect& SHARP Performance Advantage for AI

TensorFlowHorovodrunning ResNet50 benchmark

E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4

P100 NVIDIA GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed)

RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlowv1.11, Horovod0.15.0
16%
12%

17© 2018 Mellanox Technologies | Confidential
Mellanox HPC-X™ Scalable HPC Software Toolkit

Complete MPI, PGAS OpenSHMEMand UPC package

Maximize application performance

For commercial and open source applications

Best out of the box experience

18© 2018 Mellanox Technologies | Confidential
Mellanox HPC-X™ Scalable HPC Software Toolkit

Allow fast and simple deployment of HPC
libraries
Both Stable & Latest Beta are bundled

All libraries are pre-compiled

Includes scripts/module files to ease
deployment

Package Includes

OpenMPI/ OpenSHMEM

BUPC (Berkeley UPC)

UCX

FCA/HCOLL)

SHARP

KNEM

Allows fast intra-node MPI communication for large
messages

Profiling Tools

Libibprof
IPM

Standard Benchmarks

OSU
IMB

19
© 2019 Mellanox Technologies
HPCX/SHARP SW architecture

HCOLL

optimized collective library

Easy to integrate with multiple MPIs(OpenMPI, MPICH,
MVAPICH*)

Libsharp.so

Implementation of low level sharp API

Libsharp_coll.so

Implementation of high level sharp API for enabling
sharp collectives for MPI

uses low level libsharp.so API

Easy to integrate with multiple MPIs(OpenMPI, MPICH,
MVAPICH*)
SHARP(libsharp/libsharp_coll)
HCOLL(libhcoll)
MPI (OpenMPI)
InfiniBand Network

20
© 2019 Mellanox Technologies
SHARP Software Architecture
Aggregation
Node
Aggregation
Node
Aggregation
Node
Aggregation
Node
….
Computenode
MPI
Process
SHARPD
daemon
Computenode
MPI
Process
SHARPD
daemon
Computenode
MPI
Process
SHARPD
daemon
..….
Subnet
Manager
Aggragation
Manager (AM)

21
© 2019 Mellanox Technologies
SHARP: Configuring Subnet Manager

Edit the opensm.conffile.

Set the parameter “sharp_enabled” to “2”. 
Run OpenSMwith the configuration file.

% opensm-F <opensmconfiguration file> -B

Verify that the Aggregation Nodes were activated by the OpenSM, run "ibnetdiscover".

For example:

vendid=0x0

devid=0xcf09

sysimgguid=0x7cfe900300a5a2a0

caguid=0x7cfe900300a5a2a8

Ca 1 "H-7cfe900300a5a2a8" # "Mellanox Technologies Aggregation Node"

[1](7cfe900300a5a2a8) "S-7cfe900300a5a2a0"[37] # lid 256 lmc0 "MF0;sharp2:MSB7800/U1" lid 512 4xFDR

22
© 2019 Mellanox Technologies
MPI Collective offloads using SHARP

Enabled through FCA (HCOLL)
Flags

HCOLL_ENABLE_SHARP

Enable SHARP

HCOLL_SHARP_NP ( default: 2)

Number of nodes(node leaders) threshold in communicator to create SHArP group and use SHArP collectives

SHARP_COLL_LOG_LEVEL

0 –fatal , 1 –error, 2 –warn, 3 –info, 4 –debug, 5 –trace

SHARP_COLL_ENABLE_SAT=1

Enables SHARP Streaming aggregation

SHARP_COLL_SAT_THRESHOLD=1024

Message size threshold to switch from LLT(Local latency Tree) to SAT (Streaming Aggregation Tree)

23
© 2019 Mellanox Technologies
MPI Collective offloads using SHARP

Resources (quota)

SHARP_COLL_JOB_QUOTA_MAX_GROUPS
#communicators

SHARP_COLL_JOB_QUOTA_OSTS

Parallelism on communicator

SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST

Payload/OST

For complete list of SHARP COLL tuning options

$HPCX_SHARP_DIR/bin/sharp_coll_dump_config-f

25
© 2019 Mellanox Technologies
Setup

4 nodes, 16 GPUs
TensorFlow/Horovodrunning ResNet50 benchmark

Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Volta NVIDIA GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed)

Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlowv1.12, Horovod0.15.2

NCCL : 1 Ring, NVLinkwith in the Node.

SHARP: Using 4 channels (4 ports) directly participating in SAT operation

Topology:

26
© 2019 Mellanox Technologies
Allreduce -SHARP
0
1
2
3
4
5
6
4 8 16 32 64 128 256
Latency (us)
Message Size
SHARP Latency
HOST SHARP/LLT
0
50000
100000
150000
200000
250000
8388608
16777216
33554432
67108864
134217728
268435456
536870912
Latency (us)
Message Size
SHARP Streaming aggregation
HOST SHARP/SAT
3x

27
© 2019 Mellanox Technologies
Allreduce -GPU Direct & SHARP
0
10
20
30
40
50
48163264128256512102420484096819216384
Latency (us)
Message size
GPU Direct
NCCL SHARP
0
20000
40000
60000
80000
8388608
16777216
33554432
67108864
134217728
268435456
536870912
Latency (us)
Message Size
GPU Direct & SHARP
NCCL SHARP
10x

SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx