SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

insideHPC 1,670 views 29 slides Feb 16, 2019
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

In this deck from the 2019 Stanford HPC Conference, Devendar Bureddy from Mellanox presents: SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol.

"Increased system size and a greater reliance on ut...


Slide Content

1
© 2019 Mellanox Technologies
Devendar Bureddy
SHARP: In-Network Scalable Hierarchical
Aggregation and Reduction Protocol
Feb 15, 2019

2
© 2019 Mellanox Technologies
Accelerating HPC and AI Applications
Accelerating HPC Applications

Significantly MPI collective runtime

Increase CPU availability and efficiency

Enable communication and computation overlap
Enabling Artificial Intelligence Solutions to Perform Critical
and Timely Decision Making

Accelerating distributed machine learning

3© 2018 Mellanox Technologies | Confidential
Aggregation Operations

All reduce –vector operations, spread result to specific group (mostly the aggregating group)

Reduce –like all but send result to a single entity

Gather / All gather -vector concatenation operation

….

4© 2018 Mellanox Technologies | Confidential
Collective (Example) –Trees

Many2One and One2Many traffic patterns –possible network congestion

Probably not a good solution for large data

Large scale requires higher tree / larger radix

Result distribution –over the tree / MC
Switch
EndNode
Stage1
Stage2

5© 2018 Mellanox Technologies | Confidential
Collective (Example) -Recursive Doubling

The data is recursively divided, processed by CPUs and distributed

The rank’s CPUs are occupied performing the reduce algorithm

The data is sent at least 2x times, consumes at least twice the BW
Calculation phase
Result sending phase

6© 2018 Mellanox Technologies | Confidential
Which Offload Should We Suggest?

Lets aggregate the data while it is going through the network…

It will reduce the amount of data running through the network

It will reduce the latency because data will go through a shorter path

The operation will be fully offloaded

7
© 2019 Mellanox Technologies
HCOLL: SHARP vs No-SHARP
Step 1
`
Recursive Doubling
`
Step 2
`
SHARP
`

8© 2019 Mellanox Technologies | Confidential
SHARP AllReducePerformance Advantages (128 Nodes)
SHARP enables 75% Reduction in Latency
Providing Scalable Flat Latency
Scalable Hierarchical
Aggregation and
Reduction Protocol

9© 2019 Mellanox Technologies | Confidential
SHARP AllReducePerformance Advantages
1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest Performance
Scalable Hierarchical
Aggregation and
Reduction Protocol

10© 2018 Mellanox Technologies | Confidential
Scalable Hierarchical Aggregation Protocol
Reliable Scalable General Purpose Primitive, Applicable to Multiple Use-cases

In-network Tree based aggregation mechanism

Large number of groups

Multiple simultaneous outstanding operations

Streaming aggregation
Accelerating HPC applications

Scalable High Performance Collective Offload

Barrier, Reduce, All-Reduce, Broadcast

Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND

Integer and Floating-Point, 16 / 32 / 64 bit

Up to 1KB payload size (in Quantum)

Significantly reduce MPI collective runtime

Increase CPU availability and efficiency

Enable communication and computation overlap
Accelerating Machine Learning applications

Preventhe many-to-oneTraffic Pattern

CUDA , GPUDirectRDMA
SHArP Tree
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root

11
© 2019 Mellanox Technologies
Scalable Hierarchical Aggregation Protocol

SHARP Tree is a Logical Construct

Nodes in the SHArP Tree are IB Endnodes

Logical tree defined on top of the physical
underlying fabric

SHArP Tree Links are implemented on top of the IB
transport (Reliable Connection)

Expected to follow the physical topology for
performance but not required

SHARP Operations are Executed by a SHARP
Tree

Multiple SHArPTrees are Supported

Each SHArPTree can handle Multiple Outstanding
SHArPOperations

Within a SHArPTree, each Operation is Uniquely
Identified by a SHArP-Tuple

GroupID

SequenceNumber
Physical Topology
One SHArP Tree
Switch/Router
HCA
SHArP Tree Aggregation Node
(Process running on HCA)
SHArP Tree Endnode
(Process running on HCA)
SHArP Tree Root

12
© 2019 Mellanox Technologies
SHARP Principles of Operation -Request
SHArP Tree Root
Aggregation
Request

13
© 2019 Mellanox Technologies
SHARP Principles of Operation –Response
SHArP Tree Root
Aggregation
Response

14© 2018 Mellanox Technologies | Confidential
GPU Direct™ RDMA

Network adapter can directly read data from GPU device memory

Avoids copies through the host

Eliminates CPU bandwidth and latency bottlenecks

Uses remote direct memory access (RDMA) transfers between GPUs

Resulting in significantly improved MPISendRecvefficiency between GPUs in remote nodes

Fastest possible communication between GPU and other PCI-E devices

Allows for better asynchronous communication
With GPUDirect™ RDMA
Using PeerDirect™

15© 2018 Mellanox Technologies | Confidential
GPUDirect& SHARP Performance Advantage for AI

TensorFlowHorovodrunning ResNet50 benchmark

E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4

P100 NVIDIA GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed)

RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlowv1.11, Horovod0.15.0
16%
12%

16© 2018 Mellanox Technologies | Confidential
SHARP SW Overview

17© 2018 Mellanox Technologies | Confidential
Mellanox HPC-X™ Scalable HPC Software Toolkit

Complete MPI, PGAS OpenSHMEMand UPC package

Maximize application performance

For commercial and open source applications

Best out of the box experience

18© 2018 Mellanox Technologies | Confidential
Mellanox HPC-X™ Scalable HPC Software Toolkit

Allow fast and simple deployment of HPC
libraries
Both Stable & Latest Beta are bundled

All libraries are pre-compiled

Includes scripts/module files to ease
deployment

Package Includes

OpenMPI/ OpenSHMEM

BUPC (Berkeley UPC)

UCX

FCA/HCOLL)

SHARP

KNEM

Allows fast intra-node MPI communication for large
messages

Profiling Tools

Libibprof
IPM

Standard Benchmarks

OSU
IMB

19
© 2019 Mellanox Technologies
HPCX/SHARP SW architecture

HCOLL

optimized collective library

Easy to integrate with multiple MPIs(OpenMPI, MPICH,
MVAPICH*)

Libsharp.so

Implementation of low level sharp API

Libsharp_coll.so

Implementation of high level sharp API for enabling
sharp collectives for MPI

uses low level libsharp.so API

Easy to integrate with multiple MPIs(OpenMPI, MPICH,
MVAPICH*)
SHARP(libsharp/libsharp_coll)
HCOLL(libhcoll)
MPI (OpenMPI)
InfiniBand Network

20
© 2019 Mellanox Technologies
SHARP Software Architecture
Aggregation
Node
Aggregation
Node
Aggregation
Node
Aggregation
Node
….
Computenode
MPI
Process
SHARPD
daemon
Computenode
MPI
Process
SHARPD
daemon
Computenode
MPI
Process
SHARPD
daemon
..….
Subnet
Manager
Aggragation
Manager (AM)

21
© 2019 Mellanox Technologies
SHARP: Configuring Subnet Manager

Edit the opensm.conffile.

Set the parameter “sharp_enabled” to “2”. 
Run OpenSMwith the configuration file.

% opensm-F <opensmconfiguration file> -B

Verify that the Aggregation Nodes were activated by the OpenSM, run "ibnetdiscover".

For example:

vendid=0x0

devid=0xcf09

sysimgguid=0x7cfe900300a5a2a0

caguid=0x7cfe900300a5a2a8

Ca 1 "H-7cfe900300a5a2a8" # "Mellanox Technologies Aggregation Node"

[1](7cfe900300a5a2a8) "S-7cfe900300a5a2a0"[37] # lid 256 lmc0 "MF0;sharp2:MSB7800/U1" lid 512 4xFDR

22
© 2019 Mellanox Technologies
MPI Collective offloads using SHARP

Enabled through FCA (HCOLL)
Flags

HCOLL_ENABLE_SHARP

Enable SHARP

HCOLL_SHARP_NP ( default: 2)

Number of nodes(node leaders) threshold in communicator to create SHArP group and use SHArP collectives

SHARP_COLL_LOG_LEVEL

0 –fatal , 1 –error, 2 –warn, 3 –info, 4 –debug, 5 –trace

SHARP_COLL_ENABLE_SAT=1

Enables SHARP Streaming aggregation

SHARP_COLL_SAT_THRESHOLD=1024

Message size threshold to switch from LLT(Local latency Tree) to SAT (Streaming Aggregation Tree)

23
© 2019 Mellanox Technologies
MPI Collective offloads using SHARP

Resources (quota)

SHARP_COLL_JOB_QUOTA_MAX_GROUPS
#communicators

SHARP_COLL_JOB_QUOTA_OSTS

Parallelism on communicator

SHARP_COLL_JOB_QUOTA_PAYLOAD_PER_OST

Payload/OST

For complete list of SHARP COLL tuning options

$HPCX_SHARP_DIR/bin/sharp_coll_dump_config-f

24
© 2019 Mellanox Technologies
Demo

25
© 2019 Mellanox Technologies
Setup

4 nodes, 16 GPUs
TensorFlow/Horovodrunning ResNet50 benchmark

Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Volta NVIDIA GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed)

Ubuntu-16.04, Mellanox OFED 4.5, HPC-X v2.3, TensorFlowv1.12, Horovod0.15.2

NCCL : 1 Ring, NVLinkwith in the Node.

SHARP: Using 4 channels (4 ports) directly participating in SAT operation

Topology:

26
© 2019 Mellanox Technologies
Allreduce -SHARP
0
1
2
3
4
5
6
4 8 16 32 64 128 256
Latency (us)
Message Size
SHARP Latency
HOST SHARP/LLT
0
50000
100000
150000
200000
250000
8388608
16777216
33554432
67108864
134217728
268435456
536870912
Latency (us)
Message Size
SHARP Streaming aggregation
HOST SHARP/SAT
3x

27
© 2019 Mellanox Technologies
Allreduce -GPU Direct & SHARP
0
10
20
30
40
50
48163264128256512102420484096819216384
Latency (us)
Message size
GPU Direct
NCCL SHARP
0
20000
40000
60000
80000
8388608
16777216
33554432
67108864
134217728
268435456
536870912
Latency (us)
Message Size
GPU Direct & SHARP
NCCL SHARP
10x

28
© 2019 Mellanox Technologies
Horovod –Resnet50
SHARP

29
© 2019 Mellanox Technologies
Thank You