Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines

IntelSoftware 283 views 46 slides Jun 28, 2017
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

Orbital representations that are based on B-splines are widely used in quantum Monte Carlo (QMC) simulations of solids, which historically take as much as 50 percent of the total runtime. Random access to a large four-dimensional array make it challenging to efficiently use caches and wide vector un...


Slide Content

INTEL’ HPC DEVELOPER CONFERENCE
FUEL YOUR INSIGHT

INTEL HPC DEVELOPER CONF

Optimizations of B-spline based SPO evaluations in QMC for -
Multi/many-core Shared Memory Processors

Amrita Mathuriya
HPC Application Engineer

DCG/Intel Corporation
November 2016

FUEL YOUR IN

In collaboration with

Jeongnim Kim (Intel), Victor Lee(Intel)
Ye Luo, Anouar Benali (Argonne
National Laboratory)

Luke Shulenburger (Sandia National
Laboratories )

Presenter: Amrita Mathuriya

= HPC application Engineer at HPC Ecosystem Application Engineering Team; working for code
modernization and optimization on Xeon and Xeon Phi™

= Working at Intel for past 8 Years.
— Expert at algorithms and optimizations for IA architectures.

— Worked on HPC applications in areas of Computational Geometry, Optical Proximity Correction
(OPC), Electromagnetics, Computational Biology, Quantum Monte Carlo.

— Working on code modernization for Intel® Xeon and Xeon Phi™ architectures.

= MS in Computer Science with the specialization in Computational Science and Engineering from
Georgia Tech, USA under the guidance of Professor David Bader.

= Obtained B. Tech degree in Computer Science from Indian Institute of Technology (IIT) Roorkee, India.

Systems

KNC: Intel® Xeon Phi™ coprocessor 7120P
61 cores @ 1.238 GHz, 4-way Intel® Hyper-Threading Technology, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 3.6.1
OS Version :3.10.0-229.el7.x86_64

Intel® Xeon Phi™ 7250P (code-named Knights Landing, KNL), 68 cores, 1.4GHz with
16GB MCDRAM (used in flat mode), cluster boot mode=Quad, Turbo=enable. KNL used in
Quad/Flat mode.

Intel® Xeon® E5-2697v4(BDW) node single socket, 18 cores HT Enabled @2.3GHz 145W
(E5-2697v4 w/128GB RAM DDR4 2400 8*16GB DIMMS.

Bluegene/Q (BG/Q) processor from Mira Supercomputer, at Argonne National lab facility.

Compilers and MPI and math library.
icc version 16.0.2 (gcc version 4.8.3 compatibility)
Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160120 (build id: 14053)

Agenda

= KNL overview and motivation

= Intro to quantum Monte Carlo and QMCPACK

= Current status of QMCPACK

= Analysis of CORAL graphite benchmark

= Optimizations to B-spline based SPO evaluations for QMC

= Summary

DORMC

DORMC

Omni-path not shown

Important Characteristics of KNL

= Increasing core count per node on both Intel® Xeon® and Xeon Phi™
processors.

= Large SIMD units — AVX512 supporting 16 single precision floating point
simultaneously.

= Two level Cache system L1/L2 and high memory bandwidth.

TEL” HPC DEVELOPER CONFERENCE

How to gain performance?

= Scalability
— Enable data sharing with hybrid parallelism using MPI + threading.
— Design and implement scalable algorithms

= SIMD Parallelism — adapt Data layouts to enable efficient vectorization.

= Efficiently utilize caches and memory bandwidth with Tiling (cache-blocking).

Roofline Performance Analysis on KNL

10,000
5,000

(b) KNL SP. Vector. FMA Peak |

Peak GFLOPS at
(0.22 Flops/Byte

GFLOPS

19
0.01 . 1 10
Al (FLOPS/Byte)

VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware Al.

7% of peak GFLOPS achieved with the current AoS version

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

INTEL” HPC DEVELOPER CONFERENCE 11/12/2016

Performance Portable on Intel® Xeon®, Intel Xeon Phi™
and BG/Q Processors

Optimizations for efficiently utilizing SIMD units and caches.

— SoA data layout transformation.

— Tiling or AoSoA data layout transformation.

— Nested thread parallelization to reduce time-to-solution and memory usage.

Optimization work done on KNC.

Later ported on KNL — works out of the box.

Optimizations result in significant performance improvement on BG/Q.

QMCPACK

An open-source US-DOE flagship many-body ab initio quantum Monte Carlo (QMC) code for
computing the electronic structure of atoms, molecules, and solids. http://qmcpack.org/

100,

System (MPtxOpenMP, Max # of nodes)
9-0 Jaguar (400x6,18000)
Aa Hopper (256x6,2048)
+4 Edison (64x12,512)
ma Mira (1024x32,16384)

(b)

AI!

Y 10 100
CU/Reference CU

Parallel efficiency of QMCPACK on US-DOE facilities. The legend
(a) DMC charge-density of AB stacked graphite and (b) the ball-and- — shows the MPI tasks and OpenMP threads of the reference computing
stick rendering and the 4-Carbon unit cell in blue. unit (CU) and the maximum number of nodes on each platform.

Lim iP Ester J Mei, M À Moraes, € K. Cat L Shuenburger, and D. Cape:
igorthms in quantum monte caro, Journal of Physics: Conference Sen

yon
1,p.012008, 2012. [Online] Available: http stacks iop org/1742-6596040:

Diffusion Monte Carlo Schematics

Old configurations Random Possible new
Walking configurations

. New configurations

JA ET

g
a e v “Ss
. — o
o
ensemble wis LT
.
Ensemble 1 <<
evolves . e Le == A m
according to Fey a *
2
+ Diffusion

+ Drift

+ Branching PA 117 o °

How is QMCPACK parallelized

QMCPACK utilizes OpenMP to optimize memory usage and to take
advantage of the growing number of cores per SMP node.

= Walkers within a MPI task is distributed among MPI Task
the cores in CPU.

= Big common data is shared by all the walkers : _»
like wave function coefficients. E
= Frequency stops increasing. Bl \

= Node count stops growing.

= Nodes are getting more powerful but require
applications to expose more concurrency.

Free lunch is over. On node performance is challenging.

QMCPACK status

Pretty good and we can even do better!

Excellent MPI & OpenMP parallel efficiency at the walker level
All in double precision except 3D cubic B-Spline.

— Work done recently to implement mixed precision. Speeds up by 1.2-1.5x.
SIMD efficiency low

Basically scalar performance with few exceptions

— B-Spline - SSE/SSE2/QPX

— Distance tables with QPX

Array of Structure (AoS) for D-dim N-particle attributes, e.g., R (N,3),
Gradients (N,3), Hessian matrices (N,9)

256 electrons

CORAL Benchmark - KNL Profiling = a

Coral Benchmark Profile on KNL

= Einspline
= Distance Table
= Jastraw
= Others

CPU me

Function / Call Stack Piven
Bic Moor Doi Wiest

-muiti-UBspine

Fymcplusplus:DTD.8Condscdouble, (unsigned nt), int)8ssapply-be

Smeplusplus:SmmetrieDTD<double, (unsigned int), (nt)8>:maveOnsphere

Seval_mul-UBspine34.2.0gh

qmeplusplus: TwoBodpJastrowOrbitalgmeplusplus:BsplineFunctoredouble»>:ratio

Sqmeplusplus::0TD_BConds<double, (unsigned int)3, (int)8>::apply-be |

=.-Abmamadt.e7

gmeplusplus:äsymmetricDTD<double, (unsigned int), (nt)>:moveOnSphere

“qmplusplus:SplineC 2RPackedAdoptorcñoat, double, (unsigned n)>:assgn.vequ

Sameplusplus:BsplineFunetoredoublessevalunte

INKL [email protected]ın

Sqmeplusplus:OneBodyastrowOrbialcqmepluslus:Bsplinefunctoredauble>>:ratio

“qmcplusplus:SyrmetricDTD<doubl, (unsigned int), (t)8>:move

Smeplusplus:DTD.BCondscdoue, (unsigned nt3,(nt)>:apply.be

SIMKL BLASIGAWSL2.mc-dgemv.t.ntinsies

intel_ssse3_rep-memmove

The three compute kernels account for 80% run time in QMCPACK on KNL

INTEL’ HPC DEVELOPER CONFERENCE

11/12/2016 |

QMC: Single particle orbital (SPO) representation with B-
spline basis set

One Dimensional cubic B-spline function
i+2
f(z) = Y (x) po,

¥=i-1

Tensor product in each Cartesian direction,
Representation for 3D orbital,

i+2 > j+2 . k+2 ,
ónla,y,2) =) a) D Sy) D> OES (2) pr mo
d=i-1 j'=j-1 k'=k-1

Precomputed coefficients

4D Read only array.

Stored in SOA format, P[nx][ny][nz][N]
Provided by DFT or HF computations using
Quantum Espresso

INTEL HPC

Random position generation ———48-4 generateRandomPos (vPos, vglPos, vghPos, ns );

1
a _ me À
Simplified miniQMC ;
4
3
= Only contains B-spline 2
evaluation routines. E
mie A 10
= Mimics the computational and ¡1
data access patterns of B- 2
spline SPO evaluations in 14
15
QMC. 16
17
19
20
21
22
>
B-spline SPO evaluation kernels
2
26
a
28

//Number of splines = N, iterations = niters
//random samples = ns = 512

//dimensions of 3D position grid (nx, ny, nz)

using Pos3=float [3];
Pos3 vPos[ns], vglPos[ns], vghPos [ns];
class WalkerAoS (T v[N], g{3*N], 1(N], h[9:*N)];

//Create and init read only 4D table P once.
BsplineAoS bSpline(nx, ny, nz, N);

//Create and run walkers in parallel.
#pragma omp parallel
{
//Contains private copy of outputs.
WalkerAoS w;
// generate random samples for testing.

for (int i<niters; i++) {
for (int j=0; j < ns; j++ )
bSpline.V(vPos[3],w.v);
for (int j=0; j < ns; j++ )
bSpline.VGL(vglPos[3],w.v,w.g,w=.1)7
for ( int j=0; j < ns; j++ )
bSpline.VGH(vghPos [3],w.v,w.g,w.h);

Data Layout — Performance Considerations

Hybrid (AoSoA)

BEBHNEBEEE
BEBHNEBEE®
bib AS

Struct-of-Arrays (SoA)

i
|

Array-of-Structs (AoS)

= Pros: i= Pros: = Pros:
Logical for expression of Contiguous loads/stores : Potentially useful for
for efficient vectorization. | increasing cache locality.

physical abstractions in
3D or higher dimensions. |

Also supports efficient
vectorization.

Pseudocode - VGH

Computes value, gradient, Hessian at random (x,y,z)

(a)
(0,k0)

/Nz
Ny

Data access pattern of read-only B-spline
coefficients P at a random position (x; y; z)
and jO=floor(y/dy) etc. The outermost x
dimension is not shown.

class BsplineAos {
// read only and shared among threads.
T P(Nx] [Ny] [Nz] [N =

void VGH Tx v, Te g, Te h) (
//compute und index i0, j0,k0
//compute prefactors using (x-x0, y-y0, 2-20)

i<4; ++i)
5 3<45 +43)
5 k<4; Hk) (

for (int
for (int
for (int

Strided access

const T+ p=P[i+i0] [3+30] (k+k0]; for output arrays.
#pragma omp simd
for(int n=0; nel; +n) (

vin) FIM;

g[3en+0) +=cx( pln]) zgl3en+1]+= cy( pln));
Dn 19«n+0]+=äxx (pln]) zh[9«n+1]+= xy (pin}) ; |
\

SoA transformation for output arrays

(b) class BsplineSoA {
// Data members

void VGH(T x, Ty, T z, Tx v, Tx g, T* h) {

pertes ABBA
T xgx=g, *gy=g+N, *gz=g+2*N;

shxy=h+N, «hxz=h+2«N,

3xN, *hyz=h+4«N, xhz

Fe Output arrays in SoA
(Structure of arrays)
format

for (int i=
for (int j<4; ++3)
for (int k=0; k<4; ++k) {
const T+ p=P[i+i0] [j+j0] [k+k0];
#pragma omp simd
for (int n=0; n<N; ++n) {
[ vin]+= F(p[n]);
I gxín] += Gx(plnl); gyIn]+= Gy(plnl);
I hxx[nJ+= Hxx(p[n]); hxy(n]+= Hxy(p[n])7

VW) =

How to evaluate performance of QMC

= Rate of Monte Carlo sample generations (throughputs) per resource

= For the miniapp,
Throughput = (number of evaluations)/(T)
Evaluations = (Number of walkers) X (Number of iterations) X (Number of splines)
T = Time per call of a function (such as VGH )

= Throughput represents work done on a node.

= Ideally, it should stay constant across problem sizes.

VGH throughput by AoS-to-SoA transformation
Higher the better

1000
TI BDW-A0S EI KNC-AoS ME KNL-A0S
en DI BDW-SoA EEE KNC-SoA ME KNL-S0A
Pr
a
5 Moo
S
2
£
5 400
x
y
>
200

18 256 512 1024 2048 4096
N (number of splines)

2x-4x Performance improvement for small to medium problem sizes.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
i or configurat ic it of inform: systems

Pseudocode - VGH

Computes value, gradient, Hessian at random (x,y,z)

(a)
(0,k0)

/Nz
Ny

Data access pattern of read-only B-spline
coefficients P at a random position (x; y; z)
and jO=floor(y/dy) etc. The outermost x
dimension is not shown.

class BsplineAos {
// read only and shared among threads.
T P(Nx] [Ny] [Nz] [N =

void VGH Tx v, Te g, Te h) (
//compute und index i0, j0,k0
//compute prefactors using (x-x0, y-y0, 2-20)

i<4; ++i)
5 3<45 +43)
5 k<4; Hk) (

for (int
for (int
for (int

Strided access

const T+ p=P[i+i0] [3+30] (k+k0]; for output arrays.
#pragma omp simd
for(int n=0; nel; +n) (

vin) FIM;

g[3en+0) +=cx( pln]) zgl3en+1]+= cy( pln));
Dn 19«n+0]+=äxx (pln]) zh[9«n+1]+= xy (pin}) ; |
\

Why low performance for large N?

AoS-to-SoA improves SIMD efficiency
But, caches can be utilized better

+ Reduction on the arrays G& H of
size N

« Streaming access at 4x4x4 block

+ Pressure on resources with large N,
e.g., TLB

How to keep the write data in L1/L2

How to maximize LLC sharing

class BsplineAos {
// read only and shared among threads.
T P(Nx] [Ny] [Nz] [N];

void VGH(T x, Ty, Tz, T+ v, Tag, Ta h) {

//compute the lower-bound index i0, 40, k0
//compute prefactors using (x-x0,y-y0, 2-20)

Reduction of

output arrays

over 64N values
const T« p=P[i+i0] [j+30] [k+k0];
pragma omp simd
for (int, i.

for (int i=0; i<4; ++i)
for (int j=0; 3<4 +43)
for (int k=0; k<4; ++k) {

vin] += F(plnl);
I 9[3*n+0]+=6x( pln]) ;g(3+n+1]+= cy( pink
1 h[9«n+0] +=Hxx (p[n]) ;h[9«n+1]+= Hxy(pln])E

AoSoA Data Layout
Transformation

Data access pattern of read-only a
B-spline table 15
a) Current b) Tiled E

18

int Nb = N/M; // tile size.
class WalkerSoA (T v[Nb], g[3*Nb], 1{Nb], h[6*Nb]};
// SoA object array containing P[nx] [ny] [nz] [Nb]

BsplineSoA bs[M] (Nb); ———=>Tiled Input array

// Parallel region for creating walkers.
#pragma omp parallel {
Walkersoa w(M] (>); ————Tiled output arrays

for (int t=0; t<M; ++t) {

for(int j=0; j<ns; j++)
bs[t].V(vPos(j], w[t].v);

for(int j=0; j<ns; j++)

bs[t].VGL(vglPos[3], wlt].v, wlt]l.g, wlt].1);
for(int 3=0; j<ns; j++)
bs(t].VGH(vghPos(j], wit].v, w[t].g, wlt].h);

INTEL’ HPC DEVELOPER CONFERENCE

Performance gain with tiling/AoSoA - Higher the better

1000
UD BDW-SoA EE KNC-SoA EE KNL-S0A
80. I BDW-A0SOA EE KNC-A0SOA Mm KNL-AoSoA
2 2.1x
1600 2.5x
2 sx
£
= 400
3
>
200
0
18 256 512 1024 2048 4096

N (number of splines)
VGH Performance with SoA to AoSoA transformation (tiling)

zes for all

AoSoA helps achieve sustained th hput across problem

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations.

INTEL’ HPC DEVELOPER CONFERENCE minerai | 2

VGH throughput with tiling, higher the better

x

= BDW- peak at 64 co
700. N=2048
= The tiled input array fits in L3 |
cache. EM
= KNC, KNL- peak at 512 sn
= Fortile size > 512, output arrays g 200
fall out of caches. 100!
0-0 BDW mM KNC fa KNL
y 16 32 64 128 256 512 1024 2048
Nb (tile size)

Performance of VGH at N = 2048 with respect to tile size.

Tiling improves performance for all three processors.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

INTEL’ HPC DEVELOPER CONFERENCE

Hybrid OpenMP/MPI Parall

= Current parallelism over walkers (Nw).

= Working set size in QMCPACK grows
with number of walkers.

= Parallelizing each walker update

= Specifically, for Intel Xeon Phi, with large
number of cores/threads, next level of
parallelism becomes essential for strong
scaling.

elism in AMCPACK

100

System [MPIXOpenWP, Max # of nodes)
‘e—e Jaguar (400x6,18000)
AA Hopper (256x6,2048)
+ Edison (64x12,512)
16-41 Mira (1024x32,16384)

10

Speedup

1
1 100

10
CU/Reference CU
Parallel efficiency of QMCPACK on US-DOE facilities. The legend
shows the MPI tasks and OpenMP threads of the reference computing
unit (CU) and the maximum number of nodes on each platform

J. Kim, K. P. Esler, J. MoMinis, M. A, Morales, B. K. Clark, L. Shulenburger, and D. M. Ceperley,
“Hybrid algonthms in quantum monte caro,’ Joumal of Physics: Conference Series, vol. 402, no
41, p. 012008, 2012 [Online]. Available: http /stacks jop.org/1742-6596/402/=1/a=012008

Parallelism within a walker — nested threading

Reduces memory requirement
and time to solution on a node,
by reducing the number of
walkers on a node.

miniQMC replaces OpenMP
nested threading with manual
assignment of work.

Zseansaurun-

12
13
14
15
16
17
18

int Nb = N/M; // tile size.
class WalkerSoA (T v(Nb], g[3*Nb], LIND], h[6xNb]);
// SoA object array containing P[nx] [ny] [nz] [Nb]
BsplineSoA bs[M] (Nb) ;

// Parallel region for creating walkers.
#pragma omp parallel {
WalkerSoA w[M] (Nb);
ol let
for (int t=0; t<M; ++t) (
for(int j=0; j<ns; j++)
bs[t].V(vPos[3], w[t].v);
for (int 3=0; j<ns; j++)
bs[t].VGL(vglPos[j], w[t].v, wit].g, wlt].1);
for (int j=0; j<ns; j++)
bs[t].VGH(vghPos[3], w[t].v, wltl.g, wt].h);

ion of tiles in different threads.

Strong Scaling Results on

e. V-Opt ++ VGL-Opt 4-4 VGH-Opt
N=2048
>
£
A
Ss
a
5
3
E
> 90% parallel efficiency
16512) 2612) 41512) 8 (256) 16 (128) 32 (64)

nth (Nb)
Speedup on KNL w.rt. number of walkers per thread.

Reduces time to solution by ~14x with 16 threads per walker

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

INTEL’ HPC DEVELOPER CONFERENCE rene |

Strong Scaling Results on a tame

e. V-Opt ++ VGL-Opt 4-4 VGH-Opt

VGH throughput
3858

N=2048
> 200) ee BOW mm KNC aa KNL
S a aa a a a
5 No tile size)
3 Performance of VGH at N = 2048 with
H respect to tile size.
a

> 90% parallel efficiency

1(512) 2612) 41512) 8 (256) 16 (128) 32 (64)
nth (Nb)

Speedup on KNL w.rt. number of walkers per thread.

Reduces time to solution by ~14x with 16 threads per walker

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

INTEL’ HPC DEVELOPER CONFERENCE wpzjzoro | >

Roofline Performance Analysis on KNL

10,000
5,000 SELENE,
2,500 > = SoA data layout conversion
1,000 SoA, i — Increases cache aware Al
£ soot ~7% of peak from 0.22 to 0.32
= GFlops _.; .
5 e = Scalar Add Peak — ~7% of the achievable peak.
100
— 1.5x speedup wrt. AoS
version.
10
0.01 0. 10. 22 10

.32
AL arte)

VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware Al and X (b) the best performance (AoSoA) on DDR.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems.
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

Roofline Performance Analysis on KNL

10,000
5,000 (b) KNL
2,500 oe a = AoSoA version increases
eS i} A P
1,000 AoSoA, ii cache reuse with the same

Al.

Scalar Add Peak — Better cache utilization.

50! 11%ofpeak . 5 A0S0A "”
GFlops _ si ee

GFLOPS

= — ~2.25x gain in performance.

10
0.01 010.22 > 0.32 1
‘AI (FLUF5/Byte)

VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware Al and X (b) the best performance (AoSoA) on DDR.

10

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems.
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

Roofline Performance Analysis on KNL

10,000
5,000 (b) KNL
2,500
1.000 AoSoA, 1 1. = AoSoA version with
£ so! 3.3xspeedup CDS MCDRAM -3.3x faster than
2 aS g Add Peak DDR.
& |.“ MCDRAM
100 Fr
10
0.01 0.1 10

1
Al (FLOPS/Byte)

VGH roofline performance model for N=2048. Circles denote GFLOPS
at the cache-aware Al and X (b) the best performance (AoSoA) on DDR.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations,

Roofline performance analysis on BDW

1000

(a) BDW ctor FMA Peak
AoSoA, 500 P
~50% of
achievable 20 © E SP Vector Add Peal
GFlops 2
a 100
a
o
50
‘Scalar Add Peak
or “ 0.1 1 10

Al (FLOPS/Byte)
VGH roofline performance model for N=2048. Circles denote GFLOPS at the cache-aware Al

e AoSoA version.

Performance improved to ~50% of peak GFLOPS w

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems
‘or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, sit Intel Performance Benchmark Limitations,

Performance Summary

= The improvements are portable to 4 types of CPUs, even from different vendors.

= Significant speedups even on BG/Q.

On VGH routine

SOA and basic TX | >11.7x 2.6x 1.7x
AoSoA/Tiling
Strong scaling 5.2x 1 6.4x 35.2x 33.1x

Number of threads per 2(32) !2(32) 8(256) 16(128)

walker de”
(The optimal tile size)

Symmetric Distance table computation
AOS to SoA transformation of particle positions.

Speedup vs. Problem Size
Higher the better
350
5 30 30
£ 300
El
E 250
2 KNL 50x
200 18 5
a Faster with
E 150 SoA data 13-0
a
5 100 75 layout
i
& °° 08 E 06 06
oo: — —
256 512 800
Number of electrons
MKNL Baseline(256 TH) = BDWOpt(2MPI/36TH) MKNL Opt(256 TH)

KNL used in Quad/Cache
mode for these
experiments

Here, TH = threads.
BDW has 2 sockets for
these experiments.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests, Any

difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems

‘or components they are considering purchasing, For more information on performance tests and on the performance of Intel products, visit

Results

= Array of structures (AOS) to structure of arrays (SOA) transform helps
achieve efficient vectorization.

= Tiling for better memory access helps achieve approximately constant
throughput across problem sizes.

= Nested parallelism over the AoSoA objects on KNL helps reduce the time-to-
solution by ~14x speedup with 16 threads.

= Optimizations result in significant performance gain on all three distinct
cache-coherent architectures.

INTEL” HPC DEVELOPER CONFERENCE!

Ways we increased the performance!

= SIMD Parallelism
— SoA data layout adaption.
= Efficient cache utilization
— Tiling/Cache-Blocking.
= Scalability

— Next level of threading to reduce time to solution.

- Takes advantage of reduced working set size.

Reference

Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim Kim

“Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/
many-core shared memory processors”

arXiv: 1611.02665

Legal Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
‘ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application” is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR
USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND
AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS’
FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL
APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS,
PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or
instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilties arising from
future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current
characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:

itp /wwwintel com/design/iteraturehim
Knights Landing and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release.
Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of
Intel's internal code names is at the sole risk of the user

Intel, Look Inside, Xeon, Intel Xeon Phi, Pentium, Cilk, VTune and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

"Other names and brands may be claimed as the property of others.
Copyright © 2016 Intel Corporation

TEL HPC DEVELOPER CONFERENC

Legal Disclaimers

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel
does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel
microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved
for Intel microprocessors. Please refer to the applicable product User and Reference Guides
for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark"
and MobileMark", are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. For more information go to hitp:/Avww. intel com/performance.

Estimated Results Benchmark Disclaimer:

Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or
configuration may affect actual performance.

Software Source Code Disclaimer:
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms ofthat license.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the
Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to the following concitions:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

11/12/201

Thank you for your time

Amrita Mathuriya
[email protected]

www.intel.com/hpcdevcon

INTEL” HPC DEVELOPER CONFERENCE
FUEL YOUR INSIGH

Backup

SymmetricDTD::moveonsphere — Code Sinippet

const T rnewX = mew[0];
const T rnewY = mew[1];
const T rnewZ = mew[2];

SoA

_assume_aligned(X, 64);
Zassume_aligned(Y, 64);
Zassume_aligned(Z, 64);
—assume_aligned (OResults, 64);
for(int lat=0; iat<tiSourceIndex; +riat)
1
T displx
T disply
T displz

rnewx + X[iat];
mewY - Yliat];
rnewZ - Z[iat);

/cart2unit
isplx*g00+displY+g10+displZ*g20;
isplx*g01#displY+g11#displZ*g21;
<Z = displx*g02#displY*g124displZ*g22;
wur them in tne Dox

//wit2cart
displX=ar_X*r00+ar_Y*rl0tar_Z*r20;
displY=ar_X*r0l+ar_Y*rlltar_Ztr21;
displZ=ar_X*r02+ar_Y*ri2+ar_2+r22;

T dResult = displX*displt+dispWY*displY+displZ*displZ;

OResults[iat]

For efficient auto-vectorization with the compiler

Three separate arrays for X, Y and Z instead of a single
array with (x. y, z) as a data member.

Similar SOA (structure of arrays) data layout for the output
array.

using pos_type=T[3];
pos_type RIN],dR[N];
T dist[N];

AoS

for(int j=05 Jen; +43)

pos_type delta-PBC(R[J]-R[i])
dist[j]=sqrt(dot(delta,delta));
dR[j]=delta;

y

//use dist and dR