IntelSoftwareBR
2,791 views
12 slides
Jun 03, 2014
Slide 1 of 12
1
2
3
4
5
6
7
8
9
10
11
12
About This Presentation
Intel Software Conference 2014 Brazil
May 2014
Leonardo Borges
Size: 277.33 KB
Language: en
Added: Jun 03, 2014
Slides: 12 pages
Slide Content
Intel Software Conference 2014 Brazil
May 2014
Leonardo Borges
Notes on NUMA architecture
2
2
Non-Uniform Memory Access (NUMA) SFSB architecture
-
All memory in one location
SStarting with Nehalem
-
Memory located in multiple
places
SLatency to memory
dependent on location
SLocal memory
-
Highest BW
-
Lowest latency
SRemote Memory
-
Higher latency
Socket 0
Socket 1
QPI
Ensure software is NUMA-optimized for best performance
Notes for Intel Software Conference Brazil, May 2014
3
3
CPU1
DRAM
Node 1
Non-Uniform Memory Access (NUMA)
NLocality matters
-
Remote memory access latency ~1.7x than local memory
-
Local memory bandwidth can be up to 2x greater than remote
Intel® QPI = Intel® QuickPath Interconnect
Remote Memory Access
Intel
®
QPI
CPU0
DRAM
Local Memory Access
Node 0
N
BIOS:
-NUMA mode (
NUMA Enabled
)
oFirst Half of memory space on Node 0, second half on Node 1
oShould be default on Nehalem
(!)
-Non-NUMA (
NUMA Disabled
)
oEven/Odd cache lines assigned to Nodes 0/1: Line interleaving
Notes for Intel Software Conference Brazil, May 2014
4
4
Local Memory Access Example NCPU0 requests cache line X, not present in any CPU0 cache
-CPU0 requests data from its DRAM
-CPU0 snoops CPU1 to check if data is present
NStep 2:
-DRAM returns data
-CPU1 returns snoop response
NLocal memory latency is the maximum latency of the two responses
NNehalem optimized to keep key latencies close to each other
CPU0
CPU1
QPI
DRAM
DRAM
Notes for Intel Software Conference Brazil, May 2014
5
5
Remote Memory Access Example NCPU0 requests cache line X, not present in any CPU0 cache
-
CPU0 requests data from CPU1
-
Request sent over QPI to CPU1
-
CPU1’s IMC makes request to its DRAM
-
CPU1 snoops internal caches
-
Data returned to CPU0 over QPI
NRemote memory latency a function of having a low latency
interconnect
CPU0
CPU1
QPI
DRAM
DRAM
Notes for Intel Software Conference Brazil, May 2014
6
6
Non Uniform Memory Access and
Parallel Execution
NProcess-parallel execution:
-
NUMA friendly- data belongs only to the process
-
E.g. MPI
-
Affinity pinning maximizes local memory access
-
Standard for HPC
NShared-memory threading:
-
More problematic: same thread may require data from multiple
NUMA nodes
-
E.g. OpenMP, TBB , explicit threading
-
OS scheduled thread migration can aggravate situation
-
NUMA and non-NUMA should be compared
Notes for Intel Software Conference Brazil, May 2014
7
7
Operating System Differences
NOperating systems allocate data differently
NLinux*
-
Malloc reserves the memory
-
Assigns the physical page when data touched (
first touch
)
oMany HPC code initialize memory by single ‘master’ thread !!
-
A couple of extensions available via
numactl
and
libnuma
like
onumactl --interleave=all /bin/program
onumactl --cpunodebind=1 --membind=1 /bin/program
onumactl --hardware
onuma_run_on_node(3) // run thread on node 3
NMicrosoft Windows*
-
Malloc assigns the physical page on allocation
-
This default allocation policy is not NUMA friendly
-
Microsoft Windows has NUMA Friendly API’s
oVirtualAlloc reserves memory (like malloc on Linux*)
nPhysical pages assigned at first use
NFor more details:
http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf http://msdn.microsoft.com/en-us/library/aa363804.aspx
Notes for Intel Software Conference Brazil, May 2014
8
8
Other Ways to Set Process Affinity Ctaskset
: sets or retrieves the CPU affinity
CIntel MPI: using
I_MPI_PIN
and
I_MPI_PIN_PROCESSOR_LIST
environment
variables
CKMP_AFFINITY
on Intel Compilers OpenMP
-Compact: binds the OpenMP thread n+1 as close as
possible to OpenMP thread n
-Scatter: distributes threads evenly across the entire
system. Scatter is the opposite of compact
Notes for Intel Software Conference Brazil, May 2014
9
9
NUMA Application Level Tuning: Shared Memory Threading Example: TRIAD
EParallelized time consuming hotspot “TRIAD” (e.g.
of STREAM benchmark) using OpenMP
main() {
…
#pragma omp parallel
{
//Parallelized TRIAD loop…
#pragma omp parallel for private(j)
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
} //end omp parallel
…
} //end main
Parallelizing hotspots may not be sufficient for NUMA
Notes for Intel Software Conference Brazil, May 2014
10
10
NUMA Shared Memory Threading
Example ( Linux* )
KMP_AFFINITY=compact,0,verbose
main() {
…
#pragma omp parallel
{
#pragma omp for private(i)
for(i=0;i<N;i++)
{ a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;}
…
//Parallelized TRIAD loop…
#pragma omp parallel for private(j)
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
} //end omp parallel …
} //end main
Each thread initializes its data
pinning the pages to local memory
Environment variable
to pin affinity
Same thread that initialized
data uses data
Notes for Intel Software Conference Brazil, May 2014
11
11
NUMA Optimization Summary NNUMA adds complexity to software parallelization
and optimization
NOptimize for latency and for bandwidth
-In most cases goal to minimize latency
-Use local memory
-Keep memory near the thread it accesses
-Keep thread near memory it uses
NRely on quality middle-ware for CPU affinitization:
o
Example: Intel Compiler OpenMP or MPI environment
variables
NApplication level tuning may be required to
minimize NUMA first touch policy effects
Notes for Intel Software Conference Brazil, May 2014
12
12
Notes for Intel Software Conference Brazil, May 2014