1
A
ndreas Dilger
S
enior Staff Engineer, Lustre Group
S
un Microsystems
L
ustre HPCS Design Overview
S
calability Workshop – ORNL – May 2009
T
opics
H
PCS Goals
H
PCS Architectural Improvements
P
erformance Enhancements
C
onclusion
2
S
calability Workshop – ORNL – May 2009
H
PC Center of the Future
Shared Storage Network
User Data
1000's OSTs
Metadata
100's MDTs
Capacity 1
250,000 Nodes
Capability
500,000 Nodes
Capacity 2
150,000 Nodes Capacity 3
50,000 Nodes
Test
25,000 Nodes
Viz
2
WAN
Access
10 TB/sec
HPSS Archive
Viz
1
H
PCS Goals - Capacity
F
ilesystem Limits
•1
00 PB+
m
aximum file system size
(
10PB)
•1
trillion files
(10
1
2
)
per file system
(
4 billion files)
•>
30k client
n
odes
(
> 30k clients)
S
ingle File/Directory Limits
•1
0 billion
files per directory
(
15M)
•0
to 1 PB file size range
(
360TB)
•1
B to 1 GB I/O request size range
(
1B - 1GB IO req)
•1
00,000 open
shared files per process
(
10,000 files)
•L
ong file names and 0-length files as data records
(
long file names and 0-length files)
4
S
calability Workshop – ORNL – May 2009
H
PCS Goals - Performance
A
ggregate
•1
0,000 metadata operations per second
(
10,000/s)
•1
500 GB/s
file per process/shared file
(
180GB/s)
•N
o impact RAID rebuilds on performance
(
variable)
S
ingle Client
•4
0,000 creates/s
,
up to 64kB data
(
5k/s, 0k
B)
•3
0 GB/s
full-duplex streaming I/O
(
2GB/s)
M
iscellaneous
•P
OSIX I/O API extensions proposed at OpenGroup
*
(
partial)
*
High End Computing Extensions Working Group
h
ttp://www.opengroup.org/platform/hecewg/
5
S
calability Workshop – ORNL – May 2009
H
PCS Goals - Reliability
•E
nd-to-end resiliency T10 DIF equivalent
(
net only)
•N
o impact RAID rebuild on performance
(
variable)
•U
ptime of
9
9.99%
(
99% ?)
•D
owntime <
1
h/year
•1
00h
filesystem integrity check
(
8h, partial)
•1
h downtime means check must be online
6
S
calability Workshop – ORNL – May 2009
H
PCS Architectural Improvements
7
U
se Available ZFS Functionality
E
nd-to-End Data Integrity
R
AID Rebuild Performance Impact
F
ilesystem Integrity Checking
C
lustered Metadata
R
ecovery Improvements
P
erformance Enhancements
S
calability Workshop – ORNL – May 2009
U
se Available ZFS Functionality
8
C
apacity
•S
ingle filesystem 100TB+
(
2
6
4
LUNs * 2
6
4
bytes)
•T
rillions of files in a single file system
(
2
4
8
files)
•D
ynamic addition of capacity
R
eliability and resilience
•T
ransaction based, copy-on-write
•I
nternal data redundancy
(double parity, 3 copies)
•E
nd-to-end checksum of all data/metadata
•O
nline integrity verification and reconstruction
F
unctionality
•S
napshots, filesets, compression, encryption
•O
nline incremental backup/restore
•H
ybrid storage pools (HDD + SSD)
S
calability Workshop – ORNL – May 2009
E
nd-to-End Data Integrity
9
C
urrent Lustre checksumming
•D
etects data corruption over network
•E
xt3/4 does not checksum data on disk
Z
FS stores data/metadata checksums
•F
ast (Fletcher-4 default, or none)
•S
trong (SHA-256)
H
PCS Integration
•I
ntegrate Lustre and ZFS checksums
•A
void recompute full checksum on data
•A
lways overlap checksum coverage
•U
se scalable tree hash method
S
calability Workshop – ORNL – May 2009
H
ash Tree and Multiple Block Sizes
1
0
S
calability Workshop – ORNL – May 2009
1
6kB
1
28kB
2
56kB
5
12kB
3
2kB
1
024kB
6
4kB
D
ATA
S
EGMENT
L
EAF HASH
I
NTERIOR HASH
R
OOT HASH
4
kB
8
kB
M
AX PAGE
Z
FS BLOCK
L
USTRE RPC SIZE
M
I
N
P
A
G
E
H
ash Tree For Non-contiguous Data
1
1
S
calability Workshop – ORNL – May 2009
C
45
K
C
7
C
457
C
1
= LH(data segment 1)
C
0
= LH(data segment 0)
C
5
= LH(data segment 5)
C
4
= LH(data segment 4)
LH(x) = hash of data x (leaf)
x+y = concatenation x and y
C
7
= LH(data segment 7)
C
01
= IH(C
0
+ C
1
)
C
45
= IH(C
4
+ C
5
)
C
457
= IH(C
45
+ C
7
)
C
5
C
7
C
4
K = IH(C
01
+ C
457
) = ROOT hash
IH(x) = hash of data x (interior)
C
01
C
0
C
01
C
1
E
nd-to-End Integrity Client Write
1
2
S
calability Workshop – ORNL – May 2009
E
nd-to-End Integrity Server Write
1
3
S
calability Workshop – ORNL – May 2009
H
PC Center of the Future
Shared Storage Network
Metadata
100's MDTs
Capacity 1
250,000 Nodes
Capability
500,000 Nodes
Capacity 2
150,000 Nodes Capacity 3
50,000 Nodes
Test
25,000 Nodes
Viz
2
WAN
Access
10 TB/sec
HPSS Archive
Viz
1
1224 96TB OSTs
3 * 32TB LUNs
36720 4TB disks
RAID-6 8+2
115 PB Data
R
AID Failure Rates
1
15PB filesystem, 1.5TB/s
•3
6720 4TB disks in RAID 6 8+2 LUNs
•4
year Mean Time To Failure
•1
disk fails
e
very hour
on average
•4
TB disk @ 30MB/s
38hr rebuild
•3
0MB/s is 50%+ disk bandwidth (seeks)
•M
ay reduce aggregate throughput by 50%+
•1
disk failure may cost 750GB/s aggregate
•3
8hr * 1 disk/hr = 38 disks/OSTs degraded
S
calability Workshop – ORNL – May 2009
R
a
t
e
0
.0
0
.5
1
.0
1
.5
2
.0
2
.5
3
.0
3
.5
4
.0
4
.5
5
.0
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
00
F
ailure Rates vs. MTTF
3
6720 4TB disks, RAID-6 8+2, 1224 OSTs, 30MB/s rebuild
D
isks failed during rebuild
P
ercent OST degraded
D
ays to OST double failure
Y
ears to OST triple failure
M
TTF (years)
F
a
ilu
r
e
R
a
t
e
S
calability Workshop – ORNL – May 2009
0 5 1
0
1
5
2
0
2
5
3
0
3
5
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
00
F
ailure Rates vs. RAID disks
3
6720 4TB disks, RAID-6 N+2, 1224 OSTs, MTTF 4 years
D
isks failed during rebuild
P
ercent OST degraded
D
ays to OST double failure
Y
ears to OST triple failure
D
isks per RAID-6 stripe
F
a
ilu
r
e
R
a
t
e
S
calability Workshop – ORNL – May 2009
0 1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
00
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
00
F
ailure Rates vs. Rebuild Speed
3
6720 4TB disks, RAID-6 8+2, 1224 OSTs, MTTF 4 years
D
isks failed during rebuild
P
ercent OST degraded
D
ays to OST double failure
Y
ears to OST triple failure
R
ebuild Speed (MB/s)
F
a
ilu
r
e
R
a
t
e
S
calability Workshop – ORNL – May 2009
L
ustre-level Rebuild Mitigation
1
9
S
calability Workshop – ORNL – May 2009
S
everal related problems
•A
void global impact
f
rom
degraded RAID
•A
void load
o
n
rebuilding RAID set
A
voids degraded OSTs for new files
•L
ittle or no load on degraded RAID set
•M
aximize rebuild performance
•M
inimal global performance impact
•3
0 disks (3 LUNs) per OST, 1224 OSTs
•3
8 of 1224 OSTS = 3% aggregate cost
•O
STs available for existing files
Z
FS-level RAID-Z Rebuild
R
AID-Z/Z2 is not the same as RAID-5/6
•N
EVER does read-modify-write
•S
upports arbitrary block size/alignment
•R
AID layout is stored in block pointer
Z
FS metadata traversal for RAID rebuild
•G
ood: only rebuild used storage (<80%)
•G
ood: verify checksum of rebuilt data
•B
ad: may cause random disk access
S
calability Workshop – ORNL – May 2009
R
AID-Z Rebuild Improvements
R
AID-Z optimized rebuild
•~
3% of storage is metadata
•S
can metadata first, build ordered list
•D
ata reads mostly linear
•B
ookmark to restart rebuild
•Z
FS itself is not tied to RAID-Z
D
istributed hot space
•S
pread hot-spare rebuild space over all disks
•A
ll disks' bandwidth/IOPS for normal IO
•A
ll disks' bandwidth/IOPS for rebuilding
S
calability Workshop – ORNL – May 2009
H
PC Center of the Future
Shared Storage Network
Capacity 1
250,000 Nodes
Capability
500,000 Nodes
Capacity 2
150,000 Nodes Capacity 3
50,000 Nodes
Test
25,000 Nodes
Viz
2
WAN
Access
10 TB/sec
HPSS Archive
Viz
1
User Data
1000's OSTs
2PB
5120 800GB
SAS/SSD
RAID-1
128 MDTs
16TB LUNs
C
lustered Metadata
2
3
1
00s of metadata servers
D
istributed inodes
•F
iles normally local to parent directory
•S
ubdirectories often non-local
S
plit directories
•S
plit dir::name hash
Striped file::offset
D
istributed Operation Recovery
•C
ross directory mkdir, rename, link, unlink
•S
trictly ordered distributed updates
•E
nsures namespace coherency, recovery
•A
t worst inode refcount too high, leaked
S
calability Workshop – ORNL – May 2009
F
ilesystem Integrity Checking
2
4
S
calability Workshop – ORNL – May 2009
P
roblem is among hardest to solve
•1
trillion
files in 100h
•2
-4PB
of
M
DT filesystem metadata
•~
3 million files/sec, 3GB/s+ for one pass
•3
M*stripes checks/sec from MDSes to OSSes
•8
60*stripes random
m
etadata
IOPS on OSTs
N
eed to handle CMD coherency as well
•L
ink count on files, directories
•D
irectory parent/child relationship
•F
ilename to FID to inode mapping
F
ilesystem Integrity Checking
2
5
S
calability Workshop – ORNL – May 2009
I
ntegrate Lustre with ZFS scrub/rebuild
•Z
FS callback to check Lustre references
•E
vent-driven checks means fewer re-reads
•l
diskfs can use an inode table iteration
B
ack-pointers to allow direct verification
•P
ointer from OST object to MDT inode
•P
ointer list from inode to {parent dir, name}
•N
o saved state needed for coherency check
•A
bout 1 bit/block to detect leaks/orphans
•O
r, a second pass in reverse direction
R
ecovery Improvements
V
ersion Based Recovery
•I
ndependent recovery stream per file
•I
solate recovery domain to dependent ops
C
ommit on Share
•A
void client getting any dependent state
•A
void sync for single client operations
•A
void sync for independent operations
S
calability Workshop – ORNL – May 2009
I
mperative Recovery
S
erver driven notification of failover
•S
erver notifies client of failover completed
•C
lient replies immediately to server
•A
void client waiting on RPC timeouts
•A
void server waiting for dead clients
C
an tell between slow/dead server
•N
o waiting for RPC timeout start recovery
•C
an use external or internal notification
S
calability Workshop – ORNL – May 2009
P
erformance Enhancements
S
MP Scalability
N
etwork Request Scheduler
C
hannel Bonding
2
8
S
calability Workshop – ORNL – May 2009
S
MP Scalability
2
9
S
calability Workshop – ORNL – May 2009
F
uture nodes will have 100s of cores
•N
eed excellent SMP scaling on client/server
•N
eed to handle NUMA imbalances
R
emove contention on servers
•P
er-CPU resources (queues, locks)
•F
ine-grained locking
•A
void cross-node memory access
•B
ind requests to a specific CPU deterministically
●C
lient NID, object ID, parent directory
R
emove contention on clients
•P
arallel copy_{to,from}_user, checksums
N
etwork Request Scheduler
3
0
S
calability Workshop – ORNL – May 2009
M
uch larger working set than disk elevator
•H
igher level information
●C
lient NID, File/Offset
R
ead & write queue for each object on server
●R
equests sorted in object queues by offset
●Q
ueues serviced round-robin, operation count (variable)
●D
eadline for request service time
•S
cheduling input: opcount, offset, fairness, delay
F
uture enhancements
•J
ob ID, process rank
•G
ang scheduling across servers
•Q
uality of service
●P
er UID/GID, cluster: min/max bandwidth
3
1
q
2
q
3
q
n
N
etwork Request Scheduler
q
1
W
Q
r
eq
r
eq
r
eq
r
eq
r
eq
r
eq
r
eq
r
eq
r
eq
r
eq
Q
(
Global FIFO Queue)
D
ata Object Queues
W
ork Queue
S
calability Workshop – ORNL – May 2009
H
PC Center of the Future
Shared Storage Network
User Data
1000's OSTs
Metadata
100's MDTs
Capacity 1
250,000 Nodes
Capability
500,000 Nodes
Capacity 2
150,000 Nodes Capacity 3
50,000 Nodes
Test
25,000 Nodes
Viz
2
10 TB/sec
HPSS Archive
Viz
1
WAN
Access
C
hannel Bonding
3
3
S
calability Workshop – ORNL – May 2009
C
ombine multiple Network Interfaces
•I
ncreased performance
•S
hared: balance load across all interfaces
•I
mproved reliability
•F
ailover: use backup links if primary down
•F
lexible configuration
•N
etwork interfaces of different types/speeds
•P
eers do not need to share all networks
•C
onfiguration server per network
C
onclusion
3
4
S
calability Workshop – ORNL – May 2009
L
ustre can meet HPCS filesystem goals
•S
calability roadmap is reasonable
•I
ncremental orthogonal improvements
H
PCS provides impetus to grow Lustre
•A
ccelerate development roadmap
•E
nsures that Lustre will meet future needs
Q
uestions?
3
5
S
calability Workshop – ORNL – May 2009
H
PCS Filesystem Overview online at:
h
ttp://wiki.lustre.org/index.php/Learn:Lustre_Publications
3
6
T
HANK YOU
S
calability Workshop – ORNL – May 2009
A
ndreas Dilger
S
enior Staff Engineer, Lustre Group
S
un Microsystems
T
10-DIF vs. Hash Tree
C
RC-16 Guard Word
•A
ll 1-bit errors
•A
ll adjacent 2-bit errors
•S
ingle 16-bit burst error
•1
0
-
5
bit error rate
3
2-bit Reference Tag
•M
isplaced write != 2nTB
•M
isplaced read != 2nTB
F
letcher-4 Checksum
•A
ll 1- 2- 3- 4-bit errors
•A
ll errors affecting 4 or fewer 32-bit
w
ords
•S
ingle 128-bit burst error
•1
0
-
13
bit error rate
H
ash Tree
•M
isplaced read
•M
isplaced write
•P
hantom write
•B
ad RAID reconstruction
M
etadata Improvements
M
etadata Writeback Cache
•A
voids unnecessary server communication
●O
perations logged/cached locally
●P
erformance of local file system when uncontended
•A
ggregated distributed operations
●S
erver updates batched and tranferred using bulk protocols
(
RDMA)
●R
educed network and service overhead
S
ub-Tree Locking
●L
ock aggregation – a single lock protects a whole subtree
●R
educe lock traffic and server load
3
8
S
calability Workshop – ORNL – May 2009
M
etadata Improvements
M
etadata Protocol
•S
ize on MDT (SOM)
>A
void multiple RPCs for attributes derived from OSTs
>O
STs remain definitive while file open
>C
ompute on close and cache on MDT
•R
eaddir+
>A
ggregation
-D
irectory I/O
-G
etattrs
-L
ocking
3
9
S
calability Workshop – ORNL – May 2009
L
NET SMP Server Scaling
4
0
T
otal client processes
T
otal client processes
R
P
C
T
r
h
o
u
g
h
p
u
t
R
P
C
T
h
r
o
u
g
p
u
t
C
lient
N
odes
S
ingle Lock
F
iner Grain Locking
S
calability Workshop – ORNL – May 2009
C
ommunication Improvements
F
lat Communications model
•S
tateful client/server connection required for coherence and performance
•E
very client connects to every server
•O
(n) lock conflict resolution
H
ierarchical Communications Model
•A
ggregate connections, locking, I/O, metadata ops
•C
aching clients
>L
ustre
System Calls
>A
ggregate local processes (cores)
>I
/O Forwarders scale another 32x or more
•C
aching Proxies
>L
ustre
Lustre
>A
ggregate whole clusters
>I
mplicit Broadcast - scalable conflict resolution
4
1
S
calability Workshop – ORNL – May 2009
F
ault Detection Today
R
PC timeout
>T
imeout cannot distinguish death / congestion
P
inger
>N
o aggregation across clients or servers
>O
(n) ping overhead
R
outed Networks
>R
outer failure confused with peer failure
F
ully automatic failover scales with slowest time
c
onstant
>1
0s of minutes on large clusters
>F
iner failover control could be
m
uch
faster
4
2
S
calability Workshop – ORNL – May 2009
A
rchitectural Improvements
S
calable Health Network
•B
urden of monitoring clients distributed – not replicated
•F
ault-tolerant status reduction/broadcast network
>S
ervers
a
nd
LNET routers
•L
NET high-priority small message support
>H
ealth network stays responsive
•P
rompt, reliable detection
>T
ime constants in seconds
>F
ailed servers, clients and routers
>R
ecovering servers and routers
I
nterface with existing RAS infrastructure
•R
eceive and deliver status notification
4
3
S
calability Workshop – ORNL – May 2009
4
4
P
rimary Health Monitor
F
ailover Health Monitor
C
lient
4
4
H
ealth Monitoring Network
S
calability Workshop – ORNL – May 2009
O
perations support
L
ustre HSM
•I
nterface from Lustre to hierarchical storage
>I
nitially HPSS
>S
AM/QFS soon afterward
T
iered storage
•C
ombine HSM support with ZFS’s SSD support and a
p
olicy manager to provide tiered storage management
4
5
S
calability Workshop – ORNL – May 2009