access network 4 4 t 4
Controller 2 |] interconnect
Controller 1
E Dual redundancy m N-way resilience, e.g.
El 1 node/controller can fail E 3 nodes can fail
m 2 drives can fail (RAID 6) E 3 drives can fail
E HA only inside a pair m Nodes coordinate replication,
E While traditional storage systems distribute Volumes across sub-sets of Spindles
Scale-Out systems use algorithms to distribute Volumes across all/many Spindles
and provide maximum utilization of all system resources
> Offer same high performance for all volumes and shortest rebuild times
20138 per Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved 4
{SDC =a
A model for dynamic “clouds” in nature cima
o Swarm intelligence [Wikipedia]
= (SI) is the collective behavior of decentralized, self-organized systems, natural or
artificial.
o Swarm behavior [Wikipedia]
= Swarm behavior, or swarming, is a collective behavior exhibited by animals of similar
size (...) moving en masse or migrating in some direction.
= From a more abstract point of view, swarm behavior is the collective motion of a large
number of self-propelled entities.
= From the perspective of the mathematical modeler, it is an (...)
behavior arising from simple rules that are followed by individuals and does not
involve any central coordination.
m 4 MB objects by default
Objects mapped to
placement groups (PGs) a
m pgid = hash(object) & mask
PGs mapped to sets of OSDs
m crush(cluster, rule, pgid) = [osd2, osd3] PGs |
m Pseudo-random, statistically uniform
distribution
m ~100 PGs per node
a OSDs
Fast: O(log n) calculation, no lookups (grouped by |
Reliable: replicas span failure domains wre domain)
Stable: adding/removing OSDs moves few PGs
A deterministic pseudo-random hash like function that distributes data uniformly among OSDs
Relies on compact cluster description for new storage target w/o consulting a central allocator
2013 Stor:
loper Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved,
Ceph principles
VMs_JCient|[ App Jctient][ App JCient
App
JCiient
librbd Block Object
SDC =a
10 YEARS
SNIA = SANTA CLARA, 2013
Distributed Redundant Storage
Intelligent data Distribution across
all nodes and spindles = wide
striping (64KB — 16MB)
Redundancy with replica=2, 3 ... 8
Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance
m Disk, Node, Interconnect
m Automatic rebuild
m Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCle-SSD for extreme
Transaction processing
SDC =a
10 YEARS
Ceph processes méme.
ient ient ient ient | Distributed Redundant Storage
Cien I Client ion | m Intelligent data Distribution across
T allnodes and spindies = wide
A “0e 00
E = Redundancy with replica=2, 3 ... 8
M aH Stor Stor Stor Stor Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance
m Disk, Node, Interconnect
m Automatic rebuild
m Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCle-SSD for extreme
da a
Nod; Nod
Gada aa
aaa BEL:
Cade ad
Agenda
Introduction
Hardware / Software layout
Tools how to monitor
Transformation test cases
disk LSI RAID 5/6 SAS 6G 2.12 /dev/sda
ï ; disk LSI RAID 5/6 SAS 66 2.12 /dev/sdb
> With vanilla Kernel 3.8.13 disk LSL RAID 5/6 SAS 6G 2.12 Idevlede
7 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdd
a fio-2.0.13 disk LSI RAID 5/6 SAS 66 2.12 /dev/sde
; n disk LSI RAID 5/6 SAS 66 2.12 Idevisdt
= ceph version 0.61.7 cuttlefish disk LSI RAID 5/6 SAS 66 2.12 /dev/sdg
disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdh
disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdi
16bE <BROADCAST, MULTICAST, UP, LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdj
1GbE <BROADCAST. MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdk
10GbE: <BROADCAST, MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdl
10GbE: <BROADCAST, MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdm
40Gb: ibO: <BROADCAST MULTICAST,UP,LOWER_UP> mtu 65520 disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdn
560b: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdo
58Gb: ib2: <BROADCAST MULTICAST, UP.LOWER_UP> mtu 65520 disk LSI_ RAID 5/6 SAS 6G 2.12 /dev/sdp
40GbE: ethá: <BROADCAST, MULTICAST, UP,LOWER UP> mtu 9216
40GbE: eth5: <BROADCAST, MULTICAST, UP,LOWER_UP> mtu 9216
disk INTEL(R) SSD 910 20068 a411 /dev/sdq
disk INTEL(R) SSD 910 20068 a411 /dev/sdr
disk INTEL(R) SSD 910 20068 a411 /dev/sds
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 25
qperf - explanation
0000
Qa
tcp_lat almost the same for blocks <= 128 bytes
tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB
Significant difference between 1 / 10GbE only for blocks >= 4k
Better latency on IB can only be achieved with rdma_write /
rdma_write_poll
Bandwidth on 1 / 10GbE very stable on possible maximum
40GbE implementation (MLX) with unexpected fluctuation
Under IB only with RDMA the maximum transfer rate can be achieved
Use Socket over RDMA
Options without big code changes are: SDP, rsocket, SMC-R
H All Rights
OSD-FS | OSDobject | Journal | 56 Gb os ¿ÉS DC.
xfs e 4m SSD IPol SSD E es
10PS
90000
80000
70000
60000
50000
oe 5 ceph-ko
al E ceph-fuse
20000 mrbdko
u = dl „da
0 —
s
PS ey
"4 of a f oc eS
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All ae Reserved. 27
OSD-FS | OSD object | Journal 6 Gb OSD-DISK $ SDC =a
xfs size 4m SSD IPolB_CM SSD E ee
MB/s
4000
3500 +
3000 +
2500 +
2000 +
E ceph-ko
my ‘mceph-fuse
1000 + lm rbd-ko
‘mrbd-wrap
500
o
28
OsD-FS — [05D object | Journat [se Gb EEE IS DC a
xfs size 4m IPolB_CM |SSD E cerns
o Ceph-fuse seems not to support multiple I/Os in parallel with
O_DIRECT which drops the performance significantly
o RBD-wrap (= rbd in user land) shows some advantages on IOPS, but
not sufficient enough to replace rbd.ko
a Ceph.ko is excellent on sequential IOPS reads, presumable because of
the read (ahead) of complete 4m blocks
> Stay with the official interfaces ceph.ko / rbd.ko to give fio the needed
access to File and Block
> rbd.ko has some room for performance improvement
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 29
ceph-ko OSD object | Journal | 56 Gb OSD-DISK $ SDC =a
O SA [SSP IPoIB_CM_ | SSD chose
o
10PS
90000
80000
70000
en mxfs ceph-ko
50000 mbtrfs ceph-ko
mxfs rbd-ko
en mbtrfs rbd-ko
30000
20000
ol mi ma
ES
$
< ca 3 a
>” si 2 2
e 2 e e 2
à E
s ~ $ 3 sc
é é x £ 0?
2013 Storage Developer Conference ® Fujitsu Tecffiology Solutions GmbH All Rights Resenf&%. 31
ceph-ko OSD object | Journal | 56 Gb OSD-DISK ¿ÉS DC =
rbd-ko size 4m SSD IPoIB_CM | SSD El Fem DeveLoren. wee
a 6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes
a With kernel 3.6 some essential enhancements were made to the btrfs
code, so almost no differences could be identified in our kernel 3.8.13
> Use btrfs in the next test cases, because btrfs has the more promising
storage features: compression, data deduplication
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 32
OSD-FS Journal | 56 Gb OSD-DISK $ SDC 7
tris BAM IPoIB_CM | SSD apres
o
10PS MB/s
80000 3500
70000 3000
60000 2500
50000 Su
40000
1500
30000
m64k 1000 = 64k
20000 we van
500
10000
o
o
>
Pose Ss ÿ
SARA Se
SIS ow 7 7 OS oe
OOP ESE EE EE EC oy
SS E § SF SF S 5 & &
oe & SF S& OF oF SR:
e 7 e SS
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 33
OSD-FS Journal | 56 Gb OSD-DISK ¿JS = SE a
btrts RAM IPolB_CM | SSD Etre
Small chunks of 64k can especially increase 4k/8k sequential IOPS for
reads as well as for writes
o For random IO 4m chunks are as good as 64k chunks
o But, the usage of 64k chunks will result in very low bandwidth for 4m/8m
blocks
> Create each volume with the appropriate OSD object size: -64k if small
sequential IOPS are used, otherwise stay with ~4m
ceph-ko OSD object | Journal | 56 Gb OSD-DISK $ SDC =a
O SA [SSP IPoIB_CM_ | SSD chose
o
10PS
90000
80000
70000
en mxfs ceph-ko
50000 mbtrfs ceph-ko
mxfs rbd-ko
en mbtrfs rbd-ko
30000
20000
ol mi ma
ES
$
< ca 3 a
>” si 2 2
e 2 e e 2
à E
s ~ $ 3 sc
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 40
OSD-FS | OSD object | Journal mene IS DC ma
a Incase of write IOPS 1GbE is doing extremely well
o On sequential read IOPS there is nearly no difference between 10GbE and
56Gb IPoIB
a On the bandwidth side with reads the measured performance is close in sync
with the possible speed of the network. Only the TrueScale IB has some
weaknesses, because it was designed for HPC and not for Storage/Streaming.
> If you only look or IOPS 10GbE is a good choice
> If throughput is relevant for your use case you should go for 56Gbl
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. ai
ceph-ko | OSD-FS |OSDobject | Journal | 56 Gb El SDC 7
ko buts akon $82 IPoIB_CM À een
—
10PS
90000
80000
70000
em SAS ceph-ko
50000 ™SSD ceph-ko
M SAS rbd-ko
en = SSD rbd-ko
30000
20000
> i M0 a
ol
> >
s
pe S No: x
>? y 2 2
e 2”
Q &
# E E E
o Ceph is the most comprehensive implementation of Unified Storage.
Ceph simulates “distributed swarm intelligence” which arise from simple
rules that are followed by individual processes and does not involve any
central coordination.
o Crush, a deterministic pseudo-random hash like function distributes
data uniformly among block devices.
o The usage of TCP/IP will slow down the latency capabilities of
InfiniBand, but the better bandwidth mostly remains. DG has some
advantage in small blocks, but overall CM is the better compromise.
a Only an optimal setting of block device parameter in Linux will ensure to
get the maximum performance out of the SSD.
o 2.5” 10k 6G SAS drives are an attractive alternative for high
performance in combination with SSDs for the journal.
ujits
Conclusion and Outlook
> Only with Socket over RDMA a better bandwidth and lower latency of
Infiniband can be utilized: Options are: SDP, rsocket, SMC-R
> The Ceph code has a lot of room for improvement to achieve lower
latency.
M ¡RMC for OOB-Mgmt
EN EN m 2x 1GbE onboard for
[secs] [neos] admin
Cosa) [es] = Infiniband 2x
empty ‘empty A
40/56GbIB interconnect
RAM (-768GB / 1536GB) between nodes
m 10GbE as Front-End
interface (optional also
Infiniband)
16x 2.5" SAS/SATA
Configuration option per node:
a 16x 900GB = 14TB SAS, or 16x 1TB SATA = 16TB SATA
m 4x 800GB = 3.6TB PCle-SSD
2013 Sto
loper Conference. O Fujitsu Technology Soluti
GmbH All Rights Reserved. 52
Architecture Vision Unified Storage 115
VMs /
Apps
t
ea NFS,CIFS | en $3, Swift, KVS | =, FCoE, iscsi \
VMs /
Apps
an
2013 Storage Developer Conterence. O Fujitsu Technology Solutions GmbH "AI Rights Reserved,