DieterKasper_Transforming_PCIe-SSDs_with_IB_into_Scalable_Enterprise_Storage_v6.pdf

DieterKasper1 13 views 54 slides Jun 12, 2024
Slide 1
Slide 1 of 54
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54

About This Presentation

SW defined storage: Transforming PCIe SSDs with IB into Scalable Enterprise Storage


Slide Content

Transforming PCle-SSDs and HDDs with
Infiniband into Scalable Enterprise Storage

Dieter Kasper
Fujitsu

Agenda

Introduction

Hardware / Software layout

Tools how to monitor
Transformation test cases
Conclusion

© Fujitsu Technology Solutions GmbH All R

Challenges of a Storage Subsystem se

0 Y

a Transparency
> User has the impression of a single global File System / Storage Space
o Scalable Performance, Elasticity
© No degradation of performance as the number of users or volume of data increases
» Intelligent rebalancing on capacity enhancements
+ Offer same high performance for all volumes
o Availability, Reliability and Consistency
= User can access the same file system / Block Storage from different locations at the same time
= User can access the file system at any time
» Highest MTTDL (mean time to data loss)
a Fault tolerance
= System can identify and recover from failure
» Lowest Degradation during rebuild time
» Shortest Rebuild times

a Manageability & ease of use

© Fujitsu Technology Solutions GmbH All Right

Conventional vs. distributed model

access network 4 4 t 4
Controller 2 |] interconnect

Controller 1

E Dual redundancy m N-way resilience, e.g.
El 1 node/controller can fail E 3 nodes can fail
m 2 drives can fail (RAID 6) E 3 drives can fail
E HA only inside a pair m Nodes coordinate replication,

20138 per Conference. © Fujtsu Technology Solutions GmbH All Rights Reserved and recovery 3

Conventional vs. distributed model

JEL

Performance

Basel

o
©
=
G
E
Ss

=
ö

a

E While traditional storage systems distribute Volumes across sub-sets of Spindles
Scale-Out systems use algorithms to distribute Volumes across all/many Spindles
and provide maximum utilization of all system resources

> Offer same high performance for all volumes and shortest rebuild times

20138 per Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved 4

{SDC =a
A model for dynamic “clouds” in nature cima

loper Conference. © Fujitsu Technology Solutions GmbH All Rig!

2SDC a
Distributed intelligence A pre

o Swarm intelligence [Wikipedia]
= (SI) is the collective behavior of decentralized, self-organized systems, natural or
artificial.
o Swarm behavior [Wikipedia]

= Swarm behavior, or swarming, is a collective behavior exhibited by animals of similar
size (...) moving en masse or migrating in some direction.

= From a more abstract point of view, swarm behavior is the collective motion of a large
number of self-propelled entities.

= From the perspective of the mathematical modeler, it is an (...)
behavior arising from simple rules that are followed by individuals and does not
involve any central coordination.

olutions GmbH All Rights Re:

Ceph Key Design Goals pei

a The system is inherently dynamic: Be ceph

= Decouples data and metadata TABLE SCALE STORAGE
> Eliminates object list for naming and lookup by a hash-like distribution function - CRUSH
(Controlled Replication Under Scalable Hashing)
© Delegates responsibility of data migration, replication, failure detection and recovery to the OSD
(Object Storage Daemon) cluster
O Node failures are the norm, rather than an exception
= Changes in the storage cluster size (up to 10k nodes) cause automatic and fast failure recovery
and rebalancing of data with no interruption
a The characters of workloads are constantly shifting over time
© The Hierarchy is dynamically redistributed over hundreds of MDSs (Meta Data Services) by
Dynamic Subtree Partitioning with near-linear scalability
a The system is inevitably built incrementally
© FS can be seamlessly expanded by simply adding storage nodes (OSDs)
= Proactively migrates data to new devices -> balanced distribution of data
© Utilizes all available disk bandwidth and avoids data hot spots

y Solutions GmbH All Righ

Unified Storage for Cloud based on Ceph — ¿SDC.1

Architecture and Principles SA SAR marta GA a

Host/ VM The Ceph difference
Files 8. Dirs Ceph's CRUSH algorithm

liberates storage clusters

from the scalability and ceph

performance limitations

imposed by centralized data table
(RBD) mapping. It replicates and re-

pus A comp balance data within the cluster

gateway, istribute i ae f

compatible with device, with a Linux | system,withaLinux — dY"amically - elminating this

$3 and Swift kernel client and a kernel client and tedious task for administrators,
QEMU/KVM driver support for FUSE while delivering high-performance

and infinite scalability.

-stor:

http://www. inktank. com

Architecture: Ceph + RADOS (1). 13.2.5."

o Clients
= Standard Interface to use the
RADOS data (POSIX, Device, S3)
= Transparent for Applications
o Metadata Server Cluster
(MDSs)
Namespace Management
Metadata operations (open,
stat, rename, ...)
= Ensure Security
o Object Storage Cluster
(OSDs)
© Stores all data and metadata
© Organizes data into flexible-
sized containers, called objects

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 9

Metadata ops

Architecture: Ceph + RADOS (2) 12.25"

g MONs(monitors) ap

= 1s-10s, paxos
© lightweight process
© authentication, cluster membership, critical
| N =

cluster state

a OSDs [|

= 45-10,000s OIDO
o Ceph Clients ©

© Zillions

authenticate with monitors, talk directly to
ceph-osds

a MDSs
o 1s-10s I

© Build POSIX file system on top of objects

y
= Smart, coordinate with peers PES

sloper Conference. © Fujitsu Technology Solutions GmbH All Rights Reserved 10

SDC Aa

SMA m SANTA CLARA 291

Data placement with CRUSH =| al —

Files/bdevs striped over objects File / Block

m 4 MB objects by default
Objects mapped to
placement groups (PGs) a

m pgid = hash(object) & mask
PGs mapped to sets of OSDs

m crush(cluster, rule, pgid) = [osd2, osd3] PGs |

m Pseudo-random, statistically uniform

distribution
m ~100 PGs per node
a OSDs

Fast: O(log n) calculation, no lookups (grouped by |
Reliable: replicas span failure domains wre domain)
Stable: adding/removing OSDs moves few PGs

A deterministic pseudo-random hash like function that distributes data uniformly among OSDs
Relies on compact cluster description for new storage target w/o consulting a central allocator

2013 Stor:

loper Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved,

Ceph principles

VMs_JCient|[ App Jctient][ App JCient

App

JCiient

librbd Block Object

SDC =a

10 YEARS

SNIA = SANTA CLARA, 2013

Distributed Redundant Storage

Intelligent data Distribution across
all nodes and spindles = wide
striping (64KB — 16MB)
Redundancy with replica=2, 3 ... 8
Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance

m Disk, Node, Interconnect

m Automatic rebuild

m Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCle-SSD for extreme
Transaction processing

SDC =a

10 YEARS

Ceph processes méme.

ient ient ient ient | Distributed Redundant Storage
Cien I Client ion | m Intelligent data Distribution across
T allnodes and spindies = wide
A “0e 00
E = Redundancy with replica=2, 3 ... 8
M aH Stor Stor Stor Stor Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance
m Disk, Node, Interconnect
m Automatic rebuild
m Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCle-SSD for extreme

da a

Nod; Nod

Gada aa
aaa BEL:
Cade ad

Agenda

Introduction

Hardware / Software layout
Tools how to monitor
Transformation test cases

Conclusion

olutions GmbH All R

loper Conference. © Fujitsu Technolog,

Hardware test configuration

fio 137-1

RBD | CephFs

fio

10 YEARS

s

o 1x37-[3-8]: Fujitsu ZU Server

2x Inte
128GB
2x 1G
2x 106
2x 406

(R) Xeon(R) E5-2630 @ 2.30GHz
RAM

bE onboard

bE Intel 82599EB

bIB Intel TrueScale IBA7322 QDR

InfiniBand HCA

2x 564
(config
3x Inte

bIB Mellanox MT27500 Family
urable as 40GbE, too)
PCle-SSD 910 Series 800GB

16x SAS 6G 300GB HDD through
LSI MegaRAID SAS 2108 [Liberator]
o 1x37-[12]: same as above, but

1x Inte

(R) Xeon(R) E5-2630 E 2.30GHz

64GB RAM

No SSD:

Is, 2x SAS drives

Which parameter to change, tune

RBD

fio

137-1

237-2

All Rig

10 YEARS

Frontend interface: ceph.ko, rbd.ko,
ceph-fuse, rbd-wrapper in user land

OSD object size of data: 64k, 4m

Block Device options for /dev/rbdX,
/dev/sdY: scheduler, rq_affinity,
rotational, read_ahead_kb

Interconnect: 1 / 10 / 40 GbE, 40 / 56
GbIB CM/DG

Network parameter
Journal: RAM-Disk, SSD
OSD File System: xfs, btrfs
OSD disk type: SAS, SSD

Software test configuration ee

SÍ sia = sanracıana, 2013

a CentOS 6.4

disk LSI RAID 5/6 SAS 6G 2.12 /dev/sda
ï ; disk LSI RAID 5/6 SAS 66 2.12 /dev/sdb

> With vanilla Kernel 3.8.13 disk LSL RAID 5/6 SAS 6G 2.12 Idevlede

7 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdd

a fio-2.0.13 disk LSI RAID 5/6 SAS 66 2.12 /dev/sde

; n disk LSI RAID 5/6 SAS 66 2.12 Idevisdt

= ceph version 0.61.7 cuttlefish disk LSI RAID 5/6 SAS 66 2.12 /dev/sdg

disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdh

disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdi

16bE <BROADCAST, MULTICAST, UP, LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdj
1GbE <BROADCAST. MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdk
10GbE: <BROADCAST, MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdl
10GbE: <BROADCAST, MULTICAST, UP LOWER_UP> mtu 1500 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdm
40Gb: ibO: <BROADCAST MULTICAST,UP,LOWER_UP> mtu 65520 disk LSI RAID 5/6 SAS 6G 2.12 /dev/sdn
560b: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 disk LSI RAID 5/6 SAS 66 2.12 /dev/sdo
58Gb: ib2: <BROADCAST MULTICAST, UP.LOWER_UP> mtu 65520 disk LSI_ RAID 5/6 SAS 6G 2.12 /dev/sdp

40GbE: ethá: <BROADCAST, MULTICAST, UP,LOWER UP> mtu 9216
40GbE: eth5: <BROADCAST, MULTICAST, UP,LOWER_UP> mtu 9216

disk INTEL(R) SSD 910 20068 a411 /dev/sdq
disk INTEL(R) SSD 910 20068 a411 /dev/sdr
disk INTEL(R) SSD 910 20068 a411 /dev/sds

# fdisk -1 /dev/sdq disk INTEL(R) SSD 910 20068 as11 /dev/sdt

Device Boot Start End Blocks Id System disk INTEL(R) SSD 910 20068 as11 /dev/sdu
Idevisdq1 1 22754 182767616 83 Linux disk INTEL(R) SSD 910 20068 a411 /dev/sdv
Idevisdq2 22754 24322 12591320 f Ext'd disk INTEL(R) SSD 910 20068 a411 /dev/sdw
Idevisdq5 22755 23277 4194304 83 Linux disk INTEL(R) SSD 910 20068 a411 /dev/sdx
Idev/sdq8 23277 23799 4194304 83 Linux disk INTEL(R) SSD 910 20068 a48D /dev/sdy
Idevisdq7 23799 24322 4194304 83 Linux disk INTEL(R) SSD 910 20068 a40D /dev/sdz

disk INTEL(R) SSD 910 206068 a40D /dev/sdaa
disk INTEL(R) SSD 910 20068 a40D /dev/sdab

#--- 3x Journals on each SSD

2013 St

Agenda

Introduction

Hardware / Software layout
Tools how to monitor
Transformation test cases

Conclusion

olutions GmbH All R

loper Conference. © Fujitsu Technolog,

Intel PCle-SSD 910 data sheet oe as

o IOPS random rd/wr 4k: 180k/75k use oc
(queue depth 32 per NAND module) °

o Bandwidth rd/wr 128k: 2/1 GB/s A re
(queue depth 32 per NAND module) Brioe

o Latency rd 512 / wr 4k seq: <65us
(queue depth 1 per NAND module)

Recommended Settings in Linux:

o rq_affinity 1
a scheduler noop
a rotational 0
oO read_ahead_kb 0

2013 Storag:

loper Conference. O Fujitsu Technology Solutions

os

Intel © SSD Data Center Tool

t if i 2D:

Supported Page List | 0 (0x00) |
Write Error Counter | 2 (0x02) |
Read Error Counter | 3 (0x03) |
Verify Error Counter | 5 (0x05) |
Non-medium Error Counter | 6 (0x06) |
Temperature | 13 (oxeD) |
Manufacturing Date Information | 14 (0x0E) |
Application Client Log | 15 (0x0F) |
Self Test Results | 16 (0x10) |
Solid State Media | 17 (0x11) |
Background Scan Medium Operation | 21 (0x15) |
Protocol Specific Log Parameter | 24 (0x18) |
Link Status | 26 (0x1A) |
SMART Status and Temperature Reading | 47 (0x2F) |
Vendor Specific | 48 (0x30) |
Misc Data Counters | 55 (0x37) |

isdet -device 1 -drive 3 -1og 0x11 | grep end
Percentage used endurance indicator | 1 (0x01) |
isdct -device 1 -drive 3 -log OxD | grep Cel
Temperature (Degress Celsius) | 28 (ox1C) |
Reference Temperature (Degress Celsius) | 85 (0x55) |

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved

e
2
a
=

SDC =a

SNIA = SANTA CLARA, 2013

20

spc 7

SNIA = SANTA CLARA, 2013

10 YEARS

Infiniband

# # ibstatus | egrep 'Infi|stat|rate’ | grep -v Tink_ # ibhosts -C qibo -P 1
Infiniband device 'mlx4_0' port 1 status Ca 0x00117500005a6aea ports 1 "rx37-8 qibo"
state: 4: ACTIVE Ca 0x0011750000783984 ports 1 “rx37-7 qib0"
phys state 5: LinkUp Ca 0x001175000078405e ports 1 “rx37-1 qibo"
rate: 56 Gb/sec (4X FOR) Ca 0x00117500005a6ad2 ports 1 “rx37-3 qibo"
Infiniband device 'mlx4_0' port 2 status Ca 0x001175000077F6ec ports 1 "rx37-4 qib0"
state 4: ACTIVE Ca 0x001175000077740e ports 1 "rx37-5 qibo"
phys state 5: LinkUp Ca 0x0011750000789c9e ports 1 "rx37-6 qibo"
rate 56 Gb/sec (4X FOR) Ca 0x00117500005a6a32 ports 1 "rx37-2 qib0"
Infiniband device 'qib0' port 1 status
state 4: ACTIVE # ibv_devinfo
phys state 5: LinkUp # iblinkinfo -R
rate: 40 Gb/sec (4X QDR) # perfquery -C qibo -P 1
# ibdiagnet -p 1

# iblinkinfo
CA: rx37-8 mix4_0:

0x000290300218c81 1411 14.0625 Gbps Active/ LinkUp) 3 35[ |] "NFO:Switch-b79858:SX6036/U1" ( )
0x0002c90300218082 15 21 ] 14.0625 Gbps Active/ LinkUp): 3 33f ] switch-b79e58:SX6036/01* ( )
CA: localhost mlx4_0
0x0002c9030021b91 12 11 ] 14.0625 Gbps Active/ LinkUp) 3 291 |] "NFO:Switch-b79858:SX6036/41" ( )
0x000290300218b82 18 2 ] 14.0625 Gbps Active/ Linkup) 3 31[ ] *HFO;switen-b79e58:Sx6036/01" ( )

CA: rx37-6 mix4_0:
OXO002090300218d01 171]
0x0002c90300218dc2 16 2[

14.0625 Gbps Actives LinkUp)
14.0625 Gbps Actives LinkUp)

3 25[ ] THFO: switch-b79e58-Sx6036/U1" ( )
3 27[ |] “NFO! switch-b79e58:Sx6036/U1" ( |

3 it 1-0 14.0625 Gbps Active/ LirkUp) 1,10 1 HR)
3 21 Down’ Pol ing) bait)

3 30 } ==( ax 14.0625 Gbps Active/ LirkUp) 2 Yt HEAT ()

3 4( 1 Down/ Polling)==: a ae

3 BE )==(ax 14.0625 Gbps Active/ LinkUp) ANS)

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 21

Agenda

Introduction
Hardware / Software layout

Tools how to monitor
Transformation test cases
Conclusion

loper Conference. © Fujitsu Technolog,

olutions GmbH All R

e
ÉS
a
Es

Performance test tools

o qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 -v
tcp_lat tcp_bw rc_rdma_write_lat rc_rdma_write_bw
rc_rdma_write_poll_lat

o fio --filename=$RBD | --directory=$MDIR --direct=1
--rw=$io --bs=$bs --size=10G --numjobs=$threads
--runtime=60 --group_reporting --name=file1
--output=fio_${io}_${bs}_${threads}
= RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir
o jo=write,randwrite,read, randread
[1 bs=4k,8k, 4m, 8m
= threads=1,64,128

loper Confer

hnology Solutions GmbH All Rights

qperf - latency Joie.

SNIA = SANTA CLARA, 2013

tcp_lat (us : IO_bytes)

131072,00
65536,00
32768,00
16384,00
8192,00
4096,00 16bE
2048,00 + 106bE
1024,00 4 40GbE-1500
512,00
sco © 406bE-9126
22000 > 401POIB_CM
64,00 ——S6IPOlB_CM
32,00 —5618_wr_lat
16,00 + —56IB_wr_poll_lat

8,00
4,00
2,00

ce or FP FF SP SE SS SY FCO S
ES og SE OF NK D OP OS
= e vun
2013 Storage Developer Conference. © Fujitsu Technology Solutions GmbH All Rights Reserved, 24

qperf - bandwidth dress

SNIA = SANTA CLARA, 2013

tcp_bw (MB/s : IO_bytes)

7000
6000
5000
——16bE
4000 A 10GE
te 40GbE-1500
3000 © 40GbE-9126
—401P018_CM.
2000 ——56iPoIB_CM
— 5618 wr_lat
1000
o

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 25

qperf - explanation

0000

Qa

tcp_lat almost the same for blocks <= 128 bytes

tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB
Significant difference between 1 / 10GbE only for blocks >= 4k
Better latency on IB can only be achieved with rdma_write /
rdma_write_poll

Bandwidth on 1 / 10GbE very stable on possible maximum
40GbE implementation (MLX) with unexpected fluctuation
Under IB only with RDMA the maximum transfer rate can be achieved

Use Socket over RDMA
Options without big code changes are: SDP, rsocket, SMC-R

H All Rights

OSD-FS | OSDobject | Journal | 56 Gb os ¿ÉS DC.
xfs e 4m SSD IPol SSD E es

10PS
90000
80000
70000
60000
50000
oe 5 ceph-ko
al E ceph-fuse
20000 mrbdko
u = dl „da
0 —
s
PS ey
"4 of a f oc eS

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All ae Reserved. 27

OSD-FS | OSD object | Journal 6 Gb OSD-DISK $ SDC =a
xfs size 4m SSD IPolB_CM SSD E ee

MB/s

4000

3500 +

3000 +

2500 +

2000 +
E ceph-ko

my ‘mceph-fuse

1000 + lm rbd-ko
‘mrbd-wrap

500

o

28

OsD-FS — [05D object | Journat [se Gb EEE IS DC a
xfs size 4m IPolB_CM |SSD E cerns

o Ceph-fuse seems not to support multiple I/Os in parallel with
O_DIRECT which drops the performance significantly

o RBD-wrap (= rbd in user land) shows some advantages on IOPS, but
not sufficient enough to replace rbd.ko

a Ceph.ko is excellent on sequential IOPS reads, presumable because of
the read (ahead) of complete 4m blocks

> Stay with the official interfaces ceph.ko / rbd.ko to give fio the needed
access to File and Block
> rbd.ko has some room for performance improvement

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 29

ceph-ko OSD object | Journal | 56 Gb OSD-DISK $ SDC =a
O SA [SSP IPoIB_CM_ | SSD chose
o

10PS
90000
80000
70000
en mxfs ceph-ko
50000 mbtrfs ceph-ko
mxfs rbd-ko
en mbtrfs rbd-ko
30000
20000
ol mi ma
ES
$
< ca 3 a
>” si 2 2
e 2 e e 2
à E
s ~ $ 3 sc

$ °F
20° 0% 20° <
2013 Stöfäge Developé? Conference. © Fujitsu Technology SoluióRg GmbH AICRights Reserved. 30

ceph-ko OSD object | Journal | 56 Gb mene ISDC a
rbd-ko e4m | SSD IPolB_CM | SSD a ee
o

MB/s

xfs ceph-ko
m btrfs ceph-ko
xfs rbd-ko
ll | mbtrfs rbd-ko

2

4500

4000
3500

3000

2500

2000

1500

1000

. e # e eo $

é é x £ 0?
2013 Storage Developer Conference ® Fujitsu Tecffiology Solutions GmbH All Rights Resenf&%. 31

ceph-ko OSD object | Journal | 56 Gb OSD-DISK ¿ÉS DC =
rbd-ko size 4m SSD IPoIB_CM | SSD El Fem DeveLoren. wee

a 6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes
a With kernel 3.6 some essential enhancements were made to the btrfs
code, so almost no differences could be identified in our kernel 3.8.13

> Use btrfs in the next test cases, because btrfs has the more promising
storage features: compression, data deduplication

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 32

OSD-FS Journal | 56 Gb OSD-DISK $ SDC 7
tris BAM IPoIB_CM | SSD apres
o

10PS MB/s
80000 3500
70000 3000
60000 2500
50000 Su
40000
1500
30000
m64k 1000 = 64k
20000 we van
500
10000
o
o
>
Pose Ss ÿ
SARA Se
SIS ow 7 7 OS oe
OOP ESE EE EE EC oy
SS E § SF SF S 5 & &
oe & SF S& OF oF SR:
e 7 e SS

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 33

OSD-FS Journal | 56 Gb OSD-DISK ¿JS = SE a
btrts RAM IPolB_CM | SSD Etre

Small chunks of 64k can especially increase 4k/8k sequential IOPS for
reads as well as for writes
o For random IO 4m chunks are as good as 64k chunks

o But, the usage of 64k chunks will result in very low bandwidth for 4m/8m
blocks

> Create each volume with the appropriate OSD object size: -64k if small
sequential IOPS are used, otherwise stay with ~4m

2013 Storage Developer Conference. © Fujitsu Technology Solutions GmbH All Rights Reserved. 34

ceph-ko OSD object | Journal | 56 Gb OSD-DISK $ SDC =a
O SA [SSP IPoIB_CM_ | SSD chose
o

10PS
90000
80000
70000
en mxfs ceph-ko
50000 mbtrfs ceph-ko
mxfs rbd-ko
en mbtrfs rbd-ko
30000
20000
ol mi ma
ES
$
< ca 3 a
>” si 2 2
e 2 e e 2
à E
s ~ $ 3 sc

$ °F
20° 0% 20° <
2013 Stöfäge Developé? Conference. © Fujitsu Technology SoluióRg GmbH AICRights Reserved. 35

cepteko [OSDFS | OSD object 56 Gb os ¿ÉS DC.
> buts 2m IPol SSD lee

N
90000
80000
70000
Do 2 j-RAM ceph-ko
50000 = j-SSD ceph-ko
= j-RAM rbd-ko
° =j-SSDrbd-ko
30000
20000
“ In IM of a D 1
o
ÿ
a NA &
y (24 24
& e

$e
2013 stéfège Develop? Conference. © Fujitsu Technology SolutiS8s GmbH PE Reserved. 36

SDC =a

SNIA = SANTA CLARA, 2013

10 YEARS

ceph-ko OSD-FS OSD object | Journal 56 Gb OSD-DISK
rbd-ko btrfs e 4m RAM/SSD | IPol M | SSD
ooo — —

MB/s
4500
4000
3500
3000 mM j-RAM ceph-ko
2500 1 j-SSD ceph-ko
|j-RAM rbd-ko
2000 m-SSD rbd-ko
1500
1000
” | |
0
$ 2 $
we «7
>” $
E eo
S >
é

ceph-ko | OSD-FS |OSD object 56 Gb OSD-DISK $ SDC =a
dai tris Sr Am IPOIB_CM | SSD À Eee

a Only on heavy write bandwidth requests RAM as a journal media can
show its predominance

o Another reason might be the double burden the SSD has to accomplish
when the journal and the data is written to it

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 38

ae © OSD-FS OSD object | Journal OSD-DISK
pP btrís size 4m SSD SSD

IOPS

90000

$
«7 oe”
2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved.

©

80000
70000
60000
50000
40000
30000
20000

ISDC 7

SA SNIA = santa CLARA, 2013

mIGbE
m10GbE
m 40GbE
= 406bIB_CM
= 56GbIB_CM

39

cephko | OSD-FS | OSD object | Journal osp.isk SDC 1
il tris Dem [EE SSD chose
a

MB/s
4500
4000
3500
3000
2500
2000 m1GbE
2300 m 106bE
500 = a06bIB_CM
o = S6GbIB_cM
° e a # eS
se Cu 2 2 a © e
2 e? e $ > e
er «7 e? e

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. 40

OSD-FS | OSD object | Journal mene IS DC ma
a Incase of write IOPS 1GbE is doing extremely well
o On sequential read IOPS there is nearly no difference between 10GbE and

56Gb IPoIB

a On the bandwidth side with reads the measured performance is close in sync
with the possible speed of the network. Only the TrueScale IB has some
weaknesses, because it was designed for HPC and not for Storage/Streaming.

> If you only look or IOPS 10GbE is a good choice
> If throughput is relevant for your use case you should go for 56Gbl

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved. ai

ceph-ko | OSD-FS |OSDobject | Journal | 56 Gb El SDC 7
ko buts akon $82 IPoIB_CM À een


10PS
90000
80000
70000
em SAS ceph-ko
50000 ™SSD ceph-ko
M SAS rbd-ko
en = SSD rbd-ko
30000
20000
> i M0 a
ol
> >
s
pe S No: x
>? y 2 2
e 2”
Q &
# E E E

$ °F
20° 0% 20° <
2013 Stöfäge Developé? Conference. © Fujitsu Technology SoluióRg GmbH AICRights Reserved. 42

ceph-ko |OSD-FS | OSDobject | Journal | 56Gb 2SDC =
ko buts Dem [EE IPolB_CM À perte
o o

MB/s

SAS ceph-ko
m SSD ceph-ko
M SAS rbd-ko
ll hi SSD rbd-ko

y
& or
$ $

é é x £ 0?
2013 Storage Developer Conference ® Fujitsu Tecffiology Solutions GmbH All Rights Resenf&%. 43

4500

4000
3500

3000

2500

2000

1500

1000

o

OSD-FS | OSDobject | Journal | 56 GbIPoIB $ SDC =a
tris PAI SS CITE, Sl
es; o

10PS
70000
60000
50000
mSAS 4m CM
40000 SSD 4m CM
SAS 64k DG
30000 SSD 64k DG
20000
a
ÿ Ed
$
a mo: mo:
a 24 e’
el SS x $
s $ $
«o e ©”

$
2013 Stöfäge Developé? Conference. O Fujitsu Technology SoluióRg GmbH AITRÍGhts Reserved. 44

OSD-FS | OSD object | Journal 6 GbIPolB $ SDC =a
tris PAI SS CITE, Sl
o

MB/s

4000

3500

3000 +
.

2500 + SAS 4m CM
SD 4m CM

2000 + M SAS 64k DG
= SSD 64k DG

1500 +

1000 +

45

ceph-ko | OSD-FS | OSDobject | Journal | 56 GbIPoIB ¿ÉS DC =

rbd-ko btrís 64k/4m | SSD CM/DG El Fem a

o SAS drives are doing much better than expected. Only on small random
writes they are significantly slower than SSD

o In this comparison is has to be taken into account, that the SSD is
getting twice as much writes (journal + data)

> 2.5” 10k 6G SAS drives seems to be an attractive alternative in
combination with SSD for the journal

2013 Storage Developer Conference. O Fujitsu Technology Solutions GmbH All Rights Reserved.

SDc.1

Calculation of a single 10 TAT

fio | Time = avg latency of
one lO (queue-depth=1)
with 5x ACK
rbd Network MER ACK -Cephcode —
usec fio write AA , | a=
4k 8k 8k RON 128 Tax D
1 GbE 2565 2709 182 227 54 64 26 2017 2061
10 GbE 2555 2584 109 122 54 64 21 2178 2171
40 GbE 2191 2142 19 22 54 64 15 2024 1959
40 GbIPolB | 2392 2357 29 24 54 64 18 2190 2155
56 Gb IPolB 1848 1821 19 37 54 64 14 1686 1613

o approximately 1600 us on a single 4k/8k IO is spend in the Ceph code
> The Ceph code has a lot of room for improvement

per C 1ce. O Fujitsu Technology Solutions GmbH All Rights Reserved

Agenda

Introduction

Hardware / Software layout
Tools how to monitor
Transformation test cases

Conclusion

© Fujitsu Technology Solutions GmbH All R

Summary and conclusion

o Ceph is the most comprehensive implementation of Unified Storage.
Ceph simulates “distributed swarm intelligence” which arise from simple
rules that are followed by individual processes and does not involve any
central coordination.

o Crush, a deterministic pseudo-random hash like function distributes
data uniformly among block devices.

o The usage of TCP/IP will slow down the latency capabilities of
InfiniBand, but the better bandwidth mostly remains. DG has some
advantage in small blocks, but overall CM is the better compromise.

a Only an optimal setting of block device parameter in Linux will ensure to
get the maximum performance out of the SSD.

o 2.5” 10k 6G SAS drives are an attractive alternative for high
performance in combination with SSDs for the journal.

ujits

Conclusion and Outlook

> Only with Socket over RDMA a better bandwidth and lower latency of
Infiniband can be utilized: Options are: SDP, rsocket, SMC-R

> The Ceph code has a lot of room for improvement to achieve lower
latency.

per Conference. © Fujitsu Technology Solutions GmbH All Rights Reserved. 50

ee

FUJITSU

Fujitsu
Technology
Solutions ==

Dieter.Kasper@ts. fujitst

Storage Node RX300-S8 RSS

M ¡RMC for OOB-Mgmt
EN EN m 2x 1GbE onboard for
[secs] [neos] admin
Cosa) [es] = Infiniband 2x
empty ‘empty A
40/56GbIB interconnect
RAM (-768GB / 1536GB) between nodes

m 10GbE as Front-End
interface (optional also
Infiniband)

16x 2.5" SAS/SATA

Configuration option per node:
a 16x 900GB = 14TB SAS, or 16x 1TB SATA = 16TB SATA
m 4x 800GB = 3.6TB PCle-SSD

2013 Sto

loper Conference. O Fujitsu Technology Soluti

GmbH All Rights Reserved. 52

Architecture Vision Unified Storage 115

VMs /
Apps
t

ea NFS,CIFS | en $3, Swift, KVS | =, FCoE, iscsi \

VMs /
Apps

an

2013 Storage Developer Conterence. O Fujitsu Technology Solutions GmbH "AI Rights Reserved,

8) S en) ES