Webinar: Choosing the Right Shard Key for High Performance and Scale

mongodb 1,354 views 32 slides Apr 21, 2016
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Read these webinar slides to learn how selecting the right shard key can future proof your application.

The shard key that you select can impact the performance, capability, and functionality of your database.


Slide Content

Ger Hartnett Director of Technical Services (EMEA), MongoDB @ghartnett #MongoDB Tales from the Field Part three: Choosing the Right Shard Key for High-Performance and Scale

Or: Cautionary Tales Don’t solve the wrong problems Bad schemas & shard keys hurt ops too

The main talk should take 30-35 minutes You can submit questions via the chat box We’ll answer as many as possible at the end We are recording and will send slides Friday This is the final webinar in a series of 3 Before we start

You work in operations You work in development You have a MongoDB system in production You have contacted MongoDB Technical Services ( support ) You attended an earlier webinar in the series ( part1, part2 ) A quick poll - add a word to the chat to let me know your perspective

We collect - observations about common mistakes - to share the experience of many Names have been changed to protect the (mostly) innocent No animals were harmed during the making of this presentation (but maybe some DBAs and engineers had light emotional scarring) While you might be new to MongoDB we have deep experience that you can leverage Stories

Discovering a DR flaw during a data centre outage Complex documents, memory and an upgrade “surprise” Wild success “uncovers” the wrong shard key The Stories (part three today)

Story #1: Quick Review

Story #1: Recovering from a disaster Prospect in the process of signing up for a subscription Called us late on Friday, data centre power outage and 30+ (11 shards) servers down When they started bringing up the first shard, the nodes crashed with data corruption 17TB of data, very little free disk space, JOURNALLING DISABLED!

Recovering each shard Start secondary read only Mount NFS storage for repair Repair former primary node Iterative rsync to seed a secondary Secondary Primary Secondary

Key takeaways for you If you are departing significantly from standard config, check with us (i.e. if you think journalling is a bad idea) Two DC in different buildings on different flood plains, not in the path of the same storm (i.e. secondaries in AWS) DR/backups are useless if you haven’t tested them

Story #2: Complex documents, memory and an upgrade “surprise” Well established ecommerce site selling diverse goods in 20+ countries After switching to wired tiger in production, performance dropped, this is the opposite of what they were expecting

{ _id: 375 en_US : { name : ..., description : ..., <etc... > }, en_GB : { name : ..., description : ..., < etc... > }, fr_FR : { name : ..., description : ..., < etc... > }, de_DE : ..., de_CH : ..., <... and so on for other locales... > inventory: 423 } Product Catalog: Original Schema

Key Takeaways When doing a major version/storage-engine upgrade, test in staging with some proportion of production data/workload Sometimes putting everything into one document is counter productive

Story #3: Wild success uncovers the wrong shard key Started out as error “[Balancer] caught exception … tag ranges not valid for: db.coll” 11 shards, they had added 2 new shards to keep up traffic - 400+ databases Lots of code changes ahead of the Superbowl Spotted slow 300+s queries, decided to build some indexes without telling us Went production down

Adding Shards 2 More Shards….

The “Golden Hammer” Tendency

Diagnosing the issues #1 The red-herring hunt begins Transparent Huge Pages enabled - production Chaotic call - 20 people talking at once, then in the middle of the call everything started working again Barrage of tickets and calls Connection storms

Using mtools to analyse logs - conn churn

Diagnosing the issues #2 Got inconsistent and missing log files Discovered repeated scatter-gather queries returning the same results Secondary reads Heavy load on some shards and low disk space

Insert load on two shards (from Cloud Manager)

Diagnosing the issues #3 Shard key - string with year/month & customer id { _id : ObjectId("4c4ba5e5e8aabf3"), count : 1025, changes: { … } modified : { date : "201 5_02 ", customer Id: 314159 } }

Diagnosing the issues #4 First heard about DDOS attack Missing tag ranges on some collections Stopped the balancer which reduced system load from chunk moves Two clusters had a mongos each on the same server

Fixing the issues Script to fix the tag ranges Proposed finer granularity shard key - but this was not possible because of 30TB of data Moved mongos to dedicated servers Re-enable the balancer for short windows with waitForDelete and secondaryThrottle Put together scripts to pre-split and move empty chunks to quiet shards based on traffic from month before

Monthly pre-split and move chunks { date : "2015_03", customerId: min-500 customerId: 314159 customerId: 501-10000 customerId: 10001-300k customerId: 300k-314158 customerId: 314160-max

The diagnosis in retrospect The outage did not appear to have been related to either the invalid tag ranges or the earlier failed moves The step downs did not help resolve the outage but did highlight some queries that need to be fixed The DDoS was the ultimate cause of the outage - lead to diagnosis of deeper issues The deepest issue was the shard key

Aftermath and lessons learned Signed up for a Named TSE Now doing pre-split and move before the end of every month Check before making other changes (i.e. building new indexes)

Key takeaways for you Choosing a shard key is a pivotal decision - make it carefully Understand current bottleneck Monitor insert distribution and chunk ranges Look for slow queries (logs & mtools) Run mongos, mongod, config server on dedicated server or use containers/cgroups

Further Reading Production notes docs.mongodb.org/manual/administration/production-notes Mtools github.com/rueckstiess/mtools Previous Webinars mongodb.com/presentations

Ger Hartnett Director Technical Services (EMEA), MongoDB @ghartnett #MongoDB Questions?

You can submit questions via the chat box We are recording and will send slides Friday Questions

Code GerHartnett gets 25% discount