Webinar: Choosing the Right Shard Key for High Performance and Scale
mongodb
1,354 views
32 slides
Apr 21, 2016
Slide 1 of 32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
About This Presentation
Read these webinar slides to learn how selecting the right shard key can future proof your application.
The shard key that you select can impact the performance, capability, and functionality of your database.
Size: 2.46 MB
Language: en
Added: Apr 21, 2016
Slides: 32 pages
Slide Content
Ger Hartnett Director of Technical Services (EMEA), MongoDB @ghartnett #MongoDB Tales from the Field Part three: Choosing the Right Shard Key for High-Performance and Scale
Or: Cautionary Tales Don’t solve the wrong problems Bad schemas & shard keys hurt ops too
The main talk should take 30-35 minutes You can submit questions via the chat box We’ll answer as many as possible at the end We are recording and will send slides Friday This is the final webinar in a series of 3 Before we start
You work in operations You work in development You have a MongoDB system in production You have contacted MongoDB Technical Services ( support ) You attended an earlier webinar in the series ( part1, part2 ) A quick poll - add a word to the chat to let me know your perspective
We collect - observations about common mistakes - to share the experience of many Names have been changed to protect the (mostly) innocent No animals were harmed during the making of this presentation (but maybe some DBAs and engineers had light emotional scarring) While you might be new to MongoDB we have deep experience that you can leverage Stories
Discovering a DR flaw during a data centre outage Complex documents, memory and an upgrade “surprise” Wild success “uncovers” the wrong shard key The Stories (part three today)
Story #1: Quick Review
Story #1: Recovering from a disaster Prospect in the process of signing up for a subscription Called us late on Friday, data centre power outage and 30+ (11 shards) servers down When they started bringing up the first shard, the nodes crashed with data corruption 17TB of data, very little free disk space, JOURNALLING DISABLED!
Recovering each shard Start secondary read only Mount NFS storage for repair Repair former primary node Iterative rsync to seed a secondary Secondary Primary Secondary
Key takeaways for you If you are departing significantly from standard config, check with us (i.e. if you think journalling is a bad idea) Two DC in different buildings on different flood plains, not in the path of the same storm (i.e. secondaries in AWS) DR/backups are useless if you haven’t tested them
Story #2: Complex documents, memory and an upgrade “surprise” Well established ecommerce site selling diverse goods in 20+ countries After switching to wired tiger in production, performance dropped, this is the opposite of what they were expecting
{ _id: 375 en_US : { name : ..., description : ..., <etc... > }, en_GB : { name : ..., description : ..., < etc... > }, fr_FR : { name : ..., description : ..., < etc... > }, de_DE : ..., de_CH : ..., <... and so on for other locales... > inventory: 423 } Product Catalog: Original Schema
Key Takeaways When doing a major version/storage-engine upgrade, test in staging with some proportion of production data/workload Sometimes putting everything into one document is counter productive
Story #3: Wild success uncovers the wrong shard key Started out as error “[Balancer] caught exception … tag ranges not valid for: db.coll” 11 shards, they had added 2 new shards to keep up traffic - 400+ databases Lots of code changes ahead of the Superbowl Spotted slow 300+s queries, decided to build some indexes without telling us Went production down
Adding Shards 2 More Shards….
The “Golden Hammer” Tendency
Diagnosing the issues #1 The red-herring hunt begins Transparent Huge Pages enabled - production Chaotic call - 20 people talking at once, then in the middle of the call everything started working again Barrage of tickets and calls Connection storms
Using mtools to analyse logs - conn churn
Diagnosing the issues #2 Got inconsistent and missing log files Discovered repeated scatter-gather queries returning the same results Secondary reads Heavy load on some shards and low disk space
Insert load on two shards (from Cloud Manager)
Diagnosing the issues #3 Shard key - string with year/month & customer id { _id : ObjectId("4c4ba5e5e8aabf3"), count : 1025, changes: { … } modified : { date : "201 5_02 ", customer Id: 314159 } }
Diagnosing the issues #4 First heard about DDOS attack Missing tag ranges on some collections Stopped the balancer which reduced system load from chunk moves Two clusters had a mongos each on the same server
Fixing the issues Script to fix the tag ranges Proposed finer granularity shard key - but this was not possible because of 30TB of data Moved mongos to dedicated servers Re-enable the balancer for short windows with waitForDelete and secondaryThrottle Put together scripts to pre-split and move empty chunks to quiet shards based on traffic from month before
The diagnosis in retrospect The outage did not appear to have been related to either the invalid tag ranges or the earlier failed moves The step downs did not help resolve the outage but did highlight some queries that need to be fixed The DDoS was the ultimate cause of the outage - lead to diagnosis of deeper issues The deepest issue was the shard key
Aftermath and lessons learned Signed up for a Named TSE Now doing pre-split and move before the end of every month Check before making other changes (i.e. building new indexes)
Key takeaways for you Choosing a shard key is a pivotal decision - make it carefully Understand current bottleneck Monitor insert distribution and chunk ranges Look for slow queries (logs & mtools) Run mongos, mongod, config server on dedicated server or use containers/cgroups
Further Reading Production notes docs.mongodb.org/manual/administration/production-notes Mtools github.com/rueckstiess/mtools Previous Webinars mongodb.com/presentations
Ger Hartnett Director Technical Services (EMEA), MongoDB @ghartnett #MongoDB Questions?
You can submit questions via the chat box We are recording and will send slides Friday Questions