TroubleshootingJVMOutages-3CaseStudies.pptx

MaliniV3 51 views 22 slides Aug 21, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

In this session, we will explore three major outages at leading enterprises, analyzing thread dumps, heap dumps, and GC logs. Gain practical insights and techniques to tackle CPU spikes, OutOfMemoryErrors, and application unresponsiveness. Enhance your problem-solving skills with expert guidance.


Slide Content

Troubleshooting JVM Outages 3 Fortune 500 Case Studies Ram Lakshmanan Architect GCeasy, FastThread , HeapHero

2 Slowdown Major Financial Institution in N. America Analysis Report: https://tinyurl.com/5da3ft8z

Open-source script: https://github.com/ycrash/yc-data-script 1. GC Log 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 3 360° Troubleshooting artifacts

1 2 3 1 Timestamp at which thread dump was triggered 2 JVM Version info 3 Thread Details - <<details in following slides>> 4

1 2 3 4 5 6 7 1 Thread Name - InvoiceThread-A996 2 Priority - Can have values from 1 to 10 3 Thread Id - 0x00002b7cfc6fb000 – Unique ID assigned by JVM. It's returned by calling the Thread.getId () method. 4 Native Id - 0x4479 - This ID is highly platform dependent. On Linux, it's the pid of the thread. On Windows, it's simply the OS-level thread ID within a process. On Mac OS X, it is said to be the native pthread_t value. 5 Address space - 0x00002b7d17ab8000 - 6 Thread State - RUNNABLE 7 Stack trace - 5

How to analyze Thread dump? https://www.ibm.com/support/pages/ibm-thread-and-monitor-dump-analyzer-java-tmda IBM TDMA FastThread https://fastthread.io/ 03 02 https://tinyurl.com/wq95weo Sample thread report yCrash https://ycrash.io/ 01 6

7 Poor Response Time Major Cloud Service Provider Blog: https://blog.gceasy.io/garbage-collection-tuning-success-story-reducing-young-gen-size/

What is Garbage? HTTP Request Objects Memory Garbage 8

9 3 - 4 Decades ago Developer Writes code to Manually evict Garbage JVM Automatically evicts Garbage Now How are objects Garbage Collected? Evolution: Manual -> Automatic

10 Automatic GC sounds good right? Yes, but for GC pauses CPU consumption

Open-source script: https://github.com/ycrash/yc-data-script 1. GC Log 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 11 360° Troubleshooting artifacts

2019-08-31T01:09:19.397+0000: 1.606: [GC (Metadata GC Threshold) [ PSYoungGen : 545393K->18495K(2446848K)] 545393K->18519K(8039424K), 0.0189376 secs] [Times: user=0.15 sys=0.01, real=0.02 secs] 2019-08-31T01:09:19.416+0000: 1.625: [Full GC (Metadata GC Threshold) [ PSYoungGen : 18495K->0K(2446848K)] [ ParOldGen : 24K->17366K(5592576K)] 18519K->17366K(8039424K), [ Metaspace : 20781K->20781K(1067008K)], 0.0416162 secs] [Times: user=0.38 sys=0.03, real=0.04 secs] 2019-08-31T01:18:19.288+0000: 541.497: [GC (Metadata GC Threshold) [ PSYoungGen : 1391495K->18847K(2446848K)] 1408861K->36230K(8039424K), 0.0568365 secs] [Times: user=0.31 sys=0.75, real=0.06 secs] 2019-08-31T01:18:19.345+0000: 541.554: [Full GC (Metadata GC Threshold) [ PSYoungGen : 18847K->0K(2446848K)] [ ParOldGen : 17382K->25397K(5592576K)] 36230K->25397K(8039424K), [ Metaspace : 34865K->34865K(1079296K)], 0.0467640 secs] [Times: user=0.31 sys=0.08, real=0.04 secs] 2019-08-31T02:33:20.326+0000: 5042.536: [GC (Allocation Failure) [ PSYoungGen : 2097664K->11337K(2446848K)] 2123061K->36742K(8039424K), 0.3298985 secs] [Times: user=0.00 sys=9.20, real=0.33 secs] 2019-08-31T03:40:11.749+0000: 9053.959: [GC (Allocation Failure) [ PSYoungGen : 2109001K->15776K(2446848K)] 2134406K->41189K(8039424K), 0.0517517 secs] [Times: user=0.00 sys=1.22, real=0.05 secs] 2019-08-31T05:11:46.869+0000: 14549.079: [GC (Allocation Failure) [ PSYoungGen : 2113440K->24832K(2446848K)] 2138853K->50253K(8039424K), 0.0392831 secs] [Times: user=0.02 sys=0.79, real=0.04 secs] 2019-08-31T06:26:10.376+0000: 19012.586: [GC (Allocation Failure) [ PSYoungGen : 2122496K->25600K(2756096K)] 2147917K->58149K(8348672K), 0.0371416 secs] [Times: user=0.01 sys=0.75, real=0.04 secs] 2019-08-31T07:50:03.442+0000: 24045.652: [GC (Allocation Failure) [ PSYoungGen : 2756096K->32768K(2763264K)] 2788645K->72397K(8355840K), 0.0709641 secs] [Times: user=0.16 sys=1.39, real=0.07 secs] 2019-08-31T09:04:21.406+0000: 28503.616: [GC (Allocation Failure) [ PSYoungGen : 2763264K->32768K(2733568K)] 2802893K->83469K(8326144K), 0.0789178 secs] [Times: user=0.12 sys=1.59, real=0.08 secs] Sample GC Log

How to analyze GC Log? https://developer.ibm.com/javasdk/tools/ IBM GC & Memory visualizer GCeasy yCrash https://gceasy.io/ Google Garbage cat ( cms ) https://code.google.com/archive/a/eclipselabs.org/p/garbagecat HP Jmeter https://h20392.www2.hpe.com/portal/swdepot/displayProductInfo.do?productNumber=HPJMETER 03 02 01 05 04 https://ycrash.io/ 13

14 More GC Tuning case studies Uber Saves Millions of $ https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/ Large Automobile Manufacturer Improves Response Time https://blog.gceasy.io/2022/03/04/garbage-collection-tuning-success-story-reducing-young-gen-size/ CloudBees (Jenkins Parent company) optimizes https://blog.gceasy.io/2019/08/01/cloudbees-gc-performance-optimized-with-gceasy/ Oracle optimizes App performance by tuning GC https://blog.gceasy.io/2022/12/06/oracle-architect-optimizes-performance-using-gceasy/

15 Large SaaS company CEO’s tweet

Intermittent HTTP 502 Errors 16 Major Travel Service Provider

EBS Architecture 17

Clue: Nginx Error 18

1. GC Log 10. netstat 12. vmstat 2. Thread Dump 9. dmesg 3. Heap Dump 6. ps 8. Disk Usage 5. top 13. iostat 11. ping 14. Kernel Params 15. App Logs 16. metadata 4. Heap Substitute 7. top -H 19 Open-source script: https://github.com/ycrash/yc-data-script 360° Data

20

21 JVM Performance Master Class https://ycrash.io/java-performance-training

Ram Lakshmanan [email protected] @tier1app https://www.linkedin.com/company/ycrash This deck will be published in: https://blog.ycrash.io If you want to learn more … 22 THANK YOU FRIENDS