Measuring and Diagnosing Performance Shouldn't Require Magic

ScyllaDB 298 views 34 slides Oct 14, 2024
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

Struggling with performance issues despite all green dashboards? Experts say you need special skills, but we'll show you how to tackle these problems yourself. Join us to understand why this happens and how you can take control. #DevTips #PerformanceTuning


Slide Content

A ScyllaDB Community
Measuring and Diagnosing
Performance Shouldn’t Require Magic
Cary Millsap
Distinguished Product Manager, Oracle Corporation

Cary Millsap

Distinguished Product Manager, Oracle
■How to Make Things Faster (O’Reilly 2023)
■Performance specialist since 1990
■Faithful husband, father of three
■I make “award-winning museum-quality aviation replicas”
2

Optimizing performance is
a skill that, with the right
tools, a regular person
can be taught to master.
3

1 A Horror Story
4

© 20
th
Century Fox 5
Employees were “burning the place down” over the weekend
Because they weren’t getting paid on Friday
Because Payroll was too slow
Three months of this…

Until finally…
■Last resort: $$$ CPU upgrade
■Because dashboard said “CPU”
■Expected 1.5× faster Payroll
■Because 1.5× faster CPUs

■This would save them.

■But Payroll was actually slower after $$$ upgrading.

6

7
Picture of nuclear blast here.

The confusing part
Dashboard said…
■Heavy CPU consumption
■Lots of “latch free” wait events
cache buffers chains

This means high-LIO SQL.
And they knew what to do with high-LIO SQL.

But there was no high-LIO SQL in Payroll.
moviestore.c
om
High I/O
SQL, away!
8

9
This just made everyone that much angrier…
© 20
th
Century Fox

2 Measure, Diagnose
10

A little secret…

Their dashboard was never
going to help them solve their
Payroll problem.
11

But we solved it
in less than an hour.
12

What could we do that’s any better than their dashboard?
LOOK at it.
13
…at Payroll, that is.

LOOK at it… what is that supposed to mean?!
See how Payroll spends its time, call by call.

Trace it.

Oracle Database is the best in the world at this…
■DBMS_MONITOR.SESSION_TRACE_ENABLE and others
■Subroutine entries and exits timing, consumption, argument values, object manipulations, …
■Unfiltered so you can filter however you like
■Unaggregated so you can group however you like
14

Making the trace happen…
When is the next time you’ll run Payroll?
We’re always running Payroll. We run it
24×7 and get further behind every day.
You’re running Payroll right now?
Yes. Right now.
Do you mind if we trace it for about half an hour?
Sure, here’s our DBA. Just
tell him what you need.
15

16
=====================
PARSING IN CURSOR #140402507350816 len=800 dep=0 uid=173 oct=3 lid=173 tim=16616124723359 hv=4216956807 ad='204dcdc88' sqlid='ah8xzfvxpm9w7'
select Phase_Code ,Status_Code ,Completion_Text ,Phase.Lookup_Code ,Status.Lookup_Code ,Phase.Meaning ,Status.Meaning into :b0,:b1,:b2,:b3,:b4,:b5,:b6 from Fnd_Lookups Phase ,Fnd_Lookups Status ,Fnd_Concurrent_programs P
,Fnd_Concurrent_Requests R where (((((R.Request_Id=:b7 and (P.Concurrent_Program_Id=R.Concurrent_program_ID and P.Application_ID=R.Program_Application_ID)) and Phase.Lookup_Type=:b8) and
Phase.Lookup_Code=Decode(Status.Lookup_Code,'H','I','S','I','U','I','M','I',R.Phase_Code)) and Status.Lookup_Type=:b9) and
Status.Lookup_Code=Decode(R.Phase_Code,'P',Decode(R.Hold_Flag,'Y','H',Decode(P.Enabled_Flag,'N','U',Decode(Sign((R.Requested_Start_Date-SYSDATE)),1,'P',R.Status_Code))),'R',Decode(R.Hold_Flag,'Y','S',Decode(R.Status_Code,'Q','B','I','B',R.Status_Code
)),R.Status_Code))
END OF STMT
PARSE #140402507350816:c=109,e=109,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1279863267,tim=16616124723358
EXEC #140402507350816:c=76,e=77,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1279863267,tim=16616124723486
WAIT #140402507350816: nam='SQL*Net message to client' ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124723508
FETCH #140402507350816:c=322,e=322,p=0,cr=15,cu=0,mis=0,r=1,dep=0,og=1,plh=1279863267,tim=16616124723858
STAT #140402507350816 id=1 cnt=1 pid=0 pos=1 obj=0 op='NESTED LOOPS (cr=15 pr=0 pw=0 str=1 time=324 us cost=10 size=183 card=1)'
STAT #140402507350816 id=2 cnt=1 pid=1 pos=1 obj=0 op='NESTED LOOPS (cr=14 pr=0 pw=0 str=1 time=309 us cost=10 size=183 card=1)'
STAT #140402507350816 id=3 cnt=1 pid=2 pos=1 obj=0 op='NESTED LOOPS (cr=11 pr=0 pw=0 str=1 time=271 us cost=7 size=124 card=1)'
STAT #140402507350816 id=4 cnt=1 pid=3 pos=1 obj=0 op='NESTED LOOPS (cr=7 pr=0 pw=0 str=1 time=46 us cost=4 size=65 card=1)'
STAT #140402507350816 id=5 cnt=1 pid=4 pos=1 obj=34562 op='TABLE ACCESS BY INDEX ROWID FND_CONCURRENT_REQUESTS (cr=4 pr=0 pw=0 str=1 time=30 us cost=3 size=49 card=1)'
STAT #140402507350816 id=6 cnt=1 pid=5 pos=1 obj=34571 op='INDEX UNIQUE SCAN FND_CONCURRENT_REQUESTS_U1 (cr=3 pr=0 pw=0 str=1 time=19 us cost=2 size=0 card=1)'
STAT #140402507350816 id=7 cnt=1 pid=4 pos=2 obj=34368 op='TABLE ACCESS BY INDEX ROWID FND_CONCURRENT_PROGRAMS (cr=3 pr=0 pw=0 str=1 time=12 us cost=1 size=16 card=1)'
STAT #140402507350816 id=8 cnt=1 pid=7 pos=1 obj=51820392 op='INDEX UNIQUE SCAN FND_CONCURRENT_PROGRAMS_U1 (cr=2 pr=0 pw=0 str=1 time=9 us cost=0 size=0 card=1)'
STAT #140402507350816 id=9 cnt=1 pid=3 pos=2 obj=34006 op='TABLE ACCESS BY INDEX ROWID BATCHED FND_LOOKUP_VALUES (cr=4 pr=0 pw=0 str=1 time=223 us cost=3 size=59 card=1)'
STAT #140402507350816 id=10 cnt=1 pid=9 pos=1 obj=51821477 op='INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=3 pr=0 pw=0 str=1 time=215 us cost=2 size=0 card=1)'
STAT #140402507350816 id=11 cnt=1 pid=2 pos=2 obj=51821477 op='INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=3 pr=0 pw=0 str=1 time=37 us cost=2 size=0 card=1)'
STAT #140402507350816 id=12 cnt=1 pid=1 pos=2 obj=34006 op='TABLE ACCESS BY INDEX ROWID FND_LOOKUP_VALUES (cr=1 pr=0 pw=0 str=1 time=10 us cost=3 size=59 card=1)'
WAIT #140402507350816: nam='SQL*Net message from client' ela= 6526 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124730715
=====================
PARSING IN CURSOR #140402496894584 len=130 dep=0 uid=173 oct=3 lid=173 tim=16616124730953 hv=3516731138 ad='208b34798' sqlid='2n11d838tu3s2'
select transaction_context_id into :b0 from fnd_log_transaction_context where (transaction_type='REQUEST' and transaction_id=:b1)
END OF STMT
PARSE #140402496894584:c=122,e=122,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2041065141,tim=16616124730953
EXEC #140402496894584:c=55,e=55,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2041065141,tim=16616124731074
WAIT #140402496894584: nam='SQL*Net message to client' ela= 4 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124731104
FETCH #140402496894584:c=45,e=45,p=0,cr=2,cu=0,mis=0,r=0,dep=0,og=1,plh=2041065141,tim=16616124731173
STAT #140402496894584 id=1 cnt=0 pid=0 pos=1 obj=241009 op='TABLE ACCESS BY INDEX ROWID BATCHED FND_LOG_TRANSACTION_CONTEXT (cr=2 pr=0 pw=0 str=1 time=48 us cost=2 size=22 card=1)'
STAT #140402496894584 id=2 cnt=0 pid=1 pos=1 obj=241012 op='INDEX RANGE SCAN FND_LOG_TRANSACTION_CONTEXT_N2 (cr=2 pr=0 pw=0 str=1 time=46 us cost=1 size=0 card=1)'
WAIT #140402496894584: nam='SQL*Net message from client' ela= 5796 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124737077
EXEC #140402521051584:c=43,e=42,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=3528876040,tim=16616124737197
WAIT #140402521051584: nam='SQL*Net message to client' ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124737253
FETCH #140402521051584:c=53,e=53,p=0,cr=4,cu=0,mis=0,r=1,dep=0,og=1,plh=3528876040,tim=16616124737266
WAIT #140402521051584: nam='SQL*Net message from client' ela= 5594 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124742889
=====================
PARSING IN CURSOR #140402507940536 len=288 dep=0 uid=173 oct=3 lid=173 tim=16616124743098 hv=18721447 ad='204d82c50' sqlid='2tyrw6s0jvap7'
select nvl(CP.DELETE_LOG_FILE,'N') ,CR.LOGFILE_NAME ,STATUS_CODE ,PHASE_CODE into :b0,:b1,:b2,:b3 from FND_CONCURRENT_REQUESTS CR ,FND_CONCURRENT_PROGRAMS CP where ((CR.REQUEST_ID=:b4 and
CR.CONCURRENT_PROGRAM_ID=CP.CONCURRENT_PROGRAM_ID) and CR.PROGRAM_APPLICATION_ID=CP.APPLICATION_ID)
END OF STMT
PARSE #140402507940536:c=146,e=146,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=255874495,tim=16616124743098
EXEC #140402507940536:c=38,e=38,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=255874495,tim=16616124743187
WAIT #140402507940536: nam='SQL*Net message to client' ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124743211
FETCH #140402507940536:c=46,e=46,p=0,cr=7,cu=0,mis=0,r=1,dep=0,og=1,plh=255874495,tim=16616124743278
STAT #140402507940536 id=1 cnt=1 pid=0 pos=1 obj=0 op='NESTED LOOPS (cr=7 pr=0 pw=0 str=1 time=46 us cost=4 size=88 card=1)'
STAT #140402507940536 id=2 cnt=1 pid=1 pos=1 obj=34562 op='TABLE ACCESS BY INDEX ROWID FND_CONCURRENT_REQUESTS (cr=4 pr=0 pw=0 str=1 time=30 us cost=3 size=72 card=1)'
STAT #140402507940536 id=3 cnt=1 pid=2 pos=1 obj=34571 op='INDEX UNIQUE SCAN FND_CONCURRENT_REQUESTS_U1 (cr=3 pr=0 pw=0 str=1 time=20 us cost=2 size=0 card=1)'
STAT #140402507940536 id=4 cnt=1 pid=1 pos=2 obj=34368 op='TABLE ACCESS BY INDEX ROWID FND_CONCURRENT_PROGRAMS (cr=3 pr=0 pw=0 str=1 time=12 us cost=1 size=16 card=1)'
STAT #140402507940536 id=5 cnt=1 pid=4 pos=1 obj=51820392 op='INDEX UNIQUE SCAN FND_CONCURRENT_PROGRAMS_U1 (cr=2 pr=0 pw=0 str=1 time=8 us cost=0 size=0 card=1)'
WAIT #140402507940536: nam='SQL*Net message from client' ela= 59564 driver id=1952673792 #bytes=1 p3=0 obj#=-1 tim=16616124802985
How is that going to help anybody?

Receipt for Response Time


What’s the problem?
■CPU? (12.7%)
■latch free? (1.2%)

No…
■SQL*Net… (50.3%)
17“Receipt for Response Time” is a trademark of Method R Corporation. Visit method-r.com for access to Method R Workbench.

duration = count × mean

count and mean…
There is no other way to
influence duration.

The only way to influence total duration is by
manipulating one or more contributor durations.
The only way to influence a contributor duration is
by manipulating either its count or its mean.
There are only two knobs you can turn (ever!)

18

With such a receipt, you can predict impact.

“If you change this count
or that mean, then
duration will change by
ΔD.”

No more months of
guessing.
19

Measuring and diagnosing performance
shouldn’t require magic Ring a bell?
Measuring:
■Trace.
■Always.
■No magic.





Diagnosing:
■Receipt.
■Top row.
■Count.
■Mean.
■No magic.



20
“If you change this
count or that mean,
then duration will
change by ΔD.”

3 The End of the Horror Story
21

Our diagnosis…
Remember, only two knobs. Always.
■Count
■Mean

Count
■Payroll is prepackaged.
■Can’t change its network I/O call counts.

Mean
■Should be 55× faster.
■PROTOCOL value in net config file.
22

Now we wait.
(Change control.)
23

Friday, we got a call…
Payroll throughput is more than double,s.
Mission accomplished.
??????
24

25 © 20
th
Century Fox

4 Why the Dashboard Was Futile
26

Their dashboard was never
going to help them fix their
Payroll problem.
27

Main reason
The dashboard uses a harmful definition of “bottleneck.”

Reminder: What’s the business’s bottleneck?
Payroll.
28

Bottleneck vs. “Bottleneck”
Payroll bottleneck
■The Payroll execution’s largest
response time contributor.
■Always the top line of a “receipt for
response time.”
■Easy. Reliable. Always
business-relevant.
29
“System” bottleneck
■The system’s largest response
time contributor.
■Not necessarily what’s causing
your program to struggle.
■Confusing. Unreliable. Can lead
to months of anguish.

5 Finalacrum
30

Optimizing performance is
a skill that, with the right
tools, a regular person
can be taught to master.
31

Repeatable, teachable optimization process


32
MeasureTrace. Always.
DiagnoseCreate a “receipt for response time.” method-r.com
Solve Reduce either call count or call mean latency
for the calls at the top of your receipt.

33
For more information…
What’s the whole Payroll story?Chapter 16: “Payroll”
Why are there only ever two knobs?Chapter 28: “A Riddle”
How can you reduce event counts?Chapter 30: “Event Count”
How can you reduce event durations?Chapter 31: “Event Duration”
Why is count reduction better?Chapter 32: “Filter Early”
Why was Payroll slower after upgrading?Chapter 54: “Yeti”

Thank you! Let’s connect.
Cary Millsap
[email protected]
carymillsap.com
@carymillsap
Tags