Silk_SQLSaturdayBatonRouge_kgorman_2024.pptx

kellynpotvin 146 views 50 slides Jul 30, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation

SQL Saturday Baton Rouge Session "Migrating High IO SQL Server to Azure"


Slide Content

Migrating High IO SQL Server Workloads to Azure How to migrate “whales in the pond” from isolated data center to freedom in the cloud

Scan the QR code to submit session evaluations.

Who am I? 3

The Whales 4

Everyone Has a Whale in the Pond Every company has that one relational workload that outgrows all the others. If the workload requires high CPU and memory, often it requires high IO, too. Some companies have a whale that births other, smaller whales, creating a pod of workloads that hinder them from moving to the cloud. Due to the criticality of the workload to the entire business, if any move to the cloud, ALL NEED TO MOVE TO THE CLOUD. These high IO workloads are the most common challenge for companies on their cloud journey. 5

The Path to the Cloud- Most data estates are tightly coupled on-premises and must moved together If Oracle doesn’t migrate to Azure, then little else can, but once Oracle migrates, then refactoring and new services can be introduced to customer. On-Premises or Simple Data Estate Lift, Shift, EVOLVE

High IO Might Not Be High IO… Outside of IO, it could be CPU, memory, concurrency or even a bug causing the issue. Take the time to understand the real cause. High input/output is an issue, but is it the real issue- Is high IO occurring due to backups or maintenance processes? Lack of memory “swapping” to IO? Waiting for CPU vs. on CPU? Concurrency due to poor database design or SQL/missing indexes the cause? 7

Knowing if it’s IO… 8

Waits at the Database Level- PAGEIOLATCH_EX/SH/UP WRITELOG ASYNC_DISKPOOL_LOCK ASYNC_NETWORK_IO ASYNC_IO_COMPLETION SELECT * FROM sys . dm_os_wait_stats dows ORDER BY dows . wait_time_ms DESC ;

PAGEIOLATCH_* Occurs when data transfers from disk to buffer pool. Some will exist, but extensive waits signal an issue. Could have numerous culprits, including: Insufficient memory Outdated or invalid statistics Missing or poor indexing Page Life Expectancy metric Types SH- Share Mode and read of the datafile page from disk to complete. UP- Update Mode and modify of the page structure once it is in memory. EX- Exclusive Mode and modify the page in memory exclusively. Long waits can indicate an underlying storage issue. DT/KP/NL- Destroy/Keep/Null Mode 10

Check for Latchs , in Real-Time SELECT wt.session_id , wt.wait_type , er.last_wait_type AS last_wait_type , wt.wait_duration_ms , wt.blocking_session_id , wt.blocking_exec_context_id , resource_description FROM sys.dm_os_waiting_tasks wt JOIN sys.dm_exec_sessions es ON wt.session_id = es.session_id JOIN sys.dm_exec_requests er ON wt.session_id = er.session_id WHERE es.is_user_process = 1 AND wt.wait_type <> 'SLEEP_TASK' ORDER BY wt.wait_duration_ms desc 11

Output 12 In this example: Session ID 53 is waiting for a shared lock (LCK_M_S) and its last wait type was for an exclusive lock (LCK_M_X). It has been waiting for 12,345 milliseconds and is being blocked by session ID 51. Session ID 54 is waiting on a shared page IO latch (PAGEIOLATCH_SH) with a previous wait type of exclusive page IO latch (PAGEIOLATCH_EX). It has been waiting for 6,789 milliseconds. There's no blocking session in this case. Session ID 55 is waiting due to parallelism (CXPACKET). Its last wait type was a shared page IO latch (PAGEIOLATCH_SH). It has been waiting for 1,234 milliseconds and is not being blocked by any session. Please note that the above is a fictitious example, and actual output will vary based on your SQL Server's activity and state.

WriteLog As every change requires a change to the log buffer, followed by a write to the underlying disk, physical IO can be demanding. Caused by a transaction commit, unless set to be delayed durable for newer versions of SQL Server, (2014+) All writes and reads to log files, (LDF) are done sequentially and are written to disk. Write ahead logging forces the current log blog to the disk. Watch for waits on Always-on Ags with remote log copy and HADR_SYNC_COMMIT waits Log block is also written when the sp_flush_log is executed in newer versions, (2014+) Consider the location and storage type for LDF files carefully, ensuring it matches the demand. 13

ASYNC_DISKPOOL_LOCK Occurs while SQL Server waits for threads to synchronize because of parallel disk operations, encrypting files with TDE and/or creating/deleting of files. Mostly maintenance operations- file create/grow, backup and restores. The larger the value, (over 50ms) the larger the concern 14 Process Thread Thread Thread Thread Encrypted Datafile Wait

ASYNC_IO_COMPLETION Non-datafile disk I/O completion, like growing a transaction log or IO that doesn’t fit inside buffer pool. Can also be part of backup and restore. Slow storage can impact this wait event drastically. Measure the BACKUPIO timing to identify issues and consider the following: Backup: faster backup storage. Preallocate vs. smaller auto-growth for datafiles 15 Process Buffer Pool Disk Wait Work to do

ASYNC_NETWORK_IO The network can be our last bottleneck. Data on application side can’t process data being provided by SQL Server. Data inconsistencies in packet sizes between database(s) and applications. Consider reducing data sent over network by ensuring the right data is sent, (right vs. all) Verify Network Packet Size in connections strings is sized correctly. Consider faster network interface, (or larger pipe.) Break up applications using network interfaces, (use different Express route service, etc.) 16

I/O Stalls This is a metric to identify I/O problems. Use the DMF dm_io_virtual_file_stats to view detailed info about stall times on data and/or log files. Requires the database ID and the database file ID to complete execution. select Db . name , vfs . * from    sys . dm_io_virtual_file_stats ( NULL , NULL ) AS VFS    JOIN sys . databases AS Db    ON vfs . database_id = Db . database_id 17

Tools Available to the DBA * Performance Monitor Azure Monitor Windows Event Viewer Third Party Tools, (Redgate, Solarwinds , Internal data provided by: DMVs Extended events Execution plans Data collector 18 *Seriously consider monitoring due to SQL Server’s inability to retain information long-term.

Basic IO Usage in SQL Server sys.dm_io_virtual_file_stats fn_virtualfilestats Usage: fn_virtualfilestats ( { db_id | NULL } , { file_id | NULL } ) Example: SELECT * FROM fn_virtualfilestats ( 2 , NULL ) GO Locate the Database Info: SELECT DB_ID (‘ proddb ’) AS [ DB ID ] GO Sp_helpdb ‘ proddb ’ GO Usage: sys . dm_io_virtual_file_stats ( { db_id | NULL }, { file_id | NULL }) Example: SELECT * FROM sys.dm_io_virtual_file_stats ( 2 , NULL ) GO https:// www.sqlservercentral.com /articles/how-to-get-io-statistics-of- sql -server-database-files Dbid FileId NumberReads BytesRead IoStallReadMS NumberWrites BytesWritten IoStallWriteMS IoStallMS BytesOnDisk 1 2 742 6881280 11760 98448 1161560064 584136 595896 99090432 2 2 462 43008000 252 113442 6502465536 267792 268044 3170893824 3 2 798 46448640 966 4410 23568384 16086 17052 3170893824 4 2 1638 48857088 1470 1144206 1144206 5799948 5801418 1048707072

IO Usage at the Database Level 20 WITH IO_Per_DB AS ( SELECT DB_NAME( database_id ) AS Db , CONVERT(DECIMAL(12,2), SUM( num_of_bytes_read + num_of_bytes_written ) / 1024 / 1024) AS TotalMb FROM sys.dm_io_virtual_file_stats (NULL, NULL) dmivfs GROUP BY database_id ) SELECT Db , TotalMb ,CAST( TotalMb / SUM( TotalMb ) OVER() * 100 AS DECIMAL(5,2)) AS [I/O] FROM IO_Per_DB ORDER BY [I/O] DESC ; Displays total MBPs and percentage of IO across entire SQL Server per database. https:// dbadiaries.com /how-to-find-io-usage-per-database-in- sql -server Db TotalMB I/O% 1 ProdDB 431.00 5.42% 2 ProdDW 7248.00 91.19% 3 tempdb 269.00 3.38%

Query IO by Database and Datafile 21 WITH IO_Per_DB_Per_File AS (SELECT DB_NAME( dmivfs.database_id ) AS Db , CONVERT(DECIMAL(12,2), SUM( num_of_bytes_read + num_of_bytes_written ) / 1024 / 1024) AS TotalMb , CONVERT(DECIMAL(12,2), SUM( num_of_bytes_read ) / 1024 / 1024) AS TotalMbRead , CONVERT(DECIMAL(12,2), SUM( num_of_bytes_written ) / 1024 / 1024) AS TotalMbWritten , CASE WHEN dmmf.type_desc = 'ROWS' THEN 'Data File' WHEN dmmf.type_desc = 'LOG' THEN 'Log File' END AS DataFileOrLogFile FROM sys.dm_io_virtual_file_stats (NULL, NULL) dmivfs JOIN sys.master_files dmmf ON dmivfs.file_id = dmmf.file_id AND dmivfs.database_id = dmmf.database_id GROUP BY dmivfs.database_id , dmmf.type_desc ) SELECT Db , TotalMb , TotalMbRead , TotalMbWritten , DataFileOrLogFile , CAST( TotalMb / SUM( TotalMb ) OVER() * 100 AS DECIMAL(5,2)) AS [I/O] FROM IO_Per_DB_Per_File ORDER BY [I/O] DESC; https:// learn.microsoft.com / en -us/ sql /relational-databases/system-dynamic-management-views/ sys-dm-io-virtual-file-stats-transact-sql?view = azuresqldb -current Db TotalMb TotalMbRead TotalMbWritten DataFileOrLogFile I/O% 1 ProdDB 431 398 33 Data File 3.41% 2 ProdDB 42 6 36 Log File 1.12% 3 ProdDW 7248 4902 2346 Data File 79.84% 4 ProdDW 231 55 176 Log File 3.33% 5 tempdb 269 147 123 Data File 3.38%

The How of Cloud IO 22

Platform as a Service(PaaS) and Infrastructure as a Service (IaaS) have limits placed on them to ensure no one resource is over-allocated. These all work great for smaller workloads but scaling for the whales rarely happens. Infrastructure as a Service CAN meet the largest of workloads- those that fit outside of the PaaS and SaaS solutions, but only if you know how . Cloud Scalability Comes with Limits 23

https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/ Understand IaaS VM Series A and B-series won’t work for high IO workloads and are should be avoided. D-series can work for some, but consider matching SKU series to production, but with lesser resources L and H-series are outliers for database workloads but may work if redundancy is built into the solution- take care! Identify workload needs D-series for general use E-series and M-series are the most common VMs in the database industry E-series have current, highest IO limits in Azure. M-series for VLDB, (very large databases or heavy memory usage), especially the new M v3 series. Both will still require recommended practices to meet HIGH IO workloads.

Virtual machines for database workloads Type vCPUs vRAM Max throughput ( MBps ) for SCSI storage Max IP egress rate (Mbps) for NFS, iSCSI storage Dds_v5 2-64 8-256 125-2000 1000-30000 Eds_v5 2-80 16-504 125-2000 1000-30000 Ms_v2 32-192 875-4096 500-2000 8000-30000 Mv2 208-416 2840-11400 1000-2000 16000-32000 Eads_v5 2-96 8-384 125-4000 12500-35000 Ebds_v5 * 2-104 16-672 125-8000 1000-100000 Msv3/Mdsv3** 12-176 240-2794 390-4000 4000-40000 *may not be available in all regions currently, fall back to Eds v5 if unavailable. ** In Preview- not GA https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series

https://docs.microsoft.com/en-us/azure/virtual-machines/windows/constrained-vcpu When one VM Equals too Much Licensing - Constrained VMs Allows for isolation of vCPU to application licensing for database and app workloads Matched in existing series VMs in the Azure Pricing Calculator Share storage between databases or apps using network storage options like Silk or ANF. Poorly named, vCPU constrained counts are the # of vCPUs on VM. Carefully match workloads on IO and memory, not just vCPU usage when sizing.

Specialized Constrained vCPU VMs Name vCPU Specs Standard_E4-2ds_v5 2 Same as E4ds_v5 Standard_E8-4ds_v5 4 Same as E8ds_v5 Standard_E8-2ds_v5 2 Same as E8ds_v5 Standard_E16-8ds_v5 8 Same as E16ds_v5 Standard_E16-4ds_v5 4 Same as E16ds_v5 Standard_E32-16ds_v5 16 Same as E32ds_v5 Standard_E32-8ds_v5 8 Same as E32ds_v5 Standard_E64-32ds_v5 32 Same as E64ds_v5 Standard_E4-2ds_v5 2 Same as E4ds_v5 Standard_E8-4ds_v5 4 Same as E8ds_v5 Standard_E8-2ds_v5 2 Same as E8ds_v5 Standard_E16-8ds_v5 8 Same as E16ds_v5 Standard_E16-4ds_v5 4 Same as E16ds_v5 Standard_E32-16ds_v5 16 Same as E32ds_v5 https://docs.microsoft.com/en-us/azure/virtual-machines/windows/constrained-vcpu

Storage is SEPARATE and Important Separate storage offers us the opportunity to create the right combination for success in IaaS. 28

Storage Considerations 29

Physical Optimizations and Limits in the Cloud 30

What is Great/High IO? Measurement of IO with the following three values: IO Requests ( IOps ) IO Throughput( MBps or GBps ) Latency 31 Type High Average Low IOPS 1 million/sec 300 K/sec 60 K/sec MBps 10GBPs 500MBps 25MBps Latency <7ms <4ms <1.5ms

IO Throttling Why does it happen? No, you can’t have all the resources for yourself. What all can be involved? It’s not just the database. How to identify it? What do to when it is identified? Just because you have a big VM, doesn’t mean you have high throughput! https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-memory?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json

Database Savvy Storage 33 Protocol Max Throughput ( MBps ) Min Latency ( ms ) Pricing Notes Premium SSD SCSI 900/device, 2000/VM cumulatively 0.7 w/ host-caching, 2 w/o host-caching $ Snapshot capable, bursting capable, LRS/ZRS redundancy, limited IO ceiling UltraDisk SCSI 2000/device, 4000/VM cumulatively 1-4 $$-$$$ Recent addition of snapshots, LRS redundancy, limited IO ceiling Azure Files premium NFS v4.1 100 + (0.1 * GiB-provisioned, so varies) 1-4 $$ No snapshots, LRS/ZRS redundancy, not for most relational workloads Azure NetApp Files NFS v3.0 NFS v4.1 10400 MBPs 0.25 $$$ Snapshot capable for some platforms, LRS/GRS redundancy SILK iSCSI 11000 (15000+ MBPs with data redaction features) 0.5 $$ Multi-cloud, snapshots and thin cloning, high compression/dedupe Pure Cloud Block Storage iSCSI/NFS 4000+ (need numbers on data redaction capabilities) 0.5 $$ Hybrid and multi-cloud, snapshots and thin provisioning, AVS solution

Premium SSD: Host Caching vs. Bursting Name  Capacity (GiB)  IOPS per disk  Max burstable IOPS  Throughput per disk (MB/s)  Max burstable throughput per disk (MB/s)  Cache limit per disk (GiB)  P1  4  120  3,500  25  170  4  P2  8  120  3,500  25  170  8  P3  16  120  3,500  25  170  16  P4  32  120  3,500  25  170  32  P6  64  240  3,500  50  170  64  P10  128  500  3,500  100  170  128  P15  256  1,100  3,500  125  170  256  P20  512  2,300  3,500  150  170  512  P30  1,024  5,000     30000 200  1000 750 MBPs/1,024  P40  2,048  7,500     30000 250  1000 750 MBPs/2,048  P50  4,096  7,500     30000 250  1000 750 MBPs/4,095  P60  8,192  16,000     30000 500    1000 750 MBPs/4,095 P70  16,384  18,000     30000 750    1000 750 MBPs/4,095   P80  32,727  20,000    30000 900    1000 750 MBPs/4,095   Source:   Managed disks pricing  

Types of cache Settings Microsoft Confidential Available to Premium Storage A Multi-tier caching technology, aka BlobCache OS Disk- ReadWrite is fine, which is the default, but not for datafiles. ReadOnly Cache is, as it caches reads, while letting writes pass through to disk. Limit of 4095Gib on per individual premium disk Results in any disk above a P40 for entirety will silently disable read caching. Larger disks are preferably used without caching, otherwise additional space is wasted. P50, just allocate 4095 of the 4096 size. Use smaller disks and choose to stripe and mirror. Verify SKU Availability

Ultra Disk Ultra Disk Offerings Disk Size (GiB) 4 8 16 32 64 128 256 512 1,024-65,536 (in increments of 1 TiB) IOPS Range 1,200 2,400 4,800 9,600 19,200 38,400 76,800 160,000 160,000 Throughput Range (MB/s) 300 600 1,200 2,000 2,000 2,000 2,000 2,000 2,000

Ultra Disk Pricing GiB * . 000164 , MBPs * .34967 , IOPs * . .04964 , vCPU * 4.38

Ultradisks Often the first recommendation by Infra Be aware of the limitations before recommending for database workloads: Only supports un-cached reads and un-cached writes Doesn't support disk snapshots, must use traditional backup methods. VM images, OS Disk, availability sets, Azure Dedicated Hosts, or Azure disk encryption No integration with Azure Backup or Azure Site Recovery* Offers up to 16 TiB per region per subscription unless upped via support. Isn’t available in all regions. Must use a minimum of single AZ. https://docs.microsoft.com/en-us/azure/virtual-machines/disks-enable-ultra-ssd#ga-scope-and-limitations Transaction Logs, Yeah!

Premium SSD v2 (PV2) The bridge between Premium SSD and Ultra Disk Ability to allocate dynamically storage. Dynamically set IO and size Great for smaller workloads, tempDB Somewhat limited on regions, but almost globally available at this time. Held to the VM throttling limit. Can’t be used in availability sets. Can’t be used with ASR, so secondary backup methods or snapshots must be used for the relational system residing on it. No host caching and limitations on scalability options

Premium SSD v2 Must be used with Zonal VMs Has same limits on IO as Premium SSDs and Ultra Disk Can be allocated like ultra disk by IO and by size needs, (solve specific, smaller demand IO challenges) No encryption, so secondary storage encryption is required. Can’t be used in Availability Sets Azure Backup-yes, Azure Site Recovery- No, (limitation on IO churn in ASR, shouldn’t be used for relational workloads anyway.) Outage required to expand. 40

Network Attached Storage Solutions for High IO

Winning with Network Storage Solutions in Azure High IOPS- Consider higher IO solutions like Azure NetApp Files , Silk , Pure Storage or even a PaaS service like Tessell for lesser demand workloads. Consider disk striping of smaller disks and parallel processing at the database level . Offload backups with secondary backup solutions. Some have additional compression and deduplication for additional IO capabilities Thin provisioning saves time and cost of cloud overall spending. Cloning options to add simplicity to development lifecycle and automation.

Microsoft Confidential NetApp Files on Azure (ANF)

The Numbers for ANF Fully Managed, PaaS, Microsoft Azure Storage Service All Flash BareMetal Storage Only dependent on NIC, not VM. *Available in Standard, Premium, (common) and Ultra, (optimal) ANF is native to Azure Azure Files Premium Files Azure NetApp Files Premium Disk Performance 1K IOPs 100K IOPs 320K IOPs 20K IOPs Capacity Pool 5TB 100TB 500TB 32TB AD Integration Azure AD N/A Bring Your Own AD / Azure AD N/A Protocol SMB SMB NFS & SMB Disk Data Protection LRS Only Snapshots Back Up Tools Snapshots FAQs About Azure NetApp Files | Microsoft Docs

Pure Cloud Block Store for Azure Dependent on the VM limit for the cloud solution, high IO on-premises: very focused on hybrid and DR solution, along with snapshots, thin provisioning and multiple VMs attach to a storage.   Make sure to look at the AVS with CBS for high IO on SQL Server-  doesn’t have the ANF limit for AVS ! https:// www.purestorage.com /solutions/applications/ microsoft.html

Capacity Layer ( m.nodes and d.nodes ) Performance Layer ( c.nodes ) m.node sizing options: 21/43/86 TB each m.node A Silk Data Pod has 1 to 4 m.nodes Protected m.node #4 … Protected m.node #1 Virtual Machine d.node #16 … Virtual Machine d.node #2 Virtual Machine d.node #1 Virtual Machine d.node #16 … Virtual Machine d.node #2 Virtual Machine d.node #1 Scalable Performance Layer Scalable Capacity Layer iSCSI or NVMe /TCP Virtual Machine c.node #8 … Virtual Machine c.node #4 Virtual Machine c.node #3 Virtual Machine c.node #2 Virtual Machine c.node #1 Silk Data Pod iSCSI or NVMe /TCP Azure Compute D64s v5 / L64s v3 VMs 256/512 GiB RAM 64 vCPU 14TiB local NVM (optional) Azure Compute Lsv2/3, Lasv3 VMs: L8* VMs L16* VMs L32* VMs 64-256 GB RAM 8-32 vCPU 2/4/8 TiB NVM Silk Flex Deployed from Azure Marketplace Customer Database VMs Mixed DBs Analytical DBs Transactional DBs Silk on Azure 46 https://silk.us/

Heavy IOPS and Latency-Sensitive Workloads Heavy Throughput Environments Silk on Pv2 with Metadata Pipeline for Reads Silk on Pv2 with Read Acceleration Engine Silk on VMs with Metadata Pipeline for Reads Silk on VMs with Read Acceleration Engine Transactional Workloads Analytical Workloads New! New! Multiple Silk Solutions in Azure Silk on Pv2 Unmatched Resiliency Over 11 9’s of Expected Durability 1 Silk on VMs Ultimate Performance Up to 26.4 GB/s or 2.1M IOPS with sub-millisecond latency Regulated industries like Finance, Health Care Operational industries like Retail, SaaS 1 This does not constitute a contractual SLA or guarantee https://silk.us/performance/

Additional Tips/Tricks

A Few More Things To Do I’m going to avoid the RAID 5 topic, as I think we already know it’s a no-go. Use latest patch sets and newest version of software if IaaS. Find ways to reduce the amount of transaction log data generated. Reduce/remove log flushing from synchronous HA technologies. Break up applications and database by multiple Express route and VPN network/NIC to improve network IO. Consider using attached NVME for specialized wait issues, such as tempDB usage, etc. Consider rewriting apps to do more application-level caching to go to database less often. Use iSCSI or NVMe -of, PCIe for storage connectivity to give better performance Virtual disks should be sized with fixed size vs. dynamically expanding ones. Consider Jumbo Frames if the network infrastructure supports it, (we call this husky frames in Azure… ) which allows an increased data payload for each data packet. 49

In Summary 50