SlidePub
Home
Categories
Login
Register
Home
General
How GPU Computing saved me at work PyData talk
How GPU Computing saved me at work PyData talk
cvacaruiz
12 views
24 slides
Sep 05, 2024
Slide
1
of 24
Previous
Next
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
Pydata Talk 2019
GPU computing for Data tasks
Size:
1.31 MB
Language:
en
Added:
Sep 05, 2024
Slides:
24 pages
Slide Content
Slide 1
1© 2018 WalmartInternational | Confidential | ForInternal Use Only
How GPU Computing literally
saved me at work!
Abhishek Mungoli, Data Scientist, MTech(IIIT, Hyd.)
Date: Aug 3
rd
, 2019
Prepared exclusively for PyData’ 19
Slide 2
2© 2018 WalmartInternational | Confidential | ForInternal Use Only
Content
•Origin of GPU
•CPU v/s GPU
•GPU v/s Hadoop
•Intro to CUDA, NUMBA and important terminologies
•CUDA Architecture
•Use Case 1 –Items at Walmart (Task Details, Complexity and Takeaways)
•Use Case 2 –Parallel Optimization (Task Details, Complexity and Takeaways)
•Advancement in GPU Field
•Acknowledgement
Slide 3
3© 2018 WalmartInternational | Confidential | ForInternal Use Only
Origin of GPU
•A Graphics Card, or a GPU, is a very powerful CPU designed to perform graphical and graphics related
calculations.
•Marketed in 1999 with launch of GeForce 256
•Had been in existence from prior times
•With the increased customer application needs andbetter UI/UX experience, addressing rapid calculations is a necessity.
•Why GPU needs memory?
•Whether you are watching a movie, or playing a game or even moving a mouse pointer, everything is calculations in binary inside
the CPU and GPU(the GPU is responsible for rendering the display output).
•To store results of such huge amount of calculations, and to perform various operations, the CPU/GPU needs memory.
Slide 4
4© 2018 WalmartInternational | Confidential | ForInternal Use Only
CPU v/s GPU
Few Cores Thousands of Cores
Single-thread optimization Multiple concurrent threads
Low Latency Tolerance High Latency Tolerance
•When to Use GPU????
Slide 5
5© 2018 WalmartInternational | Confidential | ForInternal Use Only
Who’s better
Deliver many packages within a reasonable
timescale
Deliver a package as soon as possible
Slide 6
6© 2018 WalmartInternational | Confidential | ForInternal Use Only
GPU v/s Hadoop
•A GPU follows working on same instruction in a
lock-step fashion. (SIMD Architecture)
•CUDA can be utilised for extensive traffic on
threads.
•Limited amount of shared memory and global
memory in a GPU
•Not advisable to use just one GPU for solving a
huge problem.
•Hadoop is used for solving large problems on
commodity hardware.
•UtilisesMap Reduce paradigm.
•One doesn't have to worry about distributing
data or managing corner cases.
•Includes a file system HDFS for storing data on
compute nodes.
GPU Hadoop
Slide 7
7© 2018 WalmartInternational | Confidential | ForInternal Use Only
Intro & Terminologies
•CUDA-a parallel computing platform and application programming interface (API) model created by Nvidia.
•NUMBA-Numba supports CUDA GPU programming by directly compiling a restricted subset ofPython code into CUDA kernels
and device functions following the CUDA execution model
•Host: the CPU
•Device: the GPU
•Host memory: the system main memory
•Device memory: onboard memory on a GPU card
•Kernels: a GPU function launched by the host and executed on the device
•Device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another
device function)
Slide 8
8© 2018 WalmartInternational | Confidential | ForInternal Use Only
CUDA Architecture
•Thread
•A scheduled chain of instructions running/flowing on a CUDA core.
•32 CUDA threads in-flight running on same CUDA core.
•Each thread uses its index to access elements in array
•The collection of all threads cooperativelyprocesses the entire data.
•Block
•Agroup of threads.
•Execute either concurrently or serially, in no particular order.
•_syncthreads()function facilitatesthe thread tracking/ status tracking to
maintain the communication between threads.
•Grid
•This is a group of blocks.
Slide 9
9© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to CUDA Kernel
https://colab.research.google.com/drive/1NQcmlghMJr5SLJvz2rmX6hcAhPweAMjK
Slide 10
10© 2018 WalmartInternational | Confidential | ForInternal Use Only
Infrastructure Used
•Nvidia Volta V100 16GB GPU •Python •Numba
Slide 11
11© 2018 WalmartInternational | Confidential | ForInternal Use Only
Use Case 1 –Item Similarity
•Recommendation systems
•Item alternatives
•Assortment
•Customer basket customization
Challenge
•Number of Items
•10
5
•On a subset of 10
3
items
•17 seconds
•On 10
5
item set
•1700 * 100 seconds = 47.2 hours ~ 2 days
Problem Statement:
•Finding top 3 similar items to each item present in data.
Slide 12
12© 2018 WalmartInternational | Confidential | ForInternal Use Only
Task Complexity
Computation TimeSize
•Number of Items
•10
5
•Dimension of each item
•64-D vector
•Task
•Identify top 3 similar
items to each item in set
•Cosine Similarity
•For finding items with highest
similarity, for one item
•O(n*k)
•For finding items with highest
similarity, for all items
•O(n*n*k)
•On a subset of 10
3
items
•17 seconds
•3.7 *10⁶ operations per second
•On a subset of 10
4
items
•1700 seconds = ~28 minutes
•On 10
5
item set
•1700 * 100 seconds = 2834
minutes = 47.2 hours ~ 2 days
•GPU Time Complexity ~ 20 seconds
Slide 13
13© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to Solution
https://colab.research.google.com/drive/16yUzs2NBhB9aL63n5xFfd1kDz6MVgnE_
Slide 14
14© 2018 WalmartInternational | Confidential | ForInternal Use Only
Solution Framework
•Embeddings can be obtained at different level of hierarchies.
•Similar items have similar embeddings.
Assumptions
Work Flow
Slide 15
15© 2018 WalmartInternational | Confidential | ForInternal Use Only
Take Away
Key Findings-
•The CPU estimated time of 2 days was brought
down to 20.5 seconds with the use of GPU.
•This was possible only because of the nature of
the task. Finding top-3 similar items to Item ‘A’
is independent of finding top-3 similar items to
Item ‘B’.
•GPU’s can be used as we need to identifythe
parallelism in taskand exploit GPU for speed up.
•We can have a system/module with some
components running in CPU and some in GPU as
per need and necessity.
https://medium.com/walmartlabs/how-gpu-computing-literally-saved-me-at-work-fc1dc70f48b6
Slide 16
16© 2018 WalmartInternational | Confidential | ForInternal Use Only
Use Case 2 –Parallel Optimization
•Finance/Insurance Domain
•Retail/Ecommerce Domain
•Search Optimization
•Real Estate/Health Care Domain
Challenge
•Multiple number of parallel Optimizations for
each subproblem (shops)
•10
5
•On a subset of 10
2
items
•11 seconds
•On 10
5
item set
•11 * 1000 seconds ~ 180 minutes ~ 3
hours
Problem Statement:
•Run optimizations for each subproblem(shop) in parallel. Thief has a bag of varying size for each shop he intends to steal with.Items in the shop with
its volume & value is provided with.
Slide 17
17© 2018 WalmartInternational | Confidential | ForInternal Use Only
Task Complexity
Computation TimeSize
•Number of sub-problems
(shops)
•10
5
•Each shop has a maximum of
•100 items
•Task
•Identify which items to
be picked for each shop;
Separate Knapsack for
each shop
•Dynamic Programming
•Finding which items to be
picked from a shop
•O(items*Bag Size)
•Finding which items to be
picked from ‘n’ shops
•O(n*items*Bag Size)
•On a subset of 10
2
shops
•11 seconds
•10⁶ operations per second
•On 10
5
shop
•11 * 1000 seconds ~ 180
minutes ~ 3hours
•GPU Time Complexity ~ 11 milli
seconds
Slide 18
18© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to Solution
https://colab.research.google.com/drive/1Iowghy4Hwt11avl0xyYjXsGvpbCGM4Yl#scrollTo=a4U2xIpL5G30
Slide 19
19© 2018 WalmartInternational | Confidential | ForInternal Use Only
Take Away
Key Findings-
•The CPU estimated time of 3 hours was brought
down to 11 milli seconds with the use of GPU.
•This was possible only because of the nature of
the task. Multiple optimizations ran parallelly at
the same time in GPU kernels.
•GPU’s can be used as we need to identifythe
parallelism in taskand exploit GPU for speed up.
•We can have a system/module with some
components running in CPU and some in GPU as
per need and necessity.
Slide 20
20© 2018 WalmartInternational | Confidential | ForInternal Use Only
Advancement in GPU Field
•RAPIDS
•The RAPIDS data science framework includes a collection of libraries for executing end-to-end data science pipelines completely in
the GPU.
•It is designed to have a familiar look and feel to data scientists working in Python.
•Some RAPIDS projects includes
•cuDF, a pandas-like data frame manipulation library;
•cuML, a collection of machine learning libraries that will provide GPU versions of algorithms available in scikit-learn;
•cuGraph, a network-X like API that seamlessly integrate into the RAPIDS data science platform.
https://rapids.ai/about.html
Slide 21
21© 2018 WalmartInternational | Confidential | ForInternal Use Only
Acknowledgement
•IzzatbirSingh, Data Scientist, Walmart Labs, for helping me prepare the content and get conference ready.
•AyushKumar, Software Engineer, Walmart Labs for helping with the right GPU coding practices
•MLP (Machine Learning Platform) fromWalmartLabsin terms of the infrastructure provided.
Slide 22
22© 2018 WalmartInternational | Confidential | ForInternal Use Only
Q&A
Slide 23
23© 2018 WalmartInternational | Confidential | ForInternal Use Only
References
•https://en.wikipedia.org/wiki/Time_complexity
•https://en.wikipedia.org/wiki/Graphics_processing_unit
•https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/
•https://www.datascience.com/blog/cpu-gpu-machine-learning
•https://qr.ae/TWIuic
•https://numba.pydata.org/numba-doc/latest/index.html
•https://en.wikipedia.org/wiki/CUDA
•https://www.nvidia.in/object/cuda-parallel-computing-in.html
•https://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/
•https://qr.ae/TWIwEW
Slide 24
24© 2018 WalmartInternational | Confidential | ForInternal Use Only
Thank You
Abhishek Mungoli, Data Scientist, Walmart.
LinkedIn -https://www.linkedin.com/in/abhishek-mungoli-39048355/
Medium -https://medium.com/@mungoliabhishek81
Instagram -https://www.instagram.com/simplyspartanx/
Tags
data center
gpu
Categories
General
Download
Download Slideshow
Get the original presentation file
Quick Actions
Embed
Share
Save
Print
Full
Report
Statistics
Views
12
Slides
24
Age
453 days
Related Slideshows
22
Pray For The Peace Of Jerusalem and You Will Prosper
RodolfoMoralesMarcuc
32 views
26
Don_t_Waste_Your_Life_God.....powerpoint
chalobrido8
33 views
31
VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf
JaiJai148317
31 views
14
Fertility awareness methods for women in the society
Isaiah47
30 views
35
Chapter 5 Arithmetic Functions Computer Organisation and Architecture
RitikSharma297999
27 views
5
syakira bhasa inggris (1) (1).pptx.......
ourcommunity56
29 views
View More in This Category
Embed Slideshow
Dimensions
Width (px)
Height (px)
Start Page
Which slide to start from (1-24)
Options
Auto-play slides
Show controls
Embed Code
Copy Code
Share Slideshow
Share on Social Media
Share on Facebook
Share on Twitter
Share on LinkedIn
Share via Email
Or copy link
Copy
Report Content
Reason for reporting
*
Select a reason...
Inappropriate content
Copyright violation
Spam or misleading
Offensive or hateful
Privacy violation
Other
Slide number
Leave blank if it applies to the entire slideshow
Additional details
*
Help us understand the problem better