How GPU Computing saved me at work PyData talk

1© 2018 WalmartInternational | Confidential | ForInternal Use Only
How GPU Computing literally
saved me at work!
Abhishek Mungoli, Data Scientist, MTech(IIIT, Hyd.)
Date: Aug 3
rd
, 2019
Prepared exclusively for PyData’ 19

2© 2018 WalmartInternational | Confidential | ForInternal Use Only
Content
•Origin of GPU
•CPU v/s GPU
•GPU v/s Hadoop
•Intro to CUDA, NUMBA and important terminologies
•CUDA Architecture
•Use Case 1 –Items at Walmart (Task Details, Complexity and Takeaways)
•Use Case 2 –Parallel Optimization (Task Details, Complexity and Takeaways)
•Advancement in GPU Field
•Acknowledgement

3© 2018 WalmartInternational | Confidential | ForInternal Use Only
Origin of GPU
•A Graphics Card, or a GPU, is a very powerful CPU designed to perform graphical and graphics related
calculations.
•Marketed in 1999 with launch of GeForce 256
•Had been in existence from prior times
•With the increased customer application needs andbetter UI/UX experience, addressing rapid calculations is a necessity.
•Why GPU needs memory?
•Whether you are watching a movie, or playing a game or even moving a mouse pointer, everything is calculations in binary inside
the CPU and GPU(the GPU is responsible for rendering the display output).
•To store results of such huge amount of calculations, and to perform various operations, the CPU/GPU needs memory.

4© 2018 WalmartInternational | Confidential | ForInternal Use Only
CPU v/s GPU
Few Cores Thousands of Cores
Single-thread optimization Multiple concurrent threads
Low Latency Tolerance High Latency Tolerance
•When to Use GPU????

5© 2018 WalmartInternational | Confidential | ForInternal Use Only
Who’s better
Deliver many packages within a reasonable
timescale
Deliver a package as soon as possible

6© 2018 WalmartInternational | Confidential | ForInternal Use Only
GPU v/s Hadoop
•A GPU follows working on same instruction in a
lock-step fashion. (SIMD Architecture)
•CUDA can be utilised for extensive traffic on
threads.
•Limited amount of shared memory and global
memory in a GPU
•Not advisable to use just one GPU for solving a
huge problem.
•Hadoop is used for solving large problems on
commodity hardware.
•UtilisesMap Reduce paradigm.
•One doesn't have to worry about distributing
data or managing corner cases.
•Includes a file system HDFS for storing data on
compute nodes.
GPU Hadoop

7© 2018 WalmartInternational | Confidential | ForInternal Use Only
Intro & Terminologies
•CUDA-a parallel computing platform and application programming interface (API) model created by Nvidia.
•NUMBA-Numba supports CUDA GPU programming by directly compiling a restricted subset ofPython code into CUDA kernels
and device functions following the CUDA execution model
•Host: the CPU
•Device: the GPU
•Host memory: the system main memory
•Device memory: onboard memory on a GPU card
•Kernels: a GPU function launched by the host and executed on the device
•Device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another
device function)

8© 2018 WalmartInternational | Confidential | ForInternal Use Only
CUDA Architecture
•Thread
•A scheduled chain of instructions running/flowing on a CUDA core.
•32 CUDA threads in-flight running on same CUDA core.
•Each thread uses its index to access elements in array
•The collection of all threads cooperativelyprocesses the entire data.
•Block
•Agroup of threads.
•Execute either concurrently or serially, in no particular order.
•_syncthreads()function facilitatesthe thread tracking/ status tracking to
maintain the communication between threads.
•Grid
•This is a group of blocks.

9© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to CUDA Kernel
https://colab.research.google.com/drive/1NQcmlghMJr5SLJvz2rmX6hcAhPweAMjK

10© 2018 WalmartInternational | Confidential | ForInternal Use Only
Infrastructure Used
•Nvidia Volta V100 16GB GPU •Python •Numba

11© 2018 WalmartInternational | Confidential | ForInternal Use Only
Use Case 1 –Item Similarity
•Recommendation systems
•Item alternatives
•Assortment
•Customer basket customization
Challenge
•Number of Items
•10
5
•On a subset of 10
3
items
•17 seconds
•On 10
5
item set
•1700 * 100 seconds = 47.2 hours ~ 2 days
Problem Statement:
•Finding top 3 similar items to each item present in data.

12© 2018 WalmartInternational | Confidential | ForInternal Use Only
Task Complexity
Computation TimeSize
•Number of Items
•10
5
•Dimension of each item
•64-D vector
•Task
•Identify top 3 similar
items to each item in set
•Cosine Similarity
•For finding items with highest
similarity, for one item
•O(n*k)
•For finding items with highest
similarity, for all items
•O(n*n*k)
•On a subset of 10
3
items
•17 seconds
•3.7 *10⁶ operations per second
•On a subset of 10
4
items
•1700 seconds = ~28 minutes
•On 10
5
item set
•1700 * 100 seconds = 2834
minutes = 47.2 hours ~ 2 days
•GPU Time Complexity ~ 20 seconds

13© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to Solution
https://colab.research.google.com/drive/16yUzs2NBhB9aL63n5xFfd1kDz6MVgnE_

14© 2018 WalmartInternational | Confidential | ForInternal Use Only
Solution Framework
•Embeddings can be obtained at different level of hierarchies.
•Similar items have similar embeddings.
Assumptions
Work Flow

15© 2018 WalmartInternational | Confidential | ForInternal Use Only
Take Away
Key Findings-
•The CPU estimated time of 2 days was brought
down to 20.5 seconds with the use of GPU.
•This was possible only because of the nature of
the task. Finding top-3 similar items to Item ‘A’
is independent of finding top-3 similar items to
Item ‘B’.
•GPU’s can be used as we need to identifythe
parallelism in taskand exploit GPU for speed up.
•We can have a system/module with some
components running in CPU and some in GPU as
per need and necessity.
https://medium.com/walmartlabs/how-gpu-computing-literally-saved-me-at-work-fc1dc70f48b6

16© 2018 WalmartInternational | Confidential | ForInternal Use Only
Use Case 2 –Parallel Optimization
•Finance/Insurance Domain
•Retail/Ecommerce Domain
•Search Optimization
•Real Estate/Health Care Domain
Challenge
•Multiple number of parallel Optimizations for
each subproblem (shops)
•10
5
•On a subset of 10
2
items
•11 seconds
•On 10
5
item set
•11 * 1000 seconds ~ 180 minutes ~ 3
hours
Problem Statement:
•Run optimizations for each subproblem(shop) in parallel. Thief has a bag of varying size for each shop he intends to steal with.Items in the shop with
its volume & value is provided with.

17© 2018 WalmartInternational | Confidential | ForInternal Use Only
Task Complexity
Computation TimeSize
•Number of sub-problems
(shops)
•10
5
•Each shop has a maximum of
•100 items
•Task
•Identify which items to
be picked for each shop;
Separate Knapsack for
each shop
•Dynamic Programming
•Finding which items to be
picked from a shop
•O(items*Bag Size)
•Finding which items to be
picked from ‘n’ shops
•O(n*items*Bag Size)
•On a subset of 10
2
shops
•11 seconds
•10⁶ operations per second
•On 10
5
shop
•11 * 1000 seconds ~ 180
minutes ~ 3hours
•GPU Time Complexity ~ 11 milli
seconds

18© 2018 WalmartInternational | Confidential | ForInternal Use Only
A walkthrough to Solution
https://colab.research.google.com/drive/1Iowghy4Hwt11avl0xyYjXsGvpbCGM4Yl#scrollTo=a4U2xIpL5G30

19© 2018 WalmartInternational | Confidential | ForInternal Use Only
Take Away
Key Findings-
•The CPU estimated time of 3 hours was brought
down to 11 milli seconds with the use of GPU.
•This was possible only because of the nature of
the task. Multiple optimizations ran parallelly at
the same time in GPU kernels.
•GPU’s can be used as we need to identifythe
parallelism in taskand exploit GPU for speed up.
•We can have a system/module with some
components running in CPU and some in GPU as
per need and necessity.

20© 2018 WalmartInternational | Confidential | ForInternal Use Only
Advancement in GPU Field
•RAPIDS
•The RAPIDS data science framework includes a collection of libraries for executing end-to-end data science pipelines completely in
the GPU.
•It is designed to have a familiar look and feel to data scientists working in Python.
•Some RAPIDS projects includes
•cuDF, a pandas-like data frame manipulation library;
•cuML, a collection of machine learning libraries that will provide GPU versions of algorithms available in scikit-learn;
•cuGraph, a network-X like API that seamlessly integrate into the RAPIDS data science platform.
https://rapids.ai/about.html

21© 2018 WalmartInternational | Confidential | ForInternal Use Only
Acknowledgement
•IzzatbirSingh, Data Scientist, Walmart Labs, for helping me prepare the content and get conference ready.
•AyushKumar, Software Engineer, Walmart Labs for helping with the right GPU coding practices
•MLP (Machine Learning Platform) fromWalmartLabsin terms of the infrastructure provided.

23© 2018 WalmartInternational | Confidential | ForInternal Use Only
References
•https://en.wikipedia.org/wiki/Time_complexity
•https://en.wikipedia.org/wiki/Graphics_processing_unit
•https://blogs.nvidia.com/blog/2009/12/16/whats-the-difference-between-a-cpu-and-a-gpu/
•https://www.datascience.com/blog/cpu-gpu-machine-learning
•https://qr.ae/TWIuic
•https://numba.pydata.org/numba-doc/latest/index.html
•https://en.wikipedia.org/wiki/CUDA
•https://www.nvidia.in/object/cuda-parallel-computing-in.html
•https://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/
•https://qr.ae/TWIwEW

24© 2018 WalmartInternational | Confidential | ForInternal Use Only
Thank You
Abhishek Mungoli, Data Scientist, Walmart.
LinkedIn -https://www.linkedin.com/in/abhishek-mungoli-39048355/
Medium -https://medium.com/@mungoliabhishek81
Instagram -https://www.instagram.com/simplyspartanx/

How GPU Computing saved me at work PyData talk

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

How GPU Computing saved me at work PyData talk

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......