Databricks for Dummies

RodneyJoyce1 2,288 views 19 slides Jun 17, 2019
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)

After this you will hopefully swit...


Slide Content

for dummies Rodney Joyce – Data & AI Consultant LinkedIn - bit.ly/rodneyjoyce © 2019

Agenda Objective / Data Science Series Boring theory Use Cases Demos Getting started Interactive Notebooks ETL Batch job with Azure Data Factory The humble dataframe Pricing Takeaways Questions

Objective & Data Science Series Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/ MLops and Operationalization What is Azure Databricks, why you should learn it and how to get started…

How to get value out of your data?

Data Science Workflow Data Value

Why is data science so hard? Data Science requires a lot of data engineering before it can succeed Siloed roles = unique terminology Fragmented technologies and solutions Model training requires huge scale some of the time The more data we use to train the better Big Data infrastructure is expensive and costly to maintain Operational challenges – how to get model to production? Problem = $$$ and slow to deliver value

Where Data Scientists spend most of their time

Solution: Unified Analytics Platform Unifies Data Science, Engineering and Business Removes silos, improves collaboration Supports multiple languages Business value: Cost saving (Resources, operational, training etc) Speed to market Easy to extend for future ML Focus on extracting insights from your data and not infrastructure and processes around it!

What is Open-source big data processing engine Massively scalable/distributed Highly extensible with many libraries Started in 2009 and written in Scala Supports 4 languages Designed for speed and ease of use In memory = faster than Hadoop You can run Spark on Azure directly

Databricks PaaS – Managed Spark Service on Azure

Use Case - Modern Analytics Platform

Use Case – Real Example

Demo 1 – Getting started with databricks Create a Databricks service (Resource Group) Launching a workspace – AD Integration Menu Overview Workspaces Notebooks RBAC (Premium) Add a new cluster with auto-scale Installing libraries

Demo 2 – Interactive Notebook Notebook overview Attach to Cluster Cells Markdown Running a command & shortcuts Comments Revisions (Git) Data Tables Language choice (4!) Magic commands (e.g. Unix, md) Charting and Dashboards

Demo 3 – ETL Batch Job with databricks Key Vault integration Storage integration Widgets/Parameters Nesting pipelines Scheduling a Job (Time based) ADF integration (Event driven)

Demo 4 – The humble dataframe Import a Notebook Dataframes versus Datasets versus RDDs Download and read a CSV file and infer schema Intellisense Lazy Evaluation Actions versus Transformations Importing a library (e.g. Pandas) Immutability Static versus Dynamically typed Koalas/Apache .net

Other Use-cases Connecting to Power BI - JDBC Connection/Premium Streaming Use-cases Open-source friendly! TensorFlow / Scikit -Learn etc. Machine Learning Auto-ML ML Ops

Databricks Pricing Notes Pay DBUs (per min) only when your cluster is running a job on Spark Cluster VM size determines DBU’s usage per hour – Cheapest is 0.5 DBU Azure costs depend on size of your VMs and depends on cluster state. E.g. VMs, storage Shutdown your clusters when not in use … and don’t run an infinite loop on a notebook before long w/e Premium Features RBAC for notebooks, clusters, jobs and tables JDBC/ODBC Authentication (Power BI!) RStudio Integration Data Engineering (Jobs) Data Analytics (Interactive) STANDARD $0.20 per DBU + Azure Cost $0.40 per DBU + Azure Cost PREMIUM $0.35 per DBU + Azure Cost $0.55 per DBU + Azure Cost

Takeaways Data Science requires a lot of data engineering before it can succeed CI/CD, API, DBFS, CLI, Databricks Delta, MLFlow – some other time Databricks is awesome in the analytics stack Learn Python! To get started, spin up databricks on your MSDN subscription and start playing around: https://databricks.com/spark/getting-started-with-apache-spark https://docs.databricks.com/getting-started/index.html