Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully swit...
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Agenda Objective / Data Science Series Boring theory Use Cases Demos Getting started Interactive Notebooks ETL Batch job with Azure Data Factory The humble dataframe Pricing Takeaways Questions
Objective & Data Science Series Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/ MLops and Operationalization What is Azure Databricks, why you should learn it and how to get started…
How to get value out of your data?
Data Science Workflow Data Value
Why is data science so hard? Data Science requires a lot of data engineering before it can succeed Siloed roles = unique terminology Fragmented technologies and solutions Model training requires huge scale some of the time The more data we use to train the better Big Data infrastructure is expensive and costly to maintain Operational challenges – how to get model to production? Problem = $$$ and slow to deliver value
Where Data Scientists spend most of their time
Solution: Unified Analytics Platform Unifies Data Science, Engineering and Business Removes silos, improves collaboration Supports multiple languages Business value: Cost saving (Resources, operational, training etc) Speed to market Easy to extend for future ML Focus on extracting insights from your data and not infrastructure and processes around it!
What is Open-source big data processing engine Massively scalable/distributed Highly extensible with many libraries Started in 2009 and written in Scala Supports 4 languages Designed for speed and ease of use In memory = faster than Hadoop You can run Spark on Azure directly
Databricks PaaS – Managed Spark Service on Azure
Use Case - Modern Analytics Platform
Use Case – Real Example
Demo 1 – Getting started with databricks Create a Databricks service (Resource Group) Launching a workspace – AD Integration Menu Overview Workspaces Notebooks RBAC (Premium) Add a new cluster with auto-scale Installing libraries
Demo 2 – Interactive Notebook Notebook overview Attach to Cluster Cells Markdown Running a command & shortcuts Comments Revisions (Git) Data Tables Language choice (4!) Magic commands (e.g. Unix, md) Charting and Dashboards
Demo 4 – The humble dataframe Import a Notebook Dataframes versus Datasets versus RDDs Download and read a CSV file and infer schema Intellisense Lazy Evaluation Actions versus Transformations Importing a library (e.g. Pandas) Immutability Static versus Dynamically typed Koalas/Apache .net
Other Use-cases Connecting to Power BI - JDBC Connection/Premium Streaming Use-cases Open-source friendly! TensorFlow / Scikit -Learn etc. Machine Learning Auto-ML ML Ops
Databricks Pricing Notes Pay DBUs (per min) only when your cluster is running a job on Spark Cluster VM size determines DBU’s usage per hour – Cheapest is 0.5 DBU Azure costs depend on size of your VMs and depends on cluster state. E.g. VMs, storage Shutdown your clusters when not in use … and don’t run an infinite loop on a notebook before long w/e Premium Features RBAC for notebooks, clusters, jobs and tables JDBC/ODBC Authentication (Power BI!) RStudio Integration Data Engineering (Jobs) Data Analytics (Interactive) STANDARD $0.20 per DBU + Azure Cost $0.40 per DBU + Azure Cost PREMIUM $0.35 per DBU + Azure Cost $0.55 per DBU + Azure Cost
Takeaways Data Science requires a lot of data engineering before it can succeed CI/CD, API, DBFS, CLI, Databricks Delta, MLFlow – some other time Databricks is awesome in the analytics stack Learn Python! To get started, spin up databricks on your MSDN subscription and start playing around: https://databricks.com/spark/getting-started-with-apache-spark https://docs.databricks.com/getting-started/index.html