Hadoop in the Cloud – The What, Why and How from the Experts

HadoopSummit 747 views 33 slides Oct 31, 2016
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Hadoop in the Cloud – The What, Why and How from the Experts


Slide Content

Hadoop in the cloud The What, Why and How from the experts SATO Naoki (@ satonaoki ) Azure Technologist Microsoft Japan

Hadoop in the Cloud 2 Agenda Why Benefits of running Hadoop in the cloud What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations How Architecture of a Cloud deployment

Hadoop in the Cloud 3 Agenda Why Benefits of running Hadoop in the cloud What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations How Architecture of a Cloud deployment

Traditional Hadoop Clusters 4

Up-front HW costs Capacity planning Hadoop expertise Challenges with implementing Hadoop

Hadoop Clusters in the Cloud 6

Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need No HW costs $0 Unlimited scale Pay what you need Deployed in minutes Why Hadoop in the cloud?

Distributed Storage Files split across storage Files replicated Nearest node responds Abstracted Administration Hadoop Clusters Extensible APIs to extend functionality Add new capabilities Allow for inclusion in custom environments Automated Failover Unmonitored failover to replicated data Built for resiliency Metadata stored for later retrieval Hyper-Scale Add resources as desired Built to include commodity configs Direct correlation of performance and resources Distributed Compute Distributed processing Resource Utilization Cost-Efficient method calls 8

Distributed Storage Files split across storage Files replicated Nearest node responds Abstracted Administration Cloud Extensible APIs to extend functionality Add new capabilities Allow for inclusion in custom environments Automated Failover Unmonitored failover to replicated data Built for resiliency Metadata stored for later retrieval Hyper-Scale Add resources as desired Built to include commodity configs Direct correlation of performance and resources Distributed Compute Distributed processing Resource Utilization Cost-Efficient method calls 9

Distributed Storage Files split across storage Files replicated Nearest node responds Abstracted Administration Hadoop in the Cloud Extensible APIs to extend functionality Add new capabilities Allow for inclusion in custom environments Automated Failover Unmonitored failover to replicated data Built for resiliency Metadata stored for later retrieval Hyper-Scale Add resources as desired Built to include commodity configs Direct correlation of performance and resources Distributed Compute Distributed processing Resource Utilization Cost-Efficient method calls 10

Hadoop in the Cloud 11 Agenda Why Benefits of running Hadoop in the cloud What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations How Architecture of a Cloud deployment

Hadoop in the Cloud - Options Cloud Hadoop in IaaS Hadoop in PaaS Big Data as a Service Pros Complete Control On-Demand Cluster Sizing Storage - Local or Cloud Cons Only VMs managed for HA Administration required Clusters need to stay active Pros Fully managed – SLA bound Flexible resizing Customization Options Deployed in minutes Cons Forgo some control Pros Abstracted from clusters Automated resource alignment Easy to use interface and APIs Familiar languages Cons Forgo complete control Limited choice to tools

On-premises Hadoop Software Scenarios for deploying Hadoop as hybrid Cloud Cloud Specialized Workloads HDInsight Cloud Bursting HDInsight Cloud Backup/archive HDInsight

Traditional Hadoop Clusters – On Prem 14 Hadoop Cluster Worker Node HDFS HDFS HDFS Tasks Tasks Tasks Tasks Tasks Tasks Task Tracker Master Node Client Job (jar) file Job (jar) file

Hadoop Clusters in the Cloud

16 Azure HDInsight Hadoop and Spark as a Service on Azure Fully managed Hadoop and Spark for the cloud 100% Open Source Hortonworks Data Platform Clusters up and running in minutes Managed, monitored and supported by Microsoft with the industry’s best enterprise SLA Use familiar BI tools for analysis , or open source notebooks for interactive data science 63% lower total cost of ownership than deploy your own Hadoop on-premises* *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

HDInsight Cluster Architecture Azure Virtual Network HTTPS traffic ODBC/JDBC WebHCatalog Oozie Ambari Secure gateway AuthN HTTP Proxy Highly available Head nodes Worker nodes Azure Data Lake Store

Decoupling Compute from Storage Latency? Consistency? Bandwidth? Network

Decoupling Compute from Storage Network HDD-like latency 50 Tb+ aggregate bandwidth [1] Strong consistency [1] Azure Flat Network Architecture

Decoupling - Benefits Cloud NoSQL Workload Pros Smaller clusters can achieve the same level of performance as large clusters No need to add nodes just for storage capacity Depending on workloads, you see any where from 6x – 20x cost benefits Query + ML Workload Pros Clusters required only while processing data Data Persists for tools to connect and use Data can be replicated on other geo Delete clusters when not processing of data Streaming Workload Pros No need for large clusters to hold historical streams data Directly align throughput to cluster size as per SLA Cluster up only when streams are active

21 Azure Data Lake Store A hyper scale repository for big data analytics workloads Hadoop File System (HDFS) for the cloud No limits to scale Store any data in its native format Enterprise grade access control and encryption Optimized for analytic workload performance

Customize cluster? HDInsight cluster provisioning states RDP to cluster, update config files (non-durable) Ad hoc Cluster customization options Hive/ Oozie Metastore Storage accounts & VNET’s ScriptAction Via Azure portal Ready for deployment Accepted Cluster storage provisioned AzureVM configuration Running Timed Out Error Cluster operational Configuring HDInsight Cluster customization (custom script running) Config values JAR file placement in cluster Via scripting / SDK No Yes

Cluster integration options Each cluster surfaces a REST endpoint for integration, secured via basic authN over SSL /thrift – ODBC & JDBC /Templeton – Job Submission, Metadata management / ambari – Cluster health, monitoring / oozie – Job orchestration, scheduling

Hadoop in the Cloud 24 Agenda Why Benefits of running Hadoop in the cloud What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations How Architecture of a Cloud deployment

Cloud Deployments for Big Data 25 Data Factory Data Lake Store HDInsight Machine Learning SQL Database Power BI DocumentDB Event Hubs Stream Analytics

Introducing Cortana Intelligence Suite Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data

Where Big Data is a cornerstone Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data

Excel BI Power BI Mahout HiveQL HIVE Sqoop Pig Azure Data Lake Analytics HBase on Azure HDInsight Big Data Sources (Raw Unstructured) Log files Storm for Azure HDInsight Azure Stream Analytics Spark Streaming for Azure HDInsight Spark SQL Spark MLib Azure Data Lake Store U-SQL Data Orchestration/ Workflow Azure Data Factory Oozie for Azure HDInsight Kafka for Azure HDInsight (future) SQL Server Integration Services Azure Machine Learning R Server SQL Server R Services SSRS SharePoint BI Transactional systems Azure SQL DW SQL Server APS ETL Azure Event Hubs Data Generation Streaming Consumption Processing Storage Operational Analytical / Exploratory Data Warehouse Azure Website SSAS Spark MLLib

Why Benefits of running Hadoop in the cloud – Far outrun tradeoffs What Options to run Hadoop in the Cloud – IaaS, PaaS, Hybrid Hadoop Clusters in the cloud – Fully Managed Cluster Customizations – Immensely well leveraged How Architecture of a Cloud deployment – Simplify deployment Summary 29

Get started today! For more information on HDInsight visit: http://azure.com/hdinsight For more information on Data Lake visit: http://azure.com/datalake

http://microsoft-events.jp/mstechsummit/

Q&A [email protected] @ satonaoki