Securing your Big Data Environments in the Cloud

Hadoop_Summit 848 views 41 slides Jun 26, 2017
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there ...


Slide Content

Securing your Big Data environments in the Cloud Nishant Thacker Technical Product Manager Microsoft

What is so different about the cloud?

Traditional Hadoop Clusters – On Prem 4 Hadoop Cluster Worker Node HDFS HDFS HDFS Tasks Tasks Tasks Tasks Tasks Tasks Task Tracker Master Node Client Job (jar) file Job (jar) file

Hadoop Clusters in the Cloud

Decoupling Compute from Storage Latency? Consistency? Bandwidth? Network

Decoupling Compute from Storage Network HDD-like latency 50 Tb+ aggregate bandwidth [1] Strong consistency [1] Azure Flat Network Architecture

Decoupling - Benefits Cloud NoSQL Workload Pros Smaller clusters can achieve the same level of performance as large clusters No need to add nodes just for storage capacity Depending on workloads, you see any where from 6x – 20x cost benefits Query + ML Workload P ros Clusters required only while processing data Data Persists for tools to connect and use Data can be replicated on other geo Delete clusters when not processing of data Streaming Workload Pros No need for large clusters to hold historical streams data Directly align throughput to cluster size as per SLA Cluster up only when streams are active

Customize cluster? HDInsight cluster provisioning states Cluster customization options Ready for deployment Accepted Cluster storage provisioned AzureVM configuration Running Timed Out Error Cluster operational Configuring HDInsight Cluster customization (custom script running No Yes

OK, I understand the cloud has this optimized model for deployment, now help me SECURE IT Step 1 – Secure your data

Azure Data Lake (the cloud HDFS) Security Model Data Protection Store Authorization Authentication 2 Authenticate user and get group information 1 Check source IP address against Firewall rules 3 Authorize against POSIX ACLs on the file/folder 4 Encrypt/Decrypt data using the Data & Block Encryption Keys (Master Key is always in AKV) 5 Perform and Audit operation Auditing IP Firewall 1 2 3 4 5 OAuth Token Graph API

Secure Analytics on Azure Data Lake Data Protection Store Authorization Authentication Auditing OAuth Token Graph API SPI Kerb Analytic Engines have two options : Represent the caller (carry the user’s OAuth token) Represent itself (“engine”) using Service Principals Azure Data Lake Analytics: Represent the caller (carry the user’s OAuth token) HDInsight: Clusters can be configured with Service Principal Identities Can also pass through end user’s OAuth token SPI User OAuth Token

Contoso Big Data Pipeline People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Big Data Stores Data Lake Store Azure DW Azure Storage Landing Zone Data Prep, Analytics & ML HDInsight HDInsight

Getting Started

Connect to Azure Active Directory People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight

I want to use AAD as identity provider for my big data architecture including Hadoop and Data Lake

Use Active Directory Domain Services and VNet Peering

Configure Active Directory Domain Services

Landing Zone - Data ingestion People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight

I don’t want my cluster to be exposed to public internet

Securing Landing Zone Secure via Service principal & File and Folder ACLs Secure via access keys

Configure Network Security Group

Create a Kafka Cluster inside the VNet

Create a Spark Cluster on secure Data Lake Store

Configure access to Spark Cluster on Data Lake

Analytics using multi-user cluster People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Information Management Azure Data Factory (data movement & job orchestration) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight

I want sensitive information to be visible only to privileged users Example – I don’t want suppliers to see customer information

Behind the scenes

Apache Ranger Centralized Policy Management Portal. Open source industry standard for managing authorization policies. Extremely powerful auditing tool. Plugins available for various Hadoop components like Hive HBase Kafka Storm

ARM VNET Gateway Head Node HDInsight Cluster WASB ADLS Worker node (s) ThriftServer Zeppelin Kerberos AuthN Kerberos Ticket HTTPS Basic Auth OAuth Ticket Classic VNET Active Directory Domain Services AAD tenant Ranger DB LDAP Alice Bob Azure VNET to VNET peering Ranger

Apache Ranger in HDInsight

Why No Knox? Azure VNet HTTPS traffic ODBC/JDBC WebHCatalog Oozie Ambari Secure gateway AuthN HTTP Proxy Highly available Head nodes Worker nodes ADLS/ WASB

Sneak preview of Future releases

Ranger plugin for WASB No RBAC for file and folder level permissions Microsoft invested in a Ranger plugin for WASB. Similar to Ranger plugin for HDFS

Configuring policies in Ranger

WASB: File and Folder ACLs (Ranger)

In Conclusion …

© 2016 Microsoft Corporation. All rights reserved.