Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there ...
Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management.
This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways.
Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.
Size: 21.73 MB
Language: en
Added: Jun 26, 2017
Slides: 41 pages
Slide Content
Securing your Big Data environments in the Cloud Nishant Thacker Technical Product Manager Microsoft
Decoupling - Benefits Cloud NoSQL Workload Pros Smaller clusters can achieve the same level of performance as large clusters No need to add nodes just for storage capacity Depending on workloads, you see any where from 6x – 20x cost benefits Query + ML Workload P ros Clusters required only while processing data Data Persists for tools to connect and use Data can be replicated on other geo Delete clusters when not processing of data Streaming Workload Pros No need for large clusters to hold historical streams data Directly align throughput to cluster size as per SLA Cluster up only when streams are active
Customize cluster? HDInsight cluster provisioning states Cluster customization options Ready for deployment Accepted Cluster storage provisioned AzureVM configuration Running Timed Out Error Cluster operational Configuring HDInsight Cluster customization (custom script running No Yes
OK, I understand the cloud has this optimized model for deployment, now help me SECURE IT Step 1 – Secure your data
Azure Data Lake (the cloud HDFS) Security Model Data Protection Store Authorization Authentication 2 Authenticate user and get group information 1 Check source IP address against Firewall rules 3 Authorize against POSIX ACLs on the file/folder 4 Encrypt/Decrypt data using the Data & Block Encryption Keys (Master Key is always in AKV) 5 Perform and Audit operation Auditing IP Firewall 1 2 3 4 5 OAuth Token Graph API
Secure Analytics on Azure Data Lake Data Protection Store Authorization Authentication Auditing OAuth Token Graph API SPI Kerb Analytic Engines have two options : Represent the caller (carry the user’s OAuth token) Represent itself (“engine”) using Service Principals Azure Data Lake Analytics: Represent the caller (carry the user’s OAuth token) HDInsight: Clusters can be configured with Service Principal Identities Can also pass through end user’s OAuth token SPI User OAuth Token
Contoso Big Data Pipeline People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Big Data Stores Data Lake Store Azure DW Azure Storage Landing Zone Data Prep, Analytics & ML HDInsight HDInsight
Getting Started
Connect to Azure Active Directory People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight
I want to use AAD as identity provider for my big data architecture including Hadoop and Data Lake
Use Active Directory Domain Services and VNet Peering
Configure Active Directory Domain Services
Landing Zone - Data ingestion People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight
I don’t want my cluster to be exposed to public internet
Securing Landing Zone Secure via Service principal & File and Folder ACLs Secure via access keys
Configure Network Security Group
Create a Kafka Cluster inside the VNet
Create a Spark Cluster on secure Data Lake Store
Configure access to Spark Cluster on Data Lake
Analytics using multi-user cluster People Automated Systems Apps Web Mobile Bots Action Intelligence Data Telemetry Feedback Sales Identity Management Azure Active Directory (contosodatalake.onmicrosoft.com) Information Management Azure Data Factory (data movement & job orchestration) Power BI Dashboards & Visualizations Landing Zone Data Prep, Analytics & ML HDInsight Big Data Stores Data Lake Store Azure DW Azure Storage HDInsight
I want sensitive information to be visible only to privileged users Example – I don’t want suppliers to see customer information
Behind the scenes
Apache Ranger Centralized Policy Management Portal. Open source industry standard for managing authorization policies. Extremely powerful auditing tool. Plugins available for various Hadoop components like Hive HBase Kafka Storm
ARM VNET Gateway Head Node HDInsight Cluster WASB ADLS Worker node (s) ThriftServer Zeppelin Kerberos AuthN Kerberos Ticket HTTPS Basic Auth OAuth Ticket Classic VNET Active Directory Domain Services AAD tenant Ranger DB LDAP Alice Bob Azure VNET to VNET peering Ranger
Apache Ranger in HDInsight
Why No Knox? Azure VNet HTTPS traffic ODBC/JDBC WebHCatalog Oozie Ambari Secure gateway AuthN HTTP Proxy Highly available Head nodes Worker nodes ADLS/ WASB
Sneak preview of Future releases
Ranger plugin for WASB No RBAC for file and folder level permissions Microsoft invested in a Ranger plugin for WASB. Similar to Ranger plugin for HDFS