The Beginner's Guide to Data Lakes in AWS

GuillermoAFisher 288 views 23 slides Dec 09, 2019
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

AWS offers everything you need to deploy a secure and flexible data lake in the cloud. Discover how services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift can be used together to build and manage your own data lake, and how AWS Lake Formation makes it possible to set up a data l...


Slide Content

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Beginner’s Guide to Data Lakes
in AWS
Guillermo A. Fisher
DVC12
Senior Engineering Manager
Handshake

Agenda
Why a Data Lake?
Key Concepts
Data Lakes on AWS
An Example
Best Practices

Related DevChats
DVC10 - Lessons from the backyard: A connected BBQ grill and smoker
DVC06 - Use Neptune to discover where & when events can impact local
businesses

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“The never-ending stream of information
is incredibly useful for businesses, but it
can also be a challenge to draw relevant
insights from such a large data pool.”
Michael Brenner
CEO, Marketing Inside Group

The Data Science Hierarchy of Needs
AI
Learn/Optimize
Aggregate/Label
Explore/Transform
Move/Store
Collect
“You need a solid foundation
for your data before being
effective with AI and machine
learning.”
Monica Rogati
Data Science and AI Advisor

The Data Warehouse Solution
Data Warehouse
Data Mart Data Mart Data Mart
Advantages
Provides precise reporting and BI
Standardized, consistent data
Drawbacks
Limited to pre-determined questions
No low-level data visibility

Considerations for a Modern Solution
Centralized
Data Storage
Store all data
reliably in one
location
Multiple User
Communities
Business
analysts, data
professionals
Schema on
Read
Schema written
at time of
analysis
Storage vs.
Compute
Scale storage
and compute
independently
Data Types &
Formats
Structured,
semi-structured,
unstructured,
raw data
Security
Control access
to the data

Photo by Yifan Liu on Unsplash
A data lake is a centralized repository that allows you
to store all your structured and unstructured data at
any scale. You can store your data as-is, without having
to first structure the data, and run different types of
analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine
learning to guide better decisions.

Photo by arsalan arianmehr on Unsplash
Onboard relevant data

Metadata should exist in a data catalog

Data governance policies and procedures govern
storage and access

Automated processes manage data flow, data
cleaning, and enforce practices

Centralized Storage
Amazon S3
Scalable object storage
Decouples storage and
compute
99.999999999% durability
Cost effective lifecycle
policies

Data Ingestion
Amazon Kinesis

Data Firehose
Easily and reliably
stream data into data
lakes
AWS Snowball
Migrate large datasets

using secure devices
AWS Storage

Gateway
Gain on-premises
access to AWS cloud
storage
AWS Database

Migration Service
Migrate databases to
AWS quickly and
securely
AWS Direct

Connect
Establish a dedicated
network connection
to AWS

Catalog & Search
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon Elasticsearch

Service
Fully managed Elasticsearch
service
AWS Glue
Store metadata in a
data catalog

Move & Transform
Amazon Kinesis

Data Firehose
Easily and reliably
stream data into data
lakes
AWS Glue
Fully managed ETL
service
AWS Lambda
Event-driven,
serverless computing

Access & User Interfaces
AWS AppSync
Manage and
synchronize mobile
app data in real time
across devices and
users
Amazon Cognito
 Add user sign-up,
sign-in, and access
control to your web
and mobile apps
quickly and easily
Amazon API

Gateway
Fully managed service
for creating, publishing,
maintaining, and
monitoring secure APIs
at scale

Analytics & Serving
Amazon Redshift
Fast, simple, cost-
effective data
warehousing service
Amazon Athena
Serverless,
interactive query
service
Amazon QuickSight
Fast, cloud-powered
business intelligence
service
AWS Glue
Store metadata
in a data catalog
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon EMR
Run & Scale Spark,
Hadoop, and other
Big Data Frameworks
AWS Direct

Connect
Establish a
dedicated network
connection to AWS
Amazon Elasticsearch

Service
Fully managed
Elasticsearch service
Amazon Neptune
Fully managed Graph
database service
Amazon RDS
Distributed
relational
database service

Manage & Secure
AWS KMS
Manage cryptographic
keys and control their
use across services
AWS IAM
Securely manage
access to AWS
services and resources
AWS CloudTrail
Enable governance,
compliance,
operational auditing,
and risk auditing
Amazon CloudWatch
Monitor your AWS
resources and the
applications you run on
AWS in real time

A Data Lake in Days
AWS Lake Formation
Source crawlers, ETL and data
prep, data catalog, security
settings, access control
Identify data sources
Data lake storage
Provide self-
service access

An Example
Amazon S3 AWS Lambda
AWS CloudTrailAWS IAM
AWS Glue
Amazon Athena Amazon QuickSight

Photo by Moritz Mentges on Unsplash
DEMO

Some Best Practices
Encrypt data at-rest and in-transit
Partition data
Compress data
Use columnar file formats
Use lifecycle policies
Automate, automate, automate

Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Guillermo A. Fisher
@guillermoandrae
https://bklyn.dev

Please complete the session
survey in the mobile app.
!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.