The Beginner's Guide to Data Lakes in AWS

GuillermoAFisher 288 views 23 slides Dec 09, 2019

Slide 1 of 23

About This Presentation

AWS offers everything you need to deploy a secure and flexible data lake in the cloud. Discover how services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift can be used together to build and manage your own data lake, and how AWS Lake Formation makes it possible to set up a data l...

Size: 11.56 MB

Language: en

Added: Dec 09, 2019

Slides: 23 pages

Slide Content

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Beginner’s Guide to Data Lakes
in AWS
Guillermo A. Fisher
DVC12
Senior Engineering Manager
Handshake

Agenda
Why a Data Lake?
Key Concepts
Data Lakes on AWS
An Example
Best Practices

Related DevChats
DVC10 - Lessons from the backyard: A connected BBQ grill and smoker
DVC06 - Use Neptune to discover where & when events can impact local
businesses

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“The never-ending stream of information
is incredibly useful for businesses, but it
can also be a challenge to draw relevant
insights from such a large data pool.”
Michael Brenner
CEO, Marketing Inside Group

The Data Science Hierarchy of Needs
AI
Learn/Optimize
Aggregate/Label
Explore/Transform
Move/Store
Collect
“You need a solid foundation
for your data before being
eﬀective with AI and machine
learning.”
Monica Rogati
Data Science and AI Advisor

The Data Warehouse Solution
Data Warehouse
Data Mart Data Mart Data Mart
Advantages
Provides precise reporting and BI
Standardized, consistent data
Drawbacks
Limited to pre-determined questions
No low-level data visibility

Considerations for a Modern Solution
Centralized
Data Storage
Store all data
reliably in one
location
Multiple User
Communities
Business
analysts, data
professionals
Schema on
Read
Schema written
at time of
analysis
Storage vs.
Compute
Scale storage
and compute
independently
Data Types &
Formats
Structured,
semi-structured,
unstructured,
raw data
Security
Control access
to the data

Photo by Yifan Liu on Unsplash
A data lake is a centralized repository that allows you
to store all your structured and unstructured data at
any scale. You can store your data as-is, without having
to first structure the data, and run different types of
analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine
learning to guide better decisions.

Photo by arsalan arianmehr on Unsplash
Onboard relevant data
 
Metadata should exist in a data catalog
 
Data governance policies and procedures govern
storage and access
 
Automated processes manage data flow, data
cleaning, and enforce practices

Centralized Storage
Amazon S3
Scalable object storage
Decouples storage and
compute
99.999999999% durability
Cost effective lifecycle
policies

Data Ingestion
Amazon Kinesis 
Data Firehose
Easily and reliably
stream data into data
lakes
AWS Snowball
Migrate large datasets 
using secure devices
AWS Storage 
Gateway
Gain on-premises
access to AWS cloud
storage
AWS Database 
Migration Service
Migrate databases to
AWS quickly and
securely
AWS Direct 
Connect
Establish a dedicated
network connection
to AWS

Catalog & Search
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon Elasticsearch 
Service
Fully managed Elasticsearch
service
AWS Glue
Store metadata in a
data catalog

Move & Transform
Amazon Kinesis 
Data Firehose
Easily and reliably
stream data into data
lakes
AWS Glue
Fully managed ETL
service
AWS Lambda
Event-driven,
serverless computing

Access & User Interfaces
AWS AppSync
Manage and
synchronize mobile
app data in real time
across devices and
users
Amazon Cognito
Add user sign-up,
sign-in, and access
control to your web
and mobile apps
quickly and easily
Amazon API 
Gateway
Fully managed service
for creating, publishing,
maintaining, and
monitoring secure APIs
at scale

Analytics & Serving
Amazon Redshift
Fast, simple, cost-
effective data
warehousing service
Amazon Athena
Serverless,
interactive query
service
Amazon QuickSight
Fast, cloud-powered
business intelligence
service
AWS Glue
Store metadata
in a data catalog
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon EMR
Run & Scale Spark,
Hadoop, and other
Big Data Frameworks
AWS Direct 
Connect
Establish a
dedicated network
connection to AWS
Amazon Elasticsearch 
Service
Fully managed
Elasticsearch service
Amazon Neptune
Fully managed Graph
database service
Amazon RDS
Distributed
relational
database service

Manage & Secure
AWS KMS
Manage cryptographic
keys and control their
use across services
AWS IAM
Securely manage
access to AWS
services and resources
AWS CloudTrail
Enable governance,
compliance,
operational auditing,
and risk auditing
Amazon CloudWatch
Monitor your AWS
resources and the
applications you run on
AWS in real time

A Data Lake in Days
AWS Lake Formation
Source crawlers, ETL and data
prep, data catalog, security
settings, access control
Identify data sources
Data lake storage
Provide self-
service access

An Example
Amazon S3 AWS Lambda
AWS CloudTrailAWS IAM
AWS Glue
Amazon Athena Amazon QuickSight

Photo by Moritz Mentges on Unsplash
DEMO

Some Best Practices
Encrypt data at-rest and in-transit
Partition data
Compress data
Use columnar ﬁle formats
Use lifecycle policies
Automate, automate, automate

The Beginner's Guide to Data Lakes in AWS

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

The Beginner&#39;s Guide to Data Lakes in AWS

About This Presentation

Slide Content

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

The Beginner's Guide to Data Lakes in AWS