Strata sf - Amundsen presentation

taofung 4,017 views 69 slides Mar 23, 2019
Slide 1
Slide 1 of 69
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69

About This Presentation

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.


Slide Content

March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer , Lyft Disrupting Data Discovery

Agenda Data at Lyft Challenges with Data Discovery Data Discovery at Lyft Architecture Summary ‹#›

Data platform users ‹#› Data Modelers Analysts Data Scientists General Managers Data Platform Engineers Experimenters Product Managers

‹#› Lyft Data Team Lyft Data Team Core Data Infra Streaming Infra Visualization Experimentation BI and Logging ML Infra

‹#› Core Infra h igh level architecture Custom apps

Data Discovery ‹#›

My first project is to analyze and predict Strata Attendance Where is the data? What does it mean? Hi! I am a n00b Data Scientist! ‹#›

Option 1: Phone a friend! Option 2: Github search Status quo ‹#›

What does this field mean? Does attendance data include employees? Does it include revenue? Let me dig in and understand Understand the context ‹#›

Explore SELECT * FROM default.my_table WHERE ds=’2018-01-01’ LIMIT 100;

Exploring with SELECT * is EVIL Lack of productivity for data scientists Increased load on the databases ‹#›

Data Scientists spend upto 1/3rd time in Data Discovery... ‹#› Data discovery Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.

Audience for data discovery ‹#›

Data Discovery - User personas ‹#› Data Modelers Analysts Data Scientists General Managers Data Platform Engineers Experimenters Product Managers

3 Data Scientist personas Power user A ll info in their head Get interrupted a lot due to questions Lost Ask “power users” a lot of questions D ependencies landing on time Communicating with stakeholders Noob user Manager

Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users ? I want to follow a power user in my team. Does this analysis already exist ? This table’s delivery was delayed today, I want to notify everyone downstream . I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions

Buy vs. Build vs. Adopt ‹#›

Compared various existing solutions/open source projects Criteria / Pro d ucts Alation WhereHows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)

Meet Amundsen ‹#› F irst person to discover the South Pole - Norwegian explorer, Roald Amundsen

Landing page optimized for search

Search results ranked on relevance and query activity

How does search work? ‹#›

Relevance - search for “apple” on Google ‹#› Low relevance High relevance

Popularity - search for “apple” on Google ‹#› Low popularity High popularity

Striking the balance ‹#› Relevance Popularity Names, Descriptions, Tags, [owners, frequent users] Querying activity Dashboarding Different weights for automated vs adhoc querying

Back to mocks... ‹#›

Search results ranked on relevance and query activity

Detailed description and metadata about data resources

Data Preview within the tool

Computed stats about column metadata Disclaimer: these stats are arbitrary.

Built-in u ser feedback

Amundsen’s architecture ‹#›

‹#› Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend Service ML Feature Service Security Service Other Microservices Metadata Sources

1. Frontend Service ‹#›

‹#› Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend Service ML Feature Service Security Service Other Microservices Metadata Sources

Detailed description and metadata about data resources

2 . Metadata Service ‹#›

‹#› Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend Service ML Feature Service Security Service Other Microservices Metadata Sources

‹#› 2 . Metadata Service A thin proxy layer to interact with graph database Currently Neo4j is the default option for graph backend engine. Work with the community to support Apache Atlas Support Rest API for other services pushing / pulling metadata directly

Trade Off #1 Why choose Graph database ‹#›

Why Graph database ?

Why Graph database?

Trade Off #2 Why not propagate the metadata back to source ‹#›

Why not propagate the metadata back to source ‹#›

Why not propagate the metadata back to source ‹#› ? ?

3. Search Service ‹#›

‹#› Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend Service ML Feature Service Security Service Other Microservices Metadata Sources

3. Search Service Support REST API for building indexes A thin proxy layer to interact with the search backend Currently it supports Elasticsearch as backend. Support different search patterns Normal Search: match records based on relevancy Category Search: match records first based on data type, then relevancy Wildcard Search ‹#›

Challenge #1 How to make the search result more relevant? ‹#›

How to make the search result more relevant? ‹#› Define a search quality metric Click-Through-Rate (CTR) over top 5 results Search behaviour instrumentation is key Couple of improvements: Boost the exact table ranking Support wildcard search Support category search (e.g. “column: is_line_ride ” )

4. Data Builder ‹#›

‹#› Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend Service ML Feature Service Other Services Other Microservices Metadata Sources

Challenge #1 Various forms of metadata ‹#›

‹#› Metadata Sources @ Lyft

Metadata - Challenges Standardization: No single data model that fits for all data resources A data resource could be a table, an Airflow DAG or a dashboard Extraction: Each data set metadata is stored and fetched differently, Hive Table: Stored in Hive metastore RDBMS(postgres etc): Fetched through DBAPI interface Github source code: Fetched through git hook Mode dashboard: Fetched through Mode API … ‹#›

Challenge #2 Pull model vs Push model ‹#›

Pull model vs. Push model ‹#› Pull Model Push Model Periodically update the index by pulling from the system (e.g. database) via crawlers. The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph

Pull model vs. push model ‹#› Pull Model Push Model Onus of integration lays on data graph No interface to prescribe, hard to maintain crawlers Onus of integration lies on database Message format serves as the interface Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph

Pull model vs. push model ‹#› Pull Model Push Model Onus of integration lays on data graph No interface to prescribe, hard to maintain crawlers Onus of integration lies on database Message format serves as the interface Allows for near-real time indexing Crawler Database Data graph Database Message queue Data graph Preferred if Near-real time indexing is important Clean interface doesn’t exist Other tools like Wherehows are moving towards Push Model Preferred if Waiting for indexing is ok Working with “strapped” teams There’s already an interface

4. Databuilder

Databuilder in action

How are we building data? Databuilder

How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs

What’s next? ‹#›

Amundsen seems to be more useful than what we thought Tremendous success at Lyft Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! Many organizations have similar problems Collaborating with ING, WeWork and more We plan to announce open source soon ‹#›

Impact - Amundsen at Lyft ‹#› Beta release (internal) Generally Available (GA) release Alpha release

Adding more kinds of data resources People Dashboards Data sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows

Serving more metadata about existing resources A pplication Context Existence, description, semantics, etc. B ehavior How data is created and used over time C hange How data is changing over time

Summary ‹#›