Data Discoverability with DataHub

GlebMezhanskiy 173 views 11 slides Nov 26, 2020
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Presented on Data Quality Meetup by Maggie Hays, Senior Product Manager, Data Services @ SpotHero

Learn more about Data Quality Meetup
https://www.datafold.com/blog/data-quality-meetup-2


Slide Content

Data Discoverability with DataHub
Maggie Hays
Senior Product Manager -- Data Services
Data Quality Meetup -- November 19, 2020

2
Agenda
●Overview of Teams
●Current State of Data Discoverability
●Data Catalog Evaluation
●DataHub POC - Progress & Level of Effort
●Highlight: DataHub Functionality

3
SpotHero’s Data-Focused Teams
Data Engineering

3 Engineers

SpotHero IQ
2 Engineers
3 Data Scientists
Analytics

3 Business Analysts
(We’re hiring!!)

4
1
2
3
Current State of Data Discoverability
Data Lineage is difficult to discover and navigate,
regardless of role or tenure
●Impact analysis is arduous; Engineers avoid breaking changes at all costs
●Prolonged debugging/troubleshooting data issues
Difficult to discover what data exists and/or
what it represents
●Reliance on tribal knowledge
●Large burden on the Analytics team to answer any/all questions
Confidence in Data Accuracy is neutral, but room for
improvement
●Once folks track down the data, they are relatively confident in its
accuracy

May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate

5
Data Catalog Evaluation
DataHub
Amundsen
/ Marquez
Apache
Atlas Alation
Ease of Integration
Lineage Support
Configurable
Metadata
Affordability

6
Looker
Airflow
SpotHero’s Data Stack & DataHub POC
SH Application
Data
Workflow Tools
Marketing Tools
Microservices
Clickstream
Analytics
Redshift
S3/Parquet
Fivetran
Segment
Kafka
SQL
Python
Spark
Sources Ingestion Storage ETL
Complete
Q4 2020

7
1
2
3
DataHub POC - Level of Effort
Research & Tool Evaluation: 180 hrs
●Creation of Pugh Matrix to force-rank evaluation
●Rapid side-by-side POC of DataHub and Amundsen/Marquez
Initial Rollout of DataHub POC: 300 hrs
●Terraform Elasticsearch, MySQL, Neo4j, Aiven; helm chart for
API/frontend/Kafka components
●Datalake & ETL scrapers, including lineage
●Enrich with ETL ownership, links to GHE
Looker & Kafka Metadata Ingestion & Lineage: Est. 160 hrs
●Building Looker/LookML scraper - planning to contribute back to DH codebase
●Teaming up with DataHub to inform design of Dashboard entities

8
DataHub Functionality: Cross-Platform Search

9
DataHub Functionality: Dataset Metadata
DDL & Ownership External Docs

10
DataHub Functionality:
Lineage

11
Yay Data Discoverability!