Presented on Data Quality Meetup by Maggie Hays, Senior Product Manager, Data Services @ SpotHero
Learn more about Data Quality Meetup
https://www.datafold.com/blog/data-quality-meetup-2
Size: 651 KB
Language: en
Added: Nov 26, 2020
Slides: 11 pages
Slide Content
Data Discoverability with DataHub
Maggie Hays
Senior Product Manager -- Data Services
Data Quality Meetup -- November 19, 2020
2
Agenda
●Overview of Teams
●Current State of Data Discoverability
●Data Catalog Evaluation
●DataHub POC - Progress & Level of Effort
●Highlight: DataHub Functionality
3
SpotHero’s Data-Focused Teams
Data Engineering
3 Engineers
SpotHero IQ
2 Engineers
3 Data Scientists
Analytics
3 Business Analysts
(We’re hiring!!)
4
1
2
3
Current State of Data Discoverability
Data Lineage is difficult to discover and navigate,
regardless of role or tenure
●Impact analysis is arduous; Engineers avoid breaking changes at all costs
●Prolonged debugging/troubleshooting data issues
Difficult to discover what data exists and/or
what it represents
●Reliance on tribal knowledge
●Large burden on the Analytics team to answer any/all questions
Confidence in Data Accuracy is neutral, but room for
improvement
●Once folks track down the data, they are relatively confident in its
accuracy
May 2020 Internal Survey - Engineering, Product, Analytics, Data Science teams; 47% response rate
5
Data Catalog Evaluation
DataHub
Amundsen
/ Marquez
Apache
Atlas Alation
Ease of Integration
Lineage Support
Configurable
Metadata
Affordability
7
1
2
3
DataHub POC - Level of Effort
Research & Tool Evaluation: 180 hrs
●Creation of Pugh Matrix to force-rank evaluation
●Rapid side-by-side POC of DataHub and Amundsen/Marquez
Initial Rollout of DataHub POC: 300 hrs
●Terraform Elasticsearch, MySQL, Neo4j, Aiven; helm chart for
API/frontend/Kafka components
●Datalake & ETL scrapers, including lineage
●Enrich with ETL ownership, links to GHE
Looker & Kafka Metadata Ingestion & Lineage: Est. 160 hrs
●Building Looker/LookML scraper - planning to contribute back to DH codebase
●Teaming up with DataHub to inform design of Dashboard entities