How to use Dremio a standard interface to implement Output Ports in a Data Mesh architecture
Size: 2.84 MB
Language: en
Added: Sep 24, 2024
Slides: 16 pages
Slide Content
The role of Dremio in a Data Mesh architecture Presented by: Paolo Platter – CTO & Co-founder @ Agile Lab
Who we are : We value transparency, collaboration and results Totally decentralized and self-managed International culture and mindset Customer laser focused What we do: Data Engineering is our mission since 2013 Crafting end-to-end data platforms Data Strategy Managed Data Service www.agilelab.it
Data Mesh Principles Domain Driven Data Ownership Architecture Data as a product Self-Serve Infrastructure as a Platform Federated Computational Governance
Data Product Data + Metadata (syntax+semantic, expected behaviour, access control ) Infrastructure Data Pipeline Data Access API Observability API Internal processes (GDPR, DQ, etc ) Stream Processing Information API Control Ports code data Input Ports Operational systems Other Data Products External services Output Ports Events SQLView Raw/Files Graph/RDF infrastructure
Technology Independence Addressability Interoperability Self-Serve provisioning Independently deployable Data Mesh is a practice Each Data Product team can select the technology that best fits the use case. The technology must be compliant with Data Product features and requirements Multi-cloud needs
Output Ports Output Port API Data Consumer Descriptive schema Audit Access Control Decoupling(uri and protocol) SLO Read data through native protocol Events SQL Files Output Port API Data Consumer GraphQL or HTTP Data is flowing throught API Zero coupling, low performances Not suitable for all use cases of data consumption Events SQL Files Data is flowing throught native protocol Low coupling, good performances Fully Polyglot Pre-flight
www.agilelab.it Problem
Connecting a BI tool to an Output Port BI Tool SQL Files Output Port API GraphQL or other HTTP based protocols are not widely supported by BI Tools. Also thinking to have a custom pre-flight and dynamically discover the protocol of the source is something not easy In order to query directly a file/object storage you need a SQL Engine, tipically not available inside BI Tools JDBC/ODBC connection is a good and standard option for BI Tools, but this is hiding problems
A SQL Engine is also needed Data Products should also embed some query capabilities and offer them to data consumers. It can happens leveraging a DBMS tecnology ( storage and query capabilities all-in-one ) Otherwise you can rely on object storages and distributed file systems and then you need a SQL Engine ( Athena, BigQuery, etc. ) to query them. Application Data Analyst Data Scientist Query execution Data Products Outputport SQL Outputport Raw
Client-side coupling is not good Consumer JDBC driver Athena JDBC driver Redshift JDBC driver Aurora One driver doesn’t fit all Coupling is becoming a problem for change management DP is not indipendently deployable Resons why you need multiple technologies in a data mesh: Not all the use cases fit with a single tech Data Mesh is an evolutionary architecture, technologies will evolve and change over time and DPs will adopt them indipendently Your data mesh is expanding on a multi-cloud landscape
How to integrate legacy systems Data Mesh Data Lake Migration will require time ... Consumer jdbc jdbc What if we need to join data coming from different JDBC channels ? Huge impact on performances Join must be resolved at this level
Fitting into the big picture Native integration with data lakes Bridge also other enteprise assets Data Consumer Single interface to access all the silos and no coupling between data consumers and multiple specific technologies Single catalog of data You can use it as SQL Query Engine inside Data Products Interfacing other DBMS Cloud agnostic Efficient join between DPs leveraging different underlying technologies Query federation between data mesh and other data assets across the organization Native integration with DataLakes to facilitate the transition to the DataMesh
Self-Serve provisioning API Catalog Execution Engine ACL CI/CD Data Product specification Provision: Execution resources SQL View ACL Provision: Storage DB Policies IAM etc Deploy
Data-Product Caching Main Entity Pre-aggregated and denormalized views Query Acceleration & Caching Leverage external reflections to speed-up queries automatically without adding complexity to the data consumer Query Dremio can create such pre-aggregation automatically without the need to implement custom jobs for such purpose Data Consumer Data Consumer interacts with a single logical entity, but queries will speed-up due to the cache and reflections.