Two Leading Approaches to Data Virtualization, and Which Scales Better? by Daniel Abadi

ScyllaDB 131 views 35 slides Mar 11, 2025
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

Should you move code to data or data to code? Conventional wisdom favors the former, but cloud trends push the latter. This session by the creator of PACELC explores the shift, its risks, and the ongoing debate in data virtualization between push- and pull-based processing.


Slide Content

A ScyllaDB Community
Two Leading Approaches to
Data Federation & Virtualization:
Which Scales Better?
Daniel Abadi
Professor

Daniel Abadi
■Performs research on database system architecture and implementation,
especially at the intersection with scalable and distributed systems.
■Best-known for the development of the storage and query execution engines
of the C-Store (column-oriented database) prototype, which was
commercialized by Vertica and eventually acquired by Hewlett-Packard in
2011, for his HadoopDB research on fault tolerant scalable analytical
database systems which was commercialized by Hadapt and acquired by
Teradata in 2014
■Is ACM Fellow (2020)
■Recipient of a VLDB Best Paper Award, two VLDB Test of Time Awards (for
the work on C-Store and HadoopDB), the 2008 SIGMOD Jim Gray Doctoral
Dissertation Award, the 2013-2014 Yale Provost's Teaching Prize, and the
2013 VLDB Early Career Researcher Award, and the CIDR 2025 Test of Time
Award.

WHAT IS DATA FEDERATION?

Example
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Teradata
Data Warehouse

Example
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Teradata
Data Warehouse
ScyllaDB

Example
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Teradata
Data Warehouse
ScyllaDB

Example
7
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
ScyllaDB
Teradata
Data Warehouse

Query federation engines
8
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
ScyllaDB

Query federation engines
9
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
ScyllaDB

Query federation engines
10
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB

Query federation engines
11
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB

Query federation engines
12
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB

What are the Technical
Challenges Behind Data
Federation?

Challenges
14
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB
CQL
SQL
Spark
SQL

Challenges
15
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB
What is the total
value of customers
in North Dakota?

Challenges
16
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
AWS
Data Lake
Teradata
Data Warehouse
QUERY
SUBQUERY
SUBQUERY
SUBQUERY
ScyllaDB
What happens if
under-resourced?

Whither Data Federation?

RIP Data Federation
18
Data
Federation

RIP Data Federation
19
Data Virtualization

RIP Data Federation
20
Data
Federation
and
Virtualization

What happened?
28
Transaction Data
(customerID, Data/Time, ProductID, Amount, …)
AWS
Data Lake
Teradata
Data Warehouse
Customer Loyalty data
(customerID, customerName, twitterID, …)
Social Network Comments
(twitterID, influenceRank, productID, …)
Pull-based federation engine
ScyllaDB

Which scales better?
29
•Push-based architectures push code to data
•Pull-based architectures push data to code
•Push-based architectures scale better in theory
•But scale worse in practice
•Scalability problems with pull-based architectures rarely
seen in practice
•Only when network is a bottleneck
•JDBC interfaces can limit performance

What else happened?
30
•Orders of magnitude cheaper to store data in an open data
format on a data lake than in a commercial DB
•Vendors are being forced to support querying these data sources
directly
•First wave of solutions were half-hearted “external tables”-type of
approaches
•Now vendors are competing based on query federation /
virtualization performance
•This is a super easy federation use case for pull-based federation systems

Data federation and virtualization moving forward
31
•Simple federation use cases are not going away
•Next level of complexity is also fairly straight-forward for pull-based systems
•We’ve gone to far in the other extreme
•We can push more processing down to storage than we are doing now
•Supporting writes is still hard
•Data integration is still hard

Sort Offload Algorithm
•Observation: sort costs dominate merge-sort join overhead
•Idea: when Presto is overloaded, pushdown sort
•Given a join between a table S and table T (S ⨝ T)
•Partition table S into S1 and S2 ( t = S1 ⋃ S2)
•Partition table T into T1 and T2 ( t = T1 ⋃ T2)
•Sort S1 and T1 in DV Engine and Scan the sorted result of
S2 and T2 from the underlying DB
•Merge sorted S1 and sorted S2 to get sorted S and
sorted T1 and sorted T2 to get sorted T
•Sorted S = Merge(Sort S1, TS(sort S2))
•Sorted T = Merge(Sort T1, TS(sort T2))
•Merge join sorted s and sorted t
•S ⨝ T = Merge(Merge(sort S1, TS(sort S2), Merge(sort T1,
TS(sort T2))


TS(S1)TS(sort S2)
DV Engine
Merge (Sorted
S1, Sorted S2)
Sort S1
Merge join (Sorted S, Sorted T)
TS(T1)TS(sort T2)
Merge (Sorted
T1, Sorted T2)
Sort T1
Underlying DB

Outer Pushdown Algorithm
•Given a join between a table S and table T (S ⨝ T)
•Partition table T into T1 and T2 ( T = T1 ⋃ T2)
•Execute a left outer join between S and T2 (S ⟕ T2)
•This can be done on the underlying system
•Extract the join result from (S ⟕ T2)
•Extract S, join it with T1 (S ⨝ T1)
•This can be done in DV Engine
•Extract the inner part result, which is actually (S ⨝ T2)
•Put the results together, which is (S ⨝ T1)∪(S ⨝ T2)


TS(S) TS(T2)
extract
output
S
TS(T1)

inner part
Underlying DB
DV Engine

For more info …
•Email me ([email protected]) for (free) link to
pdf version of the book

Stay in Touch
Daniel Abadi
[email protected]
@daniel_abadi
https://www.linkedin.com/in/databaseprof
Tags