Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between the...
Apache Cassandra and ScyllaDB are distributed databases capable of processing massive globally-distributed workloads. Both use the same CQL data query language. In this webinar you will learn:
- How are they architecturally similar and how are they different?
- What's the difference between them in performance and features?
- How do their software lifecycles and release cadences contrast?
Size: 9.41 MB
Language: en
Added: Jan 13, 2022
Slides: 49 pages
Slide Content
Cassandra vs. ScyllaDB: Evolutionary Differences
Peter Corless Listen to customer stories Write blogs & case studies Play (and design) strategy & roleplaying games Director of Technical Advocacy ScyllaDB
‹#› For data-intensive apps that require high performance and low latency Fully compatible with Apache Cassandra and Amazon DynamoDB 10X the performance & low tail latency Open Source, Enterprise and Cloud options Founded by the creators of KVM hypervisor H Qs : Palo Alto, CA, USA; Herzelia, Israel; Warsaw, Poland; Teams around the world About ScyllaDB
‹#› From “Chasing Cassandra” to “Beyond Cassandra” Comparing Open Source Releases Software Release Cycles Public Perceptions Features Same-Same, Similarities, Differences Benchmarks Comparison: Cassandra vs. ScyllaDB
‹#› Wide-column NoSQL database (“key-key-value”) Originally a re-architected reimplementation of Cassandra Compatible with C*’s CQL, SSTables, drivers, etc. Written in C++ (not Java) on the Seastar framework Shard-per-core design “Async everywhere” Shared-nothing Futures/promises Now also offers a DynamoDB-compatible API, Alternator What is ScyllaDB? Learn more: https://www.scylladb.com/product/technology/
‹#› “Chasing Cassandra” Scylla traditionally trailed implementation of Cassandra Playing “catch-up” c. 2016 – 2020 Scylla 4.0 went beyond “feature completeness” for Cassandra 3.11 Now Scylla has features not found in Cassandra Though Cassandra 4.0 has some features not (yet) present in Scylla Some we’ll add for parity/compatibility Some we’ll go our own way (solve differently, improve or obviate)
‹#› Scylla Beyond Cassandra Cassandra “core” Scylla same-same “core” iteration Unique to Cassandra Unique to Scylla Scylla specific implementation Cassandra specific implementation Same/similar feature implemented differently May or may not be intercompatible Same/similar feature implemented identically/intercompatible Not in Cassandra Not in Scylla
‹#› Comparing Open Source Releases Apache Cassandra vs. ScyllaDB
‹#› February 2017: Cassandra 3.11 released May 2019: First Roadmap for Cassandra 4.0 laid out September 2019: First 4.0 Alpha July 2020: First 4.0 Beta April 26, 2021: Cassandra 4.0rc1 April 28, 2021: Cassandra 4.0 “World Party” July 27, 2021 : Actual Cassandra 4.0 Release Date Sep, 07, 2021: Cassandra 4.0.1 > 4 Years from last minor release (3.11) to 4.0 Cassandra 4.0 is Finally Here!
‹#› Scylla’s Predictable Releases Aug 2021: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance
‹#› Scylla Open Source + Enterprise
‹#› Head-to-Head Scylla engineers make ~5x more commits/month Bigger engineering team — ~50% more active committers More active release cycle — 13x more major/minor releases over past 3 years More popular with developers — Scylla exceeds Cassandra in Github stars
‹#› Comparing Public Perception Apache Cassandra vs. ScyllaDB
‹#› Scylla Moving Up in the [DB-Engines.com] Rankings ScyllaDB was 4th fastest rising database in the DB-Engines.com Top 100 from Jan 2021 to Jan 2022 [Rank #85, Score 3.91] Source: https://db-engines.com/en/ranking Cassandra remains #11 overall in the DB-Engines.com ranks for Jan 2022 [Score 123.55]
‹#› What’s the Same Between Scylla & Cassandra? Commonalities
‹#› Common Ancestry Cassandra and Scylla both descend from the same historical antecedents / whitepapers Google’s Bigtable Amazon’s Dynamo Facebook’s Cassandra [Not to be confused with commercial offerings Google Cloud Bigtable and Amazon DynamoDB, or open source Apache Cassandra]
‹#› Peer-to-peer leaderless topology Replication Factor (RF) Tunable consistency per request Multi datacenter replication CAP Theorem: Availability/Partition Tolerant “AP” High Availability No primary/replica complications Homogeneity of nodes Full datacenter loss can be survivable
‹#› Ring Architecture Token ring topology Wide column “Key-key value” Partition key Clustering key Nodes/vNodes Automatic sharding Same murmur3 partitioner & hash algorithms
‹#› Keyspaces, Tables CREATE KEYSPACE CREATE TABLE ALTER KEYSPACE ALTER TABLE DROP KEYSPACE DROP TABLE Pretty much standard Cassandra Query Language (CQL)
‹#› Basic CQL CRUD Operations Create [INSERT] Read [SELECT] Update [UPDATE] Delete [DELETE] WHERE clause ALLOW FILTERING TTL functions Pretty much standard Cassandra Query Language (CQL) Like SQL, at least at cursory glance, but do not be lulled into a false sense of familiarity
‹#› CQL Drivers ScyllaDB can use all the same Apache Cassandra and DataStax drivers Allows for a replacement of ScyllaDB on the back end without touching any existing client apps or drivers They will not take advantage of ScyllaDB’s shard-aware architecture, but they’ll work
‹#› <1 terabyte 1 to 50 terabytes 50-100 terabytes >100 terabytes How much data do you have under management in your own transactional database systems? Poll Question
‹#› What’s similar but not the same? Cassandra and Scylla differences
‹#› CQL For the most part, all basic CQL queries for Cassandra will work with Scylla Scylla uses the same CQL wire protocol as Cassandra Scylla does implement some features differently (we’ll get into those) Naturally, those differences will have related CQL commands Implementation lag: Scylla is compatible to CQL 3.4.0; current Cassandra CQL is 3.4.5
‹#› SSTables Scylla supports the same immutable on-disk SSTable LSM tree file formats Standard compaction algorithms are the same (LCS, STCS, TWCS) Cassandra 4.0 implemented a new “nb” SSTable file format Scylla will add support for “nb” file format #8593 // na (4.0-rc1): uncompressed chunks, pending repair session, isTransient, checksummed sstable metadata file, new Bloomfilter format // nb (4.0.0): originating host id Scylla will also add support for “me” file format #9869
‹#› Lightweight Transactions (LWT) Both use Paxos consensus algorithm Compare-and-set operations Also called “conditional updates” Scylla can accomplish LWTs in only 3 round trips (Cassandra takes 4) Scylla is more performant / efficient Blog: https://www.scylladb.com/2020/07/15/getting-the-most-out-of-lightweight-transactions-in-scylla/ Scylla accomplishes LWTs in 3x round trips Cassandra LWTs take 4x round trips
‹#› Materialized Views Cassandra: introduced in 3.0 [2017] , but still experimental Problems when base table gets out of sync To this day, major issues like CASSANDRA-10346 are still open Scylla: production ready since 3.0 [Jan 2019] Serve as the infrastructural basis for Secondary Indexes Can still get out of sync, but not easily Continually improving implementation * Read more: https://www.scylladb.com/2018/09/19/overheard-at-distributed-data-summit/ “ If you have them, take them out.” — Nate McCall PMC Chair, on Materialized Views in Cassandra [2018]*
‹#› Secondary Indexes Cassandra: only local Secondary Indexes (SIs) Scylla: both local and global SIs The choice is now yours ! https://www.scylladb.com/2019/07/23/global-or-localsecondary-indexes-in-scylla-the-choice-is-now-yours/ A global indexing query workflow in Scylla
‹#› Introduced in C* 3.8, uses commitlog-like structure Creates indexes as commit logs are written - for improved performance and reliability Feature enabled through cassandra.yaml CDC can be enabled per table through ALTER TABLE command Currently, no standard way to read CDC files DS planning to open source Kafka Source connector Advance replication from DS Labs Example CDC project build by someone Change Data Capture (CDC) CDC in Scylla Implemented as standard CQL Tables Just like adding another table Enabled by default Easy to integrate & consume Deltas (changes) plus pre/post image Replicated in same way as normal data Reasonable overhead TTL prevents unbounded data Easily consumable by Apache Kafka
‹#› Debezium-based Simply consumes CDC data via CQL Doesn’t need to de-dupe data Pumps data into Kafka topics Confluent-certified Less muss & fuss Kafka CDC Source Connector
‹#› Zero Copy Streaming vs. Row-level Repair Cassandra now can stream SSTables as a whole Bypasses turning SStables into objects (aka “object reification”) providing 5x better performance Scylla implemented a completely different approach in 2019 Scylla’s row-level repair feature is used instead of streaming Row-level repair is more: Robust: Better able to endure interruptions and outages Granular: Only specific rows are transferred Efficient: There’s no extra data streaming!
‹#› C* 4.0 integrates async-driven code from Netty library for communication between nodes to leverage Java’s Non-Blocking IO (NIO) capability. A single thread pool for all connections to corresponding nodes instead of maintaining N threads per peer. Potentially improves internode performance issues, providing better tail latencies and facilitating zero-copy streaming. Netty Async Messaging Scylla also believes in non-blocking IO Scylla uses asynchronous / non blocking I/O in C++ (aio) with its own schedulers Scylla per-core shards maintain as great a shared-nothing approach as possible; use async messaging when needed Read: https://www.scylladb.com/2021/09/15/what-weve-learned-after-6-years-of-io-scheduling/
‹#› P lethora of K8s operators DataStax K8ssandra 1.3+ Orange KassCop 2.0+ Bitnami Charts [ cass-operator deprecated] Sidecars collocated/run on the same instance as the DB server daemon What Works and What Doesn’t: https://k8ssandra.io/blog/articles/kubernetes-and-apache-cassandra-what-works-and-what-doesnt/ Kubernetes Support & Sidecars Scylla Operator offers great K8s support — It just works Scylla M anager Agent is a sidecar and already included by default with Scylla Operator https://www.scylladb.com/product/scylla-operator-kubernetes/
‹#› What’s Just Totally Different? Cassandra and Scylla differences
‹#› Shard-per-Core Architecture Based Seastar framework (also used in Redpanda, Redhat Crimson) Designed/optimized for multicore systems (scales to 100+ CPUs per node) Cassandra is shard-per- node Scylla balances data with more granularity
‹#› Shard-Aware Drivers Our shard-aware Rust driver serves as the paradigm for our new shard-aware drivers Still backwards-compatible with Cassandra Get it on Github! https://github.com/scylladb/scylla-rust-driver Better performance than a “vanilla” CQL driver “Smart” token-aware clients direct queries to specific shards (cores) where data resides Better for consumption of CDC data tables Up to 25% greater performance
‹#› Gossip in Cassandra requires seed nodes; which violates the idea of homogeneity of nodes Requires manual assignment and configuration Seed nodes do not bootstrap Complicated to add new seed node or replace a dead seed node Seedless Gossip Scylla implemented gossip without requiring seed nodes More symmetric; less problematic Read more: https://www.scylladb.com/2020/09/22/seedless-nosql-getting-rid-of-seed-nodes-in-scylla/
‹#› Run your DynamoDB-compatible workloads anywhere : on AWS or in an AWS Outpost on Google Cloud, Azure, or on-premises Supports DynamoDB Streams Supports Load Balancing Scylla Spark Migrator to move data to any Scylla cluster anywhere DynamoDB-compatible API (Alternator) Cassandra has no comparable feature
‹#› Schema Changes Topology Changes Add or remove any number of nodes simultaneously Durable and linearizable Background Data Rebalancing Tablets! Immediate, Strong Consistency of MVs, SIs, CDC tables 1 Round Trip! Raft in ScyllaDB Not in Cassandra
‹#› Benchmarking: Cassandra 4.0 vs Scylla 4.4 and how Scylla dominates
‹#› Cassandra 4.0 vs. Scylla 4.4 Scylla up to 100x lower P99 latencies Scylla can maintain 2x - 5x throughput Scylla adds nodes 3x faster
‹#› Scylla 4.4 vs. Cassandra 4.0 Cassandra 4.0 cannot maintain useable low latencies except at very low throughput (≤30-40k ops) Scylla can maintain low latencies for far greater throughputs (≤170-180k ops)
‹#› Replacing a Node Scylla can heal clusters far faster than Cassandra 4.0 by spinning nodes up and rebalancing data ~3x - 4x faster
‹#› Doubling Cluster Capacity Scylla doubled a cluster’s capacity in just over an hour and a half (94 minutes) It took Cassandra 4.0 just shy of 4 hours (238 minutes) to perform the same task Scylla performed 2.5X faster
‹#› Scylla 4.4: 36 min on a 3-node cluster Cassandra 4.0 took 36x - 63x as long (nearly a day; or a day and a half!) Cassandra 4.0 performed worse than Cassandra 3.11 with num_tokens: 16 Major Compaction Speed
‹#› TCO Comparison: 4 vs. 40 4x i3.metal instances with Scylla provided the same or better performance as 40 nodes of Cassandra on i3.4xlarge Cassandra had 640 vCPUs Scylla had 288 vCPUs Scylla got better utility out of hardware Cost savings of 60% Administrative burden/attack surface reduced by 90%
‹#› BLOGS Benchmark, Part 1: Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance Benchmark, Part 2: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance Webinar: Your Questions about Cassandra 4.0 vs. Scylla 4.4 Answered WEBINAR Comparing Apache Cassandra 4.0, 3.0 and ScyllaDB Published Benchmarks
Learn NoSQL for free! university.scylladb.com @petercorless Questions?