NoSQL Architecture Overview

ChristopherFoot 3,299 views 37 slides Mar 24, 2017
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

The presentation begins with an overview of the growth of non-structured data and the benefits NoSQL products provide. It then provides an evaluation of the more popular NoSQL products on the market including MongoDB, Cassandra, Neo4J, and Redis. With NoSQL architectures becoming an increasingly ap...


Slide Content

NoSQL Architecture Overview OVER 400 CUSTOMERS TRUST THEIR DATABASES TO RDX RDX Insights Series Presentation – Introduction to NoSQL Architectures Chris Foot VP DB Technologies RDX March 23, 2017

NoSQL Product Offering Analysis

NoSQL Competitors Document Graph Key-Value Pairs a key with a complex data structure called a document Records not required to have uniform structure MongoDB, CouchDB, DynamoDB, Couchbas, MarkLogic Record can have billions of columns Tables are collections of columns, rather than rows Column names and record keys are not fixed Cassandra, Bigtable, Hbase, Accumulo All items are stored as an indexed key-value pairs Redis, Riak, Memcached, Oracle NoSQL, DynamoDB Stores nodes (data elements) with relationships Interconnected, strong relationships Neo4j, Datastax Cassandra, Titan, ArangoDB IN-MEMORY DB Persistent DB Wide-Column Operations performed in memory Lightening fast read/write Often use Key-Value or Wide-Column as data store Redis, Memcached, Oracle Times 10, SAP HANA In-Memory

RDBMS and NoSQL will Merge NoSQL vendors desire to increase market share will drive them to compete directly with relational product manufacturers Vendors will add RDBMS-like functionality that allows their product to be more widely adopted. Those that don’t will quickly lose market share to those that do The larger relational vendors will attempt to co-opt any NoSQL technology that challenges their dominant role in the industry As they identify offerings as tangible threats, their strategy will be to ensure that the technologies used by those vendors become a component of, not a replacement for, their traditional database products Relational DBMS NoSQL DBMS General Purpose DBMS

Unstructured Data Examples

NoSQL Adoption Drivers - Modern Applications Single View Sensor Data Biometrics Radiology Videos, Images Weather Data Catalogs Content Management Geospatial Social D ata IDC: Unstructured data is growing at the rate of 62% per year IDC: By 2022, 93% of all data in the digital universe will be unstructured Gartner: Data volume is set to grow 800% over the next 5 years and 80% of it will reside as unstructured data

Cost effectively handle large volumes of data and/or users NoSQL Adoption Drivers – Horizontal Scaling Horizontal Vertical

Relational and NoSQL Parallel Adoption Drivers Hierarchical and Network Databases – IMS and CODASYL/Network Logical and physical layers entirely dependent upon each other. Both data storage and data navigation were rigidly defined.  Programs were required to follow the prebuilt paths to navigate through the stored data Early Releases of DB2 Flexibility Separate logical and physical layers - schema Set vs row processing Ease of use SQL language was intuitive Poor performance Crude locking, transaction management and limited features Early Releases of Oracle Flexibility Easy to use Lower Total Cost of Ownership (support, product costs) Low cost commodity hardware (as in it didn’t need a mainframe) Crude locking, transaction management and limited features Early Releases of NoSQL Flexibility East to use Lower Total Cost of Ownership (support, product costs) Faster application development Architected to scale horizontally for availability and performance Crude locking, transaction management and limited features “Niche implementations, crude technically, will never become popular, no features - no future ”. Pretty much…. “Your career is going to be toast.”

ACID vs BASE ACID Relational BASE NoSQL Distributed Tradeoff A tomicit y All operations in a single transaction succeed or fail as a group. No partial operations C onsistency The database is never in an inconsistent state I solated Transactions do not interfere with another. Contentious data access is handled by the database to make the transactions appear to run sequentially D urable Transactions are permanent in the presence of failures B asic A vailability The system is able to tolerate a partial failure (loss of a single node for example) S oft State The state of the system is in flux and may change over time because of bullet below E ventual Consistency As data is being added to the system, consistency is gradually replicated across all nodes. Data may be inconsistent in the short term but will eventually become consistent The application is given a greater responsibility for data management in systems that don’t follow ACID Leads to complex application code when strong consistency is needed across replicated nodes

CAP Theorem Distributed Systems – Pick C or A Consistency A C P Partition Tolerance Availability CP: MongoDB, Redis, BigTable, Hbase, MemcacheDB CA: Oracle, SQL Server, MySQL… AP: Cassandra, Riak, CouchDB, DynamoDB USER USER USER USER USER USER USER USER SAME DATA HERE SAME DATA HERE Consistency : All clients see the same data AVAILABLE AVAILABLE Availability: All clients can read and write Partition Tolerance: System continues to work during network partitions

CAP Theorem Allow Updates Allow Updates INCONSISTENT Synchronizing Data Partition Allow Updates Prevent Updates UNAVAILABLE Synchronizing Data Partition AVAILABILITY CONSISTENCY

Why Did RDX Choose MongoDB? Business Drivers Industry analyst evaluations Customer use cases and recommendations Largest commercial investment in any database vendor Popularity 10 million+ software downloads 1,000 partners 2,000 customers 1/3 of the Fortune 100 Robust training available Strong open source community Excellent partnership support Technical Drivers Wide scope of potential application Low TCO Combines capabilities of relational databases with next generation NoSQL technologies Schemaless, flexible data model Nonstructured data support Easily accommodates large data volumes Rich query capability Strong, tunable consistency model Elastic, horizontal scalability Easily configurable system resiliency Vendor provided database support Craigslist, New York Times, Verizon, Viacom, AstraZeneca, MTV, Google, Genetech, Adobe, GAP, Cisco, MetLife, Facebook, Expedia, Ebay, Edmunds, Washington Post, Aol, ADP, Forbes, Intuit, The Weather Channel, Carfax…..

MongoDB Features Multiple storage engines WiredTiger InMemory Encrypted Third-Party MMAPV1 Indexing Enforce uniqueness on user defined and Object ID fields Partial – Only indexed if they meet filter expression Sparse – Only indexed if field is populated Compound – Multiple column index Multikey – Indexes on arrays TTL – Allow documents to be purged based on time Text Search Hash – Creates random values Easily ingests large, nonstructured data elements Decomposes large video files, images into smaller components and rebuilds them using pointer during retrieval Document validation rules enforce data validity Enforce checks on document structure, data types, data ranges and the presence of mandatory fields DBAs can apply data governance standards, while developers maintain the benefits of a flexible schema Automatic failover with no application redirects to new primary required Driver support for all common programming languages Data compression Tunable consistency model BI Connector allows MongoDB to act as data source for SQL based BI analytics platforms LDAP, Kerberos, Windows AD, x.509 authentication DML, DCL, DDL audit logging FIPs compliant and data encryption

Rigid vs Dynamic Schemas Relational Tables and Rows Schema design performed before application is developed Schema must be built before inserting data Enforces data structure – rows can not deviate from the predefined schema Schema design based on storage Schema alterations require database and application changes to be coordinated Normalization process is critical MongoDB Collections and Documents No schema required before inserting data Schema is created as each document is inserted Documents in collection can have a different schema (sets of fields) Schema design based on application usage Schemas can evolve iteratively during application life-cycle Higher dependency on application layer for data integrity Normalization not as important Predescribed Self-Describing

Flexible Schemas Insurance Policy Document Collection AUTO LIFE HOME EQUIPMENT CYBER Collections do *not* enforce document structure. You do not predefined document schemas. The schema is defined during initial document insertion. Data types are selected by MongoDB based on data being inserted

Agile Development Features Schemaless architecture Flexible data model = easy schema changes Drivers for all major p rogramming l anguages Ability to store a ll types of data FASTER BETTER LEANER Flexible JSON document format Rich content Using GridFS Simple system p rovisioning Scale vertically and horizontally Pluggable storage engines Easy replication setup

Automatic Sharding Logical Logical Primary Physical Server Secondary Physical Server Secondary Physical Server Primary Physical Server Secondary Physical Server Secondary Physical Server Automatic Data Distribution - Sharded Cluster Shard 1 Shard 2 Primary Physical Server Secondary Physical Server Secondary Physical Server Horizontally Scalable Cluster metadata includes data location, shards, # of chunks…. Replicas Replicas Replicas Shard N

Replica Sets BI Connector MULTI DATACENTER CLUSTER Site 2 Sec 1.1 Display Sec 2.1 Batch Sec 3.1 Batch Site 2 – Display and Batch Priority 1 Votes 1 Site 3 Sec 1.2 Batch Sec 2.2 Batch Sec 3.2 Delayed Site 3 – Batch and DR Priority 0 Votes 1 Config Server Config Server Priority 1 Votes 1 Config Server Collection Primary 1 Display Primary 2 Display Primary 3 Display

Global Data Distribution Read Global/Write Local Primary Secondary Secondary

Videos and Images – Unstructured Data Store files larger than 16MB i.e. video, images Load chunks without reading entire file into memory Atomically sync files with their metadata Shard and distribute around the cluster

Cassandra Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the log-structured storage engine from Google's BigTable. . Apple, Sony, Walmart, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, Weather Channel, CERN, Constant Contact, Macy’s, Expedia Fault Tolerant Data Durability Data Center Aware High Performance Decentralized Horizontal Scalability Elastic Architecture Apple - 75,000 nodes storing over 10 PB of data Netflix - 2,500 nodes, 420 TB, over 1 trillion requests per day Chinese search engine Easou - 270 nodes, 300 TB, over 800 million requests per day e Bay - 100 nodes, 250 TB . BIG Data High # Concurrent Users

Datastax/Cassandra Features Multi-model storage Key Value NoSQL Tabular NoSQL JSON/Document NoSQL Graph Very high “linear” scalability Automatic data distribution amongst nodes Multi-data center replication CQL Access Language SQL “like” language Tunable consistency model Strong node fault detection and recovery Writes to Memtables in RAM Materialized views Advanced replication allows multiple clusters to be synchronized OpsCenter – browser based administration and monitoring toolset Driver support for all common programming languages In-Memory option allows parts (or all) of database to reside in RAM Tiered storage Interface to Spark (in-memory) Data stream processing Access to Spark SQL (more robust than CQL) Security End to end encryption AD, LDAP, Kerberos support

Cassandra Cluster Cassandra/DataStax REPLICATION Node 1 Primary Node 2 Copy of 1 Node 2 Copy of 1 Node 3 Copy of 1 Node 4 Node 4 West Coast Datacenter East Coast Datacenter REPLICATION Node 3 Copy of 1 Node 1 Primary

Cassandra/DataStax Keyspace - A keyspace is a logical container for data tables and indexes. It can be compared to an Oracle Schema or a SQL Server database. Keyspaces define how the data is replicated amongst the nodes Table - A collection of columns fetched by a row. Columns are ordered by name Column - Supports different data types and consists of a name, value and timestamp Primary Key - Uniquely identifies a row occurrence in a Cassandra table Partition Key - The partition key identifies which node in the cluster will store the row. It is responsible for data distribution across the nodes Clustering Key - Orders rows based on the column’s value Data Center - A collection of related nodes in a Cassandra Cluster Snitch - Determines which datacenters and racks nodes belong to. They inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into datacenters and racks Partitioner - A hashing algorithm that generates a hash value token from the partition key. The token is the value used to distribute the data across the various nodes in the cluster. The partitioner’s goal is to assign equal portions of data to each node. Each node in a Cassandra cluster becomes responsible for storing a range of hash values Gossip - A peer-to-peer communications mechanism that identifies and shares node information (state and location) to all nodes in the Cassandra cluster

Cassandra/DataStax Decentralized Storage Partitioners are hashing algorithms that generate tokens from partition keys Each node in a Cassandra cluster is responsible for a range of tokens (hash keys) First column of primary key becomes partition key Can use multiple columns as primary key, partition key Also able to cluster columns to order data PRIMARY KEY (emp_id) PRIMARY KEY (emp_id, dept_id) WITH CLUSTERING ORDER BY (dept_loc)) PRIMARY KEY (emp_id, dept_id) Partitioner TOKEN RANGE 0-25 26 26-50 51 51-75 76 76-100 All nodes can accept reads and writes Distributes data amongst nodes

Cassandra/DataStax Tunable Consistency Write Consistency Read Consistency Read and Write consistency levels are different than row replication settings. Replication factor will affect how many copies are eventually written vs tunable consistency for fast client response Level Description ALL Returns the record after all replicas have responded. The read operation will fail if a replica does not respond. QUORUM Returns the record after a quorum of replicas from all datacenters has responded. LOCAL_QUORUM Returns the record after a quorum of replicas in the current datacenter as the coordinator has reported. Avoids latency of inter-datacenter communication. ONE Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. TWO Returns the most recent data from two of the closest replicas. THREE Returns the most recent data from three of the closest replicas. LOCAL_ONE Returns a response from the closest replica in the local datacenter. SERIAL Allows reading the current (and possibly uncommitted) state of data without proposing a new addition or update. If a SERIAL read finds an uncommitted transaction in progress, it will commit the transaction as part of the read . LOCAL_SERIAL Same as SERIAL, but confined to the datacenter. Similar to LOCAL_QUORUM. Consistency Latency Level Description ALL A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition. EACH_QUORUM Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in each datacenter. QUORUM A write must be written to the commit log and memtable on a quorum of replica nodes across all datacenters. LOCAL_QUORUM Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in the same datacenter as the coordinator. Avoids latency of inter-datacenter communication. ONE A write must be written to the commit log and memtable of at least one replica node. TWO A write must be written to the commit log and memtable of at least two replica nodes. THREE A write must be written to the commit log and memtable of at least three replica nodes. LOCAL_ONE A write must be sent to, and successfully acknowledged by, at least one replica node in the local datacenter. ANY A write must be written to at least one node. If all replica nodes for the given partition key are down, the write can still succeed after a hinted handoff has been written. If all replica nodes are down at write time, an ANY write is not readable until the replica nodes for that partition have recovered.

Relational vs Cassandra NoSQL – Data Modeling In relational systems, administrators model the data   In Cassandra, administrators design schemas that are based on query patterns

Cassandra/DataStax Modeling Cassandra – YOU DESIGN SCHEMAS BASED ON QUERY PATTERNS THEN DATA RELATIONSHIPS Maximization of Denormalization Cassandra/Datastax recommendation = 1 table per query You are prebuilding answers to unique requests for data! Overcome data duplication by leveraging extremely fast write performance Determine queries accessing data FIRST, then design the data models No concept of foreign keys No concept of join operations Prepare data for fast reads by writing pre-built result sets Attempt to minimize reads from multiple partitions Cassandra prefers INSERTs over UPDATEs and DELETEs

Redis In-Memory, Key-Value Database Dumps to disk is configurable Database handles swapping All data can live in memory but key caching is required 1 Million Keys = 160 MEGs 10 Million Keys – 1.6 GIGs ATOMIC Operations Master-slave replication Scalability Redundancy Slaves Can’t respond to queries during initial synch Automatically reconnect and resynch after outage Journal file Every write is logged Commands replayed when server is started Configurable – Can choose between 2 settings Eventually consistent - “Speed” Immediately consistent - Safety” Tumblr, Uber, Coinbase, Flickr, Hulu, Craigslist, Alibaba, Digg

Redis Features Not a replacement for relational databases but can be used as their “front end” Lightening fast read and write access Single threaded architecture – does not exploit multiple CPU/Cores Does not support unit-of-work roll back Optimistic locking – data contention (race) will cause transaction failure Redis Clusters Not able to guarantee strong consistency amongst nodes Able to add/remove nodes in a Redis cluster Partitioning allows data to be split and stored in multiple Redis instances. Each instance contains a subset of keys Range partitioning Hash partitioning Can be used as a data store or a pure cache When used as a Cache, can be configured as a LRU (gets rid of old data to make way for new) Sensor data Redis RDB persistence and backups Redis snapshots at specified time intervals = a full database backup Move RDB files to other storage Write operations in memory can be logged to Append Only Files (AOF) Appendfsynch parameter allows administrator to configure log writes

Neo4j Walmart, Ebay, Cisco, Adobe, CrunchBase, Pitney Bowes, CareerBuilder, TomTom, ConocoPhillips, National Geographic, Century Link, Glassdoor, Zephyr Health, Gamesys, Telenor Highly scalable, native graph database Enterprise and community editions Store, manage, analyze, and use data within the context of connections, like the circles and lines drawn on whiteboards More than 1 Million downloads Understanding data relationships is also key to understanding dependencies, uncovering cascading impacts, and predicting behavior Access language allows you to traverse relationships in a much more simple, and easy to understand, way than relational SQL SQL – Dozens of lines Cypher – Couple of lines

Neo4j Features Provides graphical browser utility to better visualize relationships Import data from different sources using rules Cypher is another SQL “Like” language Properties are key-value pairs Nodes with properties (node is data, not server) Named relationships with properties Key – string Value – individual data types or array Path – connecting relationships, which you traverse using an API Schemaless Easily able to store unstructured data Easily able to store large volumes of data Full support for ACID Transactions Full indexing capabilities Constraint capabilities Unique Exists (like a Foreign Key with no parent delete rules) Find Sushi Restaurants in New York that my friends like

Neo4j Graph Examples Master Data Management Graph Based Search Recommendations

NoSQL vs Relational Strengths Weaknesses ACID Transaction management Sophisticated locking and latching Power of the SQL Language – Two-phase commits, foreign key constraints, joins, subqueries, integrated aggregations, complex business rule enforcement Product maturity Robust utilities Vendor support Most vendors have robust cloud strategies Strong third-party software provider adoption (applications, tools and utilities ) Product purchase/support costs Scalability can be complex and expensive Data normalization can impact performance Schemas are not flexible Not all data fits neatly into rows and columns Geographic distribution can be complex Relational DBMS

NoSQL vs Relational NoSQL DBMS Strengths Weaknesses Dynamic schema flexibility Faster development times Total cost of ownership Easily stores semi, non and fully structured data Horizontal and vertical scalability Geographic replication and data distribution Easier to achieve high performance accessing large volumes of data Custom tailor environment to data storage and processing needs Cost effective clustering Crude transaction management and locking mechanisms (BASE vs ACID) Limited cloud offerings Vendor support (or lack thereof) Data is often denormalized leading to duplicate updates Weak access languages No inherent data integrity enforcement mechanisms

NoSQL vs Relational Transactions – COMPLEX Transactions – SIMPLE Data – STRUCTURED AND STATIC Data – FULL/SEMI/NON STRUCTURED DYNAMIC Data Velocity – MODERATE TO HIGH Data Velocity – HIGH to ASTRONOMICAL Data Locations – FEWER THE BETTER Data Locations – MANY LOCATIONS Data Volumes – MAINTAIN BY PURGING Data Volumes – RETAIN FOREVER Data Availability – CLUSTER, LOG SHIPPING Data Availability – INHERENT ARCHITECTURE Data Performance – FOCUS ON READS Data Performance – FOCUS ON READS/WRITES Relational DBMS NoSQL DBMS

Questions and Additional Information [email protected] Next Month’s Presentation – Evaluating and Selecting Cloud Database Management Systems The RDX Report Is NoSQL the Natural Progression of DB Technology, Cloud’s Hidden Impact on IT Support, SQL Server 2016 Licensing Best Practices, The Rise of Corporate Ransomware LinkedIn S electing Cloud DBMS, NoSQL Architectures, Database Security Series, Improving Customer Service 20 YEARS OF SERVICE DELIVERY EXPERIENCE