What is covered in this presentation? A brief history of databases NoSQL WHY, WHAT & WHEN? Characteristics of NoSQL databases Aggregate data models CAP theorem
I n t r odu c tion Database - Organized collection of data DBMS - a software package with comput er programs that controls th e creation, maintenance and use of a database Databases are created to operate large quantities of information by inputting, storing, retrieving, and managing that information
A brief history
Benefits of Relational databases: Designed for all purposes ACID Strong consistancy, concurrency, recovery Mathematical background Standard Query language (SQL) Lots of tools to use with i.e: Reporting services, entity frameworks, ... Relational databases
SQL databases
But... Relational databases were not built for distributed applications. Because... Joins are expensive Hard to scale horizontally Impedance mismatch occurs Expensive (product cost, hardware, Maintenance) NoSQL why, what and when?
And.... It’s weak in: Speed (performance) High availability Partition tolerance NoSQL why, what and when?
Why NOSQL now?? Ans. Driving Trends
RDBMS performance
Data Data is a new class of economic asset, like currency and gold Source: World Economic Forum 2012 Data is the new raw material
Data size growth 150 exabytes in 2005 (exabyte is a billion gigabytes) 1200 exabytes in 2010 35000 exabytes in 2020 (expected by IBM)
Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025
Data size growth Examples: ISRO launches the advanced earth observation and mapping satellite CARTOSAT-3 along with 13 other commercial nano-satellites Information and images coming from the satellite Maharashtra Election : 20000 tweets/second Around 30 billion RFID tags produced/year Automatic toll collection using RFID Oil drilling platforms have 20k to 40k sensors 95% of data produced is unstructured
Challenge Big Data’s characteristics are challenging conventional information management architectures Massive and growing amounts of information residing internal and external to the organization Unconventional semi structured or unstructured ( diverse ) including web pages, log files, social media, click-streams, instant messages, text messages, emails, sensor data from active and passive systems, etc. Changing information 15 Multi-Channel analytics Sentiment analytics Transaction analytics Call Detail Records analytics Warranty claim analytics Surveillance analytics Claim fraud analytics
What is big data? “A massive volume of both structured and unstructured data that is so large that it's difficult to store, analyse , process, share, visualise and manage with traditional database and software techniques.” - Roger Magoulas of O’reilly in 2005 Big data technologies describe a new generation of technologies and architectures , designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery, and/or analysis IBM / MS Volume (Terabytes -> Zettabytes ) Variety (Structured -> Semi-structured -> Unstructured) Velocity (Batch -> Streaming Data)
What Makes it Big Data? (V 3 ) VOLUME VELOCITY VARIETY VALUE SOCIAL BLOG SMART METER 101100101001001001101010101011100101010100100101 Volume : Gigabyte(10 9 ), Terabyte(10 12 ), Petabyte (10 15 ), Exabyte(10 18 ), Zettabytes (10 21 ) Variety : Structured,semi -structured, unstructured; Text, image, audio, video, record Velocity (Dynamic, sometimes time-varying)
Variability: Variability vs variety. 6 different coffee blends tastes different every day, that is variability. The same is true of data, if the meaning is constantly changing it can have a huge impact on your data homogenization. Visualization: Using charts and graphs to visualize large amounts of complex data
A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational database No SQL systems are also referred to as "NotonlySQL“ to emphasize that they do in fact allow SQL-like query languages to be used. But Wh at is NoSQL?
NoSQL avoids: Overhead of ACID transactions Complexity of SQL query Burden of up-front schema design DBA presence Transactions (It should be handled at application layer) Provides: Easy and frequent changes to DB Fast development Large data volumes(eg.Google) Schema less Characteristics of NoSQL databases
NoSQL is getting more & more popular
In relational Databases: You can’t add a record which does not fit the schema You need to add NULLs to unused items in a row We should consider the datatypes. i.e : you can’t add a stirng to an interger field You can’t add multiple items in a field (You should create another table: primary-key, foreign key, joins, normalization, ... !!!) What is a schema-less datamodel?
In NoSQL Databases: There is no schema to consider There is no unused cell There is no datatype (implicit) Most of considerations are done in application layer We gather all items in an aggregate (document) What is a schema-less datamodel?
NoSQL databases are classified in four major datamodels: Key-value Document Column family Graph Each DB has its own query language Categories of NoSQL databases
Simplest NOSQL databases The main idea is the use of a hash table Access data (values) by strings called keys Data has no required format data may have any format Data model: (key, value) pairs Basic Operations: Insert(key,value), Fetch(key), Update(key), Delete(key) Key-value data model
Row oriented DB – stores row by row, suitable for OLTP Column oriented DB – stores column by column – OLAP Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally (large data and random read/write) The column is lowest/smallest instance of data. It is a tuple that contains a name, a value and a timestamp Column family data model
Example 28
Some statistics about Facebook Search (using Cassandra ) MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms Rewritten with Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Column family data model
Based on Graph Theory. Scale vertically, no clustering. You can use graph algorithms easily Transactions ACID Graph data model
Pair each key with complex data structure known as data structure. Indexes are done via B-Trees. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Document based data model
SQL vs NOSQL
NoSQL may complement RDBMS RDBMS may hold smaller amounts of high-value structured data NoSQL may hold vast amounts of less valued and less structured Relational implementations provide ACID guarantees Atomicity : transaction treated an all or nothing operation Consistency : database values correct before and after Isolation : as if only transaction. Durability : upon completion of transaction, operation is not reversed. NoSQL often provides BASE Basically available : Allowance for parts of a system to fail ( sharding / partitioning) Soft state : An object may have multiple simultaneous values (at different times) Eventually consistent : Consistency achieved over time (not on every commit) CAP Theorem It is impossible to have consistency , availability , and partition tolerance in a distributed system
What we need ? We need a distributed database system having such features: • • • • Fault tolerance High availability Consistency Scalability Which is impossible!!! According to CAP theorem
We can not achieve all the three items In distributed database systems (center) The CAP theorem