The Big Data Stack

5,095 views 35 slides Jan 10, 2014
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

No description available for this slideshow.


Slide Content

The Big Data Stack Zubair Nabi z [email protected] 7 January, 2014

Data Timeline f ork() 2003 2012 2015

Example*: Facebook 2.5B – content items shared 2.7B – ‘Likes’ 300M – photos uploaded 105TB – data scanned every 30 minutes 500+TB – new data ingested 100+PB – data warehouse * VP Engineering, Jay Parikh – 2012

Example: Facebook’s Haystack* 65B photos 4 images of different size stored for each photo For a total of 260B images and 20PB of storage 1B new photos uploaded each week Increment of 60TB At peak traffic, 1M images served per second An image request is like finding a needle in a haystack *Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel , and Peter Vajgel . 2010. Finding a needle in Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1-8.

More Examples The LHC at CERN generates 22PB of data annually (after throwing away around 99% of readings ) The Square Kilometre Array (under construction) is expected to generate hundreds of PB each day Farecast , a part of Bing, searches through 225B flight and price records to advise customers on their ticket purchases

More Examples (2) The amount of annual traffic flowing over the Internet is around 700EB Walmart handles in excess of 1M transactions every hour ( 25PB in total ) 400M Tweets everyday

Big Data Large datasets whose processing and storage requirements exceed all traditional paradigms and infrastructure On the order of terabytes and beyond Generated by web 2.0 applications, sensor networks, scientific applications , financial applications, etc. Radically different tools needed to record , store , process , and visualize Moving away from the desktop Offloaded to the “cloud” Poses challenges for computation , storage , and infrastructure

The Stack Presentation layer Application layer: processing + storage Operating System layer Virtualization layer (optional) Network layer (intra- and inter-data center) Physical infrastructure layer Can roughly be called the “cloud”

Presentation Layer Acts as the user-facing end of the entire ecosystem Forwards user queries to the backend (potentially the rest of the stack) Can be both local and remote For most web 2.0 applications, the presentation layer is a web portal

Presentation Layer (2) For instance, the Google search website is a presentation layer Takes user queries Forwards them to a scatter-gather application Presents the results to the user (within a time bound) Made up of many technologies, such as HTTP, HTML, AJAX, etc. Can also be a visualization library

Application Layer Serves as the back-end Either computes a result for the user, or fetches a previously computed result or content from storage The execution is predominantly distributed The computation itself might entail cross-disciplinary (across sciences ) technology

Processing Can be a custom solution, such as a scatter-gather application Might also be an existing data intensive computation framework, such as MapReduce , Spark , MPI , etc. or a stream processing system, such as IBM Infosphere Streams , Storm , S4 , etc. Analytics engines: R , Matlab , etc.

Numbers Everyone Should Know* * Jeff Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009. Operation Time ( nsec ) L1 cache reference 0.5 Branch mispredict 5 L2 cache reference 7 Mutex lock/unlock 25 Main memory reference 100 Send 2K over 1Gbps network 20,000 Read 1MB sequentially from memory 250,000 Disk seek 10,000,000 Read 1MB sequentially from disk 20,000,000 Send packet CA -> NL -> CA 150,000,000 Time 0.5s 5s 7s 25s 1m40s 5h30m ~3days ~6days 8months 4.75years

Ubiquitous Computation: Machine Learning Making predictions based on existing data Classifying emails into spam and non-spam American Express analyzes the monthly expenditures of its cardholders to suggest products to them Facebook uses it to figure out the order of Newsfeed stories, friend and page recommendations , etc. Amazon uses it to make product recommendations while Netflix employs it for movie recommendations

Case Study: MapReduce Designed by Google to process large amounts of data Google’s “hammer for 80% of their data crunching ” Original paper has 9000+ citations The user only needs to write two functions The framework abstracts away work distribution, network connectivity , data movement, and synchronization Can seamlessly scale to hundreds of thousands of machines Open-source version, Hadoop , being used by everyone, from Yahoo and Facebook to LinkedIn and The New York Times

Case Study: MapReduce (2) Used for embarrassingly parallel applications, most divide-and-conquer algorithms For instance, the count of each word in a billion document library can be calculated in less than 10 lines of custom code Data is stored on a distributed filesystem m ap() -> groupBy -> reduce()

Case Study: Storm Used to analyze “data in motion ” Originally designed at Backtype but later acquired by Twitter; now an Apache source project Each datapoint , called a tuple, passes through a processing pipeline The user only needs to provide the code for each operator and a graph specification ( topology )

Storage Most Big Data solutions revolve around data without any structure ( possibly from heterogeneous sources) The scale of the data makes a cleaning phase next to impossible Therefore, storage solutions need to explicitly support unstructured and semi -structured data Traditional RDBMS being replaced by NoSQL and NewSQL solutions Varying from document stores to key-value stores

Storage (2) Relational database management systems (RDBMS) : IBM DB2 MySQL, Oracle DB, etc. ( structured data) NoSQL : Key-value stores, document stores, graphs, tables, etc . ( semi-structured and unstructured data) Document stores: MongoDB , CouchDB , etc. Graphs : FlockDB , etc. Key -value stores: Dynamo, Cassandra, Voldemort , etc. Tables : BigTable , HBase , etc. NewSQL : The best of both worlds: Spanner, VoltDB , etc.

NoSQL Different S emantics: RDBMS provide ACID semantics: Atomicity : The entire transaction either succeeds or fails Consistent : Data within the database remains consistent after each Transaction Isolation : Transactions are sandboxed from each other Durable : Transactions are persistent across failures and restarts Overkill in case of most user-facing applications Most applications are more interested in availability and willing to sacrifice consistency leading to eventual consistency High Throughput: Most NoSQL databases sacrifice consistency for availability leading to higher throughput (in some cases an order of magnitude )

Case Study: BigTable * Distributed multi-dimensional table Indexed by both row-key as well as column-key Rows are maintained in lexicographic order and are dynamically partitioned into tablets Implemented atop GFS Multiple tablet servers and a single master * Fay Chang, et al. 2006. Bigtable : a distributed storage system for structured data. In Proceedings of the 7th symposium on Operating systems design and implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 205-218.

Case Study: Spanner* A database that stretches across the globe, seamlessly operating across hundreds of datacenters and millions of machines, and trillions of rows of information Took Google 4 and a half years to design and develop Time is of the essence in distributed systems; (possibly geo-distributed ) machines , applications, processes, and threads need to be synchronized * James C. Corbett, et al. 2012. Spanner: Google’s globally- distributed database . In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association , Berkeley , CA, USA, 251-264.

Case Study: Spanner (2) Spanner consists of a “ TrueTime API”, which makes use of atomic clocks and GPS! Ensures consistency for the entire system Even if two commits (with agreed upon ordering) take place at other ends of the globe (say US and China), their ordering will be preserved For instance, the Google ad system (an online auction where ordering matters ) can span the entire globe

Framework Plurality

Cluster Managers Mix different programming paradigms For instance, batch-processing with stream-processing Cluster consolidation No need to manually partition cluster across multiple frameworks Data sharing Pass data from, say, MapReduce to Storm and vice versa Higher level job orchestration The ability to have a graph of heterogeneous job types Examples include YARN , Mesos , and Google’s Omega

Operating System Layer Consists of the traditional operating system stack with the usual suspects , Windows , variants of *nix, etc. Alternatives exist though. Specialized for the cloud or multicore systems Exokernels , multikernels , and unikernels

Virtualization Layer Allows multiple operating systems to run on top of the same physical hardware Enables infrastructure sharing, isolation, and optimized utilization Different allocation strategies possible Easier to dedicate CPU and memory but not the network Allocation either in the form of VMs or containers VMWare , Xen , LXC, etc.

Network Layer Connects the entire ecosystem together Consists of the entire protocol stack Tenants assigned to Virtual LANs Multiple protocols available across the stack Most datacenters employ traditional Ethernet as the L2 fabric, although optical , wireless, and Infiniband are not far-fetched Software Defined Networks have also enabled more informed traffic engineering Run-of-the-mill tree topologies being replaced by radical recursive and random topologies

Physical Infrastructure Layer The physical hardware itself Servers and network elements Mechanism for power distribution, wiring, and cooling Servers are connected in various topologies using different interconnects Dubbed as datacenters Modular and self-containing, container-sized datacenters can be moved at will “We must treat the datacenter itself as one massive warehouse- scale computer ” – Luiz André Barroso and Urs Hölzle , Google* * Urs Hoelzle and Luiz Andre Barroso . 2009. The Datacenter as a Computer : An Introduction to the Design of Warehouse-Scale Machines ( 1st ed.). Morgan and Claypool Publishers.

Power Generation According to the New York Times in 2012, datacenters are collectively responsible for the energy equivalent of 7-10 nuclear power plants running at full capacity Datacenters have started using renewable energy sources, such as solar and wind power Engendering the paradigm of “move computation wherever renewable sources exist ”

Heat Dissipation The scale of the set up necessitates radical cooling mechanisms Facebook in Prineville, US, “the Tibet of North America” Rather than use inefficient water chillers, the datacenter pulls the outside air into the facility and uses it to cool down the servers Google in Hamina , Finland, on the banks of the Baltic Sea The cooling mechanism pulls sea water through an underground tunnel and uses it to cool down the servers

Case Study: Google All that infrastructure enables Google to: Index 20B web pages a day Handle in excess of 3B search queries daily Provide email storage to 425M Gmail users Serve 3B YouTube videos a day

Q