INTRODUCTION TO BIG DATA AND HADOOP

1,620 views 96 slides Sep 24, 2022
Slide 1
Slide 1 of 96
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96

About This Presentation

INTRODUCTION TO BIG DATA AND HADOOP
9
Introduction to Big Data, Types of Digital Data, Challenges of conventional systems - Web data, Evolution of analytic processes and tools, Analysis Vs reporting - Big Data Analytics, Introduction to Hadoop - Distributed Computing
Challenges - History of Hadoop, ...


Slide Content

IT19741 Cloud and Big Data Analytics Dr G Geetha Dean innovation and Professor CSE Women Scientist, Qualified Patent agent Rajalakshmi Engineering College

UNIT-III INTRODUCTION TO BIG DATA AND HADOOP Introduction to Big Data, Types of Digital Data, Challenges of conventional systems - Web data, Evolution of analytic processes and tools, Analysis Vs reporting - Big Data Analytics, Introduction to Hadoop - Distributed Computing Challenges - History of Hadoop, Hadoop Eco System - Use case of Hadoop – Hadoop Distributors – HDFS – Processing Data with Hadoop – Map Reduce.

Customer Challenges:The Data Deluge The Economist, Feb 25, 2010 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 3

Introduction to Big Data Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

Questions from Businesses will Vary Reporting, D a s hb o a r d s What h a pp e n e d? Why did it happen? Forensics & Data Mining R e al - T i m e Analytics What is h a pp e n i ng? Why is it h a pp e n i ng? Real-Time Data Mining P r ed icti v e Analytics What is likely to happen? What should I do about it? P r e sc r i p t i v e Analytics P a s t Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 6 Futu r e

Types of Digital Data

Tradational Data  Big Data  Traditional data is generated in enterprise level. Big data is generated outside the enterprise level. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes. secondTraditional database system deals with structured data. Big data system deals with structured, semi-structured,database, and unstructured data. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds. Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. Data integration is very easy. Data integration is very difficult. Normal system configuration is capable to process traditional data. High system configuration is required to process big data. The size of the data is very small. The size is more than the traditional data size. Traditional data base tools are required to perform any data base operation. Special kind of data base tools are required to perform any database schema-based operation. Normal functions can manipulate data. Special kind of functions can manipulate data. Its data model is strict schema based and it is static. Its data model is a flat schema based and it is dynamic. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data. Its data sources includes ERP transaction data, CRM transaction data, financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.

Types of Digital Data Structured data: When data follows a pre-defined schema/structure we say it is structured data. This is the data which is in an organized form (e.g., in rows and columns) and be easily used by a computer program. Relationships exist between entities of data, such as classes and their objects. About 10% data of an organization is in this format. Data stored in databases is an example of structured data.

Types of Digital Data Unstructured data: This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. About 80% data of an organization is in this format; for example, memos, chat rooms, PowerPoint presentations, images, videos, letters. researches, white papers, body of an email, etc.

Types of Digital Data Semi-structured data: Semi-structured data is also referred to as self describing structure. This is the data which does not conform to a data model but has some structure. However, it is not in a form which can be used easily by a computer program. About 10% data of an organization is in this format; for example, HTML, XML, JSON, email data etc.

Big Data: Different than Business Intelligence “TRADITIONAL BI” Repetitive Structured O p era t i o n al GB s t o 10 s o f T B s “ B I G D A T A A N A L Y T I C S ” E x p er i me n t a l , Ad H o c M o st l y S e m i - S t r u ct u r e d E x t er n al + O p era t i o n a l 10 s o f T B t o 100 ’ s o f P B ’ s Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 13

web data Data that is sourced and structured from websites  is referred to as "web data" In the world of big data, data comes from multiple sources and in huge amount. In which one source is web itself. Web data extraction is one of the medium of collecting data from this source i.e. web

W e b 2 . i s “D a t a - D r ive n ” “The future is here, it’s just not evenly distributed yet.” W illia m Gibs o n Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 15

The World of Data-Driven Applications Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 16

Attributes of Big Data V olume V eloci t y V a r ie t y B at c h N e ar T i me R e a l T i me Streams S tr u ct u r e d Unstructured S e m i s tr u ct u r e d Terabytes T r a n s ac t i o n s Tables Records Files Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 18

What is Big Data Analytics?

Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases The Evolution of Business Intelligence 1990 ’ s 2000 ’ s 2010 ’ s S pee d S c al e S c al e Sp e ed

Big Data Analytics B ig d a t a is m o r e r e a l - t ime in n a t u r e t h an t r a d i t io n al DW applications Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps S h a r ed parallel nothing, massively p r ocessi n g , s c ale o u t architectures are well-suited for big data apps 23

Big Data Technology 25

Analysis Vs reporting

Ten Common Big Data Problems Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 36 Modeling true risk Customer churn analysis R ec o mmend a ti on engine Ad targeting PoS transaction analysis Analyzing network data to predict failure Threat analysis Trade surveillance Search quality Data “sandbox”

The Big Data Opportunity Financial Services Healthcare Retail Web/Social/Mobile Manufacturing Government Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 37

Industries Are Embracing Big Data Retail CRM – Customer Scoring S t o r e S it i n g an d L ay o u t F r a u d De te c t i o n / P r e v e n t i o n S upp l y C h ai n O p t i m i z a t i o n Advertising & Public Relations D e m a n d S i g n a li n g Ad Targeting S e n t i m e n t A n a l y s i s C ust o me r A cq u i s i t i o n Financial Services Algorithmic Trading R i s k A n a l y s is F r a u d De te c t i o n Portfolio Analysis M e d i a & T ele c o m m un i c a t io n s Network Optimization C u s t om e r S c or i n g C hur n P r e v e n t i o n F r a u d P r e v e n t i o n Manufacturing Product Research E n g i n ee r i n g A n a l y t i c s P ro ce ss & Q u a l i t y A n a l ys i s Distribution Optimization Energy S m ar t G r i d Exploration Government M ar k et G o v e r n an c e Counter-Terrorism Econometrics H e a l th I n f or m at ic s Healthcare & Life Sciences Pharmaco-Genomics Bio-Informatics P h ar m ac e u t i c al R e s e ar c h Clinical Outcomes Research Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 38

5Vs” of Big Data Volume: With increasing dependence on technology, data is producing at a large volume. Common examples are data being produced by various social networking sites, sensors, scanners, airlines and other organizations. Velocity: Huge amount of data is generated per second. It is estimated that by the end of 2020, every individual will produce 3mb data per second. This large volume of data is being generated with a great velocity. Variety: The data being produced by different means is of three types:  Structured Data: It is the relational data which is stored in the form of rows and columns. Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows and columns. Semi Structured Data: Log files are the examples of this type of data. Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in the generation of doubtful or uncertain Information. Often data inconsistency arises because of the volume or amount of data e.g. data in bulk could create confusion whereas less amount of data could convey half or incomplete Information. Value: After having the 4 V’s into account there comes one more V which stands for Value!. Bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s

Distributed Computing Challenges Lack of performance and scalability. Lack of flexible resource management. Lack of application deployment support. Lack of quality of service. Lack of multiple data source support.

Storage & Memory B/W lagging CPU CPU B/W requirements out-pacing memory and storage Disk & memory getting “further” away from CPU Large sequential transfers better for both memory & disk CPU DRAM LAN Disk An n ual ban d wi d th imp r o v ement ( all milestones) 1.5 1.27 1.39 1.28 An n ual latency imp r o v ement ( all milestones) 1.17 1.07 1.12 1.11 M e m o r y W all Storage Chasm Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 41

Commodity Hardware Economics For $1000 One computer can Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 42 Process ~32GB St ore ~15TB 99.9% Of data is Underutilized

Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. Enterprise + Big Data = Big Opportunity 16

Why Hadoop? Big Data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing. Enterprises can gain a competitive advantage by being early adopters of big data analytics. Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 44

Evolution of Hadoop

Introduction to Analytics and Big Data - Hadoop Hadoop Adoption HDFS MapReduce Ecosystem Projects © 2014 Storage Networking Industry Association. All Rights Reserved. 46

Introduction to Analytics and Big Data - Hadoop Hadoop Adoption in the Industry 2007 2008 2009 2010 T he Data gr a ph B l o g Source: Hadoop Summit Presentations © 2014 Storage Networking Industry Association. All Rights Reserved. 47

Use case of Hadoop

Hadoop Distributors

What is Hadoop? A scalable fault-tolerant distributed system for data storage and processing Core Hadoop has two main components Hadoop Distributed File System (HDFS): self-healing, high- bandwidth clustered storage Reliable, redundant, distributed file system optimized for large files MapReduce: fault-tolerant distributed processing Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Operates on unstructured and structured data A large and active ecosystem Open source under the friendly Apache License http://wiki.apache.org/hadoop/ Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 51

Hadoop ecosystem HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark: In-Memory data processing PIG, HIVE: Query based processing of data services HBase: NoSQL Database Mahout, Spark MLLib : Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job Scheduling

HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. HDFS consists of two core components i.e. Name node Data Node Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data . These data nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost effective. HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system.

Apache HBase It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable , thus able to work on Big Data sets effectively. At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data

YARN Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters . In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e. Resource Manager Nodes Manager Application Manager Resource manager has the privilege of allocating resources for the applications in a system whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth per machine and later on acknowledges the resource manager. Application manager works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.

MapReduce By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is: Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is lateron processed by the Reduce() method. Reduce(), as the name suggests does the summarization by aggregating the mapped data . In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.

PIG Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL . It is a platform for structuring the data flow , processing and analyzing huge data sets. Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS . Pig Latin language is specially designed for this framework which runs on Pig Runtime . Just the way Java runs on the JVM. Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.

HIVE With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier . Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command Line . JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.

Mahout Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.

Apache Spark It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization , etc. It consumes in memory resources hence, thus being faster than the prior in terms of optimization. Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably.

Data Management Solr , Lucene: These are the two services that perform the task of searching and indexing with the help of some java libraries, especially Lucene is based on Java which allows spell check mechanism, as well. However, Lucene is driven by Solr . Zookeeper: There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter-component based communication, grouping, and maintenance. Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit . There is two kinds of jobs . i.e Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.

High Level Hadoop Architecture

Main components of Hadoop Namenode —controls operation of the data jobs. Datanode —this writes data in blocks to local storage. And it replicates data blocks to other datanodes . DataNodes are also rack-aware. You would not want to replicate all your data to the same rack of servers as an outage there would cause you to loose all your data. SecondaryNameNode —this one take over if the primary Namenode goes offline. JobTracker —sends MapReduce jobs to nodes in the cluster. TaskTracker —accepts tasks from the Job Tracker.

Main components of Hadoop Yarn—runs the Yarn components ResourceManager and NodeManager . This is a resource manager that can also run as a stand-alone component to provide other applications the ability to run in a distributed architecture. For example you can use Apache Spark with Yarn. You could also write your own program to use Yarn. But that is complicated. Client Application—this is whatever program you have written or some other client like Apache Pig. Apache Pig is an easy-to-use shell that takes SQL-like commands and translates them to Java MapReduce programs and runs them on Hadoop. Application Master—runs shell commands in a container as directed by Yarn.

HDFS Concepts Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 70 Sits on top of a native (ext3, xfs, etc..) file system Performs best with a ‘modest’ number of large files Files in HDFS are ‘write once’ HDFS is optimized for large, streaming reads of files

H D F S Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 71 Hadoop Distributed File System Data is organized into files & directories Files are divided into blocks, distributed across cluster nodes Block placement known at runtime by map-reduce = computation co-located with data Blocks replicated to handle failure Checksums used to ensure data integrity Replication: one and only strategy for error handling, recovery and fault tolerance Self Healing Make multiple copies

Hadoop Server Roles S l a v e Task T r a ck e r Data N o d e S l a v e T a sk T r a ck e r Data N o d e S l a v e T a sk T r a ck e r Data N o d e M as te r Na me N o d e M as te r S e c o nda r y Node Job T r a ck e r C lie n t C lie n t C lie n t C lie n t C lie n t C lie n t C lie n t C lie n t S l a v e Task T r a ck e r Data N o d e S l a v e T a sk T r a ck e r Data N o d e S l a v e Task T r a ck e r Data N o d e U p t o 4 K Nodes Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 72

Cluster versus single node When you first install Hadoop, such as to learn it, it runs in single node. But in production you would set it up to run in cluster node, meaning assign data nodes to run on different machines. The whole set of machines is called the cluster. A Hadoop cluster can scale immensely to store petabytes of data.

Hadoop Cluster D N , T T U p t o 4K Nodes D N , T T D N , T T D N , T T D N , T T D N , T T NN 1GbE/10GbE D N , T T D N , T T D N , T T D N , T T D N , T T D N , T T JT 1GbE/10GbE D N , T T D N , T T D N , T T D N , T T D N , T T D N , T T S NN 1GbE/10GbE D N , T T D N , T T D N , T T D N , T T D N , T T D N , T T 1GbE/10GbE CORE SWITCH D N , T T CORE SWITCH C lie n t Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 74

HDFS File Write Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 75

HDFS File Read Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 76

Funct i o n a l P r o g r a mm i ng m e e ts Distributed Processing Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 77

What is MapReduce? A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer-created phases 1. Map 2. Reduce In between Map and Reduce is the Shuffle and Sort Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 78

MapReduce Provides Automatic parallelization and distribution Fault Tolerance Status and Monitoring Tools A clean abstraction for programmers Google Technology RoundTable: MapReduce Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 79

MapReduce -two programs Map means to take items like a string from a csv file and run an operation over every line in the file, like to split it into a list of fields. Those become (key->value) pairs. Reduce groups these (key->value) pairs and runs an operation to, for example, concatenate them into one string or sum them like (key->sum)

Key MapReduce Terminology Concepts A user runs a client program on a client computer The client program submits a job to Hadoop The job is sent to the JobTracker process on the Master Node Each Slave Node runs a process called the TaskTracker The JobTracker instructs TaskTrackers to run and monitor tasks A task attempt is an instance of a task running on a slave node There will be at least as many task attempts as there are tasks which need to be performed Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 81

MapReduce: Basic Concepts Each Mapper processes single input split from HDFS Hadoop passes developer’s Map code one record at a time Each record has a key and a value Intermediate data written by the Mapper to local disk During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer Reducer is passed each key and a list of all its values Output from Reducers is written to HDFS Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 82

MapReduce Operation Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 83 W h a t w a s t h e m a x / m in te m p e r a t u r e f o r t h e l a s t c e n t u r y ?

Sample Dataset The requirement: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name). Each record has a key and a value To do this you need to: Join the data sets Key on country Count type of customer per country Output the results Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 84

MapReduce Paradigm I nput M ap Shuffle and Sort Reduce Output M ap R e d u c e c at g r ep s o rt un i q ou t put M ap M ap R e d u c e Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 85

MapReduce Example Problem : Count the number of times that each word appears in the following paragraph: J o hn has a r e d c a r , w hic h has no r a d i o . M a r y has a r e d b i cycl e . B i ll h a s n o c a r o r b i cycl e . M ap Server 1 : John has a red car, which has no radio. J ohn : 1 ha s : 2 a : 1 r ed : 1 c a r: 1 w h i c h : 1 no : 1 r ad i o : 1 Server 2 : Mary has a red bicycle. Ma r y : 1 ha s : 1 a : 1 r ed : 1 bicycle: 1 Server 3 : Bill has no car or bicycle. B ill : 1 ha s : 1 no : 1 c a r: 1 o r: 1 b i c l y c l e: 1 R e du c e John: 1 ha s 2 ha s : 1 ha s : 1 a : 1 a : 1 r ed : 1 r ed : 1 c a r: 1 c a r: 1 w h i c h : 1 no : 1 no : 1 r ad i o : 1 Ma r y : 1 bicycle: 1 bicycle: 1 B ill : 1 o r: 1 J ohn : 1 ha s 4 a : 2 r ed : 2 c a r: 2 w h i c h : 1 no : 2 r ad i o : 1 Ma r y : 1 bicycle: 2 B ill : 1 o r: 1 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 86 Ser v e r 1 Ser v e r 2 Ser v e r 3 Ser v e r 1 Ser v e r 2 Ser v e r 3

Putting it all Together: MapReduce and HDFS Task Tracker Task Tracker Task Tracker Job Tracker H a d oo p D i st r i b u t e d F i l e Syst e m ( H D F S) L a r g e D a t a S e t ( L o g f i les, S e n s o r D ata) R edu c e Job Map Job R edu c e Job Map Job R edu c e Job Map Job R edu c e Job Map Job R edu c e Job Client/Dev 2 Map Job 1 3 4 Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 87

H a d oo p i s a ‘ t op - l e v el’ A p a c h e p r o jec t Created and managed under the auspices of the Apache Software Foundation Several other projects exist that rely on some or all of Hadoop Typically either both HDFS and MapReduce, or just HDFS Ec o s y s t e m P r o j e c t s Inc l ude Hive Pig HBase M a n y mo r e … .. Hadoop Ecosystem Projects Int r odu c t i h o n t t t p o : A / n / h a l y a t i d c s o a o n d p B . a i g p D a a c t a h - e H . a o d o r o g p / © 2014 Storage Networking Industry Association. All Rights Reserved. 88

Hadoop, SQL & MPP Systems Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 89 Hadoop Traditional SQL Systems MPP Systems Scale-Out Scale-Up Scale-Out K e y / V alue P ai r s Relational T ables Relational T ables Functional P r og r amming D ecl ara ti v e Q ue r ies D ecl ara ti v e Q ue r ies O ff line B atch Processing O nline T r ansactions O nline T r ansactions

Comparing RDBMS and MapReduce Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 90 Traditional RDBMS MapReduce Data Size Gigabytes (Terabytes) Petabytes (Exabytes) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear D B A R a tio 1:40 1:3000 Reference: Tom White’s Hadoop: The Definitive Guide

Diagnostics and Customer Churn Issues W h a t m a k e a n d m od e l s ys t e m s a r e d e p l oye d? Are certain set top boxes in need of replacement based on system d i a g n os t i c da t a ? Is the a correlation between make, model or vintage of set top box and cu s t o m e r ch u rn ? W h a t a r e t h e m os t e x p e n s i v e b o x e s t o m a i n t a i n ? Which systems should we pro-actively replace to keep customers happy? B i g D a t a S o l u t i on Collect unstructured data from set top boxes—multiple terabytes Analyze system data in Hadoop in near real time P u ll da t a i n t o H i v e f or i n t e ra c t i v e q u e r y a n d m od e l i n g Analytics with Hadoop increases customer satisfaction Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 91

P a y P e r V i e w A d v e r t i s i n g Issues F i x e d i n v e n t ory o f a d s p a c e i s p rov i d e d by n a t i o n a l co n t e n t p rov i d e rs . F or example, 100 ads offered to provider for 1 month of programming Provider can use this space to advertise its products and services, such as p a y p e r v i e w Do we advertise “The Longest Yard” in the middle of a football game or in t h e m i dd l e o f a ro m a n t i c co m e dy? 10% increase in pay per view movie rentals = $10M in incremental revenue B ig D a ta S o l u t i o n Collect programming data and viewer rental data in a large data repository Develop models to correlate proclivity to rent to programming format Find the most productive time slots and programs to advertise pay per v i e w i n v e n t ory I m p rov e a d p l a c e m e n t a n d p a y - p e r - v i e w co n v e rs i o n w i t h H a doop Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 92

Risk Modeling Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 93 Risk Modeling Bank had customer data across multiple lines of business and needed to d e v e l op a b e tt e r r is k p i c t u r e o f i t s cu s t o m e rs . i . e , i f d i r e c t de p os i t s s t op coming into checking acct, it’s likely that customer lost his/her job, which impacts creditworthiness for other products (CC, mortgage, etc.) Data existing in silos across multiple LOB’s and acquired bank systems D a t a s i z e a pp roa c h e d 1 p e t a b y t e W hy d o thi s i n H a d oo p? Ability to cost-effectively integrate + 1 PB of data from multiple data s o u rc e s : da t a w a r e h o u s e , ca ll c e n t e r , ch a t a n d e m a i l Platform for more analysis with poly-structured data sources; i.e., combining bank data with credit bureau data; Twitter, etc. O ff l o a d i n te n s i v e c o m p u t a t i o n f r o m D W

Sentiment Analysis Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 94 S e n tim e n t A n a l y s is Hadoop used frequently to monitor what customers think of company’s p rod u c t s or s e rv i c e s Data loaded from social media sources (Twitter, blogs, Facebook, emails, ch a t s , e t c . ) i n t o H a doop c l u s t e r Map/Reduce jobs run continuously to identify sentiment (i.e., Acme Company’s rates are “ outrageous” or “rip off” ) Negative/positive comments can be acted upon (special offer, coupon, etc.) W h y H a doop S oc i a l m e d i a / we b da t a i s un s t ru c t u r e d A m o un t o f d a ta i s im m e n s e N e w da t a s o u rc e s a r is e w ee k l y

tutorial is by th SNIA

Resources: The Big Data Conversation World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011 McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity Big Data: Harnessing a game-changing asset IDC: 2011 Digital Universe Study: Extracting Value from Chaos The Economist: Data, Data Everywhere Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field O’Reilly – What is Data Science? O’Reilly – Building Data Science Teams? O’Reilly – Data for the public good Obama Administration “Big Data Research and Development Initiative.” Introduction to Analytics and Big Data - Hadoop © 2014 Storage Networking Industry Association. All Rights Reserved. 96