bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx

meganath16032003 7 views 71 slides Sep 24, 2024
Slide 1
Slide 1 of 71
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71

About This Presentation

Big data


Slide Content

Dr. N.G.P. Institute of Technology - Coimbatore-48 (An Autonomous Institution) 22UAD501 BIG DATA Analytics Dr. S.Rajalakshmi Professor / AI & DS

COURSE OBJECTIVES Outline the concepts of NoSQL . Elaborate the concepts of HDFS . Elucidate the concepts of MapReduce . Describe the advanced features in MapReduce . Study about the big data analytics tools such as Hive and Hbase . 9/24/2024 2

COURSE OUTCOMES CO1: Describe the concepts of NoSQL with database revolutions. CO2: Explain the concepts of HDFS in Hadoop for Storage . CO3: Examine the concepts of MapReduce for simple applications. CO4: Elucidate the advanced concepts in MapReduce . CO5: Use Big data analytics tools such as Hive and HBase for data analysis. 9/24/2024 3

Unit 1 DATABASE REVOLUTIONS & NO SQL Three Database Revolutions: First, second and third database Revolution– Big data Revolution, Google- Pioneer of Big data -Hadoop open-source stack–Hadoop Eco System- Scaling web 2.0: Sharding-CAP Theorem- Eventual Consistency-Cassandra-Gossip-Consistent Hashing-Mongo Db Case study 9/24/2024 4

Interpret this!!!

What is Big Data? There are two common sources of data grouped under the banner of Big Data. First, we have a fair amount of data within the corporation. This includes emails, mainframe logs , blogs, data available inside the organization. Second, we are seeing a lot more data outside the corporation. This includes information available on social media sites, product literature freely distributed by competitors, and customer complaints posted on regulatory sites.

Examples of Big Data Social media text Cell phone locations Channel click information from set-top box Web browsing and search Product manuals Communications network events Call detail records (CDRs) Radio Frequency Identification (RFID) tags Maps Traffic patterns Weather data Mainframe logs AND MANY MORE…

Types of Big Data Data: Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), … Streaming Data You can only scan the data once Big Data : Web and social media, machine-to-machine (M2M), big transaction data, biometrics, human generated.

THE FOUR Vs

Conversion Factor 1 Bit = Binary Digit 8 Bits = 1 Byte 1024 Bytes = 1 Kilobyte  1024 Kilobytes = 1 Megabyte  1024 Megabytes = 1 Gigabyte  1024 Gigabytes = 1 Terabyte  1024 Terabytes = 1 Petabyte  1024 Petabytes = 1 Exabyte 1024 Exabytes = 1 Zettabyte  1024 Zettabytes = 1 Yottabyte  1024 Yottabytes = 1 Brontobyte 1024 Brontobytes = 1 Geopbyte BIG DATA

Volume Wal-Mart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Face book handles 50 billion photos from its user base FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates

Velocity Two major aspects Throughput - corporate seeks bigger pipes and parallel processing Latency - data-in-motion with reduced latency

Variety The data was compiled from a variety of sources. The source data includes unstructured text sound video in addition to structured data

Veracity Veracity Credibility of the data source Suitability of the data for the target audience. Veracity represents both the credibility of the data source as well as the suitability of the data for the target audience .

Data Model

Data Sources Internet data (i.e., clickstream, social media, social networking links) Primary research (i.e., surveys, experiments, observations) Secondary research (i.e., competitive and marketplace data, industry reports, consumer data, business data) Location data (i.e., mobile device data, geospatial data) Image data (i.e., video , satellite image, surveillance) Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information) Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry) 9/24/2024 17

Big Data Analytics Big Data analytics uses a wide variety of advanced analytics Deeper Insights Broader Insights Frictionless Actions 9/24/2024 18

Analytics Spectrum 9/24/2024 19

A sample IBM platform

IBM

Parallel nodes

Facets of data The main categories of data are these: ■ Structured ■ Unstructured ■ Natural language ■ Machine-generated ■ Graph-based ■ Audio, video, and images ■ Streaming 9/24/2024 24

Three Database Revolutions: First, second and third database Revolution 9/24/2024 25

9/24/2024 26 The First Database Revolution

The Second Database Revolution 9/24/2024 27 Relational theory Transaction Models The First Relational Databases Client-server Computing Object-oriented Programming and the OODBMS

9/24/2024 28

The Third Database Revolution 9/24/2024 29 Google and Hadoop Cloud Computing Document Databases The “NewSQL”

The Big Data Revolution 9/24/2024 30

9/24/2024 31 https://www.menti.com/al28gprc3ee5

9/24/2024 32

9/24/2024 33 Google: Pioneer of Big Data Google was first created in 1996 Google Hardware The Google Software Stack • Google File System (GFS): a distributed cluster file system that allows all of the disks within the Google data center to be accessed as one massive, distributed, redundant file system. • MapReduce: a distributed processing framework for parallelizing algorithms across large numbers of potentially unreliable servers and being capable of dealing with massive datasets. • BigTable : a nonrelational database system that uses the Google File System for storage

9/24/2024 34 The MapReduce framework operates on <key, value> pairs MapReduce job  − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output). Map tasks  (Splits & Mapping) Reduce tasks  (Shuffling, Reducing)

Hadoop: Open-Source Google Stack 9/24/2024 35 Hadoop’s Architecture (Yet Another Resource Negotiator or, recursively, YARN Application Resource Negotiator)

Hadoop Ecosystem It is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities . Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc. Following are the components that collectively form a Hadoop ecosystem: HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark: In-Memory data processing PIG, HIVE: Query based processing of data services HBase: NoSQL Database Mahout, Spark MLLib : Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job Scheduling 9/24/2024 36

9/24/2024 37

9/24/2024 38 HBase Hive

9/24/2024 39 HDFS:     HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files. HDFS consists of two core components i.e.  Name node Data Node Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost effective. HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system. YARN:     Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System. Consists of three major components i.e.  Resource Manager Nodes Manager Application Manager

9/24/2024 40 MapReduce:    By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:  Map()  performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method. Reduce() , as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples. PIG:   Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL. It is a platform for structuring the data flow, processing and analyzing huge data sets. Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS. Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the  JVM . Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem. HIVE:     With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its query language is called as HQL (Hive Query Language). It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported by Hive thus, making the query processing easier. Similar to the Query Processing frameworks, HIVE too comes with two components:  JDBC Drivers  and  HIVE Command Line . JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE Command line helps in the processing of queries.

9/24/2024 41 Mahout:     Mahout, allows Machine Learnability to a system or application.  Machine Learning , as the name suggests helps the system to develop itself based on some patterns, user/environmental interaction or on the basis of algorithms. It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries. Apache Spark:     It’s a platform that handles all the process consumptive tasks like batch processing, interactive or iterative real-time processing, graph conversions, and visualization, etc. It consumes in memory resources hence, thus being faster than the prior in terms of optimization. Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch processing, hence both are used in most of the companies interchangeably. Apache HBase:     It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable , thus able to work on Big Data sets effectively. At times where we need to search or retrieve the occurrences of something small in a huge database, the request must be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing limited data

9/24/2024 42 https://www.menti.com/al28gprc3ee5

Hadoop – Architecture 9/24/2024 43 Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists of 4 components.    MapReduce HDFS(Hadoop Distributed File System) YARN(Yet Another Resource Negotiator) Common Utilities or Hadoop Common

Scaling Web 2.0 9/24/2024 44

Sharding Sharding allows a logical database to be partitioned across multiple physical servers. 9/24/2024 45

CAP Theorem Consistency, Availability, and Partition tolerance 9/24/2024 46 Eventual Consistency Amazon’s Dynamo alternative nonrelational system that had been developed internally to address the requirements of their massive online website.

9/24/2024 47

Consistent Hashing When we hash a key value, we perform a mathematical computation on the key value and use that computed value to determine where to store the data. One reason to use hashing is so that we are able to evenly distribute the data across a certain number of slots. If we want to hash any number into 10 buckets, we can use modulo 10; then key 27 would map to bucket 7, key 32 would map to bucket 2, key 25 to bucket 5, and so on. Using this method, we could map keys evenly across 10 servers. 9/24/2024 48

Tunable Consistency Dynamo allows the application to choose the level of consistency applied to specific operations. NWR notation describes how Dynamo will trade off consistency, read performance, and write performance: • N is the number of copies of each data item that the database will maintain. • W is the number of copies of the data item that must be written before the write can complete. • R is the number of copies that the application will access when reading the data item 9/24/2024 49

Cassandra - A number of open-source systems have implemented the Dynamo model. 9/24/2024 50 Gossip In HBase and MongoDB, we encountered the concept of master nodes—nodes which have a specialized supervisory function, coordinate activities of other nodes, and record the current state of the database cluster In Cassandra and other Dynamo databases, there are no specialized master nodes. Every node is equal and every node is capable of performing any of the activities required for cluster operation. Nodes in Cassandra do, however, have short-term specialized responsibilities. For instance, when a client performs an operation, a node will be allocated as the coordinator for that operation When a new member is added to the cluster, a node will be nominated as the seed node from which the new node will seek information.

Cassandra Column-family databases store data in column families as rows that have many columns associated with a row key. Column families are groups of related data that is often accessed together. Cassandra is one of the popular column-family databases; there are others, such as HBase, Hypertable , and Amazon DynamoDB. Cassandra can be described as fast and easily scalable with write operations spread across the cluster. The cluster does not have a master node, so any read and write can be handled by any node in the cluster. 9/24/2024 51

Cassandra’s Data Model 9/24/2024 52

Column Family { name: " fullName ", value: "Martin Fowler", timestamp: 12345667890 } The column has a key of firstName and the value of Martin and has a timestamp attached to it. A row is a collection of columns attached or linked to a key; a collection of similar rows makes a column family. When the columns in a column family are simple columns, the column family is known as standard column family. Column Family A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns. 9/24/2024 53

Difference between a column family from a table of relational databases Relational Table Cassandra column Family A schema in a relational model is fixed. Once we define certain columns for a table, while inserting data, in every row all the columns must be filled at least with a null value. In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time. Relational tables define only columns and the user fills in the table with values. In Cassandra, a table contains columns, or can be defined as a super column family. 9/24/2024 54

Column Family A Cassandra column family has the following attributes − keys_cached  − It represents the number of locations to keep cached per SSTable . rows_cached  − It represents the number of rows whose entire contents will be cached in memory. preload_row_cache  − It specifies whether you want to pre-populate the row cache 9/24/2024 55

Cassandra column family 9/24/2024 56

Super Column A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns. Insert into University.Student ( RollNo,Name,dept,Semester ) values(2,'Michael','CS', 2); 9/24/2024 57

Update Data 9/24/2024 58 Syntax Update KeyspaceName.TableName Set ColumnName1 = new Column1Value, ColumnName2=new Column2Value, ColumnName3=new Column3Value, . . . Where ColumnName = ColumnValue

Delete Command 9/24/2024 59 Delete from University.Student where rollno =1;

Limitations There are following limitations in Cassandra query language (CQL). CQL does not support aggregation queries like max, min, avg CQL does not support group by, having queries. CQL does not support joins. CQL does not support OR queries. CQL does not support wildcard queries. CQL does not support Union, Intersection queries. Table columns cannot be filtered without creating the index. Greater than (>) and less than (<) query is only supported on clustering column. Cassandra query language is not suitable for analytics purposes because it has so many limitations . 9/24/2024 60

Keyspace in Cassandra Create keyspace KeyspaceName with replication={' class':strategy name, ' replication_factor ': No of replications on different nodes}; Various Components of Cassandra Keyspace Strategy : While declaring strategy name in Cassandra. There are two kinds of strategies declared in Cassandra Syntax. Simple Strategy : Simple strategy is used when you have just one data center. In this strategy, the first replica is placed on the node selected by the partitioner. Remaining nodes are placed in the clockwise direction in the ring without considering rack or node location . Network Topology Strategy : Network topology strategy is used when you have more than one data centers . In this strategy, you have to provide replication factor for each data center separately. Network topology strategy places replicas in nodes in the clockwise direction in the same data center . This strategy attempts to place replicas in different racks. Replication Factor : Replication factor is the number of replicas of data placed on different nodes. For no failure, 3 is good replication factor. 9/24/2024 61

Create Keyspace Create keyspace University with replication={'class': SimpleStrategy ,' replication_factor ': 3}; 9/24/2024 62

Alter Keyspace Alter Keyspace KeyspaceName with replication={'class':' StrategyName ', ' replication_factor ': no of replications on different nodes} with DURABLE_WRITES=true/false Alter Keyspace University with replication={'class':' NetworktopologyStrategy ', 'DataCenter1':1}; 9/24/2024 63

Key aspects while altering Keyspace in Cassandra Keyspace Name: Keyspace name cannot be altered in Cassandra. Strategy Name: Strategy name can be altered by specifying new strategy name. Replication Factor: Replication factor can be altered by specifying new replication factor. DURABLE_WRITES :DURABLE_WRITES value can be altered by specifying its value true/false. By default, it is true. If set to false, no updates will be written to the commit log and vice versa . 9/24/2024 64

Drop Keyspace Drop keyspace KeyspaceName Drop keyspace University; 9/24/2024 65

Create table-column family in Cassandra Create table KeyspaceName.TableName ( ColumnName DataType , ColumnName DataType , ColumnName DataType . . . Primary key( ColumnName ) ) with PropertyName = PropertyValue ; 9/24/2024 66

Alter Table in Cassandra Alter table KeyspaceName.TableName + Alter ColumnName TYPE ColumnDataype | Add ColumnName ColumnDataType | Drop ColumnName | Rename ColumnName To NewColumnName | With propertyName = PropertyValue 9/24/2024 67

Drop and truncate table Drop Table KeyspaceName.TableName Truncate KeyspaceName.TableName 9/24/2024 68

9/24/2024 69 How Cassandra achieve coordination between nodes ?  Let’s consider an example with 6 nodes in a cluster. one and two three four five and six, and you can see that node number three is down. So, there is a question, how Cassandra will behave in such situations. So, Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. What is Gossip protocol ?  Gossip is the message system that Cassandra nodes, virtual nodes used to make their data consistent with each other, and is used to enforce the replication factor in a cluster. So, let’s imagine and Cassandra cluster as a ring system where each node contains a certain partition of each table in the database. And, can only communicate with adjacent nodes

Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are added or removed . Consistent hashing partitions data based on the partition key . 9/24/2024 70

9/24/2024 71
Tags