BDA_UNIT_II_BDA_UNIT_II_BDA_UNIT_II_BDA_UNIT_II_

SrikanthYadav578790 7 views 176 slides Nov 01, 2025
Slide 1
Slide 1 of 200
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186
Slide 187
187
Slide 188
188
Slide 189
189
Slide 190
190
Slide 191
191
Slide 192
192
Slide 193
193
Slide 194
194
Slide 195
195
Slide 196
196
Slide 197
197
Slide 198
198
Slide 199
199
Slide 200
200

About This Presentation

BDA_UNIT_II_


Slide Content

Overview of the Subject and Syllabus Unit-I: Introduction to big data, Big data analytics Unit-II: Introduction to Hadoop Unit-III: Introduction to MAPREDUCE Programming Unit-IV: Introduction to Hive, Introduction to Pig Unit-V: Introduction to Spark

Syllabus Introduction to big data: Data, Characteristics of data and Types of digital data:, Sources of data, Working with unstructured data, Evolution and Definition of big data, Characteristics and Need of big data, Challenges of big data Big data analytics: Overview of business intelligence, Data science and Analytics, Meaning and Characteristics of big data analytics, Need of big data analytics, Classification of analytics, Challenges to big data analytics, Importance of big data analytics, Basic terminologies in big data environment

Learning Objectives Structured data: Sources of structured data, ease with structured data, etc. Semi-Structured data: Sources of semi-structured data, characteristics of semi-structured data. Unstructured data: Sources of unstructured data, issues with terminology, dealing with unstructured data.

Learning Outcomes To differentiate between structured, semi-structured and unstructured data. To understand the need to integrate structured, semi-structured and unstructured data.

Agenda Types of Digital Data Structured Sources of structured data Ease with structured data Semi-Structured Sources of semi-structured data Unstructured Sources of unstructured data Issues with terminology Dealing with unstructured data

Classification of Digital Data Digital data is classified into the following categories: Structured data Semi-structured data Unstructured data

Approximate percentage distribution of digital data

Approximate percentage distribution of digital data

Structured Data This is the data which is in an organized form (e.g., in rows and columns) and can be easily used by a computer program. Relationships exist between entities of data, such as classes and their objects. Data stored in databases is an example of structured data.

Sources of Structured Data Databases Oracle DB2 Teradata MySQL PostgreSQL Spread sheets OLTP Systems

Examples of Structured Data Column Name Data Type Constraints EmpNo Varchar(10) PRIMARY KEY EmpName Varchar(50) Designation Varchar(20) NOT NULL DeptNo Varchar(10) ContactNo Varchar(10) NOT NULL

Ease with Structured Data Insert/ Update/ Delete DML operations Security Encryption, Sensitivity Indexing/ Searching Scalability Transaction Processing Atomicity Consistency Isolation Durability

Semi-structured Data This is the data which does not conform to a data model but has some structure. However, it is not in a form which can be used easily by a computer program. Example, emails, XML, markup languages like HTML, etc. Metadata for this data is available but is not sufficient.

Sources of Semi-structured Data XML- eXtensible Markup Language JSON- (Java Script Object Notation) Other Markup languages Example HTML program <HTML> <HEAD> <TITLE> Welcome to BDA</TITLE> </HEAD> <BODY BGCOLOR=“FFFFFF”> </BODY> </HTML>

Sample JSON document { _id:9, BookTitle: “Fundamentals of Business Analytics”, AuthorName: “Seema Acharya”, Publisher: “Wiley India”, YearofPublication: “2011” }

Characteristics of Semi-structured Data

Unstructured Data This is the data which does not conform to a data model or is not in a form which can be used easily by a computer program. About 80–90% data of an organization is in this format. Example: memos, chat rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an email, etc.

Unstructured Data

Sources of Unstructured Data Webpages Images Audios Videos Body of email Text Messages Chats Social media data Word document

How to Deal with Unstructured Data

How to Deal with Unstructured Data Data Mining: Knowledge discovery in databases, popular Mining algorithms are Association rule mining, Regression Analysis, and Collaborative filtering NLP – Natural Language Processing: It is related to HCI. It is about enabling computers to understand human or natural language input.

How to Deal with Unstructured Data Text mining: Text mining is the process of gleaning high quality and meaningful information from text. It includes tasks such as text categorization, text clustering, sentiment analysis and concept/entity extraction. Noisy Text Analysis: Process of extraction structured or semi-structured from noisy unstructured data such as chats, blogs, wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non standard words. .

Summary and Topics Covered Types of Digital data, Structured data, Semi-structured data and Unstructured data Sources of digital data Dealing with various types of digital data Examples of Structured data Examples of Semi-structured data Examples of Unstructured data

Answer Me Which category (structured, semi-structured, or unstructured) will you place a Web Page in? Which category (structured, semi-structured, or unstructured) will you place Word Document in?  State a few examples of human generated and machine-generated data.

Answer Me Place the following in suitable basket: Email MS Access Images Database Char conversions Relations / Tables Face book Video MS Excel XML 

Answer Me Structured Unstructured Semi-Structured MS Access Email XML Database Images Relations/ Table Chat conversions MS Excel Facebook Videos

Thank You

Department of Information Technology BIG DATA ANALYTICS (16IT445) IV B.Tech (IT) I Semester 2020-21 I SEMESTER UNIT-I Name of the Faculty : Mr. Srikanth Yadav. M Assistant Professor, Department of IT VFSTR (Deemed to be University)

Definition of big data Challenges of big data Learning Objectives Learning Outcomes To understand the significance of big data To understand the challenges of big data

Agenda Introduction to Big Data Characteristics of DATA Evolution of Big Data Definition of Big Data Challenges with Big Data What is Big Data Volume Velocity Variety Sources of Big Data Other characteristics of Big Data Why Big Data

Characteristics of DATA Composition - deals with the structure of the data Sources of data Types of data

Characteristics of DATA Condition – deals with the state of the data Can one use this data as is for analysis? Does it require cleansing for enhancement?

Characteristics of DATA Context - deals with Where has this data been generated? Why was this data generated? How sensitive is this data?

Characteristics of DATA

Evolution of Big Data

Definition of Big Data- Gartner High-volume High-velocity High-variety Cost-effective, innovative forms of information processing Enhanced insight Decision making

Definition of Big Data Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Challenges with Big Data

Challenges with Big Data

Challenges with Big Data Capture Large size, high tide of data Storage Cloud computing and Virtualization Curation Preserving: Collecting and taking care of research data. Sharing: Revealing data’s potential across domains Discovering: Promoting the re-use and new combinations of data

Challenges with Big Data Search Dearth (lack) of skilled professional Analysis Tools required – Hadoop, PIG, Hive Transfer Visualization Business visualization experts are needed Privacy Violations

What is Big Data Big data is high-volume , velocity and variety information assets that demand cost-effective , innovative forms of information processing for enhanced insight and decision making.

What is Not a Big Data Big Data  is a collection of large datasets that cannot be adequately processed using traditional processing techniques. Big data is not only data it has become a complete subject, which involves various tools, techniques and frameworks .

Data: Big in Volume, Variety and Velocity

Sources of Big Data Internal Data Sources Data Storage : File systems, SQL, NoSQL etc., Archives: Scanned documents, Paper archives.. External Data Sources Public Web: Wikipedia, weather, Census, etc.,

Sources of Big Data Both (Internal + External) Sensor data: Car sensor, Traffic Sensor etc., Machine Log Data: Application logs, Event logs Social media: Twitter, FB, YouTube etc., Business APPs: ERP, CRM, HR, Google Docs Media: Audio, Video, Images etc., Docs: CSV, PDF, XLS, PPT etc.,

Sources of Big Data

Other Characteristics of Big Data

Why Big Data

Topics Discussed Introduction to Big Data Characteristics of DATA Evolution of Big Data Definition of Big Data Challenges with Big Data What is Big Data Volume Velocity Variety Sources of Big Data Other characteristics of Big Data Why Big Data

Fill in the blanks Big data is high-volume, high-velocity, and high-variety information assets that demand ------------, ------------------- forms of information processing for enhanced ----------------------- and --------- -------------------.

Fill in the blanks Cost effective Innovative Insight Decision making

QUIZ What is Big data? What are the characteristics of Data? What are 3V’s of Big data? What are the Applications of Bigdata? What are the challenges of Bigdata?

Thank you

BIG DATA ANALYTICS (16IT445) IV B.Tech (IT) I Semester 2020-21 I SEMESTER Department of Information Technology Name of the Faculty : Mr. Srikanth Yadav. M Assistant Professor, Department of IT VFSTR (Deemed to be University)

Learning Objectives What is data Science? What is big data analytics To understand the significance of big data analytics. To understand the role of data scientist . Learning Outcomes

Agenda Traditional Business Intelligence (BI) versus BIG DATA A Typical Data Warehouse Environment A Typical Hadoop Environment What is new- Coexistence of Big Data and Data Warehouse QUIZ

Traditional Business Intelligence Vs Big Data 1 . In Traditional BI environment, all the enterprise’s data is housed in a central server where as in a Big data environment data resides in a distributed file system. The distributed file system scales by scaling in or out horizontally as compared to typical database sever that scales vertically. 2. In traditional BI, data is generally analyzed in an offline mode whereas in Big data, it is analyzed both real time as well as in offline mode.

3. Traditional BI is about structured data and the data is taken to process functions (move data to code). 4. Where as Big data is about variety: Structured, semi structured, and unstructured data and here the processing functions are taken to the data (move code to data) Traditional Business Intelligence Vs Big Data

Typical data warehouse environment

Typical HADOOP environment

Co-existence of Big Data and Data Warehouse

Fill me Big data is high-volume, high-velocity, and high-variety information assets that demand --------------, ----------------forms of information processing for enhanced ----------------------- and -----------------. Answers Cost effective Innovative Insight Decision making

Quiz Share your understanding of Big Data. How is traditional BI environment different from the Big Data environment? Share your experience as a customer on an e-commerce site. Comment on the big data that gets created on a typical e-commerce site.

Thank You

BIG DATA ANALYTICS (16IT445) IV B.Tech (IT) I Semester 2020-21 I SEMESTER Department of Information Technology Name of the Faculty : Mr. Srikanth Yadav. M Assistant Professor, Department of IT VFSTR (Deemed to be University)

Learning Objectives What is data Science? What is big data analytics To understand the significance of big data analytics. To understand the role of data scientist . Learning Outcomes

Agenda What is Big data analytics? What is Big data analytics is not? Data science and Analytics Meaning and Characteristics of big data analytics Need of big data analytics Classification of analytics Challenges to big data analytics Importance of big data analytics Basic terminol ogies in big data environment

What is Big Data Analytics Big data Analytics is the process of examining big data to uncover patterns, unearth trends, and find unknown correlations and other useful information to make faster and better decisions.

What is Big Data Analytics?

What Big Data Analytics isn’t?

What is Big Data Analytics 75 Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics, Statistical, World Programming Systems (WPS), and WEKA.

76 Classification of Analytics: There are basically two schools of thought: First School of thought Second School of thought First School of thought Those that classify analytics into Basic- Slicing and Dicing of Data, Historical, basic visualization Operational- Enterprises Business Processes Advanced- Predictive and Perspective modeling Monetized- Direct business revenue

77 Second School of thought Those that classify analytics into Analytics 1.0 Analytics 2.0 and Analytics 3.0.

Analytics 1.0 Analytics 2.0 Analytics 3.0 Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present Descriptive statistics (report events, occurrences etc of the past. Descriptive statistics + Predictive statistics (use data from the past to make predictions for the future. Descriptive statistics + Predictive statistics + prescriptive statistics (use data from the past to make prophecies for the future and at the same time make recommendations to leverage the situations to one’s advantage.

Analytics 1.0 Analytics 2.0 Analytics 3.0 Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present Key questions asked: What happened? Why did it happen? Key questions are: What will happen? Why will it happen? Key questions are: What will happen? When will it happen? Why will it happen? What should be the action taken to take advantage of what will happen?

Data from legacy systems, ERP,CRM and third party applications. Big Data A blend of big data and data from legacy systems, ERP,CRM and third party applications. Small and structured data sources. Data stored in enterprise data warehouses or data marts. Big data is being taken up seriously. Data is mainly unstructured, arriving at a higher pace. This fast flow of big volume data had to be stored and processed rapidly, often on massively parallel servers running hadoop . A blend of big data and traditional analytics to yield insights and offerings with speed and impact.

81 Data was internally sourced. Data was often externally sourced. Data is being both internally and externally sourced. Relational databases Database applications, Hadoopo clusters, SQL to hadoop environments etc.. In ,memory analytics, in database processing, agile analytical methods, Machine learning techniques etc ..

Analytics 1.0, 2.0 and 3.0

Top Challenges facing Big Data Scale: Storage (RDBMS, NoSQL is the major concern that needs to be addressed Security (poor security mechanism) Schema (no rigid schema, Dynamic is required) Continuous availability (how to provide 24X7 support) Consistency Partition tolerant Data quality

Big Data Analytics

Data Scientist

Top Challenges facing Big Data Scale Security Schema Continuous availability Consistency Partition tolerant Data quality

Terminologies used in Big data environments: In memory analytics In-Database processing Symmetric Multiprocessor system Massively parallel processing Distributed systems Shared nothing architecture CAP theorem

Terminologies used in Big data environments: In-memory Analytics: Data access from non-volatile storage such as hard disk is a slow process. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access rapid deployment, better insights, and minimal IT involvement.

Terminologies used in Big data environments: In-Database Processing: In-Database processing is also called as in-database analytics. It works by fusing data warehouses with analytical systems. Typically the data from various enterprise OLTP systems after cleaning up through the process of ETL is stored in the Enterprise data warehouse or data marts. The huge data sets are then exported to analytical programs for complex and extensive computations. With in-database processing, the database program itself can run the computations by eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature to large businesses .

Terminologies used in Big data environments: Symmetric Multi-Processor System: In this there is single common main memory that is shared by two or more identical processors. The processors have full access to all I/O devices and are controlled by single operating system instance. SMP are tightly coupled multiprocessor systems. Each processor has its own high speed memory called cache memory and are connected using a system bus.

Terminologies used in Big data environments:

Terminologies used in Big data environments: Massively Parallel Processing: Massively parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel. The processors, each have their own OS and dedicated memory. They work on different parts of the same program. The MPP processors communicate using some sort of messaging interface. MPP is different from symmetric multiprocessing in that SMP works with processors sharing the same OS and same memory. SMP also referred as tightly coupled Multiprocessing.

Terminologies used in Big data environments: Shared nothing Architecture: The three most common types of architecture for multiprocessor systems: Shared memory Shared disk Shared nothing In shared memory architecture , a common central memory is shared by multiple processors . In shared disk architecture , Multiple processors share a common collection of disks while having their own private memory. In shared nothing architecture , neither memory nor disk is shared among multiple processors.

Terminologies used in Big data environments: Advantages of shared nothing architecture: Fault isolation: Scalability:

CAP Theorem: CAP Theorem: The CAP theorem is also called the Brewer’s theorem. It states that in a distributed computing environment, it is possible to provide the following guarantees: Consistency implies that every read fetches the last write. Availability implies that reads and writes always succeed. In other words, each non-failing node will return response in a reasonable amount of time. Partition tolerance implies that the system will continue to function when network partition occurs.

BASE Definition - What does  Basically Available, Soft State, Eventual Consistency (BASE)  mean? Basically Available, Soft State, Eventual Consistency (BASE) is a data system design philosophy that prizes availability over consistency of operations. BASE was developed as an alternative for producing more scalable and affordable data architectures, providing more options to expanding enterprises/IT clients and simply acquiring more hardware to expand data operations.  

BASE   Techopedia explains  Basically Available, Soft State, Eventual Consistency (BASE) BASE may be explained in contrast to another design philosophy - Atomicity, Consistency, Isolation, Durability (ACID). The ACID model promotes consistency over availability, whereas BASE promotes availability over consistency.  

Answer Me What are the key questions to be answered by all organizations stepping into analytics? What is predictive and prescriptive analytics? What are the major challenges of big data? Differntiate Analytics 1.0, 2.0 and 3.0?

Challenges with Big Data The challenges with big data: Data today is growing at an exponential rate. The key question is : will all this data be useful for analysis how will separate knowledge from noise. How to host big data solutions outside the world. The period of retention of big data. Dearth of skilled professionals. Shortage of data visualization experts .

BIG DATA ANALYTICS (16IT445) IV B.Tech (IT) I Semester 2020-21 I SEMESTER Department of Information Technology Name of the Faculty : Mr. Srikanth Yadav. M Assistant Professor, Department of IT VFSTR (Deemed to be University)

Learning Objectives What is data Science? What is big data analytics To understand the significance of big data analytics. To understand the role of data scientist . Learning Outcomes

Agenda Challenges to big data analytics Importance of big data analytics Basic terminol ogies in big data environment

Top Challenges facing Big Data Scale: Storage (RDBMS, NoSQL is the major concern that needs to be addressed Security (poor security mechanism) Schema (no rigid schema, Dynamic is required) Continuous availability (how to provide 24X7 support)

Top Challenges facing Big Data Consistency Partition tolerant- System should take care of both H/W and S/W failures Data quality- Accuracy, completeness and Timeliness

Big Data Analytics

Data Science

Business Acumen Skills Understanding of domain Business strategy Problem solving Communication Presentation Inquisitiveness

Technology Expertise Good Database knowledge such as RDBMS Programming Languages such as Java, Python Open-Source tools such as Hadoop Data Warehousing and Data Mining Visualizations such as Tableau, Flare, etc

Mathematics Expertise Mathematics Statistics Artificial Intelligence Machine Learning Pattern recognition Natural Language Processing

Healthcare Manufacturing Applications in Media & Entertainment Applications in IoT Data Applications in Government Big Data Applications: 

Healthcare

Manufacturing

Marketing

IoT

Govt

Quiz Define Big Data and explain the Vs of Big Data. Big Data can be defined as a collection of complex unstructured or semi-structured data sets which have the potential to deliver actionable insights. The five Vs of Big Data are – Volume –  Talks about the amount of data Variety –  Talks about the various formats of data Velocity –  Talks about the ever increasing speed at which the data is growing Veracity –  Talks about the degree of accuracy of data available Value- The use of big data in improving the business

Terminologies used in Big data environments: In memory analytics In-Database processing Symmetric Multiprocessor system Massively parallel processing Distributed systems Shared nothing architecture CAP theorem

Terminologies used in Big data environments: In-memory Analytics: Data access from non-volatile storage such as hard disk is a slow process. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access rapid deployment, better insights, and minimal IT involvement.

Terminologies used in Big data environments: In-Database Processing: In-Database processing is also called as in-database analytics. It works by fusing data warehouses with analytical systems. Typically the data from various enterprise OLTP systems after cleaning up through the process of ETL is stored in the Enterprise data warehouse or data marts. The huge data sets are then exported to analytical programs for complex and extensive computations. With in-database processing, the database program itself can run the computations by eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature to large businesses .

Terminologies used in Big data environments: Symmetric Multi-Processor System: In this there is single common main memory that is shared by two or more identical processors. The processors have full access to all I/O devices and are controlled by single operating system instance. SMP are tightly coupled multiprocessor systems. Each processor has its own high speed memory called cache memory and are connected using a system bus.

Terminologies used in Big data environments:

Terminologies used in Big data environments: Massively Parallel Processing: Massively parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel. The processors, each have their own OS and dedicated memory. They work on different parts of the same program. The MPP processors communicate using some sort of messaging interface. MPP is different from symmetric multiprocessing in that SMP works with processors sharing the same OS and same memory. SMP also referred as tightly coupled Multiprocessing.

Terminologies used in Big data environments: Shared nothing Architecture: The three most common types of architecture for multiprocessor systems: Shared memory Shared disk Shared nothing In shared memory architecture , a common central memory is shared by multiple processors . In shared disk architecture , Multiple processors share a common collection of disks while having their own private memory. In shared nothing architecture , neither memory nor disk is shared among multiple processors.

Terminologies used in Big data environments: Advantages of shared nothing architecture: Fault isolation: Scalability:

Terminologies used in Big data environments:

CAP Theorem: CAP Theorem: The CAP theorem is also called the Brewer’s theorem. It states that in a distributed computing environment, it is possible to provide the following guarantees: Consistency implies that every read fetches the last write. Availability implies that reads and writes always succeed. In other words, each non-failing node will return response in a reasonable amount of time. Partition tolerance implies that the system will continue to function when network partition occurs.

CAP Theorem CAP Theorem is a concept that  a distributed database system can only have 2 of the 3: Consistency, Availability and Partition Tolerance

CAP Theorem

CAP Theorem

CAP Theorem- Partition Tolerance This condition states that the system continues to run, despite the number of messages being delayed by the network between nodes.

CAP Theorem- High Consistency This condition states that all nodes see the same data at the same time.

CAP Theorem- High Availability This condition states that every request gets a response on success/failure.

BASE Definition - What does  Basically Available, Soft State, Eventual Consistency (BASE)  mean? Basically Available, Soft State, Eventual Consistency (BASE) is a data system design philosophy that prizes availability over consistency of operations. BASE was developed as an alternative for producing more scalable and affordable data architectures, providing more options to expanding enterprises/IT clients and simply acquiring more hardware to expand data operations.

BASE Techopedia explains  Basically Available, Soft State, Eventual Consistency (BASE) BASE may be explained in contrast to another design philosophy - Atomicity, Consistency, Isolation, Durability (ACID). The ACID model promotes consistency over availability, whereas BASE promotes availability over consistency.  

BaSE- Basically available Soft State, Eventual Consistency

BaSE- Basically available Soft State, Eventual Consistency

BaSE- Basically available Soft State, Eventual Consistency

Answer Me What are the key questions to be answered by all organizations stepping into analytics? What is predictive and prescriptive analytics? What are the major challenges of big data? Differntiate Analytics 1.0, 2.0 and 3.0?

Quiz Solution: 1. Consistency 2.Availability 3. Brewer 4. Partition Tolerant

Quiz -Match the following Column A Column B NLP Content analytics Text analytics Text messages UIMA Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, Artificial Intelligence, machine learning & DBs IBM UIMA

Column A Column B NLP Content analytics Text analytics Text messages UIMA Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, Artificial Intelligence, machine learning & DBs IBM UIMA

Quiz 1. List various types of digital data? A. Structured, Semi-structured and unstructured 2. Why an email placed in the Unstructured category? A. Because it contains hyperlinks, attachments, videos, images, free flowing text... 3. What category will you place a CCTV footage into? A. unstructured 4. You have just got a book issued from the library. What are the details about the book that can be placed in an RDBMS table. A. Title, author, publisher, year, no. of pages, type of book, price, ISBN, with CD or not.

Quiz 5. Which category would you place the consumer complaints and feedback? A. Unstructured. 6. Which category (structured, semi-structured or Unstructured) will you place a web page in? A. Unstructured 7. Which category (structured, semi-structured or Unstructured) will you place a Power point presentation in? A. Unstructured 8. Which category (structured, semi-structured or Unstructured) will you place a word document in? A. Unstructured

Quiz 1. Big data is high-volume, high-velocity, and high-variety information assets that demand--------------------, ---------------------forms of information processing for enhanced ----------------------and ------------- Answer: Cost-effective, Innovative, Insight, Decision making

Quiz _____ characteristics of data explains the spikes in data. ______, a Gartner analyst coined the term, “Big Data”. ______ is the characteristic of data dealing with its retention. Near real time processing or real time processing deals with _____ characteristics of the data. _____, is a large data repository that stores data in its native format until it is needed.

Quiz State a few examples of human generated and machine-generated data.

Data Generation Origin Definition type of data Examples Humans Data representing the digitization of human interactions Structured Business process data e.g., payment transactions, sales order, call record, ERP, CRM Semi structured Weblogs Unstructured Content such as Web pages, E-mail, Blog, Wiki, Review, Comment Binary Content such as Video, Audio, Photo Machines Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things) Structured Some devices Semi structured Computer logs, Device logs, Network logs, Sensor/Meter logs Binary Video, Audio, Photo

Challenges with Big Data The challenges with big data: Data today is growing at an exponential rate. The key question is : will all this data be useful for analysis how will separate knowledge from noise. How to host big data solutions outside the world. The period of retention of big data. Dearth of skilled professionals. Shortage of data visualization experts .

What is Big Data? Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.

Big Data Characteristics The five characteristics that define Big Data are: Volume Velocity Variety Veracity Value

Volume Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data generated by humans, machines and their interactions on social media itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.

Volume

Velocity Velocity is defined as the pace at which different sources generate the data every day. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year over-year. This shows how fast the number of users are growing on social media and how fast the data is getting generated daily.

The type of data can be structured, semi-structured or unstructured. Earlier, we used to get the data from excel and databases, now the data are coming in the form of images, audios, videos, sensor data etc. Variety

Variety

Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. In the image below, you can see that few values are missing in the table. Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity. Veracity

Veracity

Value is all well and good to have access to big data but unless we can turn it into value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations who are analyzing big data? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless. Value

5 V’s

Data Quality Discovery Storage Analytics Security Lack of Talent Challenges with Big Data

Smarter Healthcare Telecom Retail Traffic control Manufacturing Search Quality Applications of Big Data

Healthcare Manufacturing Applications in Media & Entertainment Applications in IoT Data Applications in Government Big Data Applications: Healthcare

Big Data Applications: Healthcare

Big Data Applications: Manufacturing

Big Data Applications: Marketing

Big Data Applications: IoT

Big Data Applications: Govt

Quiz Define Big Data and explain the Vs of Big Data. Big Data can be defined as a collection of complex unstructured or semi-structured data sets which have the potential to deliver actionable insights. The five Vs of Big Data are – Volume –  Talks about the amount of data Variety –  Talks about the various formats of data Velocity –  Talks about the ever increasing speed at which the data is growing Veracity –  Talks about the degree of accuracy of data available Value- The use of big data in improving the business

Quiz Define Big Data and explain the Vs of Big Data. Big Data can be defined as a collection of complex unstructured or semi-structured data sets which have the potential to deliver actionable insights. The five Vs of Big Data are – Volume –  Talks about the amount of data Variety –  Talks about the various formats of data Velocity –  Talks about the ever increasing speed at which the data is growing Veracity –  Talks about the degree of accuracy of data available Value- The use of big data in improving the business

Terminologies used in Big data environments:

Terminologies used in Big data environments:

Terminologies used in Big data environments:

CAP Theorem

CAP Theorem

CAP Theorem CAP Theorem is a concept that  a distributed database system can only have 2 of the 3: Consistency, Availability and Partition Tolerance

CAP Theorem- Partition Tolerance This condition states that the system continues to run, despite the number of messages being delayed by the network between nodes.

CAP Theorem- High Consistency This condition states that all nodes see the same data at the same time.

CAP Theorem- High Availability This condition states that every request gets a response on success/failure.

BaSE- Basically available Soft State, Eventual Consistency

BaSE- Basically available Soft State, Eventual Consistency

BaSE- Basically available Soft State, Eventual Consistency

Summary Topics Discussed Quiz

Quiz -I Solution: 1. Consistency 2.Availability 3. Brewer 4. Partition Tolerant

Quiz -II- Match the following Column A Column B NLP Content analytics Text analytics Text messages UIMA Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, Artificial Intelligence, machine learning & DBs IBM UIMA

Quiz -II- Answers Column A Column B NLP Content analytics Text analytics Text messages UIMA Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, Artificial Intelligence, machine learning & DBs IBM UIMA

Quiz -III- 1. List various types of digital data? A. Structured, Semi-structured and unstructured 2. Why an email placed in the Unstructured category? A. Because it contains hyperlinks, attachments, videos, images, free flowing text... 3. What category will you place a CCTV footage into? A. unstructured 4. You have just got a book issued from the library. What are the details about the book that can be placed in an RDBMS table. A. Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.

Quiz -III 5. Which category would you place the consumer complaints and feedback? A. Unstructured. 6. Which category (structured, semi-structured or Unstructured) will you place a web page in? A. Unstructured 7. Which category (structured, semi-structured or Unstructured) will you place a Power point presentation in? A. Unstructured 8. Which category (structured, semi-structured or Unstructured) will you place a word document in? A. Unstructured

Quiz -IV 1. Big data is high-volume, high-velocity, and high-variety information assets that demand--------------------, ---------------------forms of information processing for enhanced ----------------------and ------------- Answer: Cost-effective, Innovative, Insight, Decision making

Quiz -IV _____ characteristics of data explains the spikes in data. ______, a Gartner analyst coined the term, “Big Data”. ______ is the characteristic of data dealing with its retention. Near real time processing or real time processing deals with _____ characteristics of the data. _____, is a large data repository that stores data in its native format until it is needed.

Quiz -IV- Answers Variability Doug Laney Volatility Velocity Data Lakes

Quiz -V State a few examples of human generated and machine-generated data.

Quiz -V Data Generation Origin Definition type of data Examples Humans Data representing the digitization of human interactions Structured Business process data e.g., payment transactions, sales order, call record, ERP, CRM Semi structured Weblogs Unstructured Content such as Web pages, E-mail, Blog, Wiki, Review, Comment Binary Content such as Video, Audio, Photo Machines Data representing machine-to-machine interactions, or simply not human-generated (Internet of Things) Structured Some devices Semistructured Computer logs, Device logs, Network logs, Sensor/Meter logs Binary Video, Audio, Photo

Co-existence of Big Data and Data Warehouse

Data Science

Data Science- Business Acumen Skills Understanding of domain Business strategy Problem solving Communication Presentation Inquisitiveness

Data Science- Business Acumen Skills Understanding of domain Business strategy Problem solving Communication Presentation Inquisitiveness
Tags