Big Data Infrastructure and Hadoop components.pptx

GEZWARDGERALD 8 views 44 slides Jul 15, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

This slide contains knowledge on Data infrastructure and Processing


Slide Content

Lecture 5: Big Data Storage and Infrastructure

Big Data

Characteristics of Big Data: The four Vs A crucial part of the rise of data science is the steep increase in amount and availability of data According to IBM scientists big data can be analyzed from four dimensions

Data Types

Analysis of structured data Tools OLAP, SQLite, MySQL, PostgreSQL Use cases Customer Relationship Management (CRM) Online bookings Accounting

Analysis of unstructured data ML algorithms + NLP Tools MongoDB, DynamoDB, Hadoop Use cases Sentiment analysis, topic analysis, language detection, intent detection

Analysis of unstructured data

Hardware and Storage

Big Data Hardware Need to think of : Data collection hardware Data storage hardware Data processing hardware

Data Collection Hardware Smartphones, cameras, cars, watches, security systems, motion sensors, credit card terminals etc. Capture Requirements Data accuracy Real time transmission Compatibility with analytical systems Support for standard protocols i.e. IEEE 802.11, Z-Wave, ZigBee, Bluetooth etc

Hardware of data storage Big Data requires big hardware Powerful hardware optimized for processing lots of information Even small applications generate huge amount of information Traditional single server is insufficient We need massive data stored on multiple optimized nodes

Data Science Supportive Hardware Trends Cloud technology Solid state Drives AI focused Chips

Cloud Technology No need of buying physical servers but can rent hardware on the cloud Benefits include: access to specialized resources, quick deployment, easily expanded capacity, the ability to discontinue a cloud service when it is no longer needed, cost savings, and good backup and recovery.

Cloud Technology Software as a Service (SaaS): the vendor provides the hardware, application software, operating system, and storage. Platform as a Service (PaaS): differs from SaaS in that the vendor does not provide the software for building or running specific applications; this is up to the company. Only the basic platform is provided. Infrastructure as a Service (IaaS): the vendor provides raw computing power and storage; neither operating system nor application software are included. Customers upload an image that includes the application and operating system

Solid State Drive (SSD) Faster No moving parts Smaller Best for storing frequently accessed data

Processors? Usual processor : Central Processing Unit Can scale buy adding more cores (Multi-core) However, scaling is limited.

Processors for Big Data Analytics Chips that has been specially designed for Artificial used in the field of Artificial Intelligence. Examples: Graphics Processing Units (GPUs) Application specific Integrated Circuits (ASIC) such as TPUs (Tensor Processing Unit)

Databases

Relational databases Queries are issued using Structured Query Language (SQL) Used for storing structured data Examples: MySQL MariaDB Oracle PostgreSQL

Databases Traditional databases : Relational databases Consists of tables (rows and columns) Two types: Row oriented databases Column oriented databases

Row oriented Databases Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Highly efficient. It can quickly locate the row and update the single entry. Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Less efficient. The database has to read through all rows, picking out the Last Purchase Amount from each, which can be slow if the dataset is large.

Column Oriented Databases Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Less efficient compared to row-oriented. It needs to locate the right column and then find the specific customer within that column. Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Highly efficient. The database can quickly aggregate this single column as it doesn’t need to read through the entire dataset, only the relevant column.

Row Oriented Vs. Column Oriented Databases Row-Oriented Database: Best for transactional operations or scenarios where entire records are frequently accessed or updated together. Column-Oriented Database: Ideal for analytical queries and operations that require fast read access to specific columns for aggregation, like in data warehousing.

Big data databases Remember the 4 Vs (Volume, Velocity, Variety, Veracity)? Databases need to handle all these characteristics Commonly known as NoSQL (Not Only SQL)

NoSQL Databases Can accommodate unstructured data No need to store data in rows and columns, several data models are acceptable (Files, graph, etc. ) Do not rely on SQL to retrieve data (though some do support SQL) Data is stored and retrieved “as is” through key-value pairs that use keys to provide links to where files are stored on disk. Examples : Apache Hadoop Apache Cassandra, MongoDB, and Apache Couchbase

OLTP Vs. OLAP OLTP – Online Transaction Processing systems OLAP – Online Analytical Processing systems

OLTP and OLAP https:// www.geeksforgeeks.org /difference-between- olap -and- oltp -in- dbms /

Data warehousing

Typical setup

Big data analytics pipeline Data sources Data storage Data applications

Typical Data Storage Implementation

Solution Move the algorithms to the data instead of the data to the algorithms

Advantages No data movement Faster performance High security Scalability Real-time deployment and environments Production deployment

Data Storage Some efficiency issues with real databases indexing how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries? data structures for representing sorted order on fields disk management databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed – could be slow concurrency transaction semantics: either all updates happen en batch or none (commit or rollback) like delete one record and simultaneously add another but guarantee not to leave in an inconsistent state other users might be blocked till done query optimization the order in which you JOIN tables can drastically affect the size of the intermediate tables

Hadoop Most used technology with Big data Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

Goals / Requirements: Abstract and facilitate the storage and processing of large and/or rapidly growing data sets Structured Semi-structured and unstructured data High scalability and availability Use commodity (cheap!) hardware with little redundancy Fault-tolerance Move computation rather than data

https://youtu.be/aReuLtY0YMI?si=mjbjZ6Hpyd3S4n5c

MapReduce example https://medium.com/edureka/mapreduce-tutorial-3d9535ddbe7c

Questions?

Something to research What do you think the chatGPT hardware infrastructure looks like? Amazon data centers? Google data centers?

How was the test?

Paper presentations – Next week (25 th Jan 2023) Group 6 : Biswas, S., Wardat , M., & Rajan , H. (2022, May). The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of the 44th International Conference on Software Engineering (pp. 2091-2103). Group 7 : Talib, M. A., Majzoub , S., Nasir, Q., & Jamal, D. (2021). A systematic literature review on hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing , 77 (2), 1897-1938. Group 8 : Ngo, V. M., Le- Khac , N. A., & Kechadi , M. (2019, June). Designing and implementing data warehouse for agricultural big data. In International Conference on Big Data (pp. 1-17). Springer, Cham.