Big Data Infrastructure and Hadoop components.pptx
GEZWARDGERALD
8 views
44 slides
Jul 15, 2024
Slide 1 of 44
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
About This Presentation
This slide contains knowledge on Data infrastructure and Processing
Size: 9.6 MB
Language: en
Added: Jul 15, 2024
Slides: 44 pages
Slide Content
Lecture 5: Big Data Storage and Infrastructure
Big Data
Characteristics of Big Data: The four Vs A crucial part of the rise of data science is the steep increase in amount and availability of data According to IBM scientists big data can be analyzed from four dimensions
Data Types
Analysis of structured data Tools OLAP, SQLite, MySQL, PostgreSQL Use cases Customer Relationship Management (CRM) Online bookings Accounting
Analysis of unstructured data ML algorithms + NLP Tools MongoDB, DynamoDB, Hadoop Use cases Sentiment analysis, topic analysis, language detection, intent detection
Analysis of unstructured data
Hardware and Storage
Big Data Hardware Need to think of : Data collection hardware Data storage hardware Data processing hardware
Data Collection Hardware Smartphones, cameras, cars, watches, security systems, motion sensors, credit card terminals etc. Capture Requirements Data accuracy Real time transmission Compatibility with analytical systems Support for standard protocols i.e. IEEE 802.11, Z-Wave, ZigBee, Bluetooth etc
Hardware of data storage Big Data requires big hardware Powerful hardware optimized for processing lots of information Even small applications generate huge amount of information Traditional single server is insufficient We need massive data stored on multiple optimized nodes
Data Science Supportive Hardware Trends Cloud technology Solid state Drives AI focused Chips
Cloud Technology No need of buying physical servers but can rent hardware on the cloud Benefits include: access to specialized resources, quick deployment, easily expanded capacity, the ability to discontinue a cloud service when it is no longer needed, cost savings, and good backup and recovery.
Cloud Technology Software as a Service (SaaS): the vendor provides the hardware, application software, operating system, and storage. Platform as a Service (PaaS): differs from SaaS in that the vendor does not provide the software for building or running specific applications; this is up to the company. Only the basic platform is provided. Infrastructure as a Service (IaaS): the vendor provides raw computing power and storage; neither operating system nor application software are included. Customers upload an image that includes the application and operating system
Solid State Drive (SSD) Faster No moving parts Smaller Best for storing frequently accessed data
Processors? Usual processor : Central Processing Unit Can scale buy adding more cores (Multi-core) However, scaling is limited.
Processors for Big Data Analytics Chips that has been specially designed for Artificial used in the field of Artificial Intelligence. Examples: Graphics Processing Units (GPUs) Application specific Integrated Circuits (ASIC) such as TPUs (Tensor Processing Unit)
Databases
Relational databases Queries are issued using Structured Query Language (SQL) Used for storing structured data Examples: MySQL MariaDB Oracle PostgreSQL
Databases Traditional databases : Relational databases Consists of tables (rows and columns) Two types: Row oriented databases Column oriented databases
Row oriented Databases Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Highly efficient. It can quickly locate the row and update the single entry. Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Less efficient. The database has to read through all rows, picking out the Last Purchase Amount from each, which can be slow if the dataset is large.
Column Oriented Databases Scenario: Updating Data Use Case: Update the Last Purchase Amount for a specific customer. Efficiency: Less efficient compared to row-oriented. It needs to locate the right column and then find the specific customer within that column. Scenario: Aggregating a Single Column Use Case: Calculate the average Last Purchase Amount. Efficiency: Highly efficient. The database can quickly aggregate this single column as it doesn’t need to read through the entire dataset, only the relevant column.
Row Oriented Vs. Column Oriented Databases Row-Oriented Database: Best for transactional operations or scenarios where entire records are frequently accessed or updated together. Column-Oriented Database: Ideal for analytical queries and operations that require fast read access to specific columns for aggregation, like in data warehousing.
Big data databases Remember the 4 Vs (Volume, Velocity, Variety, Veracity)? Databases need to handle all these characteristics Commonly known as NoSQL (Not Only SQL)
NoSQL Databases Can accommodate unstructured data No need to store data in rows and columns, several data models are acceptable (Files, graph, etc. ) Do not rely on SQL to retrieve data (though some do support SQL) Data is stored and retrieved “as is” through key-value pairs that use keys to provide links to where files are stored on disk. Examples : Apache Hadoop Apache Cassandra, MongoDB, and Apache Couchbase
OLTP Vs. OLAP OLTP – Online Transaction Processing systems OLAP – Online Analytical Processing systems
Big data analytics pipeline Data sources Data storage Data applications
Typical Data Storage Implementation
Solution Move the algorithms to the data instead of the data to the algorithms
Advantages No data movement Faster performance High security Scalability Real-time deployment and environments Production deployment
Data Storage Some efficiency issues with real databases indexing how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries? data structures for representing sorted order on fields disk management databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed – could be slow concurrency transaction semantics: either all updates happen en batch or none (commit or rollback) like delete one record and simultaneously add another but guarantee not to leave in an inconsistent state other users might be blocked till done query optimization the order in which you JOIN tables can drastically affect the size of the intermediate tables
Hadoop Most used technology with Big data Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage. It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
Goals / Requirements: Abstract and facilitate the storage and processing of large and/or rapidly growing data sets Structured Semi-structured and unstructured data High scalability and availability Use commodity (cheap!) hardware with little redundancy Fault-tolerance Move computation rather than data
https://youtu.be/aReuLtY0YMI?si=mjbjZ6Hpyd3S4n5c
MapReduce example https://medium.com/edureka/mapreduce-tutorial-3d9535ddbe7c
Questions?
Something to research What do you think the chatGPT hardware infrastructure looks like? Amazon data centers? Google data centers?
How was the test?
Paper presentations – Next week (25 th Jan 2023) Group 6 : Biswas, S., Wardat , M., & Rajan , H. (2022, May). The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In Proceedings of the 44th International Conference on Software Engineering (pp. 2091-2103). Group 7 : Talib, M. A., Majzoub , S., Nasir, Q., & Jamal, D. (2021). A systematic literature review on hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing , 77 (2), 1897-1938. Group 8 : Ngo, V. M., Le- Khac , N. A., & Kechadi , M. (2019, June). Designing and implementing data warehouse for agricultural big data. In International Conference on Big Data (pp. 1-17). Springer, Cham.