Big Data Analytics With Hadoop
http://theanchor.pk/
Size: 778.67 KB
Language: en
Added: Nov 10, 2022
Slides: 32 pages
Slide Content
Big Data Analytics With Hadoop Big Data & IoT Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
What is Big data? ‘ Big Data’ is similar to ‘small data’, but bigger in size but having data bigger it requires different approaches: – Techniques, tools and architecture an aim to solve new problems or old problems in a better way Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques
B ig data analytics Big data analytics is the often complex process of examining big data to uncover information such as hidden patterns, correlations, market trends and customer preferences that can help organizations make informed business decisions . On a broad scale, data analytics technologies and techniques give organizations a way to analyze data sets and gather new information .
Why is big data analytics important? Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. Businesses that use big data with advanced analytics gain value in many ways, such as : Reducing cost. Big data technologies like cloud-based analytics can significantly reduce costs when it comes to storing large amounts of data (for example, a data lake). Plus, big data analytics helps organizations find more efficient ways of doing business.
Cont … Making faster, better decisions: The speed of in-memory analytics – combined with the ability to analyze new sources of data, such as streaming data from IoT helps businesses analyze information immediately and make fast, informed decisions . Developing and marketing new products and services: Being able to gauge customer needs and customer satisfaction through analytics empowers businesses to give customers what they want, when they want it. With big data analytics, more companies have an opportunity to develop innovative new products to meet customers ’ changing needs.
Hadoop Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly . It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware
Features of Hadoop Hadoop is Open Source Hadoop cluster is Highly Scalable Hadoop provides Fault Tolerance Hadoop provides High Availability Hadoop is very Cost-Effective Hadoop is Faster in Data Processing Hadoop provides Feasibility
Hadoop Architecture The Hadoop Architecture Mainly consists of 4 components.
1. Map Reduce Processing/Computation layer(MapReduce ) A method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two developer phases Map Reduce In between map and reduce is shuffle and sort.
Map Reduce Map: The Map function always runs first typically used to filter, transform, or parse the data. The output from Map becomes the input to Reduce Reduce: The Reduce function is optional normally used to summarize data from the Map function.
2.HDFS(Hadoop Distributed File System) 2. Storage layer (Hadoop Distributed File System ) The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines. HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules: Two identical blocks cannot be placed on the same DataNode When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
Example (HDFS)
Components of HDFS There are two components of HDFS: NameNode ( masternode ) 2) Slave Node
Components of HDFS NameNode : NameNode works as a Master in a Hadoop cluster that guides the Datanode (Slaves). Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop cluster. Meta Data can also be the name of the file, size, and the information about the location(Block number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication. Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
Components of HDFS DataNode : DataNodes works as a Slave. DataNodes are mainly utilized for storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of DataNode , the Hadoop cluster will be able to store more data. So it is advised that the DataNode should have High storing capacity to store a large number of file blocks.
File Block In HDFS Data in HDFS is always stored in terms of blocks. So the single block of data is divided into multiple blocks of size 128MB which is default and you can also change it manually .
Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of something and the number of times you make a copy of that particular thing can be expressed as it’s Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various blocks at the same time Hadoop is also configured to make a copy of those file blocks. By default, the Replication Factor for Hadoop is set to 3 which can be configured.
Read Operation In HDFS Data read request is served by HDFS, NameNode , and DataNode
Write Operation In HDFS
3.Hadoop YARN Hadoop YARN (Yet Another Resource Negotiator) is the cluster resource management layer of Hadoop and is responsible for resource allocation and job scheduling The Purpose of Job schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be Maximized Job Scheduler also keeps track of which job is important, which job has more priority, dependencies between the jobs and all the other information like job timing, etc And the use of Resource Manager is to manage all the resources that are made available for running a Hadoop cluster
E lements of YARN The elements of YARN include: ResourceManager (one per cluster) ApplicationMaster (one per application) NodeManagers (one per node)
Elements of YARN 1. Resource Manager Resource Manager manages the resource allocation in the cluster and is responsible for tracking how many resources are available in the cluster and each node manager’s contribution. It has two main components: Scheduler : Allocating resources to various running applications and scheduling resources based on the requirements of the application; it doesn’t monitor or track the status of the applications Application Manager : Accepting job submissions from the client or monitoring and restarting application masters in case of failure
Elements of YARN 2. Application Master Application Master manages the resource needs of individual applications and interacts with the scheduler to acquire the required resources. It connects with the node manager to execute and monitor tasks. 3. Node Manager Node Manager tracks running jobs and sends signals (or heartbeats) to the resource manager to relay the status of a node. It also monitors each container’s resource utilization.
4. Hadoop common or Common Utilities Hadoop common or Common utilities are nothing but our java library and java files or we can say the java scripts that we need for all the other components present in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in software by Hadoop Framework.
Advantages of Hadoop
Advantages of Hadoop 1. Varied Data Sources Hadoop accepts a variety of data. Data can come from a range of sources like email conversation, social media etc. and can be of structured or unstructured form. Hadoop can derive value from diverse data. Hadoop can accept data in a text file, XML file, images, CSV files etc. 2 . Cost-effective Hadoop is an economical solution as it uses a cluster of commodity hardware to store data. Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not much high. In Hadoop 3.0 we have only 50% of storage overhead as opposed to 200% in Hadoop2.x. This requires less machine to store data as the redundant data decreased significantly. 3. Performance Hadoop with its distributed processing and distributed storage architecture processes huge amounts of data with high speed. Hadoop even defeated supercomputer the fastest machine in 2008 . It divides the input data file into a number of blocks and stores data in these blocks over several nodes. It also divides the task that user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel thereby improving the performance.
Advantages of Hadoop 4.Fault-Tolerant In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In event of failure of any node the data block affected can be recovered by using these parity blocks and the remaining data blocks. 5. Highly Available In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode , so if a NameNode goes down then we have standby NameNode to count on. But Hadoop 3.0 supports multiple standby NameNode making the system even more highly available as it can continue functioning in case if two or more NameNodes crashes. 6. Low Network Traffic In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub-tasks are assigned to the data nodes thereby moving a small amount of code to data rather than moving huge data to code which leads to low network traffic . 7. High Throughput Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using distributed processing with ease. A given job gets divided into small jobs which work on chunks of data in parallel thereby giving high throughput.
Advantages of Hadoop 8 . Open Source Hadoop is an open source technology i.e. its source code is freely available. We can modify the source code to suit a specific requirement . 9. Scalable Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of nodes and not change the configuration of a machine like adding RAM, disk and so on which is known as vertical scalability. Nodes can be added to Hadoop cluster on the fly making it a scalable framework. 10. Ease of use The Hadoop framework takes care of parallel processing, MapReduce programmers does not need to care for achieving distributed processing, it is done at the backend automatically. 11 . Compatibility Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink etc. They have got processing engines which work over Hadoop as a backend i.e. We use Hadoop as data storage platforms for them . 12. Multiple Languages Supported Developers can code using many languages on Hadoop like C, C++, Perl, Python , Ruby, and Groovy.
Disadvantages of Hadoop 1. Issue With Small Files Hadoop is suitable for a small number of large files but when it comes to the application which deals with a large number of small files, Hadoop fails here. A small file is nothing but a file which is significantly smaller than Hadoop’s block size which can be either 128MB or 256MB by default. These large number of small files overload the Namenode as it stores namespace for the system and makes it difficult for Hadoop to function. 2. Vulnerable By Nature Hadoop is written in Java which is a widely used programming language hence it is easily exploited by cyber criminals which makes Hadoop vulnerable to security breaches. 3. Processing Overhead In Hadoop, the data is read from the disk and written to the disk which makes read/write operations very expensive when we are dealing with tera and petabytes of data. Hadoop cannot do in-memory calculations hence it incurs processing overhead .
Disadvantages of Hadoop 4. Supports Only Batch Processing At the core, Hadoop has a batch processing engine which is not efficient in stream processing. It cannot produce output in real-time with low latency. It only works on data which we collect and store in a file in advance before processing. 5. Iterative Processing Hadoop cannot do iterative processing by itself. Machine learning or iterative processing has a cyclic data flow whereas Hadoop has data flowing in a chain of stages where output on one stage becomes the input of another stage. 6. Security For security , Hadoop uses Kerberos authentication which is hard to manage. It is missing encryption at storage and network levels which are a major point of concern.