Clustering and types of Clustering in Data analytics
ssuser08ea44
14 views
39 slides
Jun 28, 2024
Slide 1 of 39
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
About This Presentation
Clustering and its types of clustering in HTML
Size: 1.15 MB
Language: en
Added: Jun 28, 2024
Slides: 39 pages
Slide Content
Presented By: Ms. Neeharika Tripathi Assistant Professor Department of Computer Science And Engineering AJAY KUMAR GARG ENGINEERING COLLEGE, GHAZIABAD Computer Science And Engineering BIG DATA AND ANALYTICS(KDS-601) Hadoop Ecosystem
Introduction to HDFS When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. File systems that manage the storage across a network of machines are called distributed filesystems . Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem
The Design of HDFS HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Very large files : “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Streaming data access : HDFS is built around the idea that the most efficient data processing pattern is a write-once, read many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Commodity hardware : Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors3) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
These are areas where HDFS is not a good fit today: Low-latency data access : Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Lots of small files : Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode . Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
HDFS Architecture
Hadoop Distributed File System follows the master-slave architecture . Each cluster comprises a single master node and multiple slave nodes . Internally the files get divided into one or more blocks , and each block is stored on different slave machines depending on the replication factor . The master node stores and manages the file system namespace, that is information about blocks of files like block locations, permissions, etc. The slave nodes store data blocks of files. The Master node is the NameNode and DataNodes are the slave nodes.
HDFS NameNode It maintains and manages the file system namespace and provides the right access permission to the clients. The NameNode stores information about blocks locations, permissions, etc. on the local disk in the form of two files: Fsimage : Fsimage stands for File System image. It contains the complete namespace of the Hadoop file system since the NameNode creation. Edit log: It contains all the recent changes performed to the file system namespace to the most recent Fsimage .
Functions of HDFS NameNode It executes the file system namespace operations like opening, renaming, and closing files and directories. NameNode manages and maintains the DataNodes . It determines the mapping of blocks of a file to DataNodes . NameNode records each change made to the file system namespace. It keeps the locations of each block of a file. NameNode takes care of the replication factor of all the blocks. NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive. If the DataNode fails, the NameNode chooses new DataNodes for new replicas.
HDFS DataNode DataNodes are the slave nodes in Hadoop HDFS. DataNodes are inexpensive commodity hardware . They store blocks of a file. Functions of DataNode DataNode is responsible for serving the client read/write requests. Based on the instruction from the NameNode, DataNodes performs block creation, replication, and deletion. DataNodes send a heartbeat to NameNode to report the health of HDFS. DataNodes also sends block reports to NameNode to report the list of blocks it contains.
Secondary NameNode
Apart from DataNode and NameNode, there is another daemon called the secondary NameNode . Secondary NameNode works as a helper node to primary NameNode but doesn’t replace primary NameNode. When the NameNode starts, the NameNode merges the Fsimage and edit logs file to restore the current file system namespace. Since the NameNode runs continuously for a long time without any restart, the size of edit logs becomes too large. This will result in a long restart time for NameNode. Secondary NameNode solves this issue. Secondary NameNode downloads the Fsimage file and edit logs file from NameNode. It periodically applies edit logs to Fsimage and refreshes the edit logs. The updated Fsimage is then sent to the NameNode so that NameNode doesn’t have to re-apply the edit log records during its restart. This keeps the edit log size small and reduces the NameNode restart time. If the NameNode fails, the last save Fsimage on the secondary NameNode can be used to recover file system metadata. The secondary NameNode performs regular checkpoints in HDFS.
Checkpoint Node The Checkpoint node is a node that periodically creates checkpoints of the namespace. Checkpoint Node in Hadoop first downloads Fsimage and edits from the Active Namenode . Then it merges them ( Fsimage and edits) locally, and at last, it uploads the new image back to the active NameNode. It stores the latest checkpoint in a directory that has the same structure as the Namenode’s directory. This permits the checkpointed image to be always available for reading by the NameNode if necessary.
Backup Node A Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the active NameNode state. It is not required for the backup node in HDFS architecture to download Fsimage and edits files from the active NameNode to create a checkpoint. It already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local Fsimage file and reset edits. NameNode supports one Backup node at a time .
Blocks in HDFS Architecture For example, if there is a file of size 612 Mb, then HDFS will create four blocks of size 128 Mb and one block of size 100 Mb (128*4+100=612) . Internally, HDFS split the file into block-sized chunks called a block. The size of the block is 128 Mb by default. We can configure the block size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
From the above example, we can conclude that: A file in HDFS, smaller than a single block does not occupy a full block size space of the underlying storage. Each file stored in HDFS doesn’t need to be an exact multiple of the configured block size. Why are blocks in HDFS huge? The default size of the HDFS data block is 128 MB . The reasons for the large size of blocks are: To minimize the cost of seek: For the large size blocks, time taken to transfer the data from disk can be longer as compared to the time taken to start the block. This results in the transfer of multiple blocks at the disk transfer rate. If blocks are small, there will be too many blocks in Hadoop HDFS and thus too much metadata to store. Managing such a huge number of blocks and metadata will create overhead and lead to traffic in a network.
Advantages of Hadoop Data Blocks 1. No limitation on the file size A file can be larger than any single disk in the network. 2. Simplicity of storage subsystem Since blocks are of fixed size, we can easily calculate the number of blocks that can be stored on a given disk. Thus provide simplicity to the storage subsystem. 3. Fit well with replication for providing Fault Tolerance and High Availability Blocks are easy to replicate between DataNodes thus, provide fault tolerance and high availability. 4. Eliminating metadata concerns Since blocks are just chunks of data to be stored, we don’t need to store file metadata (such as permission information) with the blocks, another system can handle metadata separately.
HDFS Interfaces: Features of HDFS interfaces are: 1. Create new file: Click New button in the HDFS tab, Enter the name of the file and click Create button in the prompt box opened to create a new HDFS file. 2. Upload files/folder : You can upload files / folder from your local file system to HDFS by clicking the Upload button in the HDFS tab. In the prompt select file or folder and then browse for the same and click upload. 3. Set Permission: Click Permission option in the HDFS tab, a popup window will be displayed. In the popup window change the mode of permission and click change button to set the permission for files in the HDFS. 4. Copy : Select the file / folder from HDFS directory and click Copy button in the HDFS tab, in the prompt displayed, enter / browse for target directory and click Copy button to copy the selected HDFS item.
5. Move: Select the file / folder from HDFS directory and click Move button in the HDFS tab, in the prompt displayed, enter / browse for target directory and click Move button to move the selected HDFS item. 6. Rename: Select the file / folder from the HDFS directory and click Rename button in the HDFS tab, enter the new name and click Apply to rename in the HDFS. 7. Delete: Select the file / folder from the HDFS directory and click Delete to perform delete operation. 8. Drag and Drop: Select the file / folder from the HDFS directory and drag it to another folder in the HDFS directory / Windows explorer and vice versa. 9. HDFS File viewer: you can view the content of the HDFS files directly by double clicking the file.
Data Flow In HDFS, data is stored in the form of blocks and distributed across multiple nodes in a cluster. The data flow in HDFS can be described as follows: Data ingestion: Data is first ingested into the HDFS cluster through the Hadoop file system API or tools like Flume, Sqoop , or Kafka. Data storage: Once the data is ingested, it is stored in the form of blocks across multiple nodes in the cluster. Each block is replicated across multiple nodes to ensure data availability and fault tolerance. Data retrieval: When a user or application needs to access the data stored in HDFS, a request is sent to the NameNode , which is the master node in the cluster. The NameNode maintains the metadata about the location of the blocks and sends the request to the DataNodes that hold the requested blocks.
Data processing: The data can be processed in Hadoop MapReduce or other distributed processing frameworks like Apache Spark, Apache Hive, or Apache Pig. The processing is distributed across the nodes in the cluster and can be parallelized to speed up the processing. Data output: The output of the processing is written back to HDFS or other data stores like HBase or Cassandra. The output can also be sent to external systems through tools like Flume or Kafka. data flows into HDFS through ingestion, stored in the form of blocks across multiple nodes, retrieved through the NameNode , processed in distributed processing frameworks, and output to HDFS or other data stores.
Hadoop ecosytem Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce , YARN, and Hadoop Common Utilities .
Components of ecosystem Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems Below are the Hadoop components, that together form a Hadoop ecosystem: HDFS -> Hadoop Distributed File System YARN -> Yet Another Resource Negotiator MapReduce -> Data processing using programming Spark -> In-memory Data Processing PIG, HIVE -> Data Processing Services using Query (SQL-like) HBase -> NoSQL Database Mahout, Spark MLlib -> Machine Learning Apache Drill -> SQL on Hadoop Zookeeper -> Managing Cluster Oozie -> Job Scheduling Flume, Sqoop -> Data Ingesting Services Solr & Lucene -> Searching & Indexing Ambari -> Provision, Monitor and Maintain cluster
YARN – Yet Another Resource Negotiator Hadoop 2.0 provides YARN API‘s to write other frameworks to run on top of HDFS. This enables running Non- MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph , and HAMA are few of the applications written or ported to run within YARN.
YARN – Resource Manager It takes care of the entire life cycle of a Job from scheduling to successful completion – Scheduling and Monitoring . It also has to maintain resource information on each of the nodes such as number of map and reduce slots available on DataNodes The Next Generation MapReduce framework (MRv2) is an application framework that runs within YARN. The new MRv2 framework divides the two major functions of the JobTracker , resource management and job scheduling/monitoring, into separate components. The new ResourceManager manages the global assignment of compute resources to applications and the per-application. ApplicationMaster manages the application’s scheduling and coordination.
YARN provides better resource management in Hadoop, resulting in improved cluster efficiency and application performance. This feature not only improves the MapReduce Data Processing but also enables Hadoop usage in other data processing applications. YARN’s execution model is more generic than the earlier MapReduce implementation in Hadoop 1.0. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MRv1). YARN is the resource management framework that provides infrastructure and APIs to facilitate the request for, allocation of, and scheduling of cluster resources. As explained earlier, MRv2 is an application framework that runs within YARN.
Introduction to NoSQL NoSQL is a type of database management system (DBMS) that is designed to handle and store large volumes of unstructured and semi-structured data. NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. NoSQL is used for Big data and real-time web apps. For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day. NoSQL database stands for “Not Only SQL” or “Not SQL.” NoSQL databases are often used in applications where there is a high volume of data that needs to be processed and analyzed in real-time, such as social media analytics, e-commerce, and gaming. They can also be used for other applications, such as content management systems, document management, and customer relationship management.
NoSQL databases are generally classified into four main categories: Document databases: These databases store data as semi-structured documents, such as JSON or XML, and can be queried using document-oriented query languages. Key-value stores: These databases store data as key-value pairs, and are optimized for simple and fast read/write operations. Column-family stores : These databases store data as column families, which are sets of columns that are treated as a single entity. They are optimized for fast and efficient querying of large amounts of data. Graph databases: These databases store data as nodes and edges, and are designed to handle complex relationships between data.
Features of NoSQL Non-relational NoSQL databases never follow the relational model Never provide tables with flat fixed-column records Work with self-contained aggregates or BLOBs Doesn’t require object-relational mapping and data normalization No complex features like query languages, query planners, referential integrity joins, ACID Schema-free NoSQL databases are either schema-free or have relaxed schemas Do not require any sort of definition of the schema of the data Offers heterogeneous structures of data in the same domain
Introduction- Spark Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It provides a high-level API. For example, Java , Scala , Python , and R . Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R. It can be integrated with Hadoop and can process existing Hadoop HDFS data. History Of Apache Spark Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it becomes AMPLab . It was open sourced in 2010 under BSD license. In 2013 spark was donated to Apache Software Foundation where it became top-level Apache project in 2014.
MongoDB : An introduction MongoDB , the most popular NoSQL database, is an open-source document-oriented database. The term ‘ NoSQL ’ means ‘non-relational’. It means that MongoDB isn’t based on the table-like relational database structure but provides an altogether different mechanism for storage and retrieval of data. MongoDB is available under General Public license for free, and it is also available under Commercial license from the manufacturer. The manufacturing company 10gen has defined MongoDB as: " MongoDB is a scalable, open source, high performance, document-oriented database." - 10gen MongoDB was designed to work with commodity servers. Now it is used by the company of all sizes, across all industry.
MongoDB is such a NoSQL database that scales by adding more and more servers and increases productivity with its flexible document model.
Features of MongoDB 1. Support ad hoc queries In MongoDB , you can search by field, range query and it also supports regular expression searches. 2. Indexing Without indexing, a database would have to scan every document of a collection to select those that match the query which would be inefficient. So, for efficient searching Indexing is a must and MongoDB uses it to process huge volumes of data in very less time. 3. Replication and High Availability : MongoDB increases the data availability with multiple copies of data on different servers. By providing redundancy, it protects the database from hardware failures. If one server goes down, the data can be retrieved easily from other active servers which also had the data stored on them.
Where do we use MongoDB ? MongoDB is preferred over RDBMS in the following scenarios: Big Data : If you have huge amount of data to be stored in tables, think of MongoDB before RDBMS databases. MongoDB has built-in solution for partitioning and sharding your database. Unstable Schema : Adding a new column in RDBMS is hard whereas MongoDB is schema-less. Adding a new field does not effect old documents and will be very easy. Distributed data Since multiple copies of data are stored across different servers, recovery of data is instant and safe even if there is a hardware failure.
Language Support by MongoDB : MongoDB currently provides official driver support for all popular programming languages like C, C++, Rust, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala , Go, and Erlang . I nstalling MongoDB : Just go to http://www.mongodb.org/downloads and select your operating system out of Windows , Linux , Mac OS X and Solaris. A detailed explanation about the installation of MongoDB is given on their site.
Spark Architecture The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves. The Spark architecture depends upon two abstractions: Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG) Resilient Distributed Datasets (RDD) It is a key tool for data computation. It enables you to recheck data in the event of a failure, and it acts as an interface for immutable data. It helps in recomputing data in case of failures, and it is a data structure. There are two methods for modifying RDDs: transformations and actions. The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker nodes. Here, Resilient: Restore the data on failure. Distributed: Data is distributed among different nodes. Dataset: Group of data.
Directed Acyclic Graph (DAG) The driver converts the program into a DAG for each job. The Apache Spark Eco-system includes various components such as the API core, Spark SQL, Streaming and real-time processing, MLIB, and Graph X. A sequence of connection between nodes is referred to as a driver. As a result, you can read volumes of data using the Spark shell. You can also use the Spark context -cancel, run a job, task (work), and job (computation) to stop a job. Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data. Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers the navigation whereas directed and acyclic refers to how it is done.
Driver Program The Driver Program is a process that runs the main() function of the application and creates the SparkContext object. The purpose of SparkContext is to coordinate the spark applications, running as independent sets of processes on a cluster. To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks: - It acquires executors on nodes in the cluster. Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext . At last, the SparkContext sends tasks to the executors to run.
Cluster Manager The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters. It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and Standalone Scheduler. Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines. Worker Node The worker node is a slave node Its role is to run the application code in the cluster. Executor An executor is a process launched for an application on a worker node. It runs tasks and keeps data in memory or disk storage across them. It read and write data to the external sources. Every application contains its executor. Task A unit of work that will be sent to one executor.