GFS & HDFS Introduction

5,589 views 17 slides Mar 19, 2019
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

This ppt gives a basic introduction to the Google File System & Hadoop Distributed File System.


Slide Content

Introduction to Google File System (GFS) 17MX105 G.HARIHARAN

Introduction Google  is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a  distributed computing system  to provide users with the infrastructure they need to access, create and alter data.  DISTRIBUTED COMPUTING SYSTEM: A distributed file system (DFS) is a file system with data stored on a server . The server allows the client users to share files and store data just like they are storing the information locally . However, the servers have full control over the data and give access control to the clients.

Intro (continued).. The machines that power Google's operations aren't cutting-edge powerful computers. They're relatively inexpensive machines running on Linux operating systems. Google uses the GFS to organize and manipulate huge files. The GFS is unique to Google and isn't for sale . But it could serve as a model for other file systems with similar needs.

How GFS works? GFS provides the users to access the basic file commands.  These include commands like  open ,  create ,  read ,  write  and  close   files along with special commands like append  and  snapshot . Append allows clients to add information to an existing file without overwriting previously written data.  Snapshot is a command that creates quick copy of a computer's contents . GFS tend to be very large, usually in the multi-gigabyte (GB) range . Accessing and manipulating files that large would take up a lot of the network's  bandwidth .

Solution.. The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each . Every chunk receives a unique 64-bit identification number called a  chunk handle .  By making all the file chunks to be the same size, the GFS simplifies the process. Using chunk handle, it is easy to check the memory capacity of each computer in the network. GFS easily identifies which computer’s memory is full & which one’s are un-used.

Google File System Architecture

Google organized the GFS into  clusters  of computers . Within GFS clusters there are three kinds of entities :   clients ,  master servers  and  chunkservers . “Client " refers to any entity that makes a file request . The “master server” acts as the coordinator & maintains an operation log. The master server also keeps track of  metadata , which is the information that describes chunks . T here's only one active master server per cluster at any one time.

Chunk Servers working The master server doesn't actually handle file data, i t leaves that up to the chunkservers . The chunkservers don't send chunks to the master server . Instead , they send requested chunks directly to the client . The GFS copies every chunk multiple times and stores it on different chunkservers . Each copy is called a replica . The GFS makes three replicas, one primary replica & 2 secondary replicas.

Working When the client makes a simple file-read request, The server responds with the location for the primary replica of the respective chunk . By comparing the IP address of the client, The master server chooses the chunkserver closest to the client . The client then sends the write data to all the replicas, starting with the closest replica and ending with the furthest one . Once the replicas receive the data, the primary replica begins to assign consecutive serial numbers to each change to the file. Changes are called  mutations . If that doesn't work, the master server will identify the affected replica as  garbage .

Other functions To prevent data corruption, the GFS uses a system called checksumming . The master server monitors chunks by looking at the checksums . If the checksum of a replica doesn't match the checksum in the master server's memory, the master server deletes the replica and creates a new one to replace it.

Advantages Scalability Cheap hardware Reference: https://computer.howstuffworks.com/internet/basics/google-file-system5.htm

HDFS - Introduction G.HARIHARAN 17MX105

Introduction Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware . Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . It was originally developed to support distribution for the  Nutch search engine  project . Doug, named the project after his son's toy elephant .

Why HDFS? HDFS has many similarities with other distributed file systems, but is different in several respects : HDFS follows Write-once-read-many model that simplifies data coherency since it relies mostly on “batch-processing” rather than “interactive-access” by users. Another unique attribute of HDFS is the processing logic is close to the data rather than moving the data to the application space. Fault tolerance. Data access via MapReduce . Portability across heterogeneous commodity hardware and operating systems. Scalability to reliably store and process large amounts of data. Reduce cost by distributing data and processing across clusters of commodity personal computers.

Hadoop  Distributed File System HDFS Google File System GFS Cross Platform Linux Developed in Java environment Developed in C,C++ environment Initially it was developed by Yahoo and now its an open source Framework It was developed & still owned by Google It has Name node and Data Node It has Master-node and Chunk server 128 MB will be the default block size 64 MB will be the default block size Name node receive heartbeat from Data node Master node receive heartbeat from Chunk server Commodity hardware are  used Commodity hardware are used ‘’Write Once and Read Many” times model Multiple writer , multiple reader model Deleted files are renamed into particular folder and then it will removed via garbage Deleted files are not reclaimed immediately and are renamed in hidden name space and it will deleted after three days if it’s not in use Edit Log is maintained Operational Log is maintained Only append is possible Random file write possible

References: https :// www.ibm.com/developerworks/library/wa-introhdfs/index.html https:// stackoverflow.com/questions/15675312/why-hdfs-is-write-once-and-read-multiple-times https://sensaran.wordpress.com/2015/11/24/gfs-vs-hdfs /