This ppt gives a basic introduction to the Google File System & Hadoop Distributed File System.
Size: 909.9 KB
Language: en
Added: Mar 19, 2019
Slides: 17 pages
Slide Content
Introduction to Google File System (GFS) 17MX105 G.HARIHARAN
Introduction Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a distributed computing system to provide users with the infrastructure they need to access, create and alter data. DISTRIBUTED COMPUTING SYSTEM: A distributed file system (DFS) is a file system with data stored on a server . The server allows the client users to share files and store data just like they are storing the information locally . However, the servers have full control over the data and give access control to the clients.
Intro (continued).. The machines that power Google's operations aren't cutting-edge powerful computers. They're relatively inexpensive machines running on Linux operating systems. Google uses the GFS to organize and manipulate huge files. The GFS is unique to Google and isn't for sale . But it could serve as a model for other file systems with similar needs.
How GFS works? GFS provides the users to access the basic file commands. These include commands like open , create , read , write and close files along with special commands like append and snapshot . Append allows clients to add information to an existing file without overwriting previously written data. Snapshot is a command that creates quick copy of a computer's contents . GFS tend to be very large, usually in the multi-gigabyte (GB) range . Accessing and manipulating files that large would take up a lot of the network's bandwidth .
Solution.. The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each . Every chunk receives a unique 64-bit identification number called a chunk handle . By making all the file chunks to be the same size, the GFS simplifies the process. Using chunk handle, it is easy to check the memory capacity of each computer in the network. GFS easily identifies which computer’s memory is full & which one’s are un-used.
Google File System Architecture
Google organized the GFS into clusters of computers . Within GFS clusters there are three kinds of entities : clients , master servers and chunkservers . “Client " refers to any entity that makes a file request . The “master server” acts as the coordinator & maintains an operation log. The master server also keeps track of metadata , which is the information that describes chunks . T here's only one active master server per cluster at any one time.
Chunk Servers working The master server doesn't actually handle file data, i t leaves that up to the chunkservers . The chunkservers don't send chunks to the master server . Instead , they send requested chunks directly to the client . The GFS copies every chunk multiple times and stores it on different chunkservers . Each copy is called a replica . The GFS makes three replicas, one primary replica & 2 secondary replicas.
Working When the client makes a simple file-read request, The server responds with the location for the primary replica of the respective chunk . By comparing the IP address of the client, The master server chooses the chunkserver closest to the client . The client then sends the write data to all the replicas, starting with the closest replica and ending with the furthest one . Once the replicas receive the data, the primary replica begins to assign consecutive serial numbers to each change to the file. Changes are called mutations . If that doesn't work, the master server will identify the affected replica as garbage .
Other functions To prevent data corruption, the GFS uses a system called checksumming . The master server monitors chunks by looking at the checksums . If the checksum of a replica doesn't match the checksum in the master server's memory, the master server deletes the replica and creates a new one to replace it.
Introduction Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware . Hadoop was created by Doug Cutting and Mike Cafarella in 2005 . It was originally developed to support distribution for the Nutch search engine project . Doug, named the project after his son's toy elephant .
Why HDFS? HDFS has many similarities with other distributed file systems, but is different in several respects : HDFS follows Write-once-read-many model that simplifies data coherency since it relies mostly on “batch-processing” rather than “interactive-access” by users. Another unique attribute of HDFS is the processing logic is close to the data rather than moving the data to the application space. Fault tolerance. Data access via MapReduce . Portability across heterogeneous commodity hardware and operating systems. Scalability to reliably store and process large amounts of data. Reduce cost by distributing data and processing across clusters of commodity personal computers.
Hadoop Distributed File System HDFS Google File System GFS Cross Platform Linux Developed in Java environment Developed in C,C++ environment Initially it was developed by Yahoo and now its an open source Framework It was developed & still owned by Google It has Name node and Data Node It has Master-node and Chunk server 128 MB will be the default block size 64 MB will be the default block size Name node receive heartbeat from Data node Master node receive heartbeat from Chunk server Commodity hardware are used Commodity hardware are used ‘’Write Once and Read Many” times model Multiple writer , multiple reader model Deleted files are renamed into particular folder and then it will removed via garbage Deleted files are not reclaimed immediately and are renamed in hidden name space and it will deleted after three days if it’s not in use Edit Log is maintained Operational Log is maintained Only append is possible Random file write possible