Managing Big Data Module 3 ( 1st part) Guided By- Mangala C.N. Associate Professor CSE Dept EWIT, Bangalore Presented By – Soumee Maschatak 1EW18SCS07
Contents Data Format Analysing data with Hadoop Scaling OUT Data Flow Hadoop Streaming Hadoop Pipes
Hadoop Concepts Distribute the data as it is initially stored in the system. Individual nodes can work on data local to those nodes. No data transfer over the network is required for initial processing. Developers do not worry about network programming, temporal dependencies, Shared architecture. Data is replicated multiple times on the system for the increased availability and reliability. The data on the system is split into blocks of 64MB and 128MB. Map tasks work on relatively portions of data. Master program allocates work to the nodes and manages high availability.
Data Format Data is available everywhere and in different sizes and formats. The Hadoop can many different types of data formats, from flat text files to databases. Data is captured by various applications like sensors, mobiles, satellites, Social networks and from users of laptop/desktop. For example – Meteorology department There is tens of thousands of meteorology stations data that is stored in zip files for each month.
Analysing data with Hadoop
Map and Reduce The MapReduce processes the data in two cycles: The Map Phase and the Reduce Phase. Both phases have key- value pairs as input and output types of which may be chosen thee developer. The developer also specifies two functions: the map function and the reduce function. The input to our map phase is the raw NCDC meteorology data. We chose a text input format that gives us each line in the dataset as a text value. The key is the offset of the beginning of the line from the beginning of the file.
Map function is simple. The map function is just a data preparation phase, putting the data in such a way that the reducer function can do its action on it easily. In case of the meteorology station, finding the maximum wind speed for each city can be done using a MapReduce function. The map phase is also a good place to drop unwanted records. The lines of data are fed to the map function as the key- value pair. The keys are the line offsets within the input file. The results from the map function is processed by the MApReduces framework before being forwarded to the reduce function. This processing sorts and groups the key-value pairs by key.
Hadoop MapReduce
Java MapReduce This step is all about writing the code for a MapReduce function. There are 3 things to keep in mind for the Java MapReduce function: A map function. A reduce function. Some code to run the job. The map function is represented by the Mapper class which declares an abstract map() method. The MapReduce model processes large unstructured data sets with a distributed algorithm on a Hadoop cluster . The MapReduce model processes large unstructured data sets with a distributed algorithm on a Hadoop cluster.
Scaling Out Scalability has two parts: UP and OUT. Scale UP means that the system performs better as one adds more hardware to a single node in the system. Scaling OUT also involves adding more nodes to a distributed system. When one builds a complex distributed system/application, one works with certain obstacles. The end result has to scale out, so one can easily add more hardware resources in the face of higher load. It really starts to show while on bigger clusters. It should be able to scale up in order to scale out well. In order to scale out, one need to save the data in a distributed filesystem, typically HDFS(Hadoop Distributed File System), to allow Hadoop to run the MapReduce computation on each machine hosting a part of the data.
DataFlow MapReduce job is the combination of the input data, the MapReduces code and the configuration information. Hadoop runs the job by dividing into two tasks: map tasks and reduce tasks. Hadoop has two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers . The jobtracker coordinates and collaborates all the jobs run on the sytem by scheduling tasks to run on tasktrackers . TaskTrackers run tasks and send progress reports to the jobracker , which keeps a record of all the overall progress of each job.
If a task fails, the jobtracker can reschedule it on a different tasktracker . Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. More splits means the time taken to process each split is short compared to the time to process the complete input. So if we are processing the splits in parallel, the processing is better load-balanced if the splits are small. Hadoop does its best to execute map task on a node where the input data resides in HDFS. This is called data locality optimization since it doesn’t use valuable cluster bandwidth.
Map tasks write their results to the local disk, not to HDFS. Map output data is intermediate output: its processed by reduce tasks to produce the final result and once the job is complete the map output can be deleted. So storing it in HDFS, with replication, would be an overkill. If the map tasks fails in specific node before the map output has been manipulated by the reduce task. The result of reduce task is normally stored in HDFS for efficiency. For each HDFS block of reduced output, the first replica is stored on the local node, with the other replicas being stored on off-rack nodes.
Data Flow
Data Streaming Hadoop offers an interface/API to MapReduce which will allow users to write the map and reduce jobs in any language other than java. So programmers can have any language to read input and write output to the MapReduce program like python and ruby. Streaming is naturally suited for text processing which has a row-oriented view of data. Input map data is passed to map function which processes it row by row and provides lines/rows to standard output. A map output key-value pair is written as a single tab-separated line. Input to the reduce function is in the same format a tab separated key-value pair passed over standard input.
The reduce function reads lines/rows from standard input, then the framework sorts by key and writes its results to standard output. Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer . Streaming supports streaming command options as well as generic command options.
Parameter Optional/Required Description -input directoryname or filename Required Input location for mapper -output directoryname Required Output location for reducer -mapper executable or JavaClassName Required Mapper executable -reducer executable or JavaClassName Required Reducer executable -file filename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes -inputformat JavaClassName Optional Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default -outputformat JavaClassName Optional Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default -partitioner JavaClassName Optional Class that determines which reduce a key is sent to -combiner streamingCommand or JavaClassName Optional Combiner executable for map output -cmdenv name=value Optional Pass environment variable to streaming commands -inputreader Optional For backwards-compatibility: specifies a record reader class (instead of an input format class) -verbose Optional Verbose output -lazyOutput Optional Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write) -numReduceTasks Optional Specify the number of reducers -mapdebug Optional Script to call when map task fails -reducedebug Optional Script to call when reduce task fails Hadoop streaming command options
Hadoop Pipes Hadoop pipes are nothing but an C++ interface to Hadoop MapReduce. Pipes uses sockets as the channel over which the task tracker interacts with the process running the C++ map or reduce task. Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function. The map and reduce functions are defined by extending the Mapper and Reducer classes defined in the Hadoop Pipes namespace and providing implementations of the map() and reduce() methods in each case . Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as Standard Template Library (STL) strings. This makes the interface simpler, although it does put a slightly greater burden on the application developer, who has to convert to and from richer domain-level types.
Hadoop streaming and Hadoop pipes
Important Questions Write a java script for Mapper and reducer considering weather dataset as an example, output must retrieve maximum temperature for every year . Describe with a neat diagram Map Reduce data flow with a single reduce task . Explain map and reduce phase with an example. Briefly explain the significance of data flow in distributed file system. What are Hadoop pipes? Explain. Explain different types of data input format and output format supported by Hadoop with an example. What is Hadoop pipes give a brief explanation with an example. What is the function of a combiner in Map reduce? How does it differ from Reduce function.