Hadoop Streaming and Pipes - FDTP on BDA - 26.07.2024 - 9.30 AM.pptx
rethinakumari2
0 views
27 slides
Oct 10, 2025
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
hgjhjuiukyj iiiiiiiiiii
Size: 872.12 KB
Language: en
Added: Oct 10, 2025
Slides: 27 pages
Slide Content
Hadoop Streaming and Pipes
Hadoop Streaming Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer C, Python, Java, Ruby, C#, perl, shell commands Map and Reduce classes can even be written in different languages.
Why Hadoop Streaming? Accessibility and Flexibility Language Agnostic Ease of Use Rapid Prototyping and Development Quick Prototyping & Minimal Setup Integration and Scalability Seamless Integration with HDFS Scalability Versatility and Use Cases Cost and Resource Efficiency - Resource Optimization Community and Support - Strong Community Support
Using Streaming Utility > hadoop jar <dir>/hadoop- *streaming*.jar \ - file /path/to/mapper.py \ - mapper /path/to/mapper.py \ - file /path/to/reducer.py \ -reducer /path/to/reducer.py \ - input /user/hduser/books/* \ - output /user/hduser/books-output Path to the streaming jar library Location of mapper file, and define it as mapper Location of reducer file, and define it as reducer Input and output locations
Features Part of the Hadoop Distribution System Ease of writing Map Reduce programs Supports almost all types of programming languages such as Python, C++, Ruby, Perl etc. The entire Hadoop Streaming framework runs on Java. Process uses Unix Streams that act as an interface between Hadoop and Map Reduce programs. Uses various Streaming Command Options: input directoryname or filename output directoryname
Hadoop Streaming Architecture both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout . The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
Contd… $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ - input myInputDirs \ - output myOutputDir \ - mapper /bin/cat \ - reducer /bin/wc This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
Contd… You can supply a Java class as the mapper and/or the reducer. $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ - input myInputDirs \ - output myOutputDir \ - mapper org.apache.hadoop.mapred.lib.IdentityMapper \ - reducer /bin/wc User can specify stream.non.zero.exit.is.failure as true or false to make a streaming task that exits with a non- zero status to be Failure or Success respectively.
Package Files with Job Submissions specify any executable as the mapper and/or the reducer. executables do not need to pre- exist on the machines in the cluster. Use "-file" option to tell the framework to pack your executable files as a part of job submission $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ - input myInputDirs \ - output myOutputDir \ - mapper myPythonScript.py \ - reducer /bin/wc \ - file myPythonScript.py
Package Files with Job Submissions In addition to executable files, package other auxiliary files (such as dictionaries, configuration files, etc) that may be used by the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ - input myInputDirs \ - output myOutputDir \ - mapper myPythonScript.py \ - reducer /bin/wc \ - file myPythonScript.py \ - file myDictionary.txt
Streaming Options and Usage Mapper- Only Jobs To process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job. To be backward compatible, Hadoop Streaming also supports the "- reduce NONE" option, which is equivalent to "- jobconf mapred.reduce.tasks=0" .
Streaming Options and Usage Specifying Other Plugins for Jobs a normal Map/Reduce job, specify other plugins for a streaming job - inputformat JavaClassName - outputformat JavaClassName - partitioner JavaClassName - combiner JavaClassName The class you supply for the input format should return key/value pairs of Text class. TextInputFormat is used as the default - returns keys of LongWritable class (not part of the input data) the keys will be discarded; only the values will be piped to the streaming mapper.
Streaming Options and Usage Large files and archives in Hadoop Streaming The - cacheFile and - cacheArchive options allow you to make files and archives available to the tasks. The argument is a URI to the file or archive that you have already uploaded to HDFS. These files and archives are cached across jobs. You can retrieve the host and fs_port values from the fs.default.name config variable. Examples of the - cacheFile option: - cacheFile hdfs://host:fs_port/user/testfile.txt#testlink
Streaming Options and Usage Large files and archives in Hadoop Streaming the part of the url after # is used as the symlink name that is created in the current working directory of tasks. So the tasks will have a symlink called testlink in the cwd that points to a local copy of testfile.txt . cacheFile hdfs://host:fs_port/user/testfile1.txt#testli nk1 - cacheFile hdfs://host:fs_port/user/testfile2.txt#testli nk2
Streaming Options and Usage Large files and archives in Hadoop Streaming The - cacheArchive option allows you to copy jars locally to the cwd of tasks and automatically unjar the files. For example: - cacheArchive hdfs://host:fs_port/user/testfile.jar#testlink3 a symlink testlink3 is created in the current working directory of tasks. This symlink points to the directory that stores the unjarred contents of the uploaded jar file.
Streaming Options and Usage Specifying Additional Configuration Variables for Jobs specify additional configuration variables by using "- jobconf <n>=<v>". For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop- streaming.jar \ - input myInputDirs \ - output myOutputDir \ - mapper org.apache.hadoop.mapred.lib.IdentityMapper\ - reducer /bin/wc \ - jobconf mapred.reduce.tasks=2 The - jobconf mapred.reduce.tasks=2 - specifies to use two reducers for the job.
Hadoop Pipes Hadoop Pipes is a C++ API for Hadoop, enabling programmers to write MapReduce applications in C++ rather than Java. This can be beneficial for leveraging existing C++ code or for performance reasons. Key Features: Language Interoperability : Allows you to write MapReduce programs in C++. High Performance : C++ programs can offer better performance due to lower- level memory and processor management. Existing Codebase : Useful for integrating existing C++ libraries and code into Hadoop.
Hadoop Pipes Components: HadoopPipes : The core library for writing C++ MapReduce applications. PipesRunner : The Java class responsible for managing and running Pipes programs.
Hadoop Pipes
Hadoop Pipes Hadoop Streaming It is an API that allows application users to write their map and reduce functions in languages other than Java . Using Unix standard streams as the interface between Hadoop and user’s program, application users can use any languages with standard I/O operations to implement their MapReduce programs. Hadoop Pipes It is a C++ interface to Hadoop MapReduce. Unlike Streaming, Pipes uses sockets as the channel over which the TaskTracker communicates with the process running the C++- based map and reduce functions without using JNI .
Java Native Interface (JNI) JNI is a native programming interface that allows Java code running in a Java Virtual Machine (JVM) to invoke or to be invoked by applications and libraries written in other programming languages such as C, C++, and assembly. translator between Java and other Using JNI languages, as a code application invoking platform specific features and program users can enjoy various benefits: libraries which the standard Java class library does not support . Several existing libraries wraps CUDA code invocation by using JNI.
Compute Unified Device Architecture In computing, CUDA is a proprietary parallel computing platform and API that allows software to use certain types of GPUs for accelerated general- purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA API and its runtime: The CUDA API is an extension of the C programming language that adds the ability to specify thread- level parallelism in C and also to specify GPU device specific operations (like moving data between the CPU and the GPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements for the execution of compute kernels.
CUDA processing flow
Hadoop Pipes Workflow: Create the C++ Program : Define the Mapper and Reducer classes. Implement the map() and reduce() methods. Compile the C++ Program : Use a C++ compiler to compile the source code into an executable. Deploy and Run : Copy the compiled executable to the Hadoop cluster. Use the HadoopPipes command to submit the job to the Hadoop framework.
Hadoop Pipes Workflow: Create the C++ Program : Define the Mapper and Reducer classes. Implement the map() and reduce() methods. Compile the C++ Program : Use a C++ compiler to compile the source code into an executable. Deploy and Run : Copy the compiled executable to the Hadoop cluster. Use the HadoopPipes command to submit the job to the Hadoop framework.