53063209gvdfgfgfhhghghgfghhhg5-Unit-3.pdf

rethinakumari2 0 views 73 slides Oct 10, 2025
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

vhghyuh


Slide Content

Anatomy of MapReduce Job
Run
Some slides are taken from CMU pptpresentation.

Hadoop 2.X MapReduce Components (Entities)

How Hadoop Runs a MapReduce Job

Job Submission
•Job.runJob()
•Creates a new JobClientinstances
•Calls submit() on Job object
•Job.submit()
•Creates an internal JobSubmitterinstance and calls submitJobInternal() on it(step 1 in Figure).
•Having submitted the job, waitForCompletion() polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
•The job submission process implemented by JobSubmitterdoes five things.

Five things done Job Submission Process
1.Asks the resource manager for a new application ID, used for the
MapReduce job ID (step 2).
2.Checks the output specification of the job.
3.Computes the input splits for the job.
4.Copies the resources needed to run the job, including the job JAR
file, the configuration file, and the computed input splits, to the
shared filesystem in a directory named after the job ID (step 3).
5.Submits the job by calling submitApplication() on the resource
manager (step 4).

Input Splits

Relation between input splits and HDFS Blocks

Job Initialization
1.When the resource manager receives a call to its submitApplication()
method, it hands off the request to the YARN scheduler.
2.The scheduler allocates a container, and the resource manager then
launches the application master’s process there, under the node manager’s
management (steps 5a and 5b).
3.The application master for MapReduce jobs is a Java application whose main
class is MRAppMaster. It initializes the job by creating a number of
bookkeeping objects to keep track of the job’s progress, as it will receive
progress and completion reports from the tasks (step 6).
4.Next, it retrieves the input splits computed in the client from the shared
filesystem (step 7). It then creates a map task object for each split, as well as
a number of reduce task objects. Tasks are given IDs at this point.

Task Assignment
1.The application master requests containers for all the map and
reduce tasks in the job from the resource manager (step 8).
2.Reduce tasks can run anywhere in the cluster, but requests for map
tasks have data locality constraints that the scheduler tries to
honor.
3.Requests for reduce tasks are not made until 5% of map tasks have
completed
4.Requests also specify memory requirements and CPUs for tasks. By
default, each map and reduce task is allocated 1,024 MB of
memory and one virtual core.

Task Execution
1.Once a task has been assigned resources for a container on a particular
node by the resource manager’s scheduler, the application master starts
the container by contacting the node manager (steps 9a and 9b).
2.The task is executed by a Java application whose main class is YarnChild.
3.Before it can run the task, it localizes the resources that the task needs,
including the job configuration and JAR file, and any files from the
distributed cache (step 10).
4.Finally, it runs the map or reduce task (step 11).
5.The YarnChildruns in a dedicated JVM, so that any bugs in the user-defined
map and reduce functions (or even in YarnChild) don’t affect the node
manager—by causing it to crash or hang, for example.

Progress Measure
•Following operations constitute task progress in Hadoop:
1.Reading an input record (in a mapper or reducer)
2.Writing an output record (in a mapper or reducer)
3.Setting the status description (via Reporter’s or
TaskAttemptContext’ssetStatus() method)
4.Incrementing a counter (using Reporter’s incrCounter() method
or Counter’s increment() method)
5.Calling Reporter’s or TaskAttemptContext’sprogress() method

Progress and Status Updates
1.When a task is running, it keeps track of its progress (i.e., the proportion of the task
completed). Progress is not always measurable.
2.For map tasks, this is the proportion of the input that has been processed. For reduce
tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.
3.Tasks also have a set of counters that count various events as the task runs. The
counters are either built into the framework, such as the number of map output
records written, or defined by users.
4.The task reports its progress and status (including counters) back to its application
master, which has an aggregate view of the job.
5.During the course of the job, the client receives the latest status by polling the
application master every second

propagation of status updatesthrough the
MapReduce system

Job Completion
1.When the application master receives a notification that the last
task for a job is complete, it changes the status for the job to
“successful.”
2.When the Job polls for status, it learns that the job has completed
successfully, so it prints a message to tell the user and then returns
from the waitForCompletion() method.
3.Job statistics and counters are printed to the console at this point.
4.Finally, on job completion, the application master and the task
containers clean up their working state.
5.Job information is archived by the job history server to enable later
interrogation by users if desired.

Failures in Hadoop
•In the real world,
1.User code is buggy,
2.Processes crash, and
3.Machines fail.
•One of the major benefits of using Hadoop is its ability to handle failures.
•Various entities that may fail in Hadoop
1.Task Failure
2.Application Master Failure
3.Node Manager Failure
4.Resource Manager Failure

Task Failure
•User code in the map or reduce task throws a runtime exception
The task JVM reports the error back to its parent application master before it exits. The
error ultimately makes it into the user logs. The application master marks the task
attempt as failed, and frees up the container so its resources are available for another
task.
•Sudden exit of the task JVM—perhaps there is a JVM bug
The node manager notices that the process has exited and informs the application
masterso it can mark the attempt as failed.
•Hanging tasks are dealt with differently
The application master notices that it hasn’t received a progress update for a while and
proceeds to mark the task as failed.
•When the application master is notified of a task attempt that has failed, it will reschedule
execution of the task. The application master will try to avoid rescheduling the task on a
node manager where it has previously failed.

Application Master Failure
•An application master sends periodic heartbeats to the resource manager, and
in the event of application master failure, the resource manager will detect the
failure and start a new instance of the master running in a new container.
•In the case of the MapReduce application master, it will use the job history to
recover the state of the tasks that were already run by the (failed) application
so they don’t have to be rerun. Recovery is enabled by default.
•The MapReduce client polls the application master for progress reports, but if
its application master fails, the client needs to locate the new instance.
If the application master fails, however, the client will experience a timeout
when it issues a status update, at which point the client will go back to the
resource manager to ask for the new application master’s address.

Node Manager Failure
•If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager.
The resource manager will notice a node manager that has stopped sending heartbeats
if it hasn’t received one for 10 minutes and remove it from its pool of nodes to schedule
containers on.
Any task or application master running on the failed node manager will be recovered
using the mechanisms described under “Task Failure” and “Application Master Failure”
sections respectively.
The application master arranges for map tasks (which were scheduled on failed nodes)
to be rerun if they belong to incomplete jobs.
Node managers may be blacklisted if the number of failures for the application is high,
even if the node manager itself has not failed. Blacklisting is done by the application
master.

Resource Manager Failure
•The resource manager is a single point of failure.
•To achieve high availability (HA), it is necessary to run a pair of resource managers in an active-
standby configuration. If the active resource manager fails, then the standby can take over
without a significant interruption to the client.
•Information about all the running applications is stored in a highly available state store (backed
by ZooKeeperor HDFS), so that the standby can recover the core state of the failed active
resource manager.
•Node manager information can be reconstructed by the new resource manager as the node
managers send their first heartbeats.
•When the new resource manager starts, it reads the application information from the state
store, then restarts the application masters for all the applications running on the cluster.
•The transition of a resource manager from standby to active is handled by a failover controller.
•Clients and node managers must be configured to handle resource manager failover.

Hadoop MapReduce: A Closer Look
file
file
InputFormat
SplitSplit Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
file
file
InputFormat
SplitSplit Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Files loaded from local HDFS store
RecordReaders
Final (K, V) pairs
Writeback to local
HDFS store
Node 1 Node 2
Shuffling
Process
Intermediate
(K,V) pairs
exchanged by
all nodes

Input Files
InputfilesarewherethedataforaMapReducetaskis
initiallystored
Theinputfilestypicallyresideinadistributedfilesystem
(e.g.HDFS)
Theformatofinputfilesisarbitrary
Line-basedlogfiles
Binaryfiles
Multi-lineinputrecords
Orsomethingelseentirely
21
file
file

InputFormat
Howtheinputfilesaresplitupandreadisdefinedby
theInputFormat
InputFormatisaclassthatdoesthefollowing:
Selects the files that should be used
for input
Defines the InputSplitsthat break
a file
Provides a factory for RecordReaderobjects that
read the file
22
file
file
InputFormat
Files loaded from local HDFS store

InputFormat Types
SeveralInputFormatsareprovidedwithHadoop:
23
InputFormat Description Key Value
TextInputFormat Default format; reads
lines of text files
The byte offset
of the line
The line contents
KeyValueInputFormat Parses lines into (K,
V) pairs
Everything up to
the first tab
character
The remainder of
the line
SequenceFileInputFormatA Hadoop-specific
high-performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format; reads
lines of text files
The byte offset
of the line
The line contents
KeyValueInputFormat Parses lines into (K,
V) pairs
Everything up to
the first tab
character
The remainder of
the line
SequenceFileInputFormatA Hadoop-specific
high-performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format; reads
lines of text files
The byte offset
of the line
The line contents
KeyValueInputFormat Parses lines into (K,
V) pairs
Everything up to
the first tab
character
The remainder of
the line
SequenceFileInputFormatA Hadoop-specific
high-performance
binary format
user-defined user-defined
InputFormat Description Key Value
TextInputFormat Default format; reads
lines of text files
The byte offset
of the line
The line contents
KeyValueInputFormat Parses lines into (K,
V) pairs
Everything up to
the first tab
character
The remainder of
the line
SequenceFileInputFormatA Hadoop-specific
high-performance
binary format
user-defined user-defined

Input Splits
AninputsplitdescribesaunitofworkthatcomprisesasinglemaptaskinaMapReduceprogram
Bydefault,theInputFormatbreaksafileupinto64MBsplits
By dividing the file into splits, we allow
several map tasks to operate on a single
file in parallel
If the file is very large, this can improve
performance significantly through parallelism
Eachmaptaskcorrespondstoasingleinputsplit
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store

RecordReader
Theinputsplitdefinesasliceofworkbutdoesnotdescribehow
toaccessit
TheRecordReaderclassactuallyloadsdatafromitssourceandconvertsitinto(K,V)pairssuitable
forreadingbyMappers
The RecordReaderis invoked repeatedly
on the input until the entire split is consumed
Each invocation of the RecordReaderleads
to another call of the map function defined
by the programmer
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR

Mapper and Reducer
TheMapperperformstheuser-definedworkofthefirstphaseoftheMapReduceprogram
AnewinstanceofMapperiscreatedforeachsplit
The Reducerperforms the user-defined work of
the second phase of the MapReduceprogram
A new instance of Reducer is created for each partition
For each key in the partition assigned to a Reducer, the
Reducer is called once
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce

Combiners and Partitioners

Combiner Example

Partitioner
Eachmappermayemit(K,V)pairstoanypartition
Therefore, the map nodes must all agree on
where to send different pieces of
intermediate data
The partitionerclass determines which
partition a given (K,V) pair will go to
The default partitionercomputes a hash valuefor a
given key and assigns it to a partition based on
this result
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce

Sort
Each Reducer is responsible for reducing
the values associated with (several)
intermediate keys
The set of intermediate keys on a single
node is automatically sorted by
MapReducebefore they are presented
to the Reducer
file
file
InputFormat
SplitSplit Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce

OutputFormat
The OutputFormatclass defines the way (K,V) pairs
produced by Reducers are written to output files
The instances of OutputFormatprovided by
Hadoopwrite to files on the local disk or in HDFS
Several OutputFormatsare provided by Hadoop:
file
file
InputFormat
Split Split Split
Files loaded from local HDFS store
RR RR RR
Map Map Map
Partitioner
Sort
Reduce
OutputFormat
OutputFormat Description
TextOutputFormat Default; writes lines in "key \t value"
format
SequenceFileOutputFormat Writes binary files suitable for
reading into subsequent MapReduce
jobs
NullOutputFormat Generates no output files

Shuffle and Sort: The Map Side
•MapReduce makes the guarantee that the input to every reducer is sorted by
key.
•Theprocess by which the system performs the sort—and transfers the map
outputs to the reducers as inputs—is known as the shuffle
•When the map function starts producing output, it is not simply written to
disk. It takes advantage of buffering by writing in main memory and doing
some presorting for efficiency reasons.
•Each map task has a circular memory buffer that it writes the output to. The
buffer is 100 MB by default.
•When the contents of the buffer reach a certain threshold size (default value
0.80, or 80%), background thread will start to spill the contents to disk.

Shuffle and Sort in MapReduce

Shuffle and Sort: The Map Side
•Spills are written in round-robin fashion to the specified directories.
•Before it writes to disk, the thread first divides the data into partitions corresponding
to the reducers that they will ultimately be sent to.
•Within each partition, the background thread performs an in-memory sort by key,
and if there is a combiner function, it is run on the output of the sort.
•Running the combiner function makes for a more compact map output, so there is
less data to write to local disk and to transfer to the reducer.
•Each time the memory buffer reaches the spill threshold, a new spill file is created, so
after the map task has written its last output record, there could be several spill files.
•Before the task is finished, the spill files are merged into a single partitioned and
sorted output file. If there are at least three spill files,the combiner is run again
before the output file is written.

Shuffle and Sort: The Reduce Side
•The map output file is sitting on the local disk of the machine that ran the map
task, but now it is needed by the machine that is about to run the reduce task
for the partition.
•Moreover, the reduce task needs the map output for its particular partition
from several map tasks across the cluster.
•Themap tasks may finish at different times, so the reduce task starts copying
their outputs as soon as each completes. This is known as the copy phase of the
reduce task.
•The reduce task has a small number of copier threads, by default 5, so that it
can fetch map outputs in parallel.
•A thread in the reducer periodically asks the master for map output hosts until
it has retrieved them all.

Shuffle and Sort: The Reduce Side
•Map outputs are copied to the reduce task JVM’s memory if they are small enough.
otherwise, they are copied to disk.
•When the in-memory buffer reaches a threshold size or reaches a threshold number
of map outputs, it is merged and spilled to disk. If a combiner is specified, it will be run
during the merge to reduce the amount of data written to disk.
•As the copies accumulate on disk, a background thread merges them into larger,
sorted files.
•When all the map outputs have been copied, the reduce task moves into the sort
phase, which merges the map outputs, maintaining their sort ordering. This is done in
rounds.
•During the reduce phase, the reduce function is invoked for each key in the sorted
output. The output of this phase is written directly to the output filesystem, typically
HDFS.

Speculative Execution
•The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would be if the tasks ran sequentially.
•AMapReducejobisdominatedbytheslowesttask
•Hadoop doesn’t try to diagnose and fix slow-running tasks(stragglers); instead, it tries
to detect when a task is running slower than expected and launches another
equivalent task as a backup. This is termed speculative execution of tasks.
•Onlyonecopyofastragglerisallowedtobespeculated
•Whichevercopy(amongthetwocopies)ofataskcommitsfirst,itbecomesthe
definitivecopy,andtheothercopyiskilled.
•Speculative execution is an optimization, and not a feature to make jobs run more
reliably.
•Speculative execution is turned on by default.

Output Committers
•Hadoop MapReduce uses a commit protocol to ensure that
jobs and tasks either succeed or fail cleanly.
•The behavior is implemented by the OutputCommitterin use
for the job.
•In the old MapReduce API, the OutputCommitteris set by
calling the setOutputCommitter() on JobConfor by setting
mapred.output.committer.classin the configuration.
•In the new MapReduce API, the OutputCommitteris
determined by the OutputFormat, via its
getOutputCommitter() method.

Apache Pig
Unit III

Outline of Today’s Talk
•What is Pig?
•Why do we need Apache Pig?
•Features of Pig
•Pig Architecture
•Pig Terms
•Example Scripts

What is Pig ?
•DevelopedbyYahoo!andatoplevelApacheproject
•PigisaHigh-levelprogramminglanguageusefulforanalyzing
largedatasets.
•Itisanalternativetomapreduceprogramming.
•PigisgenerallyusedwithHadoop.Wecanperformallthe
datamanipulationsoperationsinHadoopusingApachePig.
•ApachePigisanabstractionoverMapReduce.Itisbuilton
topofHadoop.
•ThePigprogramminglanguagedesignedtoworkuponany
kindofdata.
•PigprovidesaHighLevellanguageknownasPigLatinfordata
analysis.

Cont...
•ProgrammersneedtowritescriptsusingPig
Latinlanguagewhichareinternallyconverted
toMapReducetasks.ApachePighasa
componentknownasPigEnginethataccepts
thePigLatinscriptsasinputandconverts
thosescriptsintoMapReducejobs.

Why do we need Apache Pig?
•PigLatinhelpsprogrammerstoperformMap
Reducetaskswithouthavingtotypecomplex
codesinJava.
•Apachepigusesmulti-queryapproachthereby
reducingthelengthofcodes.
•Pig(likeMapReduce)isorientedaroundthe
batchprocessingofdata.Ifyouneedtoprocess
gigabytesorterabytesofdata,Pigisagood
choice.Butitexpectstoreadalltherecordsofa
fileandwriteallofitsoutputsequentially.

Contd...
•Apache Pig provides many built-in operators
to support data operations like joins, filters,
ordering, etc. In addition, it also provides
nested data types like tuples,bags,and maps
that are missing in Map Reduce.

Features of Pig
•Extensibility:Usingtheexistingoperators,
userscandeveloptheirownfunctionstoread,
process,andwritedata.
•UDF’s:PigprovidesthefacilitytocreateUser-
definedFunctionsinotherprogramming
languagessuchasJavaandinvokeorembed
theminPigscripts.
•Handlesallkindsofdata:ApachePiganalyzes
allkindsofdata,bothstructuredaswellas
unstructured.ItstorestheresultsinHDFS.

Contd...,
•Richsetofoperators:Itprovidesmany
operationslikejoins,sort,filter,etc.
•Easeofprogramming:PigLatinissimilarto
SQLanditiseasytowriteaPigscriptifyou
aregoodatSQL.
•Optimizationopportunities:Thetasksin
ApachePigoptimizetheirexecution
automatically,sotheprogrammersneedto
focusonlyonsemanticsofthelanguage.

Pig Architecture

Contd...
•Gruntshell:Itisaninteractiveshelltorite/executePigLatin
scripts.
•Parser:IttakesinputfromGruntshellanddoesfollowing
thingslikeTypechecking,SyntaxcheckandconstructaDAG
forPigLatinscripts.Parserwillgeneratelogicalplaninthe
formofDAG.
•Optimizer:logicalplanisgivenasinputtotheOptimizer.
Optimizerappliessomeprojectionsandpushdowns.
Optimizerproducesoptimizedlogicalplan.
•Compiler:compilerwilltakelogicalplanandcompileand
generateseriesofMapReducejobs.
•ExecutionEngine:itwillexecuteMapReducetasksbyfetching
datafromHDFS.

Pig Terms
•All data in Pig one of four types:
–An Atomis a simple data value -stored as a string but
can be used as either a string or a number
–ATupleis a data record consisting of a sequence of
"fields"
•Each field is a piece of data of any type (atom, tuple or bag)
–ABagis a set of tuples (also referred to as a ‘Relation’)
•The concept of a “kind of a” table
–AMapis a map from keys that are string literals to
values that can be any data type
•The concept of a hash map

Execution Modes
•Local
--Executes in a single JVM
--Works exclusively with local file system
--Great for development, experimentation and prototyping
$pig -x local
•HadoopMode
--AlsoknownasMapReducemode
--PigrendersPigLatinintoMapReducejobsandexecutes
themonthecluster.
--Canexecuteagainstsemi-distributedorfully-distributed
hadoopinstallation
$ pig -x mapreduce

Comparison of Pig with Databases

Running Pig
•Script
–Execute commands in a file
$pig scriptFile.pig
• Grunt
–Interactive Shell for executing Pig Commands
–Started when script file is NOT provided
–Can execute scripts from Grunt via run or exec commands
• Embedded
–Execute Pig commands using Pig Server class
–Just like JDBC to execute SQL
–Can have programmatic access to Grunt via Pig Runner
class

Grunt
•Grunthasline-editingfacilitieslikethosefoundinGNUReadline(usedin
thebashshellandmanyothercommand-lineapplications).
•Forinstance,theCtrl-Ekeycombinationwillmovethecursortotheendof
theline.
•Gruntrememberscommandhistorytooandyoucanrecalllinesinthe
historybufferusingCtrl-PorCtrl-N(forpreviousandnext),or
equivalently,theupordowncursorkeys.
•Another handy feature is Grunt’s completion mechanism, which will try to
complete Pig Latin keywords and functions when you press the Tab key.
•For example, consider the following incomplete line:
grunt> a = foreach b ge
If you press the Tab key at this point, ge will expand to generate, a Pig Latin
keyword:
grunt> a = foreach b generate

Contd...,
•Youcancustomizethecompletiontokensbycreatingafilenamed
autocompleteandplacingitonPig’sclasspath(suchasintheconf
directoryinPig’sinstalldirectory)orinthedirectoryyouinvokedGrunt
from.
•Whenyou’vefinishedyourGruntsession,youcanexitwiththequit
command,ortheequivalentshortcut\q.

Pig Data Model
•The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple.
•Types
Pig’s data types can be divided into two categories: scalar types, which
contain a single value, and complex types, which contain other types.
Scalar Types
•Pig’s scalar types are simple types that appear in most programming
languages. With the exception of bytearray, they are all represented in Pig
interfaces byjava.lang classes, making them easy to work with in UDFs:
Int,Long ,Float Double, CharArray and byteArray.
Complex Types
Pig has three complex data types: maps, tuples, and bags. All of these
types can contain data of any type, including other complex types.

•Map
•A map in Pig is a chararrayto data element mapping, where that element
can be any Pig type, including a complex type. The chararrayis called a
key and is used as an index to find the element, referred to as the value.
•Forexample,['name'#'bob','age'#55]willcreateamapwithtwokeys,
“name”and“age”.Thefirstvalueisachararray,andthesecondisan
integer.
•Tuple
•Atupleisafixed-length,orderedcollectionofPigdataelements.Tuples
aredividedintofields,witheachfieldcontainingonedataelement.These
elementscanbeofanytype—theydonotallneedtobethesametype.A
tupleisanalogoustoarowinSQL,withthefieldsbeingSQLcolumns.
Forexample,('bob',55)describesatupleconstantwithtwofields.

Contd...,
•Bag
Abagisanunorderedcollectionoftuples.Becauseithasnoorder,itisnot
possibletoreferencetuplesinabagbyposition
Bagconstantsareconstructedusingbraces,withtuplesinthebag
separatedbycommas.Forexample,{('bob',55),('sally',52),('john',
25)}constructsabagwiththreetuples,eachwithtwofields.

Pig Latin.
•ThelanguageusedtoanalyzedatainHadoopusingPigisknownasPig
Latin.
•Itisahighleveldataprocessinglanguagewhichprovidesarichsetofdata
typesandoperatorstoperformvariousoperationsonthedata.
•ToperformaparticulartaskProgrammersusingPig,programmersneedto
writeaPigscriptusingthePigLatinlanguage,andexecutethemusingany
oftheexecutionmechanisms(GruntShell,UDFs,Embedded).
•Afterexecution,thesescriptswillgothroughaseriesoftransformations
appliedbythePigFramework,toproducethedesiredoutput.
•Structure
•Statements
•Expressions
•Functions

Structure
•APigLatinprogramconsistsofacollectionofstatements.Astatementcanbethoughtofas
anoperationoracommand.
grouped_records=GROUPrecordsBYyear;
•Statementsareusuallyterminatedwithasemicolonandcanbesplitacrossmultiplelinesfor
readability.
•records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
•Pig Latin has two forms of comments.
•Double hyphens are used for single-line comments.
•Everything from the first hyphen to the end of the line is ignored by the Pig Latin Interpreter:
--My program
DUMP A; --What's in A?
•Multi-line Comments:
/*
* Description of my program spanning
* multiple lines.
*/
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';

Statements
•As a Pig Latin program is executed, each statement is parsed in turn.
•If there are syntax errors or other (semantic) problems, such as undefined
aliases, the interpreter will halt and display an error message.
•The interpreter builds a logical plan for every relational operation, which
forms the core of a Pig Latin program.
•The logical plan for the statement is added to the logical plan for the
program so far, and then the interpreter moves on to the next statement.

Expressions
•Anexpressionissomethingthatisevaluatedtoyieldavalue.
•ExpressionscanbeusedinPigasapartofastatementcontaininga
relationaloperator.Pighasarichvarietyofexpressions,manyofwhich
willbefamiliarfromotherprogramminglanguages.
•Examples:
Category :Constant
Expression: Literal
Description:Constant value.
Example : 1.0, 'a‘.
Category : Field (by position) , Field (by name)
Example : $n , f
Description :Field in position n (zero-based) , Field named f
Example : $0 , year

Types

Schemas
•A relation in Pig may have an associated schema, which gives the fields in
the relation names and types.
•Example :
•grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:int, temperature:int, quality:int);
•grunt> DESCRIBE records;
Output: records: {year: int,temperature: int,quality: int}

Functions
•Functions in Pig come in four types:
Eval function: A function that takes one or more expressions and returns
another expression.
Example : An example of a built-in eval function is MAX, which returns
the maximum value of the entries in a bag.
Filter function: A special type of eval function that returns a logical
Boolean result. Filter functions are used in the FILTER operator to remove
unwanted rows.
Load function: A function that specifies how to load data into a relation
from external storage.
Store function: A function that specifies how to save the contents of a
relation to external storage.

User Defined Functions (UDFs)
•Apache Pig provides extensive support for user defined functions (UDFs)
as a way to specify custom processing.
•Pig UDFs can currently be executed in three languages:Java, Python,
JavaScript and Ruby.
•The most extensive support is provided for Java functions.
•Java UDFs can be invoked through multiple ways.
•The simplest UDF can just extend EvalFunc, which requires only the exec
function to be implemented.
•What is a Piggybank?
•Piggybank is a collection of user-contributed UDFs that is released along
with Pig. Piggybank UDFs are not included in the Pig JAR, so you have to
register them manually in your script. You can also write your own UDFs or
use those written by other users.

Contd..,
•EvalFunctions
•The UDF class extends theEvalFuncclass which is the base for all Evalfunctions.
•All Evaluation functions extend the Java class‘org.apache.pig.EvalFunc.
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc<String>
{
public String exec(Tupleinput) throws IOException{
if (input == null || input.size() == 0)
return null;
try{ String str= (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}

Contd..,
•Using the UDF:
a) Registering the Jar File
REGISTERpath;
b) Deining the Alias
DEFINE UPPER up();
c) Using the UDF
grunt> Upper_case = FOREACH emp_data
GENERATE up(name);

Data Processing Operators
•Loading and Storing Data:
Load: The Apache Pig LOAD operator is used to load the data from the file system.
Example:
grunt>A=LOAD'/pigexample/pload.txt'USINGPigStorage(',')AS(a1:int,a2:int,a3:int,a4:in
t);
Store: We can store the loaded data in the file system using thestoreoperator.
Example:
grunt> STORE A INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage(',');

Contd...,
•Filtering Data:
Onceyouhavesomedataloadedintoarelation,oftenthe
nextstepistofilterittoremovethedatathatyouarenot
interestedin.Byfilteringearlyintheprocessingpipeline,
•youminimizetheamountofdataflowingthroughthesystem,
whichcanimproveefficiency.
FOREACH...GENERATE:TheFOREACH...GENERATEoperatorhas
anestedformtosupportmorecomplexprocessing.
Example:
grunt>B=FOREACHAGENERATE$0,$2+1,'Constant';
grunt>DUMPB;
•STREAM: The STREAM operator allows you to transform data in a
relation using an external program or script.

Contd..,
•Grouping and Joining Data:
JOIN: TheJOINoperator is used to combine records from two or more
relations.
Example:
grunt> customers3 = JOIN customers1 BY id, customers2 BY
id;
COGROUP:JOINalwaysgivesaflatstructure:asetoftuples.
TheCOGROUPstatementissimilartoJOIN,butinsteadcreatesanested
setofoutputtuples.
GROUP:WhereCOGROUPgroupsthedataintwoormorerelations,the
GROUPstatementgroupsthedatainasinglerelation

Contd..,
•Sorting Data: Relations are unordered in Pig.
Consider a relation A:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
•The following example sorts A by the first field in ascending order and by
the second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
Output: (1,2)
(2,4)
(2,3)

Contd..,
•Combining and Splitting Data:Sometimes you have several relations that
you would like to combine into one. For this, the UNION statement is used.
•grunt> DUMP A;
(2,3)
(1,2)
(2,4)
•grunt> DUMP B;
(z,x,8)
(w,y,1)
•grunt> C = UNION A, B;
•grunt> DUMP C;
(2,3)
(z,x,8)
(1,2)
(w,y,1)