Introduction to pig.

TrilokiGupta 341 views 19 slides Mar 30, 2018
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

An engine for executing data flow in parallel on Hadoop. It provides a simple language called Pig Latin, for queries and data manipulation.


Slide Content

Introduction To Pig Triloki Gupta 205217006

An engine for executing data flow in parallel on Hadoop. Pig is an open-source high level data flow system. It provides a simple language called Pig Latin, for queries and data manipulation. Pig Latin already have most of the traditional data operation functionalities built into Filtering data Sorting data Joining data Pig users can create their own functions for reading, processing and writing data under UDF(user defined functions). What is Pig?

Pig Latin program is made up of a series of operations or transformations that are applied to the input data to produce output.  The job of Pig is to convert the transformations into a series of MapReduce jobs. What is Pig Latin Program?

It’s easy to learn, especially if you’re familiar with SQL. Pig’s multi-query approach reduces the number of times data is scanned. This means 1/20th the lines of code and 1/16th the development time when compared to writing raw MapReduce . Performance of Pig is in par with raw MapReduce Pig provides data operations like filters, joins, ordering, etc. and nested data types like tuples, bags, and maps, that are missing from MapReduce . Pig Latin is easy to write and read . Why Do you Need Pig?

Pig was originally developed by Yahoo in 2006, for researchers to have an ad-hoc way of creating and executing MapReduce jobs on very large data sets. It was created to reduce the development time through its multi-query approach. Pig is also created for professionals from non-Java background, to make their job easier. Why was Pig Created?

Pig can be used under following scenarios: When data loads are time sensitive. When processing various data sources. When analytical insights are required through sampling. Where Should Pig be Used?

In places where the data is completely unstructured, like video, audio and readable text. In places where time constraints exist, as Pig is slower than MapReduce jobs. In places where more power is required to optimize the codes. Where Not to Use Pig?

Pigs Eat Anything Pigs Lives Anywhere Pigs are Domestic Animals Pigs Fly Pig Philosophy

Pig can operate on data whether it has metadata or not. Pig can operate on data that is relational, nested or u nstructured. Pig can easily be extended to operate on data beyond files Including key/value stores, databases, etc. Pigs Eats Everything

Pig is intended to be a language for parallel data processing. Pig is not tied to one particular parallel framework. Pig has been implemented first on Hadoop, Not intend that to be only on Hadoop Pig on MongoDB Pig with Cassandra Pigs Live Anywhere

Designed to be easily controlled and modified by its users. Integration of user designed functions(UDF) UDF are written in Java, Jython , Pig supports customer Loaders and store Load and Store Pig Supports streaming Execution of external executables Using Hadoop Streaming Methods Pig uses Optimizer by rearranging some of the operations for better performance. Pigs are domestic Animals

Pig processes data very fast The notion is that consistently improve Pig’s performance, and implementation is done in a way that Pig performance just go higher above. Pigs Fly

Processing of web logs. Data processing for search platforms. Support for Ad-hoc queries across large data sets. Quick prototyping of algorithms for processing large data sets. Applications of Apache Pig:

Yahoo uses Pig for the following purpose: In Pipelines  – To bring logs from its web servers, where these logs undergo a cleaning step to remove bots, company interval views and clicks. In Research  – To quickly write a script to test a theory. Pig Integration makes it easy for the researchers to take a Perl or Python script and run it against a huge data set. How Yahoo! Uses Pig:

Here’s the hierarchy of Pig’s program structure : Script  – Pig Can run a file script that contains Pig Commands. Eg : pig script .pig runs the command in the local file script.pig Grunt  – It is an interactive shell for running Pig commands. It is also possible to run pig scripts from within Grunts using run and exec. Embedded  – Can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. Basic Program Structure of Pig:

Components of Pig:

Pig comprises of 4 basic types of data models. They are as follows: Atom –  It is a simple atomic data value. It is stored as a string but can be used as either a string or a number Tuple – An ordered set of fields Bag – An collection of tuples. Map –  set of key value pairs . Basic Types of Data Models in Pig:

Yahoo Pig Tutorial http:// developer.yahoo.com/hadoop/tutorial/pigtutorial.html edureka.co https://www.edureka.co/blog/introduction-to-pig/ slideshare.net https://www.slideshare.net/Avkashslide/introduction-to-apache-pig-18002897 Resources