Pig latin

BitaKazemi1 607 views 21 slides Sep 25, 2014
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Pig Latin is a language for data processing


Slide Content

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Ravi Kumar, Utkarsh Srivastava, Andrew Tomkins Yahoo! Research Presented by: Bita Kazemi zahrani Adv. Data intensive computing- Fall 2014 University of Georgia

Outline of the Paper 1. Introduction 2. Features and Motivations 3. PIG LATIN 4. Implementation 5. Debugging Environment 6. Scenarios 7. Related and Future work 8. Summary 9. References

Introduction The need of ad-hoc data analysis over large data sets SQL and declarative laguages Map-Reduce paradigm Why Pig Latin

Why not SQL or Map Reduce Programmers prefer Imperative Scripts rather than declarative queries SQL solutions for large data sets like Nettezza , or Oracle RAC are expensive Map-Reduce is very low level Map-Reduce data flow is rigid and is limited to one level A custom code should be written for even common operations The code is hard to reuse and maintain Code optimization is nearly impossible

Pig vs Map-Reduce and SQL SELECT category, AVG( pagerank ) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 good_urls = FILTER urls BY pagerank >0.2 Groups= GROUP good_urls BY category Big_groups = FILTER Groups BY COUNT( good_urls ) > 10 6 Output = FOREACH Big_groups GENERATE category, AVG( good_urls.pagerank ); Imperative, sequence of steps, high-level

Good Features optimizable Support for flexible, fully nested data model Support of user-defined functions Working with plain input files witouth schema information Novel debugging environment

PIG Features and Motivations Data Flow language Comes with a sequence of steps Programmer friendly They should not execute in order : spam_urls = FILTER urls BY isSpam (); culprit_urls = FILTER spam_urls BY pagerank > 0.8 Query optimization , change the order of the sequence

PIG Features and Motivations Quick Start and Interoperability and Nested Data Model directly and quickly read input into tuples Load : queries = LOAD ‘query_log.txt’ USING myLoad () AS ( userID , queryString , timestamp); Outputs data in any user defined model Store : STORE query_revenues INTO ‘ myoutput ’ USING myStore (); // serialize Conforms to data and application ecosystem LIMITATION : it works with read-only data set and scan-centric workload Input file : query_log.txt myLoad () deserialize the file into tuples. Tuples have the fields userID , queryString , timestamp

PIG Features and Motivations Quick Start and Interoperability and Nested Data Model No need to implement transactions Provides optional schemas good_urls = FILTER urls BY $2 >0.2 $2 refers to the 3 rd field of the schema if it is not provided ( if no schema is provided , tuple fields are referred by $ position) Data is nested opposed to 1NF, data is on disk, easier for programmers

PIG Features and Motivations Quick Start and Interoperability and Nested Data Model Four types of data models : Atom : simple atomic value : ‘ alice ’ Tuple : sequence of fields with variable data types : (‘ alice ’,’ lakers ’) Bag: collection of tuples : Map: collection of data items as key value. [ ‘fan of’  { (‘ lakers ’) (‘ ipod ’)} , ‘age’20] Not necessarily map from the same to the same type  

Expressions in Pig Latin Pig latin : a not-so-foreign language for data processing

UDF support Pig supports User-Defined Functions groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top( urls ); UDF an have nested data model as input as well as output both taking non-atomic values In SQL we can use scalar function in fron of SELECT clause, set-valued functions in fron of FROM and aggregation functions with conjuction to GROUP BY or PARTITION BY Pig Latin UDFs can be used without restriction to the context PIG Latins tries to support as many as languages for UDFs, now works with C/C++, java, Python and Perl

Data Processing functions Per-tuple processing: FOREACH expanded_queries = FOREACH queries GENERATE userId , expandQuery ( QueryString ); Discarding unwanted Data : FILTER real_queries = FILTER queries BY userId neq ‘bot’; ( not equal) Or real_queries = FILTER queries BY NOT isBot ( userId ); // UDF Map-reduce in Pig Latin Map_result = FOREACH input GENERATE FLATTEN(MAP(*)); // flattens bag of tuples (*) all fields Key_groups = GROUP Map_result BY $0; // group by key Output= FOREACH Key_groups GENERATE reduce(*); // pass the bag of values for every key to reduce UDF Courtesy: Pig latin : a not-so-foreign language for data processing

Parallelization and debugging environment Works with parallelized primitives Parallelization is required Non parallel functions are excluded Non parallels can be implemented as UDFs by being aware of loss of efficiency Can’t conform with run-debug-run paradigm due to the time consuming nature Supports and interactive table of data and fill it step by step Adjust it by time Pinpoints the erroneous step

Implementation Building a logical plan : building a logical plan for each bag Compile into physical plan Lazy execution style allows in-memory pipelining and filtering reordering and optimization Compilation : Convert each (CO)GROUP statement in logical plan to a map-reduce job Map assigns keys to tuples based on BY clause, reduce is no-op first The clauses like LOAD , COGROUP and FOREACH in the first level go to first map function, The intermediate clauses can go either to level c[ i ] map or to reduce Courtesy: Pig latin : a not-so-foreign language for data processing

Debugging Sample or side data Pig-Pen creates a side data set automatically Dynamically created data set is called Sandbox data set If something is wrong it is shown in the right hand side Courtesy: Pig latin : a not-so-foreign language for data processing

Usage Rollup aggregates: in various activity logs, web crawls, i.e. calculate the frequency of search terms aggregated over days on a geographical location based on IP address Temporal analysis: search of logs over time Session analysis: how long an average user session takes, by analyzing user sessions, sequence of pages viewd or clicked

Future work Safe optimizer External functions Unified environment ( embed in different language platforms) User Interface : providing sharing and collaboration

Summary 1. Introduction 2. Features and Motivations 3. PIG LATIN 4. Implementation 5. Debugging Environment 6. Scenarios 7. Related and Future work 8. Summary

Questions?

References Christopher Olston , Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin : a not-so-foreign language for data processing.
Tags