Quick Introduction to Apache Tez

getindata 4,122 views 20 slides Dec 11, 2014
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Did you like it? Check out our blog to stay up to date: https://getindata.com/blog

We share our slides about Apache Tez delivered as a lightening talk given at Warsaw Hadoop User Group http://www.meetup.com/warsaw-hug/events/218579675


Slide Content

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
Piotr Krewski, Adam Kawa

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
■Efficient execution engine
●Faster than MapReduce
■Can be leveraged by existing frameworks e.g. Hive, Pig,
Scalding
●SET hive.execution.engine=[tez,mr,spark]
■Built atop Hadoop YARN

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■Natural DAG
●No intermediate data written to HDFS (replication 3x)
●No need for “empty” map tasks to reshuffle data
●No time spent in a queue to start a next MapReduce job

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Simple Comparison
■Three real-world queries
■Real production datasets
●Stored in Avro and ORC formats
■+900-node cluster (thanks, Spotify!)
●Queries run in a queue with limited capacity
■Hive 0.14 and Tez 0.5 (version from April 2014)

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users
■Find top 3 users with largest number of streams

SELECT user_id, count(*) AS cnt
FROM stream
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 3

■The pattern is GROUP BY and ORDER BY and LIMIT

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users





Hive on MapReduce
on Avro
Hive on Tez on Avro
Plan 2 MapReduce jobs
Map => Reduce =>
Reduce
Wallclock
Time (sec)
353 197
Improvement 1.8x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users - On A Busier Cluster







Hive on MapReduce
on Avro
Hive on Tez on Avro
Wallclock
Time (sec)
576 183
Improvement 3.14x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Console Output
……
Query ID = kawaa_20141130185757_3e4bd581-23bb-4d7c-b755-
044c4a5783b5
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id:
application_1414118456795_314710)

Map 1: -/-Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
……

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■Container reuse
●Less time spent negotiating with the Resource Manager
●Smaller tasks can be started, so fewer stragglers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■Container reuse
●Less time spent negotiating with the Resource Manager
●Smaller tasks can be started, so fewer stragglers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries
■Find top 10 countries with largest number of streams

SELECT country, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
GROUP BY country
ORDER BY cnt DESC
LIMIT 3

■The pattern is JOIN ON and GROUP BY and ORDER BY and
LIMIT

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries





Hive on
MapReduce on
Avro
Hive on Tez on
Avro
Hive on Tez on ORC
Snappy
Plan
3 MapReduce
jobs
Map => Map =>
Reduce => Reduce
=> Reduce
Map => Map =>
Reduce => Reduce
=> Reduce
Wallclock
Time (sec)
636 268 203
Improvement 2.4x 3.1x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■Find the biggest Polish fan of Timbuktu (popular Swedish
rap/reggae artists)

SELECT user_id, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
JOIN track ON stream.track_id = track.id
WHERE ...
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 1

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu





Hive on
MapReduce on
ORC ZLIB
Hive on Tez on
ORC ZLIB
Hive on Tez on
ORC Snappy
Plan 6 MapReduce jobs
Map => Map =>
Map => Reduce =>
Reduce
Map => Map =>
Map => Reduce =>
Reduce
Wallclock
Time (sec)
519 259 209
Improvement 2x 2.5x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■We also run this query on 1.5-year long production dataset
●+25 TB of data
●690 nodes
■Benefits (after optimizations)
●6+ hours with Hive on MapReduce and Avro Deflate
●10min 11sec with Hive on Tez and ORC Zlib
■Features used
●Containers reuse
●Broadcast JOIN
●Warm containers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Summary
■Very fast and smart
●Out of the box performance for small and large queries
■Very good at scale
●Tested by Yahoo!
■Not memory-hungry
●Great for large datasets and multi-tenancy
■Well integrated with YARN
■No pain deployment and maintenance
●No daemons - build Tez jars and upload them to HDFS
■Gives you a powerful and effortless option
●Switch execution mode between MR, Tez or Spark using
simple configuration settings

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Q&A

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Thanks!

© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
About GetInData
■Data-processing
challenges
addressed with
passion and
experience
■+4 years with
Apache Hadoop
and Big Data
technologies