Quick Introduction to Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
Piotr Krewski, Adam Kawa

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez
■Efficient execution engine
●Faster than MapReduce
■Can be leveraged by existing frameworks e.g. Hive, Pig,
Scalding
●SET hive.execution.engine=[tez,mr,spark]
■Built atop Hadoop YARN

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Tez

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■Natural DAG
●No intermediate data written to HDFS (replication 3x)
●No need for “empty” map tasks to reshuffle data
●No time spent in a queue to start a next MapReduce job

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Simple Comparison
■Three real-world queries
■Real production datasets
●Stored in Avro and ORC formats
■+900-node cluster (thanks, Spotify!)
●Queries run in a queue with limited capacity
■Hive 0.14 and Tez 0.5 (version from April 2014)

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users
■Find top 3 users with largest number of streams

SELECT user_id, count(*) AS cnt
FROM stream
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 3

■The pattern is GROUP BY and ORDER BY and LIMIT

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users

Hive on MapReduce
on Avro
Hive on Tez on Avro
Plan 2 MapReduce jobs
Map => Reduce =>
Reduce
Wallclock
Time (sec)
353 197
Improvement 1.8x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Three Users - On A Busier Cluster

Hive on MapReduce
on Avro
Hive on Tez on Avro
Wallclock
Time (sec)
576 183
Improvement 3.14x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Console Output
……
Query ID = kawaa_20141130185757_3e4bd581-23bb-4d7c-b755-
044c4a5783b5
Total jobs = 1
Launching Job 1 out of 1

Status: Running (application id:
application_1414118456795_314710)

Map 1: -/-Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
Map 1: 0/36 Reducer 2: 0/5Reducer 3: 0/1
……

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Some Advantages Of Tez
■Container reuse
●Less time spent negotiating with the Resource Manager
●Smaller tasks can be started, so fewer stragglers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries
■Find top 10 countries with largest number of streams

SELECT country, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
GROUP BY country
ORDER BY cnt DESC
LIMIT 3

■The pattern is JOIN ON and GROUP BY and ORDER BY and
LIMIT

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Top Ten Countries

Hive on
MapReduce on
Avro
Hive on Tez on
Avro
Hive on Tez on ORC
Snappy
Plan
3 MapReduce
jobs
Map => Map =>
Reduce => Reduce
=> Reduce
Map => Map =>
Reduce => Reduce
=> Reduce
Wallclock
Time (sec)
636 268 203
Improvement 2.4x 3.1x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■Find the biggest Polish fan of Timbuktu (popular Swedish
rap/reggae artists)

SELECT user_id, count(*) AS cnt
FROM stream
JOIN user ON stream.user_id = user.id
JOIN track ON stream.track_id = track.id
WHERE ...
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 1

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu

Hive on
MapReduce on
ORC ZLIB
Hive on Tez on
ORC ZLIB
Hive on Tez on
ORC Snappy
Plan 6 MapReduce jobs
Map => Map =>
Map => Reduce =>
Reduce
Map => Map =>
Map => Reduce =>
Reduce
Wallclock
Time (sec)
519 259 209
Improvement 2x 2.5x

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
The Biggest Polish Fan of Timbuktu
■We also run this query on 1.5-year long production dataset
●+25 TB of data
●690 nodes
■Benefits (after optimizations)
●6+ hours with Hive on MapReduce and Avro Deflate
●10min 11sec with Hive on Tez and ORC Zlib
■Features used
●Containers reuse
●Broadcast JOIN
●Warm containers

© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Summary
■Very fast and smart
●Out of the box performance for small and large queries
■Very good at scale
●Tested by Yahoo!
■Not memory-hungry
●Great for large datasets and multi-tenancy
■Well integrated with YARN
■No pain deployment and maintenance
●No daemons - build Tez jars and upload them to HDFS
■Gives you a powerful and effortless option
●Switch execution mode between MR, Tez or Spark using
simple configuration settings

© Copyright 2014. All rights reserved. Not to be reproduced without prior written consent.
About GetInData
■Data-processing
challenges
addressed with
passion and
experience
■+4 years with
Apache Hadoop
and Big Data
technologies

Quick Introduction to Apache Tez

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Quick Introduction to Apache Tez

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......