HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

cloudera 3,449 views 14 slides May 30, 2012

Slide 1 of 14

About This Presentation

Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty...

Size: 79.97 KB

Language: en

Added: May 30, 2012

Slides: 14 pages

Slide Content

Unique Sets With Hbase Elliott Clark

Problem Summary The business needs to know how many unique people have done some action

Problem Specifics Need to create a lot of different counts of unique users 100 d ifferent counters per day per game (could be per website, or any other group) 1000 different games Some counters require knowledge of old data Count of unique users who joined today Count of unique users who have ever paid

1 st Try – Bit Set p er Hour Row Key is the game and the hour Column qualifiers are the counter names Column values are 1.5Mb bit sets Each hour a new bloom filter is created for every counter Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision

1 st Try – Example Row Game1 2012-01-01 0100 D:DAU NUM_IN_SET: 1.5M 010010001101100100…. D:new_uniques NUM_IN_SET: 0.9M 1100110100111010….

1 st Try – Pluses & Minuses Allows accuracy to drive size Requires a full table scan of all bit sets A lot of data generated Huge number of regions Not 100% accurate Very hard to debug

2 nd Try – Bit Sets per User Row Key is the user’s ID reversed Reverse the ID to stop hot spotting of regions Column qualifiers are a compound key of game and counter name Column values are a start date-hour and a bit set Each position in a bit set refers to a subsequent hour after the start time 1 means the user performed that action 0 means the user did not perform that action

2 nd Try – Example Row Game1 2012-01-01 0100 D:game1_active Start Date: 2012-01-01 0500 010010001101100100…. D:game2_paid_money Start Date: 2012-01-01 0500 00000000000000000100….

2 nd Try – Pluses & Minuses Easier to debug Size grows with the number of users not with the accuracy required Requires a full table scan of all users Scales with number of users ever seen; not number of users active on a given day Very active users can make rows grow without bound Very hard to un-do any mistakes. Dirty data is very hard to correct.

3 rd Try – Multi Pass Group Group all log data for a day by user ID Join log data with historic data in Hbase , by doing a GET on the user’s row Compute new information about the user Emit new data about the user and +1s for every action that the user did from the log data

3 rd Try – Data Flow Log Data Log Data Log Data Hbase User Data Count: +1 Recomputed User Data

3 rd Try – Pluses & Minuses Easy to debug Scales with number of users that are active Allows for a more holistic view of the users Requires a large amount of data to be shuffled and sorted

Conclusions Try to get the best upper bound on runtime More and more flexibility will be required as time goes on Store more data now, and when new features are requested development will be easier Choose a good serialization framework and stick with it Always clean your data before inserting

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

figurative-language power point.pptththtrht

Figurative-Language-powerpoint.pptgttgth

Plasma proteins functions electroforesis - Copy.pptx

FORMATO 4. PLANTEAMIENTO DEL PROBLEMA. 4 Oct. 2022.ppt

Cristiano Ronaldo jugador portugués, la leyenda

🎮✨ Top 10 Most Used Software Tools by Indie Game Developers (2025 Edition)