HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
cloudera
3,449 views
14 slides
May 30, 2012
Slide 1 of 14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
About This Presentation
Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty...
Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.
Size: 79.97 KB
Language: en
Added: May 30, 2012
Slides: 14 pages
Slide Content
Unique Sets With Hbase Elliott Clark
Problem Summary The business needs to know how many unique people have done some action
Problem Specifics Need to create a lot of different counts of unique users 100 d ifferent counters per day per game (could be per website, or any other group) 1000 different games Some counters require knowledge of old data Count of unique users who joined today Count of unique users who have ever paid
1 st Try – Bit Set p er Hour Row Key is the game and the hour Column qualifiers are the counter names Column values are 1.5Mb bit sets Each hour a new bloom filter is created for every counter Compute a day’s counter by OR’ing the bits and dividing the count of high bits by probability of a collision
1 st Try – Example Row Game1 2012-01-01 0100 D:DAU NUM_IN_SET: 1.5M 010010001101100100…. D:new_uniques NUM_IN_SET: 0.9M 1100110100111010….
1 st Try – Pluses & Minuses Allows accuracy to drive size Requires a full table scan of all bit sets A lot of data generated Huge number of regions Not 100% accurate Very hard to debug
2 nd Try – Bit Sets per User Row Key is the user’s ID reversed Reverse the ID to stop hot spotting of regions Column qualifiers are a compound key of game and counter name Column values are a start date-hour and a bit set Each position in a bit set refers to a subsequent hour after the start time 1 means the user performed that action 0 means the user did not perform that action
2 nd Try – Pluses & Minuses Easier to debug Size grows with the number of users not with the accuracy required Requires a full table scan of all users Scales with number of users ever seen; not number of users active on a given day Very active users can make rows grow without bound Very hard to un-do any mistakes. Dirty data is very hard to correct.
3 rd Try – Multi Pass Group Group all log data for a day by user ID Join log data with historic data in Hbase , by doing a GET on the user’s row Compute new information about the user Emit new data about the user and +1s for every action that the user did from the log data
3 rd Try – Data Flow Log Data Log Data Log Data Hbase User Data Count: +1 Recomputed User Data
3 rd Try – Pluses & Minuses Easy to debug Scales with number of users that are active Allows for a more holistic view of the users Requires a large amount of data to be shuffled and sorted
Conclusions Try to get the best upper bound on runtime More and more flexibility will be required as time goes on Store more data now, and when new features are requested development will be easier Choose a good serialization framework and stick with it Always clean your data before inserting