FARROT:
Filter Amazon Review Ratings
Over Time
Andy Lai
Data set
Stanford SNAP Amazon reviews
35GB
35M reviews
University of Illinois Amazon member info
142MB
Member location information
joeme 925/26Cleveland, OH United StatesJoseph M. KotowB00006HAXW
OH
Problem
Amazon doesn't allow filtering review ratings
and totals by state or time
Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
B00006HAXW Rock Rhythm & Doo Wop Greatest Early RockunknownA1RSDE9-
N6RSZFJoseph M Kotow 9/95.01042502400Pittsburgh – Home of the OLDIESI
have all of the doo wop DVD’s and this one is as good or better than the 1
st
ones. Rem…
Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
B00006HAXW Rock Rhythm & Doo Wop Greatest Early RockunknownA1RSDE9-
N6RSZFJoseph M Kotow 9/95.01042502400Pittsburgh – Home of the OLDIESI
have all of the doo wop DVD’s and this one is as good or better than the 1
st
ones. Rem…
PIG to CLEAN,
JOIN and
AGGREGATE
rating reviews and
totals
Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
HBase Schema
Table Schemas:
PRODUCTID_STATE,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYYEAR_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYMONTH_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYDAY_EPOCH,
TOTAL REVIEWS, AVG RATING
Example:
–B00003CWT6_CA_BYMONTH_10081152000
00
Retrospective
Design Considerations
–HBase was used due to tight integration
with Hadoop and distributed nature
–Schema was bucketed by state and time for
performance at the cost of storage
–Java MR was used to convert multi-row
reviews to tabular format
•Future
–Scrape Amazon for new reviews
–Search by Product name over productid
–Filter and display reviews
About me – Andy Lai
UC Berkeley (B.S. Electrical Engineering &
Computer Science)
SJSU (M.S. Engineering)
Software Engineer (DB2, Relational
database)
Interests: