FARROT - Filter Amazon Review Ratings Over Time

altens123 542 views 11 slides Sep 26, 2014
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

FARR - Filter Amazon Review Ratings Over Time


Slide Content

FARROT:
Filter Amazon Review Ratings
Over Time
Andy Lai

Data set
Stanford SNAP Amazon reviews
35GB
35M reviews
University of Illinois Amazon member info
142MB
Member location information
joeme 925/26Cleveland, OH United StatesJoseph M. KotowB00006HAXW
OH

Problem
Amazon doesn't allow filtering review ratings
and totals by state or time

Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase

Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase

Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
B00006HAXW Rock Rhythm & Doo Wop Greatest Early RockunknownA1RSDE9-
N6RSZFJoseph M Kotow 9/95.01042502400Pittsburgh – Home of the OLDIESI
have all of the doo wop DVD’s and this one is as good or better than the 1
st
ones. Rem…

Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase
B00006HAXW Rock Rhythm & Doo Wop Greatest Early RockunknownA1RSDE9-
N6RSZFJoseph M Kotow 9/95.01042502400Pittsburgh – Home of the OLDIESI
have all of the doo wop DVD’s and this one is as good or better than the 1
st
ones. Rem…
PIG to CLEAN,
JOIN and
AGGREGATE
rating reviews and
totals

Pipeline
ImportTsv
SNAP
REVIEWS in
10 rows per
review
UIC MEMBER
LOCATION
TSV
HappyBase

HBase Schema
Table Schemas:
PRODUCTID_STATE,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYYEAR_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYMONTH_EPOCH,
TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYDAY_EPOCH,
TOTAL REVIEWS, AVG RATING
Example:
–B00003CWT6_CA_BYMONTH_10081152000
00

Retrospective

Design Considerations
–HBase was used due to tight integration
with Hadoop and distributed nature
–Schema was bucketed by state and time for
performance at the cost of storage
–Java MR was used to convert multi-row
reviews to tabular format
•Future
–Scrape Amazon for new reviews
–Search by Product name over productid
–Filter and display reviews

About me – Andy Lai

UC Berkeley (B.S. Electrical Engineering &
Computer Science)

SJSU (M.S. Engineering)

Software Engineer (DB2, Relational
database)

Interests: