A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv) @PAPIs Connect — São Paulo 2017
papisdotio
947 views
60 slides
Jun 26, 2017
Slide 1 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
About This Presentation
News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending thi...
News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending this presentation you're going to follow a detailed overview of how R&D team of Hearst's TV division is putting together Google BigQuery, Kubernetes cluster and Tensorflow to build a hybrid recommendation system combining model-based matrix factorization, content recency, and content semantics through NLP.
Size: 8.12 MB
Language: en
Added: Jun 26, 2017
Slides: 60 pages
Slide Content
A Tensorflow
Recommending System
for News
Fabricio Vargas Matos
Manhattan, NYTV Stations
Local and National News
Article’s page: recommendations
for continuous scroll section
Recommended articles
Agenda
1.Recency and cold-start problem
2.Data acquisition
3.Matrix factorization
4.Tensorflow implementation
5.Hybrid Model: NLP and feature engineering
6.Hybrid Model: Hybrid matrix factorization
7.Conclusions
Cold-start problem
Existent
Items
New
Items
Existent Users New Users
Cold-start solution
Existent
Items
New
Items
Existent Users New Users
Not personalized!
Curated by Editors
+
Highly viewed
Cold-start solution
Existent
Items
New
Items
Existent Users New Users
Not personalized!
Curated by Editors
+
Highly viewed
Hybrid
Matrix
Factorization
Data Acquisition
Page views with
user’s time on page
Google AnalyticsGoogle BigQuery CMS
Content corpus: title,
body, timestamp,
meta-data (sections,
tags, etc.)
Contents
TFRecord/CSV files
"Users x Items" Sparsity
Dataset Sparsity
MovieLens (movies) 98.61%
Netflix (movies) 98.82%
TV Stations (news) 99.94%
Yahoo! KDD (music) 99.96%
Matrix Factorization
VU
Latent Factors Model
R
Items
Users
≈
Latent
factors
Latent
factors
Items
x
user bias item bias
i
j
i
j
R[i,j] ≈ U[i] x V[j]
Natural Language
Processing
Concatenate content data
(title, body, sections, tags, …)
Remove stop words, symbols
and HTML tags
Train word2vec Neural Network
Combine all word-vectors of
each article into one (doc2vec)
CMS
articles
doc2vec
contents
Contents Data
Visualization
Entertainment
National News
Health
Sports
Local News
Features Engineering
NLP (doc2vec)
items clustering (k-means)
embed items:
similarity to each cluster centroid
embed users:
viewed contents combined
CMS
articles
k-dimension
items/users
embeddings
Google
Cloud
Storage
Items Parallel coordinates: 40 features/clusters
Feature #1: Similarity to
cluster #1
Feature #39
Who are they?
Magenta contents (health) with high
values for feature #1 (economy)?
Content/User Embeddings
+
Matrix Factorization
VU
Matrix Factorization
R
Items
Users
≈
Latent
factors
Latent
factors
Items
x
user bias item bias
i
j
i
j
R[i,j] ≈ U[i] x V[j]
Hybrid Matrix Factorization
•R ≈ U
*
x V
*
where:
•U
*
= UUsersxKClusters x AKClustersxLatent_factors
•V
*
= BLatent_factorsxKClusters x VKClustersxItems
*Only A and B are variables to be trained. U and V are constants.
TF code: factorization
Now:
Results
•Training time ≈ 20min (Kubernetes cluster)
•TimeOnPage Prediction Error (RMSE) ≈ 100 sec
(20% better)
•Qualitative recommendation tests with chosen
‘personas’ revealed very good personalization
•R&D Project - Not yet publicly available