A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv) @PAPIs Connect — São Paulo 2017

papisdotio 947 views 60 slides Jun 26, 2017
Slide 1
Slide 1 of 60
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60

About This Presentation

News recommendations are particularly challenging given the high number of new contents produced every day and the fast deterioration of its value for the users, demanding models and infrastructure able to deal with those nuances and serve a newly trained model about 100 times per day. Attending thi...


Slide Content

A Tensorflow
Recommending System
for News
Fabricio Vargas Matos

Manhattan, NYTV Stations
Local and National News

Article’s page: recommendations
for continuous scroll section
Recommended articles

Agenda
1.Recency and cold-start problem
2.Data acquisition
3.Matrix factorization
4.Tensorflow implementation
5.Hybrid Model: NLP and feature engineering
6.Hybrid Model: Hybrid matrix factorization
7.Conclusions

Cold-start problem
Existent
Items
New
Items
Existent Users New Users

Cold-start solution
Existent
Items
New
Items
Existent Users New Users
Not personalized!
Curated by Editors
+
Highly viewed

Cold-start solution
Existent
Items
New
Items
Existent Users New Users
Not personalized!
Curated by Editors
+
Highly viewed
Hybrid
Matrix
Factorization

Data Acquisition
Page views with
user’s time on page
Google AnalyticsGoogle BigQuery CMS
Content corpus: title,
body, timestamp,
meta-data (sections,
tags, etc.)
Contents
TFRecord/CSV files

"Users x Items" Sparsity
Dataset Sparsity
MovieLens (movies) 98.61%
Netflix (movies) 98.82%
TV Stations (news) 99.94%
Yahoo! KDD (music) 99.96%

Matrix Factorization

VU
Latent Factors Model
R
Items
Users

Latent
factors
Latent
factors
Items
x
user bias item bias
i
j
i
j
R[i,j] ≈ U[i] x V[j]

TF code: factorization op
(…)

TF code: train op

Initial Results
•Training time ≈ 15min (Kubernetes cluster)
•TimeOnPage Prediction Error (RMSE) ≈ 125 sec
•Qualitative recommendation tests with chosen
‘personas’ revealed poor personalization

Hybrid Matrix
Factorization Model

Natural Language
Processing
Concatenate content data
(title, body, sections, tags, …)
Remove stop words, symbols
and HTML tags
Train word2vec Neural Network
Combine all word-vectors of
each article into one (doc2vec)
CMS
articles
doc2vec
contents

Contents Data
Visualization

Entertainment
National News
Health
Sports
Local News

Features Engineering
NLP (doc2vec)
items clustering (k-means)
embed items:
similarity to each cluster centroid
embed users:
viewed contents combined
CMS
articles
k-dimension
items/users
embeddings
Google
Cloud
Storage

Items Parallel coordinates: 40 features/clusters

Feature #1: Similarity to
cluster #1

Feature #39

Who are they?
Magenta contents (health) with high
values for feature #1 (economy)?

Content/User Embeddings
+
Matrix Factorization

VU
Matrix Factorization
R
Items
Users

Latent
factors
Latent
factors
Items
x
user bias item bias
i
j
i
j
R[i,j] ≈ U[i] x V[j]

Hybrid Matrix Factorization
•R ≈ U
*
x V
*
where:
•U
*
= UUsersxKClusters x AKClustersxLatent_factors
•V
*
= BLatent_factorsxKClusters x VKClustersxItems
*Only A and B are variables to be trained. U and V are constants.

TF code: factorization
Now:

Results
•Training time ≈ 20min (Kubernetes cluster)
•TimeOnPage Prediction Error (RMSE) ≈ 100 sec
(20% better)
•Qualitative recommendation tests with chosen
‘personas’ revealed very good personalization
•R&D Project - Not yet publicly available

Let’s talk online
fabriciovargasmatos@
Fabricio Vargas Matos