Feature Hashing for Scalable Machine Learning with Nick Pentreath

704 views 33 slides Oct 31, 2017
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems. Feature hashin...


Slide Content

Nick Pentreath, IBM
Feature Hashing for
Scalable Machine
Learning
#EUds15

About
•About me
–@MLnick
–Principal Engineer at IBM working on machine
learning & Apache Spark
–Apache Spark committer & PMC
–Author of Machine Learning with Spark
#EUds15 2

Agenda
•Intro to feature hashing
•HashingTFin Spark ML
•FeatureHasherin Spark ML
•Experiments
•Future Work
#EUds15 3

Feature Hashing
4#EUds15

Encoding Features
5#EUds15
2.13.2−0.20.7
!!
•Most ML algorithms
operate on numeric
feature vectors
•Features are often
categorical –even
numerical features (e.g.
binning continuous
features)

Encoding Features
•“one-hot” encoding is
popular for categorical
features
•“bag of words” is
popular for text (or token
countsmore generally)
00100
! 02011
#EUds15 6

High Dimensional Features
•Many domains have very high densefeature dimension
(e.g. images, video)
•Here we’re concerned with sparsefeature domains, e.g.
online ads, ecommerce, social networks, video sharing,
text & NLP
•Model sizes can be very large even for simple models
#EUds15 7

!
The “Hashing Trick”
•Use a hash function to map feature values to
indices in the feature vector
Dublin
Hash “city=dublin” 00100
Modulo hash value
to vector size to get
index of feature
Stock Price
Hash “stock_price” 02.60…00
Modulo hash value
to vector size to get
index of feature
#EUds15 8

Feature Hashing: Pros
•Fast & Simple
•Preserves sparsity
•Memory efficient
–Limits feature vector size
–No need to store mapping feature name -> index
•Online learning
•Easy handling of missing data
•Feature engineering
#EUds15 9

Feature Hashing: Cons
•No inverse mapping => cannot go from feature
indices back to feature names
–Interpretability & feature importances
–But similar issues with other dim reduction techniques
(e.g. random projections, PCA, SVD)
•Hashcollisions …
–Impact on accuracy of feature collisions
–Can use signed hash functions to alleviate part of it
#EUds15 10

HashingTFin Spark ML
#EUds15 11

HashingTFTransformer
•Transforms text (sentences) -> term frequency
vectors (aka “bag of words”)
•Uses the “hashing trick” to compute the feature
indices
•Feature value is term frequency (token count)
•Optional parameter to only return binary token
occurrencevector
#EUds15 12

HashingTFTransformer
#EUds15 13

Hacking HashingTF
•HashingTFcan be used for
categorical features…
•…but doesn’t fit neatly into
Pipelines
#EUds15 14

FeatureHasherin Spark ML
#EUds15 15

FeatureHasher
•Flexible, scalable feature encoding using
hashing trick
•Support multiple input columns (numeric or
string / boolean, i.e. categorical)
•One-shotfeature encoder
•Core logic similar to HashingTF
#EUds15 16

FeatureHasher
•Operates on entire Row
•…UDF with structinput
#EUds15 17

FeatureHasher
#EUds15 18
•Determining feature
index
–Numeric: feature
name
–String / boolean:
“feature=value”
•String encoding =>
effectively “one hot”
with indicator value
of 1.0

FeatureHasher
•Modulo raw index to feature vector size
•Hash collisions are additive(currently)
#EUds15 19

FeatureHasher
#EUds15 20

Experiments
#EUds15 21

Text Classification
•KaggleEmail Spam Dataset
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 22

Text Classification
•Adding regularization (regParam=0.01)
0.97
0.975
0.98
0.985
0.99
0.995
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 23

Ad Click Prediction
•CriteoDisplay Advertising Challenge
–45m examples, 34m features, 0.000003% sparsity
•OutbrainClick Prediction
–80m examples, 15m features, 0.000007% sparsity
•CriteoTerabyte Log Data
–7 day subset
–1.5b examples, 300m feature, 0.0000003% sparsity
#EUds15 24

Data
•Illustrative characteristics -CriteoDAC
0
2
4
6
8
10
12
Millions
Unique Values per Feature
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Feature Occurence (%)
#EUds15 25

Challenges
Raw Data StringIndexer OneHotEncoder VectorAssembler
OOM!
•Typical one-hot encoding pipeline fails consistently with
large feature dimension
#EUds15 26

Results
•Compare AUC for different # hash bits
0.695
0.7
0.705
0.71
0.715
0.72
18 20 22 24
Hash bits
Outbrain
AUC
0.74
0.742
0.744
0.746
0.748
0.75
0.752
18 20 22 24
Hash bits
CriteoDAC
AUC
#EUds15 27

Results
•Criteo1T logs –7 day subset
•Can train model on 1.5b examples
•300m original features for this subset
•2
24
hashed features (16m)
•Impossible with current Spark ML (OOM, 2Gb
broadcast limit)
#EUds15 28

Summary & Future Work
#EUds15 29

Summary
•Feature hashing is a fast, efficient, flexible tool for
feature encoding
•Can scale to high-dimensional sparse data, without
giving up much accuracy
•Supports multi-column “one-shot” encoding
•Avoids common issues with Spark ML Pipelines
using StringIndexer& OneHotEncoderat scale
#EUds15 30

Future Directions
•Will be part of Spark ML in 2.3.0 release (Q4 2017)
–Refer toSPARK-13969for details
•Allow users to specify set of numeric columns to be
treated as categorical
•Signed hash functions
•Internal feature crossing & namespaces (ala Vowpal
Wabbit)
•DictVectorizer-like transformer => one-pass feature
encoder for multiple numeric & categorical columns (with
inverse mapping) –see SPARK-19962
#EUds15 31

References
•Hash Kernels
•Feature Hashing for Large Scale Multitask
Learning
•VowpalWabbit
•Scikit-learn
#EUds15 32

Thank You.
@MLnick
spark.tc