An Oracle approach to the Taxi Fare problem

JoseRodrguez46 124 views 24 slides Apr 10, 2019
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

An approach to the ML regression Taxi Fare problem using Oracle provided features and options.


Slide Content

February 2019
Just a simple experiment
Taxi Fare problem
An Oracle approach

© The Pythian Group Inc., 20192
Jose Rodriguez
Project engineer at Pythian
Spain
●+10 years of experience, mainly Oracle but also SQL Server and
others like DB2 LUW or PostgreSQL
●Solaris, Linux and Windows RAC and HA with DG and GG
Other areas of expertise, i.e., things I like doing
●Scripting and automation (lazy DBA)
●Golden Gate replication
●Cloud related stuff (who doesn't nowadays, eh? )
●Machine Learning
@jmrprieto
https://www.linkedin.com/in/jose-rodríguez-05087a13

About Pythian
Pythian’s 400+ IT professionals
help companies adopt
and manage disruptive data
technologies to better compete
© 2018 Pythian. Confidential3

© 2018 Pythian. Confidential4
Systems currently
managed by Pythian
EXPERIENCED
Pythian experts
in 35 countries
GLOBAL
Millennia of experience
gathered and shared
over 19 years
EXPERTS
11,800 2 400+

Agenda
5© The Pythian Group Inc., 2019
What is the Taxi Fare problem?
What data do we have?
Model creation
Model results
Summary
Q&A

© The Pythian Group Inc., 20196
●A typical ML regression problem based on
NYC public data
●Data set with NYC taxi rides information
●Predict the fare amount for a proposed ride
What is the Taxi Fare problem?

7© The Pythian Group Inc., 2019
The initial data set
What do we start with

© The Pythian Group Inc., 20198
●Started with a publicly available data set about 54 million rows including
●Pickup date and time
●Pickup and drop off coordinates
●Number of passengers
●Fare amount
●Data cleansing
●Remove invalid data (negative fare, coordinates away from NYC,
etc.)
●Data preparation
●Remove minute details from data
●Categorise data, e.g. transform coordinates into neighbourhoods
●Additional feature
●Adding DISTANCE as an additional feature for the model
Initial data set transformation

9© The Pythian Group Inc., 2018
Congratulations!
You've got a model

© The Pythian Group Inc., 201910
●Oracle offers two algorithms for regression problems
●Generalized Linear Model (GLM)
■Linear regression, simple calculation, low computation resources
●Support Vector Machines (SVM)
■Complex calculation, more computation resources, better results expected
●GLM model creation: 5 minutes
●SVM model creation: 9 hours
●Simple PL/SQL code to create it
Model creation
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'Lin_reg_taxi_fare',
mining_function => dbms_data_mining.regression,
data_table_name => 'TRAIN_DISTANCE_HOUR' ,
case_id_column_name => 'caseID',
target_column_name => 'FARE_AMOUNT',
settings_table_name =>
'linear_reg_taxi_fare_settings' );
END;

© The Pythian Group Inc., 201911
●SQL*Developer extension
●Provides GUI to create data flows to mine data. Allows
●data transformation
●data analysis
●model creation
●model performance data review
Data Miner

© The Pythian Group Inc., 201912
●Data Miner GUI
●Add table
●Add model
●Schedule model
DemoT
a
x
i
D
E
M
O

13© The Pythian Group Inc., 2019
Model results
Or what can I do now

© The Pythian Group Inc., 201914
●Using the data mining PREDICTION function (GLM)
Testing our model
WITH PREDICTION AS
(SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3 USING *) PREDICTED_FARE
FROM TRAIN_DISTANCE_BUCKET_3
WHERE CASEID < 11
)
SELECT FARE_AMOUNT,
ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 8.81 2.31 35.48
5.7 8.53 2.83 49.71
10.9 12.25 1.35 12.4
18 18.64 .64 3.54
6.1 7.92 1.82 29.89
13 12.64 .36 2.81
8.5 10.56 2.06 24.2
9 10.99 1.99 22.07
6.5 8.67 2.17 33.36
9.3 10.35 1.05 11.26

10 ROWS selected.

Elapsed: 00:00:04.275

© The Pythian Group Inc., 201915
●Using the data mining PREDICTION function (SVM)
Testing our model (2)
WITH PREDICTION AS
(SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
FROM TRAIN_DISTANCE_BUCKET_3
WHERE CASEID < 11
)
SELECT FARE_AMOUNT,
ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 7.79 1.29 19.91
5.7 7.24 1.54 26.96
10.9 11.99 1.09 10.04
18 20.4 2.4 13.32
6.1 6.38 .28 4.58
13 12.51 .49 3.75
8.5 9.9 1.4 16.43
9 9.61 .61 6.73
6.5 7.31 .81 12.53
9.3 8.62 .68 7.32

10 ROWS selected.

Elapsed: 00:00:05.736

© The Pythian Group Inc., 201916
●Using the data mining PREDICTION_DETAILS function (GLM)
Testing our model (3)
SELECT FARE_AMOUNT,
ROUND(PREDICTION (LIN_REG_TAXI_FARE_3 USING *),2) PREDICTED_FARE,
REDICTION_DETAILS (LIN_REG_TAXI_FARE_3 USING *) DETAILS
FROM TRAIN_DISTANCE_BUCKET
WHERE CASEID <11;
FARE_AMOUNT PREDICTED_FARE DETAILS
----------- -------------- ------------------------------------------------------------------------------------------
6.5 8.81 <Details algorithm="Generalized Linear Model" >
<Attribute name="PASSENGER_COUNT" actualValue="1" weight="-.015" rank="1"/>
<Attribute name="PICKUP_WEEKDAY" actualValue="SAT" weight="-.162" rank="2"/>
<Attribute name="DISTANCE_KM" actualValue="2.07992012952067" weight="-.823" rank="3"/>
</Details>
(...)
9 10.99 <Details algorithm="Generalized Linear Model" >
<Attribute name="DROPOFF_BUCKET" actualValue="Z2" weight=".333" rank="1"/>
<Attribute name="PICKUP_WEEKDAY" actualValue="THU" weight=".039" rank="2"/>
<Attribute name="PASSENGER_COUNT" actualValue="1" weight="-.022" rank="3"/>
<Attribute name="PICKUP_BUCKET" actualValue="Z2" weight="-.07" rank="4"/>
<Attribute name="DISTANCE_KM" actualValue="2.79316183876644" weight="-.536" rank="5"/>
</Details>
(...)

© The Pythian Group Inc., 201917
●Model performance or fitness. How good is our model?
●Root Mean Squared Error (RMSE). GLM looks better but not that much.
●Mean Absolute Error (MEA). Both models perform similarly.
Testing our model (4) - Fitness
-- First for the GLM model
SQL> WITH prediction AS
2 (SELECT FARE_AMOUNT, PREDICTION (Lin_reg_taxi_fare_3 USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3 )
4 SELECT SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6* FROM prediction p;
RMSE MAE
---------- ----------
6.66524393 2.96620921

-- Then for the SVM model
SQL> WITH prediction AS
2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3 )
4 SELECT SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6 FROM prediction p ;

RMSE MAE
---------- ----------
7.76230946 2.39739187

© The Pythian Group Inc., 201918
The Data Mining feature stores model information in several internal
tables (DM$) which can be queried to obtain some insights.
Testing our model (5) - Internal data
TABLE_NAME NUM_ROWS
-------------------------------- ----------
DM$PCREGR_GLM_1_1 20503
DM$P5REGR_GLM_1_1 287
DM$PBREGR_GLM_1_1 202
DM$PDREGR_GLM_1_1 201
DM$PPREGR_GLM_1_1 29
DM$PFREGR_GLM_1_1 9
DM$PMREGR_GLM_1_1 3
DM$PJREGR_GLM_1_1 0
DM$PKREGR_GLM_1_1 0
DM$PEREGR_GLM_1_1 0
DM$PZREGR_GLM_1_1 0

© The Pythian Group Inc., 201919
NAME NVAL SVAL
------------------------------ ------------------- ------------------------------------
ADJUSTED_R_SQUARE .53
AIC 125,859,959.10
COEFF_VAR 59.46
CONVERGED YES
CORRECTED_TOTAL_DF 32,986,604.00
CORRECTED_TOT_SS 3,207,450,478.07
DEPENDENT_MEAN 11.33
ERROR_DF 32,986,403.00
ERROR_MEAN_SQUARE 45.40
ERROR_SUM_SQUARES 1,497,535,772.47
F_VALUE 187,385.57
GLMS_CONV_TOLERANCE .00
GLMS_NUM_ITERATIONS 30.00
GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_DISABLE
GLMS_SOLVER GLMS_SOLVER_CHOL
GMSEP 45.40
HOCKING_SP .00
J_P 45.40
MODEL_DF 201.00
MODEL_F_P_VALUE .00
MODEL_MEAN_SQUARE 8,507,038.34
MODEL_SUM_SQUARES 1,709,914,705.60
NUM_PARAMS 202.00
NUM_ROWS 32,986,605.00
RANK_DEFICIENCY .00
ROOT_MEAN_SQ 6.74
R_SQ .53
SBIC 125,863,052.04
VALID_COVARIANCE_MATRIX YES
29 rows selected.
Testing our model (5.2) - Internal data
SELECT NAME, NVAL, SVAL
FROM DM$PPREGR_GLM_1_1
ORDER BY NAME;

© The Pythian Group Inc., 201920
●Data Miner automatically provides model fitness information
●Root Mean Squared Error (RMSE)
●Mean Absolute Error (MEA)
●Predictive Confidence %
Testing our model (6) - Fitness by Data miner

© The Pythian Group Inc., 201921
References

An Oracle-Based Approach To The “Taxi Fare” Prediction Problem by Jose Rodriguez

Oracle Data Miner

Oracle Data Mining documentation

22© The Pythian Group Inc., 2018
Summary
●What is the Taxi Fare prediction
problem
●How did we prepare the data
●How to create a very simple model
●Testing our model
PROS CONS
Data is already there High compute resources
Simple SQL and PL/SQL License costs

Q&A
Ask now or reach out later, but don't keep the question for yourself

24© The Pythian Group Inc., 2018
THANK YOU
I hope you enjoyed it