An Oracle approach to the Taxi Fare problem

February 2019
Just a simple experiment
Taxi Fare problem
An Oracle approach

© The Pythian Group Inc., 20192
Jose Rodriguez
Project engineer at Pythian
Spain
●+10 years of experience, mainly Oracle but also SQL Server and
others like DB2 LUW or PostgreSQL
●Solaris, Linux and Windows RAC and HA with DG and GG
Other areas of expertise, i.e., things I like doing
●Scripting and automation (lazy DBA)
●Golden Gate replication
●Cloud related stuff (who doesn't nowadays, eh? )
●Machine Learning
@jmrprieto
https://www.linkedin.com/in/jose-rodríguez-05087a13

About Pythian
Pythian’s 400+ IT professionals
help companies adopt
and manage disruptive data
technologies to better compete
© 2018 Pythian. Confidential3

© 2018 Pythian. Confidential4
Systems currently
managed by Pythian
EXPERIENCED
Pythian experts
in 35 countries
GLOBAL
Millennia of experience
gathered and shared
over 19 years
EXPERTS
11,800 2 400+

Agenda
5© The Pythian Group Inc., 2019
What is the Taxi Fare problem?
What data do we have?
Model creation
Model results
Summary
Q&A

© The Pythian Group Inc., 20196
●A typical ML regression problem based on
NYC public data
●Data set with NYC taxi rides information
●Predict the fare amount for a proposed ride
What is the Taxi Fare problem?

7© The Pythian Group Inc., 2019
The initial data set
What do we start with

© The Pythian Group Inc., 20198
●Started with a publicly available data set about 54 million rows including
●Pickup date and time
●Pickup and drop off coordinates
●Number of passengers
●Fare amount
●Data cleansing
●Remove invalid data (negative fare, coordinates away from NYC,
etc.)
●Data preparation
●Remove minute details from data
●Categorise data, e.g. transform coordinates into neighbourhoods
●Additional feature
●Adding DISTANCE as an additional feature for the model
Initial data set transformation

9© The Pythian Group Inc., 2018
Congratulations!
You've got a model

© The Pythian Group Inc., 201910
●Oracle offers two algorithms for regression problems
●Generalized Linear Model (GLM)
■Linear regression, simple calculation, low computation resources
●Support Vector Machines (SVM)
■Complex calculation, more computation resources, better results expected
●GLM model creation: 5 minutes
●SVM model creation: 9 hours
●Simple PL/SQL code to create it
Model creation
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'Lin_reg_taxi_fare',
mining_function => dbms_data_mining.regression,
data_table_name => 'TRAIN_DISTANCE_HOUR' ,
case_id_column_name => 'caseID',
target_column_name => 'FARE_AMOUNT',
settings_table_name =>
'linear_reg_taxi_fare_settings' );
END;

© The Pythian Group Inc., 201911
●SQL*Developer extension
●Provides GUI to create data flows to mine data. Allows
●data transformation
●data analysis
●model creation
●model performance data review
Data Miner

© The Pythian Group Inc., 201912
●Data Miner GUI
●Add table
●Add model
●Schedule model
DemoT
a
x
i
D
E
M
O

13© The Pythian Group Inc., 2019
Model results
Or what can I do now

© The Pythian Group Inc., 201914
●Using the data mining PREDICTION function (GLM)
Testing our model
WITH PREDICTION AS
(SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3 USING *) PREDICTED_FARE
FROM TRAIN_DISTANCE_BUCKET_3
WHERE CASEID < 11
)
SELECT FARE_AMOUNT,
ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 8.81 2.31 35.48
5.7 8.53 2.83 49.71
10.9 12.25 1.35 12.4
18 18.64 .64 3.54
6.1 7.92 1.82 29.89
13 12.64 .36 2.81
8.5 10.56 2.06 24.2
9 10.99 1.99 22.07
6.5 8.67 2.17 33.36
9.3 10.35 1.05 11.26

10 ROWS selected.

Elapsed: 00:00:04.275

© The Pythian Group Inc., 201915
●Using the data mining PREDICTION function (SVM)
Testing our model (2)
WITH PREDICTION AS
(SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
FROM TRAIN_DISTANCE_BUCKET_3
WHERE CASEID < 11
)
SELECT FARE_AMOUNT,
ROUND(PREDICTED_FARE,2) PREDICTED_FARE,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE),2) ABS_DIFF,
ROUND(ABS(FARE_AMOUNT-PREDICTED_FARE)/FARE_AMOUNT*100,2) "%DIFF"
FROM PREDICTION;
FARE_AMOUNT PREDICTED_FARE ABS_DIFF %DIFF
----------- -------------- ---------- ----------
6.5 7.79 1.29 19.91
5.7 7.24 1.54 26.96
10.9 11.99 1.09 10.04
18 20.4 2.4 13.32
6.1 6.38 .28 4.58
13 12.51 .49 3.75
8.5 9.9 1.4 16.43
9 9.61 .61 6.73
6.5 7.31 .81 12.53
9.3 8.62 .68 7.32

10 ROWS selected.

Elapsed: 00:00:05.736

© The Pythian Group Inc., 201916
●Using the data mining PREDICTION_DETAILS function (GLM)
Testing our model (3)
SELECT FARE_AMOUNT,
ROUND(PREDICTION (LIN_REG_TAXI_FARE_3 USING *),2) PREDICTED_FARE,
REDICTION_DETAILS (LIN_REG_TAXI_FARE_3 USING *) DETAILS
FROM TRAIN_DISTANCE_BUCKET
WHERE CASEID <11;
FARE_AMOUNT PREDICTED_FARE DETAILS
----------- -------------- ------------------------------------------------------------------------------------------
6.5 8.81 <Details algorithm="Generalized Linear Model" >
<Attribute name="PASSENGER_COUNT" actualValue="1" weight="-.015" rank="1"/>
<Attribute name="PICKUP_WEEKDAY" actualValue="SAT" weight="-.162" rank="2"/>
<Attribute name="DISTANCE_KM" actualValue="2.07992012952067" weight="-.823" rank="3"/>
</Details>
(...)
9 10.99 <Details algorithm="Generalized Linear Model" >
<Attribute name="DROPOFF_BUCKET" actualValue="Z2" weight=".333" rank="1"/>
<Attribute name="PICKUP_WEEKDAY" actualValue="THU" weight=".039" rank="2"/>
<Attribute name="PASSENGER_COUNT" actualValue="1" weight="-.022" rank="3"/>
<Attribute name="PICKUP_BUCKET" actualValue="Z2" weight="-.07" rank="4"/>
<Attribute name="DISTANCE_KM" actualValue="2.79316183876644" weight="-.536" rank="5"/>
</Details>
(...)

© The Pythian Group Inc., 201917
●Model performance or fitness. How good is our model?
●Root Mean Squared Error (RMSE). GLM looks better but not that much.
●Mean Absolute Error (MEA). Both models perform similarly.
Testing our model (4) - Fitness
-- First for the GLM model
SQL> WITH prediction AS
2 (SELECT FARE_AMOUNT, PREDICTION (Lin_reg_taxi_fare_3 USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3 )
4 SELECT SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6* FROM prediction p;
RMSE MAE
---------- ----------
6.66524393 2.96620921

-- Then for the SVM model
SQL> WITH prediction AS
2 (SELECT FARE_AMOUNT, PREDICTION (LIN_REG_TAXI_FARE_3_SVM USING *) PREDICTED_FARE
3 FROM train_distance_bucket_3 )
4 SELECT SQRT(AVG((predicted_fare - fare_amount) * (predicted_fare - fare_amount))) rmse,
5 AVG(ABS(predicted_fare - fare_amount)) mae
6 FROM prediction p ;

RMSE MAE
---------- ----------
7.76230946 2.39739187

© The Pythian Group Inc., 201918
The Data Mining feature stores model information in several internal
tables (DM$) which can be queried to obtain some insights.
Testing our model (5) - Internal data
TABLE_NAME NUM_ROWS
-------------------------------- ----------
DM$PCREGR_GLM_1_1 20503
DM$P5REGR_GLM_1_1 287
DM$PBREGR_GLM_1_1 202
DM$PDREGR_GLM_1_1 201
DM$PPREGR_GLM_1_1 29
DM$PFREGR_GLM_1_1 9
DM$PMREGR_GLM_1_1 3
DM$PJREGR_GLM_1_1 0
DM$PKREGR_GLM_1_1 0
DM$PEREGR_GLM_1_1 0
DM$PZREGR_GLM_1_1 0

© The Pythian Group Inc., 201919
NAME NVAL SVAL
------------------------------ ------------------- ------------------------------------
ADJUSTED_R_SQUARE .53
AIC 125,859,959.10
COEFF_VAR 59.46
CONVERGED YES
CORRECTED_TOTAL_DF 32,986,604.00
CORRECTED_TOT_SS 3,207,450,478.07
DEPENDENT_MEAN 11.33
ERROR_DF 32,986,403.00
ERROR_MEAN_SQUARE 45.40
ERROR_SUM_SQUARES 1,497,535,772.47
F_VALUE 187,385.57
GLMS_CONV_TOLERANCE .00
GLMS_NUM_ITERATIONS 30.00
GLMS_RIDGE_REGRESSION GLMS_RIDGE_REG_DISABLE
GLMS_SOLVER GLMS_SOLVER_CHOL
GMSEP 45.40
HOCKING_SP .00
J_P 45.40
MODEL_DF 201.00
MODEL_F_P_VALUE .00
MODEL_MEAN_SQUARE 8,507,038.34
MODEL_SUM_SQUARES 1,709,914,705.60
NUM_PARAMS 202.00
NUM_ROWS 32,986,605.00
RANK_DEFICIENCY .00
ROOT_MEAN_SQ 6.74
R_SQ .53
SBIC 125,863,052.04
VALID_COVARIANCE_MATRIX YES
29 rows selected.
Testing our model (5.2) - Internal data
SELECT NAME, NVAL, SVAL
FROM DM$PPREGR_GLM_1_1
ORDER BY NAME;

© The Pythian Group Inc., 201920
●Data Miner automatically provides model fitness information
●Root Mean Squared Error (RMSE)
●Mean Absolute Error (MEA)
●Predictive Confidence %
Testing our model (6) - Fitness by Data miner

© The Pythian Group Inc., 201921
References
●
An Oracle-Based Approach To The “Taxi Fare” Prediction Problem by Jose Rodriguez
●
Oracle Data Miner
●
Oracle Data Mining documentation

22© The Pythian Group Inc., 2018
Summary
●What is the Taxi Fare prediction
problem
●How did we prepare the data
●How to create a very simple model
●Testing our model
PROS CONS
Data is already there High compute resources
Simple SQL and PL/SQL License costs

Q&A
Ask now or reach out later, but don't keep the question for yourself

An Oracle approach to the Taxi Fare problem

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

An Oracle approach to the Taxi Fare problem

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx