Knowledge Data Discovery Proces in data Analytics

seagatejogja1 18 views 81 slides Oct 11, 2024
Slide 1
Slide 1 of 81
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81

About This Presentation

Knowledge Data Discovery Proces in data Analytics


Slide Content

ECML/PKDD-2003
Knowledge Discovery
Standards
Tutorial presented by:
Sarab Anand (University of Ulster),
Marko Grobelnik (Institute Jozef Stefan) and
Dietrich Wettschereck (The Robert Gordon
University)
Tuesday, 23. September 2003

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Tutorial Objectives

Overview of existing KD-standards

Motivation for using KD-standards

How do these standards relate to
each other?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Global view: CRISP-DM
Model generation: JDM
SQL/MM,
OLEDB DM
Data access:
SQL
interfaces
Model representation: PMML
The Knowledge Discovery Process
Data access:
SQL interfaces
Model representation: PMML
Model generation: JDM
SQL/MM,
OLEDB DM

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Tutorial Outline

Introduction

CRISP-DM

SQL interfaces for Data Mining

Break

Java Data Mining API

Predictive Model Mark-up Language

Examples

CRISP-DM: A Standard
Process Model for Data
Mining
http://www.crisp-dm.org/

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is CRISP-DM?
Cross-Industry Standard Process for Data Mining
Aim:

To develop an industry, tool and application neutral
process for conducting Knowledge Discovery

Define tasks, outputs from these tasks, terminology
and mining problem type characterization
Founding Consortium Members: DaimlerChrysler,
SPSS and NCR
CRISP-DM Special Interest Group ~ 200 members

Management Consultants

Data Warehousing and Data Mining Practitioners

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Four Levels of Abstraction
Phases

Example: Data Preparation
Generic Tasks

A stable, general and complete set of tasks

Example: Data Cleaning
Specialized Task

How is the generic task carried out

Example: Missing Value Handling
Process Instance

Example: The mean value for numeric attributes and
the most frequent for categorical attributes was used

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Phases of CRISP-DM

Not linear, repeatedly backtracking

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Business Understanding
Phase
Understand the business objectives

What is the status quo?

Understand business processes

Associated costs/pain

Define the success criteria

Develop a glossary of terms: speak the language

Cost/Benefit Analysis
Current Systems Assessment

Identify the key actors

Minimum: The Sponsor and the Key User

What forms should the output take?

Integration of output with existing technology landscape

Understand market norms and standards

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Business Understanding
Phase

Task Decomposition

Break down the objective into sub-tasks

Map sub-tasks to data mining problem definitions

Identify Constraints

Resources

Law e.g. Data Protection

Build a project plan

List assumptions and risk
(technical/financial/business/ organisational) factors

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Understanding Phase

Collect Data

What are the data sources?

Internal and External Sources (e.g. Axiom, Experian)

Document reasons for inclusion/exclusions

Depend on a domain expert

Accessibility issues

Legal and technical

Are there issues regarding data distribution
across different databases/legacy systems

Where are the disconnects?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Understanding Phase II
Data Description

Document data quality issues

requirements for data preparation

Compute basic statistics
Data Exploration

Simple univariate data plots/distributions

Investigate attribute interactions

Data Quality Issues

Missing Values

Understand its source: Missing vs Null values

Strange Distributions

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Preparation Phase
Integrate Data

Joining multiple data tables
Summarisation/aggregation of data
Select Data

Attribute subset selection

Rationale for Inclusion/Exclusion
Data sampling

Training/Validation and Test sets

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Preparation Phase II
Data Transformation
Using functions such as log

Factor/Principal Components analysis

Normalization/Discretisation/Binarisation
Clean Data

Handling missing values/Outliers
Data Construction

Derived Attributes

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Modelling Phase
Select of the appropriate modelling technique

Data pre-processing implications

Attribute independence

Data types/Normalisation/Distributions

Dependent on

Data mining problem type

Output requirements
Develop a testing regime

Sampling

Verify samples have similar characteristics and are
representative of the population

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Modelling Phase
Build Model

Choose initial parameter settings

Study model behaviour

Sensitivity analysis
Assess the model

Beware of over-fitting

Investigate the error distribution

Identify segments of the state space where the model is less
effective

Iteratively adjust parameter settings

Document reasons of these changes

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Evaluation Phase
Validate Model

Human evaluation of results by domain experts

Evaluate usefulness of results from business
perspective

Define control groups

Calculate lift curves

Expected Return on Investment
Review Process
Determine next steps

Potential for deployment

Deployment architecture

Metrics for success of deployment

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Deployment Phase
Knowledge Deployment is specific to objectives

Knowledge Presentation

Deployment within Scoring Engines and Integration
with the current IT infrastructure

Automated pre-processing of live data feeds

XML interfaces to 3
rd
party tools

Generation of a report

Online/Offline

Monitoring and evaluation of effectiveness
Process deployment/production
Produce final project report

Document everything along the way

Microsoft OLE DB for DM
Extension of Microsoft Analysis
Services for Data Mining

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is OLE DB for Data-
Mining?

“OLE DB for DM” is Microsoft’s Extension of
Analysis Server product for covering DM
functionality

It is closely connected to MS OLAP Server

Works within SQL Server database suite

It defines DM at several levels:

Extensions of SQL language for describing DM tasks

API in the form of COM interface for:

(1) Programming DM clients within applications

(2) Programming DM providers (server side components)
for including new DM algorithms

Uses PMML for model description

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Architecture of a solution using
OLE DB for DM technology
End-User Application
Database Systems
MS SQL Server,
MS OLAP Server
Oracle, DB2, …
MS Excel /
MS Site Server /
MS Commerce Server
MS Analysis Server
Decision Trees
Component
Clustering
Component
OLE DB for DM
OLE DB for DMOLE DB for DM
OLE DB for DM
OLE DB

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What are key DM tasks?

Key DM tasks covered by OLD DB for
DM are:

Predictive Modeling (Classification)

Segmentation (Clustering)

Association (Data Summarization)

Sequence and Deviation Analysis

Dependency Modeling

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Defining a domain –
Creating Mining Model Object
Using an OLE DB command object, the client
executes a CREATE statement that is similar to a
CREATE TABLE statement:
CREATE MINING MODEL [Age Prediction](
[Customer ID] LONG KEY,
[Gender] TEXT DISCRETE,
[Age] DOUBLE DISCRETIZED() PREDICT,
[Product Purchases] TABLE (
[Product Name] TEXT KEY,
[Quantity] DOUBLE NORMAL CONTINUOUS,
[Product Type] TEXT DISCRETE RELATED TO [Product Name]
)
)
USING [Decision Trees]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Inserting Training Data into
Model
In a manner similar to populating an ordinary table,
the client uses a form of the INSERT INTO statement.
Note the use of the SHAPE statement to create the
nested table.
INSERT INTO [Age Prediction](
[Customer ID], [Gender], [Age],
[Product Purchases](SKIP, [Product Name], [Quantity], [Product Type])
)
SHAPE {
SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY
[Customer ID]
}
APPEND (
{SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales
ORDER BY [CustID]}
RELATE [Customer ID] To [CustID])
AS [Product Purchases]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Using Models to make
Predictions
Predictions are made with a SELECT statement that joins the
model's set of all possible cases with another set of actual cases.
SELECT t.[Customer ID], [Age Prediction].[Age]
FROM [Age Prediction]
PREDICTION JOIN (
SHAPE {
SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]}
APPEND (
{SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]}
RELATE [Customer ID] To [CustID]
)
AS [Product Purchases]
) as t
ON [Age Prediction] .Gender = t.Gender and
[Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product
Name] and
[Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Association Rules

The following statement creates a data mining model
to find out those products which sell together based
on an association algorithm. The model is interested
only in rules with at least five items:
Create Mining Model MyAssociationModel (
Transaction_id long key,
[Product purchases] table predict (
[Product Name] text key ) )
Using [My Association Algorithm] (Minimum_size = 5)

Training an association model is exactly the same as
training a tree model or a clustering model.

To get all the association rules discovered by the
algorithm, run the following statement:
Select * from MyAssociationModel.content

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Regression Analysis
By using a regression algorithm, the following
mining model predicts loan risk level based on age,
income, homeowner, and marital status:
Create Mining Model MyRegressionModel (
Customer_id long key,
Age long continuous,
Homeowner boolean discrete,
Marital_status Boolean discrete,
Loan_risk_LEVELcontinuous predict
)
Using [My Regression Algorithm]
The following statement returns all the coefficients
of the regression:
Select * from MyRegressionModel.content

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Visual Basic example using the OLE
DB for DM Clustering component
(1) Dim ClusterConnection As New ADODB.Connection
(2) ClusterConnection.Provider = "MSDMine"
(3) DMMName = "[CollPlanDMM]"
(4) DataFileName = ".\CollegePlan.mdb"
(5) ClusterConnection.ConnectionString = "location=localhost;"
& _ "initial catalog=[FoodMart 2000];"
(6) ClusterConnection.Open
(7) ClusterConnection.Execute "CREATE MINING MODEL [ClusterModel]"
& _ "([Student Id] LONG KEY, [College Plans] TEXT DISCRETE PREDICT,"
& "[Gender] TEXT DISCRETE PREDICT, [Iq] LONG CONTINUOUS
PREDICT,"
& _ "[Parent Encouragement] TEXT DISCRETE PREDICT, [Parent Income]
LONG CONTINUOUS PREDICT)"
& _ "USING Microsoft_Clustering"
(8) …

XMLA - XML for Analysis
http://xmla.org/

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is XML for Analysis?

XML for Analysis is a set of XML Message
Interfaces that use the industry standard SOAP to
define the data access interaction between a client
application and an analytical data provider (OLAP
and Data Mining) working over the Internet.

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What are the benefits of XMLA?
Customers will gain the ability to protect server and
tools investments and ensure that new analytical
deployments will interoperate and work cooperatively.
Developers will gain the ability to leverage existing
developer skills and to use open access XML-based
Web services, eliminating the need to program to
multiple APIs and query languages.
Independent software vendors will be able to
reduce complexity and costs for development and
maintenance by writing to a single access interface.

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
History of XMLA
2000 2001 2002 2003
Hyperion & Microsoft
Announce Co-Sponsorship
of XMLA Specification
SAS Joins Council
First XMLA Council
Meeting (creation of SIG teams)
Microsoft Releases SDK
Version 1.0 Released
Version 1.1 Released
Version 1.2 (TBD)
Apr Nov MayAprApr Sep
InterOperate Workshop I
InterOperate Workshop II
Mar
Second XMLA Council
Meeting
1
st
Public XMLA
InterOperability
Demonstration
(TDWI)

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Example of XMLA SOAP
Request

The following is an example of an Execute method call with
<Statement> set to an OLAP MDX SELECT statement:

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Example of XMLA SOAP
Response

This is the abbreviated response for the
preceding method call:

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What Provider Vendors Support XMLA?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What Consumer & Consulting Vendors
Are/Will Support XMLA?

BREAK

JDM: The Java API for Data
Mining

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Objective
To develop a Java API that supports

Building of models

Scoring of data using models

Creation, storage, access and maintenance of data
and metadata supporting data mining results

To provide for data mining systems what JDBC
TM
did for
relational databases

Implementers of data mining applications can expose a
single, standard API understood by a wide variety of
client applications and components

Data Mining clients can be coded against a single API
that is independent of the underlying data mining system
/ vendor

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Approach and Development
Leverages other related standards

PMML (DMG)

CWM (OMG)

SQL/MM (ISO)

JCX (JSR-16)
Public Draft Released in July, 2002
Currently work is continuing on the final
draft

JMI (JSR-40)

JOLAP (JSR-69)
 CRISP-DM

OLEDB DM

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Related Standards
DMG
PMML
Representation of data
mining models for inter-
vendor exchange
(DTD/XML)
OMG
CWM
DM
Object model
for representing
data mining metadata:
models, model results
(UML/DTD/XML)
SQL/MM
Pt. 6 DM
SQL objects for defining,
creating, and applying
data mining models, and
obtaining their results
(SQL)
OLE DB
for DM
SQL-like interface
for data mining
operations
(OLE DB/SQL)
JSR-073
JDM
Java API for defining,
creating, applying, and
obtaining their results of
data mining models
(Java)

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Expert Group

Mark Hornick, Oracle
(Lead)

BEA Systems

Computer Associates

CorporateIntellect

CalTech

Fair Issac

Hyperion

IBM

KXEN

Quadstone

SAP

SAS

SPSS

Strategic Analytics

Sun Microsystems

University of Ulster

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Use Case
A programmer is tasked with development of a target
marketing tools that allows the user to

Choose a target campaign

E-mail a random sample of the customers
Build a model based on the responses

Apply the model to improve the targeting of the campaign
Using JDM (for the 3
rd
and 4
th
tasks) the programmer
Defines the target data for the modelling using the Physical and Logical
Data Classes

Uses the Classification Function Settings class to set default parameters for
the learning task

Creates a build task that generates and persists the model
Creates an apply task that applies the model to select the campaign targets

Minimises risk associated with a change in the data mining vendor by using
the standard JDM interface

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
How will it work?
JDM defines a set of
interfaces for
Defining the data to be
used in the mining

Physical/Logical Data
Defining the data
mining parameters

Function settings

Support for Novice
Users

Algorithm settings

Expert User

Algorithm specific
settings

Performing Tasks

Executing a data mining
algorithm

Importing/Exporting to
PMML

Testing the knowledge

Applying the knowledge on
new data

Batch and Real-time Scoring

Compute Statistics

Interrogating the resulting
knowledge

Persistence of all Meta
Data/Data

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Typical Architecture
J
D
M
Corporate
Warehouse
MetaData
Repository
Proprietary
Data Mining
Engine 1
MetaData
Repository
Proprietary
Data Mining
Engine 2
.
.
Uses Factory Classes
Hence, Service Provider
Classes need not
be made public

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Conformance Rules for Service
Providers
a la carte approach to functions and algorithms
supported

vendors implement functions and algorithms that their
products support

At least one function must be supported
All core packages must be supported
All methods within a implemented class must be
implemented
semantics specified for each method must be implemented
to ensure common interpretation of a given result
Must support J2EE and/or J2SE
Extension may be done through subclassing

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Mining Functions
Supported

Classification

Regression

Attribute Importance

Clustering

Association Rules

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Algorithms Supported

Naïve Bayes

Decision Trees

Feed Forward Neural Networks

Support Vector Machines

K-Means

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (1)
// Get a connection
(1) ConnectionSpec connSpec = (javax.datamining.resource.ConnectionSpec) jdmCFactory.getConnectionSpec();
(2) connSpec.setName( “user1” );
(3) connSpec.setPassword( “pswd” );
(4) connSpec.setURI( “myDME” );
(5) javax.datamining.resource.Connection dmeConn = jdmCFactory.getConnection(connSpec );
// Create and populate the Physical Data object – Define the Data to be used
(6) PhysicalDataSetFactory pdsFactory
= (PhysicalDataSetFactory) dmeConn.getFactory( “ javax.datamining.data.PhysicalDataSet” );
(7) PhysicalDataSet pd = pdsFactory.create( “minivan.data” );
(8) pd.importMetaData();
(9) dmeConn.saveObject( “myPD”, pd );
// Create LogicalData object
(10) LogicalDataFactory ldFactory
= (LogicalDataFactory) dmeConn.getFactory(“javax.datamining.data.LogicalData” );
(11) LogicalData ld = ldFactory.create( pd );
// Specify how attributes should be used
(12) LogicalAttribute income = ld.getAttribute( “income” );
(13) income.setAttributeType( AttributeType.numerical );

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (2)
// Create the FunctionSettings for Classification
(14) ClassificationSettingsFactory cfsFactory = (ClassificationSettingsFactory) dmeConn.getFactory(
“javax.datamining.supervised.classification.ClassificationSettings” );
(15) ClassificationSettings settings = cfsFactory.create();
(16) settings.setTargetAttributeName( “buyMinivan” );
(17) settings.setCostMatrix( costs ); // predefined cost matrix
// Create the AlgorithmSettings and add it to the FunctionSettings
(18) NaiveBayesSettingsFactory nbFactory = (NaiveBayesSettingsFactory) dme-Conn.getFactory(
“javax.datamining.algorithm.naivebayes.NaiveBayes-Settings” );
(19) NaiveBayesSettings nbSettings = nbFactory.create();
(20) nbSettings.setSingletonThreshold( .01L );
(21) nbSettings.setPairwiseThreshold( .01L );
// Associate LD and AS with the FunctionSettings
(22) settings.setAlgorithmSettings( nbSettings );
(23) settings.setLogicalData( ld );
(24) dmeConn.saveObject( “myFS”, settings );

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (3)
// Create the build task
(26) BuildTaskFactory btFactory
= (BuildTaskFactory) dmeConn.getFactory(“javax.datamining.task.BuildTask” );
(27) BuildTask buildTask = btFactory.create( “myPD”, “myFS”, “myModel” );
(28) VerificationReport report = buildTask.verify();
(29) if ( report != null ) { // either error or warning
(30) ReportType reportType = report.getReportType (); // check if it’s just a warning or an error
(32) } else {
(33) dmeConn.saveObject( “myBuildTask”, buildTask );
// Execute the task and block until finished
(34) ExecutionHandle handle = dmeConn.execute( “myBuildTask” );
(35) handle.waitForCompletion( null ); // wait without timeout until done
// Access the model
(36) ClassificationModel model
= (ClassificationModel) dmeConn.getObject( “myModel”, NamedObject.model );
(37) }
// Close the connection
(38) dmeConn.close();

PMML: The Predictive Model
Markup Language
http://www.dmg.org

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Predictive Model Mark-up Language (PMML)
Industry led standard for representing the
output of data mining
Supported by

Full Members: IBM, Oracle, Magnify, SPSS, SAS,
StatSoft, Microsoft, CorporateIntellect, KXEN,
Salford Systems

Numerous Associated Members
Objective

define and share predictive models using an
open standard

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Rationale
Complex mosaic of software applications

Knowledge generators

Data Mining Vendors

Different data mining algorithms have different
languages for expressing the knowledge discovered

Vendor dependent representations for knowledge e.g.
C/C++ routines

Knowledge consumers

Real-time Scoring / Personalisation engines

Marketing Tools

Visualisation Tools
Need for a vendor independent
representation of data mining output

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML
Benefits

proprietary issues and incompatibilities no
longer a barrier to the exchange of models
between applications

based on XML
develop models using any generator vendor,
deploy the models using any consumer
vendor application
Development

Current Release 2.1

Supported by most current releases of
member vendors applications

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML Document
Basic XML structure
DOCTYPE declaration not
required
A PMML document must
be a valid XML document

obey PMML conformance
rules
Root element <PMML>
6 child elements
2 required

Header

Data Dictionary

4 optional
<?xml version="1.0" ?>
<!DOCTYPE PMML PUBLIC "PMML 2.0"
"http://www.dmg.org/v2-0/pmml_v2_0.dtd
">
<PMML version="2.0" >
<Header … />
<MiningBuildTask …/>
<DataDictionary …/>
<TransformationDictionary …/>
<SequenceMiningModel …/>
<Extension …/>
</PMML>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Header
Attributes

copyright

description
Elements

Application (that generated the PMML)

Name: Capri

Version: 2.0

Annotation

Free text

TimeStamp

Date/Time of model creation

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Header (2)
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header copyright=“CorporateIntellect" description=“Results of CAPRI" >

</Header>
. . .
. . .
</PMML>
<Application name=“CORAL" version="3.0" >
<Annotation>This is a PMML document with results from the
CAPRI run on commodity market data.</Annotation>
<Timestamp>2003-03-02 18:30:00 GMT +00:00</Timestamp>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Mining Build Task

May contain any XML value describing the
configuration of the training run that
produced the model

Information provided in this element is
essentially meta-data

not used specifically in the deployment of the
model by the PMML consumer

Specific content structure not defined in PMML

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary
Attributes

Number of Fields

aids consistency checks
Elements

DataField

Attributes

Name

displayName

Optype

categorical/ordinal/continuous

Defines legal operations on the field values

Taxonomy

Name of taxonomy that defines a hierarchy on the values

isCyclic

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary (2)

Elements

Value

Defines domain for ordinal and categorical attributes

value

displayValue

property: valid/ invalid/ missing
Interval

Defines the range of valid values for continuous fields

closure: openClosed, closedOpen, openOpen,
closedClosed

leftMargin

rightMargin

Taxonomy

Define hierarchies on specific fields within the data
dictionary

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary (3)

Attributes

name: associates the taxonomy with the appropriate field
within the data dictionary (see DataField attribute taxonomy)

Elements

ChildParent

Attributes

childField: name of field within the table (see Elements
below) that represents the child value

parentField: name of field within the table (see Elements
below) that represents the parent value

parentLevelField: name of field within the table (see
Elements below) that represents the level in the hierarchy

isRecursive: Yes/No: if the whole hierarchy is defined in
the same table or an individual table per level

Elements

Inline Table/Table Locator

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
DataDictionary complete
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header … />


. . .
</PMML>
<DataDictionary numOfFields= "3" >
</DataDictionary >
<DataField name= "Type" optype="categorical">
<Value value = "BU "/>
<Value value = "HO"/>
<Value value = "CO"/>
</DataField>
<DataField name= "Age" optype= "continuous">
<Interval closure= "closedClosed" leftMargin= "0" rightMargin= "150"/>
</DataField>
<DataField name= "PostCode" optype="categorical" taxonomy = "Location" />
<Taxonomy name="Location">
………….
</Taxonomy>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Taxonomy Example
<Taxonomy name="Location">
<ChildParent childColumn=“Post Code" parentColumn="District">
<TableLocator x-dbname="myDB" x-tableName="PostCode_District" />
</ChildParent>
<ChildParent childColumn="member" parentColumn="group" isRecursive="yes">
<InlineTable>
<Extension extender="MySystem">
<row member="W9" group="CentralLondon"/>
<row member="NW9" group="NorthLondon"/>
<row member="NW2" group="NorthLondon"/>
<row member="W1" group="CentralLondon"/>
<row member="CentralLondon " group="London"/>
<row member="NorthLondon " group="London"/>
<row member="EastLondon " group="London"/>
<row member="London" group="England"/>
………….
</Extension>
</InlineTable>
</ChildParent> </Taxonomy>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary
Defines mapping of source data values to
values more suited for use by the mining
algorithm
PMML supports

Normalization: map values to numbers, the input
can be continuous or discrete.

Discretization: map continuous values to discrete
values.

Value mapping: map discrete values to discrete
values.

Aggregation: summarize or collect groups of
values, e.g. compute average

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (2)

TranformationDictionary

DerivedField Elements

Attributes

name

displayName

Elements

Expression (one of the following)

NormContinuous

NormDiscrete

Discretize

MapValues

Aggregates

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (3)
<DerivedField name=“normalAge">
<NormContinuous field="age">
<LinearNorm orig="45" norm="0"/>
<LinearNorm orig="82" norm="0.5"/>
<LinearNorm orig="105" norm="1"/>
</NormContinuous>
</DerivedField>
<DerivedField name="male">
<NormDiscrete field="marital status" value="m"/>
</DerivedField>
<DerivedField name="female">
<NormDiscrete field="marital status" value=“f"/>
</DerivedField>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (4)
<DerivedField name=“binnedProfit">
<Discretize field="Profit">
<DiscretizeBin binValue="negative">
<Interval closure="openOpen" rightMargin="0" />
</DiscretizeBin>
<DiscretizeBin binValue="positive">
<Interval closure="closedOpen" leftMargin="0" />
</DiscretizeBin>
</Discretize>
</DerivedField>
<DerivedField name=“houseType">
<MapValues outputColumn="longForm">
<FieldColumnPair field="Type" column="shortForm"/>
<InlineTable><Extension>
<row><shortForm>BU</shortForm><longForm>bunglow</longForm> </row>
<row><shortForm>HO</shortForm><longForm>house</longForm> </row>
<row><shortForm>CO</shortForm><longForm>cottage</longForm> </row>
</Extension></InlineTable>
</MapValues>
</DerivedField>
<DerivedField name=“itemsBought">
<Aggregate field="item" function="multiset" groupField="transaction"/>
</DerivedField>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The PMML Document
Data Dictionary
Transformation Dictionary
Mining Schema
Model1

Model2 Modelk
Data
Model
Statistics
Mining Schema
Model
Statistics
Mining Schema
Model
Statistics

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Mining Schema

Elements

MiningField

Attributes

Name

usageType: active/ predicted/ supplementary

Outliers: asIs/ asMissingValue/ asExtremeValues

lowValue

highValue

missingValueReplacement

missingValueTreatment: asIs/ asMean/ asMode/
asMedian/ asValue

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header … />
<DataDictionary … />

</PMML>
MiningSchema
<SequenceModel functionName="sequences" algorithmName="Capri2"
minimumSupport="24.17" minimumConfidence="0.00"
numberOfItems="5" numberOfSets="5" numberOfSequences="11"
numberOfRules="3">
<Extension name="orderby" value="none"/>
… … …
</SequenceModel >
<MiningSchema >
</MiningSchema >
<MiningField name= "Price" usageType="predicted" />
<MiningField name= "location" usageType="active" />
<MiningField name= "bedrooms" usageType="active" />
<MiningField name= "houseType" usageType="active" />
<MiningField name="Area" usageType= "supplementary" />

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Model Statistics

Elements

UnivariateStatistics

Attributes

Field

Elements

Discrete Statistics

Continuous Statistics

Counts: Valid, Invalid and Missing counts

NumericInfo: min/ max/ mean/ standard
deviation/ median/ interQuartileDistance

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Supported Data Mining
Models
Tree Model
Neural Networks
Clustering Model
Regression Model
General Regression Model
Naïve Bayes Model
Association Rules
Sequence Rule Model

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Sequence Model

Represents the output
of Sequence Rule Mining

Attributes

modelName

functionName

algorithmName

numberOfTransactions

minimumSupport

minimumConfidence

lengthLimit

…..
Elements

Sequence Rule

Elements

Antecedent Sequence

sequenceReference

Consequent Sequence

Delimiter

Sequence

Elements
SetReference

Delimiter

Set Predicate

Array

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
<SequenceModel functionName="sequences" numberOfTransactions="100“
minimumSupport="0.20" minimumConfidence="0.25" numberOfItems="6" numberOfSets="5"
numberOfSequences="3" numberOfRules="1"> <MiningSchema> ……… </MiningSchema>
<SetPredicate id="sp001" field="transaction" operator="supersetOf">
<Array n="1" type="string"> index.html </Array> </SetPredicate>
<SetPredicate id="sp002" field="transaction" operator="supersetOf">
<Array n="2" type="string"> offer.html kdnuggets.com </Array> </SetPredicate>
<SetPredicate id="sp003" field="transaction" operator="supersetOf">
<Array n="1" type="string"> products.html </Array> </SetPredicate>
<SetPredicate id="sp004" field="transaction" operator="supersetOf">
<Array n="1" type="string"> basket.html </Array> </SetPredicate>
<SetPredicate id="sp005" field="transaction" operator="supersetOf">
<Array n="1" type="string"> checkout.html </Array> </SetPredicate>
<Sequence id="seq001" numberOfSets="1" occurrence="80" support="0.80">
<SetReference setId="sp001"/> </Sequence>
<Sequence id="seq002" numberOfSets="4" occurrence="40" support="0.40">
<SetReference setId="sp002"/><Delimiter delimiter="acrossTimeWindows" gap="false"/>
<SetReference setId="sp003"/><Delimiter delimiter="sameTimeWindow" gap="true"/>
<SetReference setId="sp004"/><Delimiter delimiter="sameTimeWindow" gap="false"/>
<SetReference setId="sp005"/> </Sequence>
<SequenceRule id="rule001" numberOfSets="5" occurrence="20" support="0.20"
confidence="0.25">
<AntecedentSequence><SequenceReference seqId="seq001"/></AntecedentSequence>
<Delimiter delimiter="sameTimeWindow" gap="unknown"/>
<ConsequentSequence><SequenceReference seqId="seq002"/></ConsequentSequence>
</SequenceRule>
</SequenceModel>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML Consumers

Post-Processing

Visualization

Verification and Evaluation

Deployment

Hybrids and Meta-Learning

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PEAR: Post-Processing Association
Rules

Sets of Association rules are browsed like web pages

PMML-formated
assocation rules
can be
uploaded

Jorge et al.,
2002

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
VizWiz - PMML Visualization

Java Applet

Some non-
standard
extensions
required for best
visualization

Wettschereck,
2003

Reads, visualizes and writes PMML files

Coupling with WEKA in progress

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
ROCOn – Visualizing ROC
graphs

compare and

evaluate models

Java Applet

Understands PMML
as an extension to
VizWiz

Farrand and Flach
(http://www.cs.bris.ac.uk/%7Efarrand/rocon/index.html)

Use Receiver Operator Characteristics (ROC) to

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Summary

Standards help to streamline efforts

Sign of maturity in field of KD

From “Art” to “Engineering”

Standards are still incomplete, but:
Use what is available!

More tools utilizing standards are
needed

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
References

Grossman, R.L., Hornick, M.F., Meyer, G. (2002). Data Mining Standards Initiatives, Communications of
the ACM, Vol. 45:8 see also http://www.dmg.org

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000). CRISP-
DM 1.0: Step-by-step data mining guide, CRISP-DM consortium, http://www.crisp-dm.org

Clifton, C., Thuraisingham, B. (2001). Emerging standards for data mining. Computer Standards &
Interfaces Vol 23 pp 187 – 193.

Compare and Contrast JOLAP and XML for Analysis
http://www.essbase.com/resource_library/articles/jolap_xmla.cfm

JCX http://www.jcp.org/en/jsr/detail?id=016

JOLAP http://www.jcp.org/en/jsr/detail?id=69

Jorge, A., Poças, J. and Azevedo, P. (2002). Post-processing operators for browsing large sets of
association rules. Proc. Discovery Science 02. (eds. Lange, S., Satoh, K. and Smith, C. H.), Lübeck,
Germany, LNCS, 2534, Springer-Verlag.

Farrand, J. and Flach P. (2003). ROCOn: a tool for visualising ROC graphs. See:
http://www.cs.bris.ac.uk/%7Efarrand/rocon/index.html

Melton, J. and Eisenberg, A. SQL Multimedia and Application Packages (SQL/MM),
http://www.acm.org/sigmod/record/issues/0112/standards.pdf

OMG Common Warehouse MetaModel http://www.omg.org/cwm/

SOAP http://www.w3.org/TR/SOAP/

Tang, Z., Kim, P. Building Data Mining Solutions with SQL Server 2000,
http://www.dmreview.com/whitepaper/wid292.pdf

Wettschereck, D., Jorge, A., Moyle, S. (to appear). Data Mining and Decision Support Integration through
the Predictive Model Markup Language Standard and Visualization in Mladenic D, Lavrac N, Bohanec M,
Moyle S (editors): Data Mining and Decision Support: Integration and Collaboration, Kluwer
Publishers.

XMLA http://www.xmla.org/
Tags