Knowledge Data Discovery Proces in data Analytics

ECML/PKDD-2003
Knowledge Discovery
Standards
Tutorial presented by:
Sarab Anand (University of Ulster),
Marko Grobelnik (Institute Jozef Stefan) and
Dietrich Wettschereck (The Robert Gordon
University)
Tuesday, 23. September 2003

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Tutorial Objectives

Overview of existing KD-standards

Motivation for using KD-standards

How do these standards relate to
each other?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Global view: CRISP-DM
Model generation: JDM
SQL/MM,
OLEDB DM
Data access:
SQL
interfaces
Model representation: PMML
The Knowledge Discovery Process
Data access:
SQL interfaces
Model representation: PMML
Model generation: JDM
SQL/MM,
OLEDB DM

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Tutorial Outline

Introduction

CRISP-DM

SQL interfaces for Data Mining

Break

Java Data Mining API

Predictive Model Mark-up Language

Examples

CRISP-DM: A Standard
Process Model for Data
Mining
http://www.crisp-dm.org/

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is CRISP-DM?
Cross-Industry Standard Process for Data Mining
Aim:

To develop an industry, tool and application neutral
process for conducting Knowledge Discovery

Define tasks, outputs from these tasks, terminology
and mining problem type characterization
Founding Consortium Members: DaimlerChrysler,
SPSS and NCR
CRISP-DM Special Interest Group ~ 200 members

Management Consultants

Data Warehousing and Data Mining Practitioners

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Four Levels of Abstraction
Phases

Example: Data Preparation
Generic Tasks

A stable, general and complete set of tasks

Example: Data Cleaning
Specialized Task

How is the generic task carried out

Example: Missing Value Handling
Process Instance

Example: The mean value for numeric attributes and
the most frequent for categorical attributes was used

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Phases of CRISP-DM

Not linear, repeatedly backtracking

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Business Understanding
Phase
Understand the business objectives

What is the status quo?

Understand business processes

Associated costs/pain

Define the success criteria

Develop a glossary of terms: speak the language

Cost/Benefit Analysis
Current Systems Assessment

Identify the key actors

Minimum: The Sponsor and the Key User

What forms should the output take?

Integration of output with existing technology landscape

Understand market norms and standards

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Business Understanding
Phase

Task Decomposition

Break down the objective into sub-tasks

Map sub-tasks to data mining problem definitions

Identify Constraints

Resources

Law e.g. Data Protection

Build a project plan

List assumptions and risk
(technical/financial/business/ organisational) factors

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Understanding Phase

Collect Data

What are the data sources?

Internal and External Sources (e.g. Axiom, Experian)

Document reasons for inclusion/exclusions

Depend on a domain expert

Accessibility issues

Legal and technical

Are there issues regarding data distribution
across different databases/legacy systems

Where are the disconnects?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Understanding Phase II
Data Description

Document data quality issues

requirements for data preparation

Compute basic statistics
Data Exploration

Simple univariate data plots/distributions

Investigate attribute interactions

Data Quality Issues

Missing Values

Understand its source: Missing vs Null values

Strange Distributions

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Preparation Phase
Integrate Data

Joining multiple data tables
Summarisation/aggregation of data
Select Data

Attribute subset selection

Rationale for Inclusion/Exclusion
Data sampling

Training/Validation and Test sets

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Preparation Phase II
Data Transformation
Using functions such as log

Factor/Principal Components analysis

Normalization/Discretisation/Binarisation
Clean Data

Handling missing values/Outliers
Data Construction

Derived Attributes

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Modelling Phase
Select of the appropriate modelling technique

Data pre-processing implications

Attribute independence

Data types/Normalisation/Distributions

Dependent on

Data mining problem type

Output requirements
Develop a testing regime

Sampling

Verify samples have similar characteristics and are
representative of the population

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Modelling Phase
Build Model

Choose initial parameter settings

Study model behaviour

Sensitivity analysis
Assess the model

Beware of over-fitting

Investigate the error distribution

Identify segments of the state space where the model is less
effective

Iteratively adjust parameter settings

Document reasons of these changes

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Evaluation Phase
Validate Model

Human evaluation of results by domain experts

Evaluate usefulness of results from business
perspective

Define control groups

Calculate lift curves

Expected Return on Investment
Review Process
Determine next steps

Potential for deployment

Deployment architecture

Metrics for success of deployment

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Deployment Phase
Knowledge Deployment is specific to objectives

Knowledge Presentation

Deployment within Scoring Engines and Integration
with the current IT infrastructure

Automated pre-processing of live data feeds

XML interfaces to 3
rd
party tools

Generation of a report

Online/Offline

Monitoring and evaluation of effectiveness
Process deployment/production
Produce final project report

Document everything along the way

Microsoft OLE DB for DM
Extension of Microsoft Analysis
Services for Data Mining

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is OLE DB for Data-
Mining?

“OLE DB for DM” is Microsoft’s Extension of
Analysis Server product for covering DM
functionality

It is closely connected to MS OLAP Server

Works within SQL Server database suite

It defines DM at several levels:

Extensions of SQL language for describing DM tasks

API in the form of COM interface for:

(1) Programming DM clients within applications

(2) Programming DM providers (server side components)
for including new DM algorithms

Uses PMML for model description

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Architecture of a solution using
OLE DB for DM technology
End-User Application
Database Systems
MS SQL Server,
MS OLAP Server
Oracle, DB2, …
MS Excel /
MS Site Server /
MS Commerce Server
MS Analysis Server
Decision Trees
Component
Clustering
Component
OLE DB for DM
OLE DB for DMOLE DB for DM
OLE DB for DM
OLE DB

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What are key DM tasks?

Key DM tasks covered by OLD DB for
DM are:

Predictive Modeling (Classification)

Segmentation (Clustering)

Association (Data Summarization)

Sequence and Deviation Analysis

Dependency Modeling

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Defining a domain –
Creating Mining Model Object
Using an OLE DB command object, the client
executes a CREATE statement that is similar to a
CREATE TABLE statement:
CREATE MINING MODEL [Age Prediction](
[Customer ID] LONG KEY,
[Gender] TEXT DISCRETE,
[Age] DOUBLE DISCRETIZED() PREDICT,
[Product Purchases] TABLE (
[Product Name] TEXT KEY,
[Quantity] DOUBLE NORMAL CONTINUOUS,
[Product Type] TEXT DISCRETE RELATED TO [Product Name]
)
)
USING [Decision Trees]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Inserting Training Data into
Model
In a manner similar to populating an ordinary table,
the client uses a form of the INSERT INTO statement.
Note the use of the SHAPE statement to create the
nested table.
INSERT INTO [Age Prediction](
[Customer ID], [Gender], [Age],
[Product Purchases](SKIP, [Product Name], [Quantity], [Product Type])
)
SHAPE {
SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY
[Customer ID]
}
APPEND (
{SELECT [CustID], [Product Name], [Quantity], [Product Type] FROM Sales
ORDER BY [CustID]}
RELATE [Customer ID] To [CustID])
AS [Product Purchases]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Using Models to make
Predictions
Predictions are made with a SELECT statement that joins the
model's set of all possible cases with another set of actual cases.
SELECT t.[Customer ID], [Age Prediction].[Age]
FROM [Age Prediction]
PREDICTION JOIN (
SHAPE {
SELECT [Customer ID], [Gender], FROM Customers ORDER BY [Customer ID]}
APPEND (
{SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]}
RELATE [Customer ID] To [CustID]
)
AS [Product Purchases]
) as t
ON [Age Prediction] .Gender = t.Gender and
[Age Prediction] .[Product Purchases].[Product Name] = t.[Product Purchases].[Product
Name] and
[Age Prediction] .[Product Purchases].[Quantity] = t.[Product Purchases].[Quantity]

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Association Rules

The following statement creates a data mining model
to find out those products which sell together based
on an association algorithm. The model is interested
only in rules with at least five items:
Create Mining Model MyAssociationModel (
Transaction_id long key,
[Product purchases] table predict (
[Product Name] text key ) )
Using [My Association Algorithm] (Minimum_size = 5)

Training an association model is exactly the same as
training a tree model or a clustering model.

To get all the association rules discovered by the
algorithm, run the following statement:
Select * from MyAssociationModel.content

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Regression Analysis
By using a regression algorithm, the following
mining model predicts loan risk level based on age,
income, homeowner, and marital status:
Create Mining Model MyRegressionModel (
Customer_id long key,
Age long continuous,
Homeowner boolean discrete,
Marital_status Boolean discrete,
Loan_risk_LEVELcontinuous predict
)
Using [My Regression Algorithm]
The following statement returns all the coefficients
of the regression:
Select * from MyRegressionModel.content

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Visual Basic example using the OLE
DB for DM Clustering component
(1) Dim ClusterConnection As New ADODB.Connection
(2) ClusterConnection.Provider = "MSDMine"
(3) DMMName = "[CollPlanDMM]"
(4) DataFileName = ".\CollegePlan.mdb"
(5) ClusterConnection.ConnectionString = "location=localhost;"
& _ "initial catalog=[FoodMart 2000];"
(6) ClusterConnection.Open
(7) ClusterConnection.Execute "CREATE MINING MODEL [ClusterModel]"
& _ "([Student Id] LONG KEY, [College Plans] TEXT DISCRETE PREDICT,"
& "[Gender] TEXT DISCRETE PREDICT, [Iq] LONG CONTINUOUS
PREDICT,"
& _ "[Parent Encouragement] TEXT DISCRETE PREDICT, [Parent Income]
LONG CONTINUOUS PREDICT)"
& _ "USING Microsoft_Clustering"
(8) …

XMLA - XML for Analysis
http://xmla.org/

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What is XML for Analysis?

XML for Analysis is a set of XML Message
Interfaces that use the industry standard SOAP to
define the data access interaction between a client
application and an analytical data provider (OLAP
and Data Mining) working over the Internet.

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What are the benefits of XMLA?
Customers will gain the ability to protect server and
tools investments and ensure that new analytical
deployments will interoperate and work cooperatively.
Developers will gain the ability to leverage existing
developer skills and to use open access XML-based
Web services, eliminating the need to program to
multiple APIs and query languages.
Independent software vendors will be able to
reduce complexity and costs for development and
maintenance by writing to a single access interface.

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
History of XMLA
2000 2001 2002 2003
Hyperion & Microsoft
Announce Co-Sponsorship
of XMLA Specification
SAS Joins Council
First XMLA Council
Meeting (creation of SIG teams)
Microsoft Releases SDK
Version 1.0 Released
Version 1.1 Released
Version 1.2 (TBD)
Apr Nov MayAprApr Sep
InterOperate Workshop I
InterOperate Workshop II
Mar
Second XMLA Council
Meeting
1
st
Public XMLA
InterOperability
Demonstration
(TDWI)

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Example of XMLA SOAP
Request

The following is an example of an Execute method call with
<Statement> set to an OLAP MDX SELECT statement:

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Example of XMLA SOAP
Response

This is the abbreviated response for the
preceding method call:

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What Provider Vendors Support XMLA?

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
What Consumer & Consulting Vendors
Are/Will Support XMLA?

BREAK

JDM: The Java API for Data
Mining

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Objective
To develop a Java API that supports

Building of models

Scoring of data using models

Creation, storage, access and maintenance of data
and metadata supporting data mining results

To provide for data mining systems what JDBC
TM
did for
relational databases

Implementers of data mining applications can expose a
single, standard API understood by a wide variety of
client applications and components

Data Mining clients can be coded against a single API
that is independent of the underlying data mining system
/ vendor

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Approach and Development
Leverages other related standards

PMML (DMG)

CWM (OMG)

SQL/MM (ISO)

JCX (JSR-16)
Public Draft Released in July, 2002
Currently work is continuing on the final
draft

JMI (JSR-40)

JOLAP (JSR-69)
 CRISP-DM

OLEDB DM

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Related Standards
DMG
PMML
Representation of data
mining models for inter-
vendor exchange
(DTD/XML)
OMG
CWM
DM
Object model
for representing
data mining metadata:
models, model results
(UML/DTD/XML)
SQL/MM
Pt. 6 DM
SQL objects for defining,
creating, and applying
data mining models, and
obtaining their results
(SQL)
OLE DB
for DM
SQL-like interface
for data mining
operations
(OLE DB/SQL)
JSR-073
JDM
Java API for defining,
creating, applying, and
obtaining their results of
data mining models
(Java)

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The Expert Group

Mark Hornick, Oracle
(Lead)

BEA Systems

Computer Associates

CorporateIntellect

CalTech

Fair Issac

Hyperion

IBM

KXEN

Quadstone

SAP

SAS

SPSS

Strategic Analytics

Sun Microsystems

University of Ulster

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Use Case
A programmer is tasked with development of a target
marketing tools that allows the user to

Choose a target campaign

E-mail a random sample of the customers
Build a model based on the responses

Apply the model to improve the targeting of the campaign
Using JDM (for the 3
rd
and 4
th
tasks) the programmer
Defines the target data for the modelling using the Physical and Logical
Data Classes

Uses the Classification Function Settings class to set default parameters for
the learning task

Creates a build task that generates and persists the model
Creates an apply task that applies the model to select the campaign targets

Minimises risk associated with a change in the data mining vendor by using
the standard JDM interface

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
How will it work?
JDM defines a set of
interfaces for
Defining the data to be
used in the mining

Physical/Logical Data
Defining the data
mining parameters

Function settings

Support for Novice
Users

Algorithm settings

Expert User

Algorithm specific
settings

Performing Tasks

Executing a data mining
algorithm

Importing/Exporting to
PMML

Testing the knowledge

Applying the knowledge on
new data

Batch and Real-time Scoring

Compute Statistics

Interrogating the resulting
knowledge

Persistence of all Meta
Data/Data

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Typical Architecture
J
D
M
Corporate
Warehouse
MetaData
Repository
Proprietary
Data Mining
Engine 1
MetaData
Repository
Proprietary
Data Mining
Engine 2
.
.
Uses Factory Classes
Hence, Service Provider
Classes need not
be made public

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Conformance Rules for Service
Providers
a la carte approach to functions and algorithms
supported

vendors implement functions and algorithms that their
products support

At least one function must be supported
All core packages must be supported
All methods within a implemented class must be
implemented
semantics specified for each method must be implemented
to ensure common interpretation of a given result
Must support J2EE and/or J2SE
Extension may be done through subclassing

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Mining Functions
Supported

Classification

Regression

Attribute Importance

Clustering

Association Rules

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Algorithms Supported

Naïve Bayes

Decision Trees

Feed Forward Neural Networks

Support Vector Machines

K-Means

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (1)
// Get a connection
(1) ConnectionSpec connSpec = (javax.datamining.resource.ConnectionSpec) jdmCFactory.getConnectionSpec();
(2) connSpec.setName( “user1” );
(3) connSpec.setPassword( “pswd” );
(4) connSpec.setURI( “myDME” );
(5) javax.datamining.resource.Connection dmeConn = jdmCFactory.getConnection(connSpec );
// Create and populate the Physical Data object – Define the Data to be used
(6) PhysicalDataSetFactory pdsFactory
= (PhysicalDataSetFactory) dmeConn.getFactory( “ javax.datamining.data.PhysicalDataSet” );
(7) PhysicalDataSet pd = pdsFactory.create( “minivan.data” );
(8) pd.importMetaData();
(9) dmeConn.saveObject( “myPD”, pd );
// Create LogicalData object
(10) LogicalDataFactory ldFactory
= (LogicalDataFactory) dmeConn.getFactory(“javax.datamining.data.LogicalData” );
(11) LogicalData ld = ldFactory.create( pd );
// Specify how attributes should be used
(12) LogicalAttribute income = ld.getAttribute( “income” );
(13) income.setAttributeType( AttributeType.numerical );

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (2)
// Create the FunctionSettings for Classification
(14) ClassificationSettingsFactory cfsFactory = (ClassificationSettingsFactory) dmeConn.getFactory(
“javax.datamining.supervised.classification.ClassificationSettings” );
(15) ClassificationSettings settings = cfsFactory.create();
(16) settings.setTargetAttributeName( “buyMinivan” );
(17) settings.setCostMatrix( costs ); // predefined cost matrix
// Create the AlgorithmSettings and add it to the FunctionSettings
(18) NaiveBayesSettingsFactory nbFactory = (NaiveBayesSettingsFactory) dme-Conn.getFactory(
“javax.datamining.algorithm.naivebayes.NaiveBayes-Settings” );
(19) NaiveBayesSettings nbSettings = nbFactory.create();
(20) nbSettings.setSingletonThreshold( .01L );
(21) nbSettings.setPairwiseThreshold( .01L );
// Associate LD and AS with the FunctionSettings
(22) settings.setAlgorithmSettings( nbSettings );
(23) settings.setLogicalData( ld );
(24) dmeConn.saveObject( “myFS”, settings );

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Code Example (3)
// Create the build task
(26) BuildTaskFactory btFactory
= (BuildTaskFactory) dmeConn.getFactory(“javax.datamining.task.BuildTask” );
(27) BuildTask buildTask = btFactory.create( “myPD”, “myFS”, “myModel” );
(28) VerificationReport report = buildTask.verify();
(29) if ( report != null ) { // either error or warning
(30) ReportType reportType = report.getReportType (); // check if it’s just a warning or an error
(32) } else {
(33) dmeConn.saveObject( “myBuildTask”, buildTask );
// Execute the task and block until finished
(34) ExecutionHandle handle = dmeConn.execute( “myBuildTask” );
(35) handle.waitForCompletion( null ); // wait without timeout until done
// Access the model
(36) ClassificationModel model
= (ClassificationModel) dmeConn.getObject( “myModel”, NamedObject.model );
(37) }
// Close the connection
(38) dmeConn.close();

PMML: The Predictive Model
Markup Language
http://www.dmg.org

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Predictive Model Mark-up Language (PMML)
Industry led standard for representing the
output of data mining
Supported by

Full Members: IBM, Oracle, Magnify, SPSS, SAS,
StatSoft, Microsoft, CorporateIntellect, KXEN,
Salford Systems

Numerous Associated Members
Objective

define and share predictive models using an
open standard

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Rationale
Complex mosaic of software applications

Knowledge generators

Data Mining Vendors

Different data mining algorithms have different
languages for expressing the knowledge discovered

Vendor dependent representations for knowledge e.g.
C/C++ routines

Knowledge consumers

Real-time Scoring / Personalisation engines

Marketing Tools

Visualisation Tools
Need for a vendor independent
representation of data mining output

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML
Benefits

proprietary issues and incompatibilities no
longer a barrier to the exchange of models
between applications

based on XML
develop models using any generator vendor,
deploy the models using any consumer
vendor application
Development

Current Release 2.1

Supported by most current releases of
member vendors applications

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML Document
Basic XML structure
DOCTYPE declaration not
required
A PMML document must
be a valid XML document

obey PMML conformance
rules
Root element <PMML>
6 child elements
2 required

Header

Data Dictionary

4 optional
<?xml version="1.0" ?>
<!DOCTYPE PMML PUBLIC "PMML 2.0"
"http://www.dmg.org/v2-0/pmml_v2_0.dtd
">
<PMML version="2.0" >
<Header … />
<MiningBuildTask …/>
<DataDictionary …/>
<TransformationDictionary …/>
<SequenceMiningModel …/>
<Extension …/>
</PMML>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Header
Attributes

copyright

description
Elements

Application (that generated the PMML)

Name: Capri

Version: 2.0

Annotation

Free text

TimeStamp

Date/Time of model creation

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Header (2)
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header copyright=“CorporateIntellect" description=“Results of CAPRI" >

</Header>
. . .
. . .
</PMML>
<Application name=“CORAL" version="3.0" >
<Annotation>This is a PMML document with results from the
CAPRI run on commodity market data.</Annotation>
<Timestamp>2003-03-02 18:30:00 GMT +00:00</Timestamp>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Mining Build Task

May contain any XML value describing the
configuration of the training run that
produced the model

Information provided in this element is
essentially meta-data

not used specifically in the deployment of the
model by the PMML consumer

Specific content structure not defined in PMML

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary
Attributes

Number of Fields

aids consistency checks
Elements

DataField

Attributes

Name

displayName

Optype

categorical/ordinal/continuous

Defines legal operations on the field values

Taxonomy

Name of taxonomy that defines a hierarchy on the values

isCyclic

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary (2)

Elements

Value

Defines domain for ordinal and categorical attributes

value

displayValue

property: valid/ invalid/ missing
Interval

Defines the range of valid values for continuous fields

closure: openClosed, closedOpen, openOpen,
closedClosed

leftMargin

rightMargin

Taxonomy

Define hierarchies on specific fields within the data
dictionary

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Data Dictionary (3)

Attributes

name: associates the taxonomy with the appropriate field
within the data dictionary (see DataField attribute taxonomy)

Elements

ChildParent

Attributes

childField: name of field within the table (see Elements
below) that represents the child value

parentField: name of field within the table (see Elements
below) that represents the parent value

parentLevelField: name of field within the table (see
Elements below) that represents the level in the hierarchy

isRecursive: Yes/No: if the whole hierarchy is defined in
the same table or an individual table per level

Elements

Inline Table/Table Locator

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
DataDictionary complete
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header … />

. . .
</PMML>
<DataDictionary numOfFields= "3" >
</DataDictionary >
<DataField name= "Type" optype="categorical">
<Value value = "BU "/>
<Value value = "HO"/>
<Value value = "CO"/>
</DataField>
<DataField name= "Age" optype= "continuous">
<Interval closure= "closedClosed" leftMargin= "0" rightMargin= "150"/>
</DataField>
<DataField name= "PostCode" optype="categorical" taxonomy = "Location" />
<Taxonomy name="Location">
………….
</Taxonomy>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Taxonomy Example
<Taxonomy name="Location">
<ChildParent childColumn=“Post Code" parentColumn="District">
<TableLocator x-dbname="myDB" x-tableName="PostCode_District" />
</ChildParent>
<ChildParent childColumn="member" parentColumn="group" isRecursive="yes">
<InlineTable>
<Extension extender="MySystem">
<row member="W9" group="CentralLondon"/>
<row member="NW9" group="NorthLondon"/>
<row member="NW2" group="NorthLondon"/>
<row member="W1" group="CentralLondon"/>
<row member="CentralLondon " group="London"/>
<row member="NorthLondon " group="London"/>
<row member="EastLondon " group="London"/>
<row member="London" group="England"/>
………….
</Extension>
</InlineTable>
</ChildParent> </Taxonomy>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary
Defines mapping of source data values to
values more suited for use by the mining
algorithm
PMML supports

Normalization: map values to numbers, the input
can be continuous or discrete.

Discretization: map continuous values to discrete
values.

Value mapping: map discrete values to discrete
values.

Aggregation: summarize or collect groups of
values, e.g. compute average

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (2)

TranformationDictionary

DerivedField Elements

Attributes

name

displayName

Elements

Expression (one of the following)

NormContinuous

NormDiscrete

Discretize

MapValues

Aggregates

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (3)
<DerivedField name=“normalAge">
<NormContinuous field="age">
<LinearNorm orig="45" norm="0"/>
<LinearNorm orig="82" norm="0.5"/>
<LinearNorm orig="105" norm="1"/>
</NormContinuous>
</DerivedField>
<DerivedField name="male">
<NormDiscrete field="marital status" value="m"/>
</DerivedField>
<DerivedField name="female">
<NormDiscrete field="marital status" value=“f"/>
</DerivedField>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Transformation Dictionary (4)
<DerivedField name=“binnedProfit">
<Discretize field="Profit">
<DiscretizeBin binValue="negative">
<Interval closure="openOpen" rightMargin="0" />
</DiscretizeBin>
<DiscretizeBin binValue="positive">
<Interval closure="closedOpen" leftMargin="0" />
</DiscretizeBin>
</Discretize>
</DerivedField>
<DerivedField name=“houseType">
<MapValues outputColumn="longForm">
<FieldColumnPair field="Type" column="shortForm"/>
<InlineTable><Extension>
<row><shortForm>BU</shortForm><longForm>bunglow</longForm> </row>
<row><shortForm>HO</shortForm><longForm>house</longForm> </row>
<row><shortForm>CO</shortForm><longForm>cottage</longForm> </row>
</Extension></InlineTable>
</MapValues>
</DerivedField>
<DerivedField name=“itemsBought">
<Aggregate field="item" function="multiset" groupField="transaction"/>
</DerivedField>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
The PMML Document
Data Dictionary
Transformation Dictionary
Mining Schema
Model1
…
Model2 Modelk
Data
Model
Statistics
Mining Schema
Model
Statistics
Mining Schema
Model
Statistics

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Mining Schema

Elements

MiningField

Attributes

Name

usageType: active/ predicted/ supplementary

Outliers: asIs/ asMissingValue/ asExtremeValues

lowValue

highValue

missingValueReplacement

missingValueTreatment: asIs/ asMean/ asMode/
asMedian/ asValue

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
<?xml version="1.0" ?>
<PMML version="1.0" >
<Header … />
<DataDictionary … />

</PMML>
MiningSchema
<SequenceModel functionName="sequences" algorithmName="Capri2"
minimumSupport="24.17" minimumConfidence="0.00"
numberOfItems="5" numberOfSets="5" numberOfSequences="11"
numberOfRules="3">
<Extension name="orderby" value="none"/>
… … …
</SequenceModel >
<MiningSchema >
</MiningSchema >
<MiningField name= "Price" usageType="predicted" />
<MiningField name= "location" usageType="active" />
<MiningField name= "bedrooms" usageType="active" />
<MiningField name= "houseType" usageType="active" />
<MiningField name="Area" usageType= "supplementary" />

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Model Statistics

Elements

UnivariateStatistics

Attributes

Field

Elements

Discrete Statistics

Continuous Statistics

Counts: Valid, Invalid and Missing counts

NumericInfo: min/ max/ mean/ standard
deviation/ median/ interQuartileDistance

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Supported Data Mining
Models
Tree Model
Neural Networks
Clustering Model
Regression Model
General Regression Model
Naïve Bayes Model
Association Rules
Sequence Rule Model

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Sequence Model

Represents the output
of Sequence Rule Mining

Attributes

modelName

functionName

algorithmName

numberOfTransactions

minimumSupport

minimumConfidence

lengthLimit

…..
Elements

Sequence Rule

Elements

Antecedent Sequence

sequenceReference

Consequent Sequence

Delimiter

Sequence

Elements
SetReference

Delimiter

Set Predicate

Array

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
<SequenceModel functionName="sequences" numberOfTransactions="100“
minimumSupport="0.20" minimumConfidence="0.25" numberOfItems="6" numberOfSets="5"
numberOfSequences="3" numberOfRules="1"> <MiningSchema> ……… </MiningSchema>
<SetPredicate id="sp001" field="transaction" operator="supersetOf">
<Array n="1" type="string"> index.html </Array> </SetPredicate>
<SetPredicate id="sp002" field="transaction" operator="supersetOf">
<Array n="2" type="string"> offer.html kdnuggets.com </Array> </SetPredicate>
<SetPredicate id="sp003" field="transaction" operator="supersetOf">
<Array n="1" type="string"> products.html </Array> </SetPredicate>
<SetPredicate id="sp004" field="transaction" operator="supersetOf">
<Array n="1" type="string"> basket.html </Array> </SetPredicate>
<SetPredicate id="sp005" field="transaction" operator="supersetOf">
<Array n="1" type="string"> checkout.html </Array> </SetPredicate>
<Sequence id="seq001" numberOfSets="1" occurrence="80" support="0.80">
<SetReference setId="sp001"/> </Sequence>
<Sequence id="seq002" numberOfSets="4" occurrence="40" support="0.40">
<SetReference setId="sp002"/><Delimiter delimiter="acrossTimeWindows" gap="false"/>
<SetReference setId="sp003"/><Delimiter delimiter="sameTimeWindow" gap="true"/>
<SetReference setId="sp004"/><Delimiter delimiter="sameTimeWindow" gap="false"/>
<SetReference setId="sp005"/> </Sequence>
<SequenceRule id="rule001" numberOfSets="5" occurrence="20" support="0.20"
confidence="0.25">
<AntecedentSequence><SequenceReference seqId="seq001"/></AntecedentSequence>
<Delimiter delimiter="sameTimeWindow" gap="unknown"/>
<ConsequentSequence><SequenceReference seqId="seq002"/></ConsequentSequence>
</SequenceRule>
</SequenceModel>

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PMML Consumers

Post-Processing

Visualization

Verification and Evaluation

Deployment

Hybrids and Meta-Learning

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
PEAR: Post-Processing Association
Rules

Sets of Association rules are browsed like web pages

PMML-formated
assocation rules
can be
uploaded

Jorge et al.,
2002

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
VizWiz - PMML Visualization

Java Applet

Some non-
standard
extensions
required for best
visualization

Wettschereck,
2003

Reads, visualizes and writes PMML files

Coupling with WEKA in progress

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
ROCOn – Visualizing ROC
graphs

compare and

evaluate models

Java Applet

Understands PMML
as an extension to
VizWiz

Farrand and Flach
(http://www.cs.bris.ac.uk/%7Efarrand/rocon/index.html)

Use Receiver Operator Characteristics (ROC) to

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
Summary

Standards help to streamline efforts

Sign of maturity in field of KD

From “Art” to “Engineering”

Standards are still incomplete, but:
Use what is available!

More tools utilizing standards are
needed

ECML/PKDD 2003 : KD-Standards Tutorial S. Anand, M. Grobelnik, D. Wettschereck
References

Grossman, R.L., Hornick, M.F., Meyer, G. (2002). Data Mining Standards Initiatives, Communications of
the ACM, Vol. 45:8 see also http://www.dmg.org

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000). CRISP-
DM 1.0: Step-by-step data mining guide, CRISP-DM consortium, http://www.crisp-dm.org

Clifton, C., Thuraisingham, B. (2001). Emerging standards for data mining. Computer Standards &
Interfaces Vol 23 pp 187 – 193.

Compare and Contrast JOLAP and XML for Analysis
http://www.essbase.com/resource_library/articles/jolap_xmla.cfm

JCX http://www.jcp.org/en/jsr/detail?id=016

JOLAP http://www.jcp.org/en/jsr/detail?id=69

Jorge, A., Poças, J. and Azevedo, P. (2002). Post-processing operators for browsing large sets of
association rules. Proc. Discovery Science 02. (eds. Lange, S., Satoh, K. and Smith, C. H.), Lübeck,
Germany, LNCS, 2534, Springer-Verlag.

Farrand, J. and Flach P. (2003). ROCOn: a tool for visualising ROC graphs. See:
http://www.cs.bris.ac.uk/%7Efarrand/rocon/index.html

Melton, J. and Eisenberg, A. SQL Multimedia and Application Packages (SQL/MM),
http://www.acm.org/sigmod/record/issues/0112/standards.pdf

OMG Common Warehouse MetaModel http://www.omg.org/cwm/

SOAP http://www.w3.org/TR/SOAP/

Tang, Z., Kim, P. Building Data Mining Solutions with SQL Server 2000,
http://www.dmreview.com/whitepaper/wid292.pdf

Wettschereck, D., Jorge, A., Moyle, S. (to appear). Data Mining and Decision Support Integration through
the Predictive Model Markup Language Standard and Visualization in Mladenic D, Lavrac N, Bohanec M,
Moyle S (editors): Data Mining and Decision Support: Integration and Collaboration, Kluwer
Publishers.

XMLA http://www.xmla.org/

Knowledge Data Discovery Proces in data Analytics

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Knowledge Data Discovery Proces in data Analytics

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77