Introduction to Data Mining Why use Data Mining? Lecturer: Abdullahi Ahamad Shehu (M.Sc. Data Science, M.Sc. Computer Science) Office: Faculty of Computing Extension
Content Input Output 17 December 2024 2
Examples of Concepts How to decide whether there is an attempt to intrude in the network. How to determine whether somebody has a specific illness How to decide whether there is credit card misuse How to conclude whether a contract is good or not How to predict computer performance How to determine which products people buy together Which groupings can be established from a set of examples
Types of Concepts Classification learn to classify unclassified examples from classified ones e.g. how to decide whether to give a loan Association learning learn associations between attributes e.g. what supermarket products people buy together Clustering group examples together e.g. given a set of documents, divide them into groups Numeric prediction the output to be learned is numeric e.g. calculate the price of a car
Concept Description Outlook Temp Humidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes If outlook = sunny and humidity = high then play = no If outlook = rainy and wind = yes then play = no If outlook = cloudy then play = yes If humidity = normal then play = yes If none of the above rules applies then play = yes Concept description: output of our data mining tool
Instances (or Examples) Outlook Temp Humidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes Instance: a single example of a concept described by a set of attributes (features, columns) Input to learning algorithm set of instances usually described as a single relation/flat file. instances
Attributes (or features) Attribute: describes a specific characteristic of an instance e.g. age, salary, … Attributes are often predefined for a set of instances an instance is described by its attribute values e.g. 25, 20567, … Outlook Temp Humidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes attribute attribute value
Problems with Attributes Not all instances have values for all attributes e.g. patient’s family history is unknown Existence of an attribute may depend on value of another attribute. e.g. attribute pregnant conditional on gender = female Not all attributes are important E.g. person’s nose shape vs. whether to give them a loan Need feature selection to identify the important ones
Types of Attribute Nominal values are symbolic, e.g. desk, table, bed, wardrobe no relation between nominal values Boolean attributes are a special case 0 and 1 or True and False also called categorical, enumerated or discrete Ordinal values are ordered, e.g. small, medium, large, x-large small < medium < large < x-large but difference between 2 values is not meaningful
Types of Attribute Interval quantities are ordered measured in fixed equal units, years 2001, 2002, 2003, 2004, difference between values meaningful: 2005 - 2004 but sum or product is not meaningful: 2005 + 2004 Ratio Quantities include a natural zero Money: 0, 10, 100, 1000 treated as real numbers because all mathematical operations are meaningful
Preparing the Input Need to obtain a dataset in ‘correct format’ Possible when there is a limited set of finite relations outlook temp humidity windy play? sunny 85 85 no no sunny 80 90 yes no cloudy 83 86 no yes rainy 70 96 no yes instances attributes class attribute values
Preparing the Input For example, create data in Excel save as .csv file Values of attributes Temperature, Humidity: numeric Outlook: Sunny, Cloudy, Rainy Windy, Play?: Yes, No outlook temp humidity windy play? sunny 85 85 no no sunny 80 90 yes no cloudy 83 86 no yes rainy 70 96 no yes
Problems Preparing the Input Data may come from different sources e.g. different departments within a company Variation in record keeping Style Data aggregation (hourly, weekly, monthly etc.) Synonyms Errors Data must be assembled, integrated, aggregated and cleaned
Preparing data Wrangling - transforming data into another format to make it more suitable and valuable for a task Cleansing (cleaning) - detecting and correcting errors in the data. Scraping - automatic extraction of data from a data source. Integration - combining data from several disparate sources into a (useful) dataset 14 / 14
Missing data Missing data may be unknown, unrecorded, irrelevant Causes Equipment faults Difficult to acquire (e.g. age, income) Measurement is not possible The fact that a value is missing may be informative e.g. missing test in medical examination BUT this is NOT usually the case Represented in R as NA
Inaccurate Values Errors and omissions which do not affect original purpose of data collection E.g. age of bank customers not important E.g. Customers IDs not important Typographical errors in nominal attributes e.g. Pepsi vs Pepsi-cola Deliberate errors E.g. People may lie about their mental health history Duplicates ML algorithms very sensitive to this
Summary Preparing data for input is difficult and demanding data may need assembling, integrating, aggregating and cleaning if data set is huge, a sample may be used need relation, attributes and data (or instances) Various types of data may be used Nominal and numeric are most common
Content Input Output 17 December 2024 18
Data mining : Output Output: requirements Types of output Tables Rules Trees Instances Clusters Summary
Understanding the Output Output must be easy to understand representation of output is key Representation is NOT independent from learning process learning algorithm used determines representation of output depends on type of algorithm
Representations Decision tables Classification rules Association rules Decision trees Regression Trees for numeric prediction Instance-based representation Clusters
Data mining : Output Output: requirements Types of output Tables Rules Trees Instances Clusters Summary
Decision tables Uses same format as input but only uses selected attributes Challenge: choosing selected attributes Decisions if outlook = sunny and humidity = high then play = no etc outlook humidity play sunny high no sunny normal yes cloudy high yes …
Classification Rules IF conditions THEN conclusion if outlook = sunny and humidity > 83 then play = no Conditions tests that have to be true for the rule to apply usually several conditions connected by ANDs Conclusion solution to the problem class, set of classes or probability distribution
Classification Rules: problems Rules may contradict each other Two applicable rules may give different classifications Decision List is an ordered set of rules first satisfied rule should be applied rule only applied if preceding ones are not applicable no contradictions in classification Rules may fail to classify an instance Decision List may have a final classification rule with no conditions no classification failures If outlook=sunny & humidity=high then play=no If outlook=rainy & windy=true then play=no ... If none of the above then play=yes
Association Rules Like classification rules BUT used to infer the value of any attribute (not just class) or a combination of attributes NOT intended to be used together as a set different association rules determine different things Problem: many different association rules can be derived from a small dataset Restrict to associations with high support: number of instances it predicts correctly high confidence (accuracy): proportion of instances it predicts correctly out of all instances it applies to
Association Rules Association rule If beer = yes and crisps = no then nappy = yes If beer = yes then nappy = yes and bread = no 27 /24 Like classification rules BUT used to infer the value of any attribute (not just class) or a combination of attributes
Association Rules: Examples The rule If windy = false and play = no then outlook = sunny and humidity = high Different from If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high Due to coverage and accuracy
Data mining : Output Output: requirements Types of output Tables Rules Trees Instances Clusters Summary
Decision Trees Nodes represents attributes Each branch from a node usually represents a single value for that attribute but it can compare two values for the attribute use a function of one or more attributes Each leaf node contains answer to problem class set of classes or probability distribution To solve a problem new instance is routed down the tree to find solution
Decision Nodes Nominal attribute number of branches out of a node is equal to number of attribute values attribute tested at most once on a path Numeric attribute attribute value is compared (> or <) to a constant. or three way split may be used (i.e. 3 branches) <,>,= for integer below, within, above for real (test against an interval) attribute may be tested several times on a path
Decision Tree Example 1 st year inc bad <= 2.5 > 2.5 Statutory holidays 1 st year inc bad good good >10 <= 10 <= 4 > 4 Hours/week Health plan bad bad good none half full <=36 >36
Converting Trees to Rules Decision trees can be converted into a set of n rules n is the number of leaf nodes One rule for each path from the root to a leaf Conditions: one per node from root to leaf Conclusion(s): class(es) assigned by the leaf Rules obtained from decision tree are unambiguous and complete no classification contradictions rules are order-independent no classification failures BUT rules may be unnecessarily complex rule pruning required to remove redundant conditions
Trees to Rules: Example if 1 st year inc <= 2.5 then bad if 1 st year inc > 2.5 and stat. holidays > 10 then good if 1 st year inc > 2.5 and statutory holidays <= 10 and 1 st year inc < = 4 then bad if 1 st year inc > 2.5 and stat. holidays <= 10 and 1 st year inc > 4 then good 1 st year inc bad good bad good <= 2.5 > 2.5 >10 <= 10 <= 4 > 4 1 st year inc Statutory holidays 1 st year inc bad good bad good <= 2.5 > 2.5 >10 <= 10 <= 4 > 4
Rules to Trees: Example If a and b then x If c and d then x a b c c d d x x x y y y y y y n n n n n n
Trees for Numeric Prediction Predicting a numeric value not a class Regression computes an expression which calculates a numeric value PRP = - 55.9 + 0.0489 cycle time + 0.0153 min memory + 0.0056 max memory + 0.641 cache - 0.27 min channels + 1.48 max channels Regression tree decision tree where each leaf predicts a numeric value value is average of training instances that reach leaf Model tree regression tree with linear regression model at leaf
Model Tree Combines linear regression and regression tree chmin mmax cach mmax cach LM1 LM3 LM2 > 8.5 <=8.5 <=4500 >4250 (0.5,8.5] <=0.5 <=7.5 > 7.5 <=28000 > 28000 LM5 LM6 LM4 LM1 PRP= 8.29 + 0.004mmax + 2.77chmin LM2 PRP= 20.3 + 0.004mmin – 3.99chmin etc
Data mining : Output Output: requirements Types of output Tables Trees Rules Instances Clusters Summary
Instance-Based Representation Simplest form of learning Look for instance most similar to new instance Lazy Learning work is done when problem-solving, not training Distance function Numeric calculation indicates similarity between two attribute values Numeric attributes: difference in values Nominal attributes 0 if equal, 1 if not Or more sophisticated measure (e.g. hue for colours)
Instance-Based New problem. Class? Closest solution to new problem is “blue” 3-neighbour solution to new problem is “yellow”
Clusters Represent groups of instances which are similar Some allow overlapping clusters
Summary There are lots of different ways of representing the output from a data mining tool Trees – decision trees, regression trees Rules – classification, association Instances – decision table, nearest neighbour Clusters Output depends on learning algorithm and input