Data Mining Functionalities Data characterization - summarization of the general characteristics or features of a target class of data. Data discrimination - comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. Classification – class label is known. Supervised learning. Numeric Prediction – predict future state of the data. Clustering – class label is unknown. Unsupervised learning Association Rules- identify relationships among data
Classification Training set age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 no fair excellent yes no
Numeric Prediction It models continuous valued functions. It predicts unknown or missing values. Example – Linear Regression Y = a + bX , where X is the explanatory variable and Y is the dependent variable.
Clustering
Association Mining {Milk, Diaper}->{Beer} TID Item
Issues and Challenges Human Interaction Overfitting Outliers Interpretation of results High dimensionality Missing data Irrelevant data Noisy data Efficiency and scalability of the data
Application Areas Retail Industry Collect huge amount of data on sales, customer shopping history, transportation, customer service, etc. Many stores have web sites where you can buy online. Some of them exist only online (e.g., Amazon) Data mining helps to Identify costumer buying behaviours Discover customers shopping patterns and trends Improve the quality of customer service Reduce the cost of business
Application Areas Banking Area Data mining is widely used for risk management in the banking industry. Bank executives need to know whether the customers they are dealing with are reliable or not. Offering new credit cards and approving loans can be risky decisions for banks if they do not know anything about their customers. Data mining can be used in the banking industry is in fraud detection.
Types of Data Interval-scaled variable – measured along a linear scale. Examples- measurement of weight, height, temperature Binary variable - two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present. Example – disease_test Categorical variable - generalization of the binary variable in that it can take on more than two states. Example – class_section Ordinal variable – same as categorical variable, except that the M states of the ordinal variable are ordered in a meaningful sequence. Example - designation