Machine learning Machine Learning definition Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Naveen Aggarwal
Machine Learning Example A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E .” Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What is the task T in this setting? Classifying emails as spam or not spam. Watching you label emails as spam or not spam. The number (or fraction) of emails correctly classified as spam/not spam. None of the above— this is not a machine learning problem . achine Learning Example
Machine Learning Grew out of work in AI New capability for computers Examples: Database mining Large datasets from growth of automation/web. E.g., Web click data, medical records, biology, engineering - Applications can’t program by hand. E.g., Autonomous helicopter, handwriting recognition, most of Natural Language Processing (NLP), Computer Vision.
Supervised Learning Problem A pair (x (i) , y (i) ) is called a training example , and the dataset that we’ll be using to learn— a list of m training examples {(x (i) , y (i) ); i = 1, . . . , m}— is called a training set. Naveen Aggarwal
Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur
Classification Process (1): Model
Classification Process (2): Use the
Regression vs Classification When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
Regression vs Classification You’re running a company, and you want to develop learning algorithms to address each of two problems. Problem 1: You have a large inventory of identical items. You want to predict how many of these items will sell over the next 3 months. Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it has been hacked/compromised. Should you treat these as classification or as regression problems? Treat both as classification problems. Treat problem 1 as a classification problem, problem 2 as a regression problem. Treat problem 1 as a regression problem, problem 2 as a classification problem. Treat both as regression problems .
Classification vs. Prediction Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels ) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis
Unsupervised Learning
Unsupervised Learning Example
Unsupervised Learning Example
Unsupervised Learning Example Of the following examples, which would you address using an unsupervised learning algorithm? Given email labeled as spam/not spam, learn a spam filter. Given a set of news articles found on the web, group them into set of articles about the same story. Given a database of customer data, automatically discover market segments and group customers into different market segments. Given a dataset of patients diagnosed as either having diabetes or not, learn to classify new patients as having diabetes or not.
Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Issues regarding classification and prediction (1): Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data
Issues regarding classification and prediction (2): Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provded by the model Goodness of rules decision tree size compactness of classification rules