DRK_Introduction to Data mining and Knowledge discovery
coolscools1231
47 views
36 slides
Aug 31, 2024
Slide 1 of 36
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
About This Presentation
Knowlegde discovery process
Size: 1.38 MB
Language: en
Added: Aug 31, 2024
Slides: 36 pages
Slide Content
Data Preprocessing : Data Cleaning : Handling missing values, noise reduction, outlier detection. Data Integration : Merging data from multiple sources. Data Reduction : Dimensionality reduction, aggregation. Data Transformation : Normalization, scaling, encoding categorical variables. Discretization : Converting continuous data into discrete intervals.
A series of steps to be made suitable for mining. This transformation phase is known as data preprocessing , an essential and often time-consuming stage in the data mining pipeline. Data preprocessing is the method of cleaning and transforming raw data into a structured and usable format, ready for subsequent analysis. The real-world data we gather is riddled with imperfections. There may be missing values, redundant information, or inconsistencies that can adversely impact the outcome of data analysis. The methodologies employed to turn raw data into a rich, structured, and actionable asset. DATA PREPROCESSING
Data Cleaning: An Overview Data cleaning, sometimes referred to as data cleansing, involves detecting and correcting (or removing) errors and inconsistencies in data to improve its quality. The objective is to ensure data integrity and enhance the accuracy of subsequent data analysis
Common Issues Addressed in Data Cleaning: Missing Values: Data can often have gaps. For instance, a dataset of patient records might lack age details for some individuals. Such missing data can skew analysis and lead to incomplete results. Noisy Data: This refers to random error or variance in a dataset. An example would be a faulty sensor logging erratic temperature readings amidst accurate ones
Contd … Outliers: Data points that deviate significantly from other observations can distort results. For example, in a dataset of house prices, an unusually high price due to an erroneous entry can skew the average. Duplicate Entries: Redundancies can creep in, especially when data is collated from various sources. Duplicate rows or records need to be identified and removed Inconsistent Data: This could be due to various reasons like different data entry personnel or multiple sources. A date might be entered as "January 15, 2020" in one record and "15/01/2020" in another
Methods and Techniques for Data Cleaning: Imputation: Filling missing data based on statistical methods. For example, missing numerical values could be replaced by the mean or median of the entire column. Noise Filtering: Applying techniques to smooth out noisy data. Time-series data, for example, can be smoothed using moving averages. Outlier Detection: Utilizing statistical methods or visualization tools to identify and manage outliers. The IQR (Interquartile Range) method is a popular technique. De-duplication: Algorithms are used to detect and remove duplicate records. This often involves matching and purging data. Data Validation: Setting up rules to ensure consistency. For instance, a rule could be that age cannot be more than 150 or less than 0
Data Integration : Merging data from multiple sources Data integration is the process of combining data from various sources into a unified format that can be used for analytical, operational, and decision-making purposes.
There are several ways to integrate data Data virtualization Presents data from multiple sources in a single data set in real-time without replicating, transforming, or loading the data. Instead, it creates a virtual view that integrates all the data sources and populates a dashboard with data from multiple sources after receiving a query. Extract, load, transform (ELT) A modern twist on ETL that loads data into a flexible repository, like a data lake, before transformation. This allows for greater flexibility and handling of unstructured data. Application integration Allows separate applications to work together by moving and syncing data between them. This can support operational needs, such as ensuring that an HR system has the same data as a finance system.
Here are some examples of data integration: Facebook Ads and Google Ads to acquire new users Google Analytics to track events on a website and in a mobile app MySQL database to store user information and image metadata Marketo to send marketing email and nurture leads
DATA INTEGRATION
Data integration is the process of combining data from multiple sources into a cohesive and consistent view . This process involves identifying and accessing the different data sources, mapping the data to a common format, and reconciling any inconsistencies or discrepancies between the sources. The goal of data integration is to make it easier to access and analyze data that is spread across multiple systems or platforms, in order to gain a more complete and accurate understanding of the data. Contd..
Data integration can be challenging due to the variety of data formats, structures, and semantics used by different data sources. Different data sources may use different data types, naming conventions, and schemas, making it difficult to combine the data into a single view. Data integration typically involves a combination of manual and automated processes, including data profiling, data mapping, data transformation, and data reconciliation. Contd..
There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”. Tight Coupling: This approach involves creating a centralized repository or data warehouse to store the integrated data. The data is extracted from various sources, transformed and loaded into a data warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high level, such as at the level of the entire dataset or schema. Contd..
This approach is also known as data warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to change or update. Here, a data warehouse is treated as an information retrieval component. In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading. Contd..
Loose Coupling: This approach involves integrating data at the lowest level, such as at the level of individual data elements or records. Data is integrated in a loosely coupled manner, meaning that the data is integrated at a low level, and it allows data to be integrated without having to create a central repository or data warehouse. This approach is also known as data federation, and it enables data flexibility and easy updates, but it can be difficult to maintain consistency and integrity across multiple data sources. Contd..
Data Reduction Data Reduction refers to the process of reducing the volume of data while maintaining its informational quality. Data reduction is the process in which an organization sets out to limit the amount of data it's storing. Data reduction techniques seek to lessen the redundancy found in the original data set so that large amounts of originally sourced data can be more efficiently stored as reduced data.
Data Transformation: While data cleaning focuses on rectifying errors, data transformation is about converting data into a suitable format or structure for analysis. It’s about making the data compatible and ready for the next steps in the data mining process.
Common Data Transformation Techniques: Normalization: Scaling numeric data to fall within a small, specified range. For example, adjusting variables so they range between 0 and 1 Standardization: Shifting data to have a mean of zero and a standard deviation of one. This is often done so different variables can be compared on common grounds. Binning: Transforming continuous variables into discrete 'bins'. For instance, age can be categorized into bins like 0-18, 19-35, and so on. One-hot encoding: Converting categorical data into a binary (0 or 1) format. For example, the color variable with values 'Red', 'Green', 'Blue' can be transformed into three binary columns—one for each color. Log Transformation: Applied to handle skewed data or when dealing with exponential patterns
Benefits of Data Cleaning and Transformation: Enhanced Analysis Accuracy: With cleaner data, algorithms work more effectively, leading to more accurate insights. Reduced Complexity: Removing redundant and irrelevant data reduces dataset size and complexity, making subsequent analysis faster. Improved Decision Making: Accurate data leads to better insights, which in turn facilitates informed decision-making. Enhanced Data Integrity: Consistency in data ensures integrity, which is crucial for analytics and reporting.
Common Data Transformation Techniques: Normalization: Scaling numeric data to fall within a small, specified range. For example, adjusting variables so they range between 0 and 1 Standardization: Shifting data to have a mean of zero and a standard deviation of one. This is often done so different variables can be compared on common grounds. Binning: Transforming continuous variables into discrete 'bins'. For instance, age can be categorized into bins like 0-18, 19-35, and so on. One-hot encoding: Converting categorical data into a binary (0 or 1) format. For example, the color variable with values 'Red', 'Green', 'Blue' can be transformed into three binary columns—one for each color. Log Transformation: Applied to handle skewed data or when dealing with exponential patterns
Data Normalization and Standardization Data Normalization: Normalization scales all numeric variables in the range between 0 and 1. The goal is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the range of values Benefits of Normalization : Predictability: Ensures that gradient descent (used in many modeling techniques) converges more quickly. Uniformity: Brings data to a uniform scale, making it easier to compare different features. Normalization has its drawbacks. It can be influenced heavily by outliers.
Data Normalization and Standardization Data Standardization While normalization adjusts features to a specific range, standardization adjusts them to have a mean of 0 and a standard deviation of 1. It's also commonly known as the z-score normalization Benefits of Standardization: Centering the Data : It centers the data around 0, which can be useful in algorithms that assume zero centric data, like Principal Component Analysis (PCA). Handling Outliers: Standardization is less sensitive to outliers compared to normalization. Common Scale: Like normalization, it brings features to a common scale
Discretization : In statistics and machine learning, discretization refers to the process of converting continuous features or variables to discretized or nominal features.
Discretization in data mining refers to converting a range of continuous values into discrete categories. For example : Suppose we have an attribute of Age with the given values Discretized Data
Data Visualization Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand. Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, maps, and dashboards, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
Key Concepts in Data Visualization: Types of Visualizations : Bar Charts : Used to compare categories or show changes over time. Line Charts : Ideal for showing trends over time. Pie Charts : Good for showing proportions. Scatter Plots : Useful for observing relationships between variables. Heatmaps : Display data in matrix form, with values represented by colors . Histograms : Show the distribution of a dataset. Box Plots : Provide a summary of data through quartiles and outliers.
Best Practices : Clarity : Ensure that your visualizations are easy to understand. Avoid clutter. Accuracy : Represent data truthfully. Avoid misleading scales or distorted visuals. Context : Include labels, legends, and titles to make your visuals self-explanatory. Consistency : Use consistent colors , fonts, and styles across your visuals. Focus : Highlight the key message you want to convey through your visualization.
Tools for Data Visualization : Tableau : A powerful tool for creating interactive visualizations and dashboards. Microsoft Power BI : Offers data visualization and business intelligence capabilities. Google Data Studio : Useful for creating reports and dashboards from various data sources. Matplotlib and Seaborn (Python libraries) : Widely used for creating static, animated, and interactive plots in Python. D3.js : A JavaScript library for producing dynamic, interactive data visualizations in web browsers. Excel : A basic tool for creating charts and graphs, suitable for simple data visualization tasks.
Data Similarity and Dissimilarity Measures Euclidean Distance : A measure of similarity between two data points in space. Cosine Similarity : Measures the cosine of the angle between two vectors. Jaccard Similarity : A measure of similarity between two sets. Pearson Correlation : A measure of the linear correlation between two variables.
Cosine Similarity Definition : Measures the cosine of the angle between two vectors in a multi-dimensional space. It is often used in text mining and information retrieval. Formula: Cosine Similarity= Jaccard Similarity Definition : Measures the similarity between two sets by comparing the size of their intersection to the size of their union.
Euclidean Distance Definition : Measures the straight-line distance between two points in a multi-dimensional space. Manhattan Distance (L1 Norm) Definition : Measures the sum of the absolute differences between coordinates of two points.
Minkowski Distance Definition : A generalization of Euclidean and Manhattan distances, parameterized by an exponent ppp . Hamming Distance Definition : Measures the number of positions at which two strings of equal length differ. It is often used for binary or categorical data.