SRU_RK_Lecturer1 about datamining cocepts

coolscools1231 28 views 22 slides Sep 03, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

Introductions to datamining


Slide Content

Data Mining UNIT-I

Introduction to Data Mining Systems Data mining, often interchangeably used with terms such as ‘ Knowledge discovery ' or 'predictive analytics ,' is the process of analyzing vast sets of data to discover previously unknown, valid patterns, correlations, and relationships. But it's not just about finding patterns; it's about translating these patterns into actionable insights.

Data Mining Definition : Data mining refers to the process of discovering patterns, correlations, trends, or anomalies from large datasets using various techniques and algorithms. Data mining in transforms raw data into useful information for decision-making.

Overview of terms Data : a set of facts (items) D, usually stored in a database Pattern : an expression E in a language L, that describes a subset of facts Attribute : a field in an item i in D. Interestingness : a function I D,L that maps an expression E in L into a measure space M

Knowledge Discovery Process Data Cleaning: Removing noise and inconsistent data. Data Integration: Combining data from different sources. Data Selection: Choosing the data relevant for analysis. Data Transformation: Converting data into a suitable format or structure. Data Mining: Applying algorithms to extract patterns. Pattern Evaluation: Identifying the truly interesting patterns. Knowledge Presentation: Visualizing and presenting the findings.

Data Mining Techniques Classification : Predicting categorical labels based on input data. Clustering : Grouping similar data points together. Regression : Predicting numerical values. Association Rule Mining : Finding relationships between variables in large datasets. Anomaly Detection : Identifying unusual data points.

Issues in Data Mining Scalability : Handling large datasets efficiently. Data Quality : Managing noisy, incomplete, or inconsistent data. Privacy Concerns : Protecting sensitive information. Interpretability : Making results understandable for non-experts.

Scalability Scalability is a major issue in data mining, especially when extracting information from large amounts of data. To effectively process large amounts of data, data mining algorithms need to be scalable and efficient. For example, distributed data stored in multiple locations can become a scalability concern as the data grows. The system will need to match the data multiple times, which can impact scalability.

Data quality Poor quality data can lead to inaccurate results, while high quality data can improve the performance of data mining models. Data security This includes whether the data comes from an ethical source and whether it's protected on servers. Data integrity Data that's not up to date or has errors can produce inaccurate output. Information privacy This is especially an issue with online data mining, where most information is anonymized.

Data Quality Data quality is a critical aspect of data mining because it heavily relies on the quality of the input data. Poor data quality can lead to misleading models that can impact decision-making .

Duplicate data Data can come from multiple channels, which can lead to duplicate data when merged.  Inconsistent data Mismatches in the same information across multiple data sources can cause inconsistencies.  Inaccurate and missing data Inaccurate data can significantly impact decision-making. For example, typos can create chaos in recording Orphaned data Data that doesn't represent any value, such as a customer record that exists in one database but not in another

Privacy Concerns Data mining can raise privacy concerns because people are worried that their personal information could be leaked or sold to third parties without their consent. Data mining professionals have ethical and legal obligations to protect individuals' rights and maintain their privacy. Data Privacy and Security a. Privacy Concerns : Mining personal data can raise significant privacy issues. Ensuring that individuals’ privacy is protected while still extracting useful insights is a major challenge. b. Data Security : Protecting sensitive data from unauthorized access and breaches is critical. Data mining processes must include robust security measures to safeguard data integrity and confidentiality.

Inference attacks These attacks can be a serious threat to data privacy because the attacker is a legitimate user of the machine learning system and doesn't need to break into the system to access sensitive information. Unauthorized access Protecting privacy from unauthorized access is a major concern in data use, from business transactions to national security. Some other issues in data mining include: Performance As data volumes increase, it's important to develop algorithms and infrastructure that can process and analyze data quickly. Modeling from heterogeneous databases Data mining from databases that come from different sources can be challenging because these structures can be organized or semi-organized

Interpretability Interpretability is the degree to which a human can understand the cause of a decision.  Interpretability is the degree to which a human can consistently predict the model’s result 

Applications of Data Mining

Market Basket Analysis : Understanding customer purchasing behavior . Fraud Detection : Identifying fraudulent activities. Healthcare : Predicting disease outbreaks or patient outcomes. Customer Relationship Management (CRM) : Personalizing marketing strategies.

Examples of Large Datasets Government: IRS, NGA, … Large corporations WALMART: 20M transactions per day MOBIL: 100 TB geological databases AT&T 300 M calls per day Credit card companies Scientific NASA, EOS project: 50 GB per hour Environmental datasets
Tags