Data Science topic and introduction to basic concepts involving data management and cleaning

aashishreddy10 131 views 29 slides Jul 23, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Data Science introduction and brief of topics


Slide Content

Data Science Dr. Rakesh Roshan Assistant Professor, Anurag University

What is data science? Applying Science to data to make the data talk to us.

Introduction Data Science is a multidisciplinary field that combines various techniques, processes, and tools to extract valuable insights and knowledge from data. It encompasses a wide range of activities, from data collection and cleaning to analysis and visualization, with the ultimate goal of making data-driven decisions and solving complex problems.

Key Aspect of data Science Data Collection : Data scientists gather data from various sources, such as databases, APIs, sensors, websites, and more. This data can be structured (e.g., databases) or unstructured (e.g., text or images). Data Cleaning and Preprocessing : Raw data often contains errors, missing values, or inconsistencies. Data scientists clean and preprocess the data to ensure it is accurate and ready for analysis. This may involve techniques like data imputation, outlier detection, and data transformation. Exploratory Data Analysis (EDA): EDA is the process of visualizing and summarizing data to understand its characteristics, uncover patterns, and identify potential relationships or outliers. Data visualization tools like charts and graphs are commonly used for EDA. Data Analysis: Data scientists use statistical and machine learning techniques to extract meaningful insights from the data. This can include regression analysis, clustering, classification, and more, depending on the specific problem. Machine Learning: Machine learning is a subset of data science that focuses on building predictive models and algorithms that can learn from data and make predictions or decisions. Common machine learning tasks include classification, regression, and clustering.

Key Aspect of data Science Big Data: In some cases, data science deals with massive datasets known as "big data." Specialized tools and technologies like Hadoop and Spark are used to process and analyze these large volumes of data efficiently. Data Visualization: Communicating findings is a crucial part of data science. Data scientists use visualization tools to create charts, graphs, and dashboards that make complex data more understandable and accessible to stakeholders. Domain Knowledge : Understanding the domain or industry you're working in is essential. Data scientists need to collaborate with subject matter experts to ensure their analyses are meaningful and relevant. Tools and Software: Data scientists use a variety of tools and software, including libraries like NumPy, pandas, scikit-learn (for Python), and others for data manipulation and analysis. They also use specialized software for tasks like data visualization and machine learning.

What is Big Data? Big data refers to extremely large and complex datasets that are beyond the capabilities of traditional data management and processing tools. These datasets are typically characterized by the three "Vs": Volume, Velocity, and Variety:

Three Vs Volume: Big data involves vast amounts of data that exceed the capacity of conventional databases and storage systems. This data can range from terabytes to petabytes or even exabytes, and it continues to grow rapidly. Velocity: Big data is generated and collected at high speeds. For example, real-time data streams from social media, sensors, and IoT devices can produce data at an astonishing pace, requiring immediate processing and analysis. Variety: Big data comes in various formats and types, including structured data (e.g., databases), semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text, images, videos). Dealing with this diverse data requires specialized tools and techniques.

Two Additional Vs In addition to the three Vs, some definitions of big data include two additional Vs: Veracity: This refers to the uncertainty or quality of the data. Big data often includes data from various sources, which may be incomplete, inconsistent, or of unknown accuracy. Managing and analyzing such data can be challenging. Value: Ultimately, the goal of working with big data is to extract valuable insights and actionable information. Extracting value from big data requires advanced analytics, machine learning, and data science techniques.

Why Big Data and Data Science Hype in Past Explosion of Data Technological Advancements High-Profile Success Stories Increased Data Accessibility Data-Driven Decision-Making High Demand for Data Professionals Media and Public Attention Promise of Innovation

Datafication Datafication is a concept that describes the process of turning various aspects of our lives, activities, and the world around us into data. It involves the collection, storage, and analysis of data from both digital and physical sources, leading to a quantification of experiences and phenomena. Here are some key points to understand datafication: Data Collection : Datafication involves collecting data from a wide range of sources, including sensors, devices, social media, online transactions, and more. This data can be structured (e.g., databases) or unstructured (e.g., text, images), and it may encompass personal, environmental, and organizational data. Quantification : Datafication seeks to convert real-world events and behaviors into quantifiable data points. For example, tracking steps with a fitness wearable, monitoring online shopping behavior, or measuring air quality in a city. Data Analysis : The collected data is analyzed to extract patterns, insights, and trends. This analysis can lead to a better understanding of phenomena, such as consumer preferences, traffic patterns, and environmental changes.

Datafication 4. Decision-Making : Datafication has a significant impact on decision-making in various domains, from business and healthcare to urban planning. Data-driven decisions are based on empirical evidence rather than intuition. 5. Privacy and Ethical Concerns : The extensive collection and analysis of data raise privacy and ethical concerns. Datafication can infringe on individuals' privacy, and there is a need for responsible data handling and protection. 6. Benefits : Datafication has the potential to bring numerous benefits, such as improved healthcare through personalized medicine, optimized transportation systems, and more efficient supply chains. It enables data-driven innovations and solutions to real-world problems. 7. Challenges : Challenges associated with datafication include data security, data quality, and the potential for bias in data analysis. Ensuring that data is accurate, unbiased, and protected is crucial.

Skills Sets Needed Data Visualization Data Manipulation Statistical Analysis Machine Learning

Data visualization It is Science and Design combined in a meaningful way to interpret the data through graphs and plots

Data Manipulation This Data Does not make senses at all ! What should I do with it ? - - - - - - - - - - - - - - - - - - - - - - - - - Raw Data

Data manipulation Data manipulation refers to the process of altering, transforming, or organizing data in order to derive insights, perform analysis, or meet specific requirements.

Statistical Analysis Applying Math to understand the structure of data.

Machine Learning Machine learning is a field of study and application that enables computers to learn and improve from data without being explicitly programmed, allowing them to make predictions or take actions based on patterns and experiences..

Machine Learning

Statistical Inference Statistical inference is a crucial aspect of statistics that involves drawing conclusions or making predictions about a population based on a sample of data. Here are some key notes on statistical inference: Population and Sample : In statistical inference, you typically have a population, which is the entire group of interest, and a sample, which is a subset of the population. Statistical inference aims to make inferences about the population based on information from the sample. Two Main Types : There are two primary types of statistical inference: Estimation : Estimation involves making educated guesses about population parameters based on sample statistics. For example, estimating the population mean or variance from sample data. Hypothesis Testing : Hypothesis testing is about making decisions or drawing conclusions about the population based on sample data. It often involves testing a hypothesis or statement about the population.

Statistical Inference 3. Parameters and Statistics : In estimation, you are interested in population parameters (e.g., population mean, variance) and use sample statistics (e.g., sample mean, sample standard deviation) to estimate them. 4. Sampling Distribution : The sampling distribution is the distribution of a statistic (e.g., sample mean) over all possible samples of the same size from the population. It helps quantify the variability of the statistic and forms the basis for inference. 5. Confidence Intervals : In estimation, confidence intervals are constructed to provide a range of values within which the population parameter is likely to fall with a certain level of confidence. For example, a 95% confidence interval for the population mean. 6. Hypothesis Testing Steps : In hypothesis testing, you follow a structured process: Formulate a null hypothesis (H0) and an alternative hypothesis (Ha). Collect sample data and calculate a test statistic. Compare the test statistic to a critical value or calculate a p-value. Make a decision based on the comparison: either reject the null hypothesis or fail to reject it.

Statistical Inference 7. Significance Level : The significance level (often denoted as α) is the probability of making a Type I error, which is rejecting a true null hypothesis. Common significance levels include 0.05 and 0.01. 8. P-Value : The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample, assuming the null hypothesis is true. A smaller p-value suggests stronger evidence against the null hypothesis. 9. Type I and Type II Errors : In hypothesis testing, a Type I error occurs when the null hypothesis is incorrectly rejected when it is true. A Type II error occurs when the null hypothesis is incorrectly not rejected when it is false. 10. Sample Size : The sample size plays a critical role in the precision of estimation and the power of hypothesis tests. Larger samples generally provide more accurate estimates and better detection of differences.

Populations and Sample Populations and Samples are fundamental concepts in statistics, and they play a crucial role in the process of drawing conclusions and making inferences about various phenomena. The population refers to the entire group or set of individuals, objects, or observations about which you want to make inferences or draw conclusions. A sample is a subset of the population that is selected for the purpose of collecting data and making statistical inferences about the population.

Population The population refers to the entire group or set of individuals, objects, or observations about which you want to make inferences or draw conclusions. The population can be of any size, ranging from a small group of people in a specific city to all the people in a country, or even all possible measurements of a particular quantity. The population of a city, the population of students in a university, the entire set of products manufactured by a company, etc. Population parameters are specific characteristics or measures of the population, such as the population mean, variance, or proportion. These are typically unknown and are the targets of statistical inference.

Sample A sample is a subset of the population that is selected for the purpose of collecting data and making statistical inferences about the population. Samples are used because it is often impractical or impossible to collect data from an entire population, so a representative portion is chosen. Random sampling methods are commonly used to ensure that the sample is representative of the population, reducing bias. Sample statistics are specific characteristics or measures calculated from the sample data, such as the sample mean, standard deviation, or proportion. These are used to estimate population parameters.

Statistical Modeling Statistical modeling refers to the data science process of applying statistical analysis to datasets. A statistical model is a mathematical relationship between one or more random variables and other non-random variables. The application of statistical modeling to raw data helps data scientists approach data analysis in a strategic manner, providing intuitive visualizations that aid in identifying relationships between variables and making predictions. Common data sets for statistical analysis include Internet of Things (IoT) sensors, census data, public health data, social media data, imagery data, and other public sector data that benefit from real-world predictions.

Supervised Learning Supervised learning uses a labeled dataset , typically labeled by an external supervisor, subject matter expert(SME), or an algorithm/program. The dataset is split into training and test dataset for training and then validating the model. The supervised learned model is then used to generate predictions on previously unseen unlabeled data that belongs to the category of data the model was trained on. Examples of Supervised Learning are Classification and Regression. Classification is used in applications like Image Classification and K- Nearest Neighbors for identifying customer churn. Regression algorithms are used to predict sales, home prices, etc.

Unsupervised Learning Unsupervised learning is a machine learning approach where the algorithm is given data without explicit instructions on what to do with it. The algorithm tries to find patterns, structures, or relationships in the data without labeled target outcomes. Clustering and dimensionality reduction are common tasks in unsupervised learning. Example: Clustering Customers for Market Segmentation Imagine you work for a retail company, and you want to better understand your customers' behaviors and preferences to improve marketing strategies. You have a dataset of customer purchase histories but no predefined categories or labels for customer segments.
Tags