To comprehend the essentials of data, differentiate between data and datasets, understand various dataset file formats, and become familiar with dataset repositories like UCI, Kaggle, and Google Dataset Search
SASIVARDHANT
23 views
50 slides
Aug 10, 2024
Slide 1 of 50
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
About This Presentation
DS
Size: 3.1 MB
Language: en
Added: Aug 10, 2024
Slides: 50 pages
Slide Content
What Is Data Standardization? Data standardization converts data into a standard format that computers can read and understand. This is important because it allows different systems to share and efficiently use data. Without data standardization, it would not be effortless for different approaches to communicate and exchange information. Data standardization is also essential for preserving data quality. When data is standardized, it is much easier to detect errors and ensure that it is accurate. This is essential for making sure that decision-makers have access to accurate and reliable information. Overall, data standardization is critical to ensuring that data is usable and accessible. Without it, we would be unable to use and manage data effectively .
Why Is Data Standardization Important? Data standardization is essential because it allows different systems to exchange data consistently. Without standardization Standardization also makes it easier to process and analyze data and store it in a database. With this approach, businesses can make better decisions based on their data. When data is standardized, companies can compare and analyze it more easily to make insights that they can use to improve their operations. Data standardization has many benefits, but one of the most important is that it helps businesses avoid making decisions based on inaccurate or incomplete data. Data standardization ensures that companies have a complete and accurate picture of their data, allowing them to make better decisions to improve their bottom line.
Normalization Vs. Standardization Data normalization and data standardization are two commonly used methods for dealing with data that cannot easily be analyzed. Both ways are used to transform data into a more uniform and consistent format, but they differ in how they achieve this. Data normalization typically involves scaling data down to a smaller range of values, such as between 0 and 1. Data standardization, on the other hand, transforms data with a mean of 0 and a standard deviation of 1. So, which method should you use? Depending on what outcomes you need with it. Data standardization is likely the best option if you need to compare data from different sources
Z-Score Normalization
Z-score normalization refers to the process of normalizing every value in a dataset such that the mean of all of the values is 0 and the standard deviation is 1. We use the following formula to perform a z-score normalization on every value in a dataset: New value = (x – μ) / σ where: x : Original value μ : Mean of data σ : Standard deviation of data
Z-Score Normalization Example Mean Suppose we have the following dataset:
Z-Score Normalization Example Mean
Z-Score Normalization Example Mean The mean of the normalized values is and T he standard deviation of the normalized values is 1 .
Z-Score Normalization Example Mean
Advantages of Z-Score Normalization Mean Advantages of the z score The z-score is very useful when we are understanding the data. Some of the useful facts are mentioned below; The z-score is a very useful statistic of the data due to the following facts; It allows a data administrator to understand the probability of a score occurring within the normal distribution of the data. The z-score enables a data administrator to compare two different scores that are from different normal distributions of the data.
Is a higher or lower Z score better? Suppose we have data from two persons. Person A has a high Z score value and person B has a low Z Score value. In this case, the higher Z-score indicates that Person A is far away from person B.
What does a negative and a positive z score mean? A negative z-score indicates that the data point is below the mean. A positive z-score indicates that the data point is above the mean.
Why is the mean of Z scores is 0? The standard deviation of the z-scores is always 1 and similarly, the mean of the z-scores is always 1. Z-scores values above 0 represent that sample values are above the mean. z-scores values below 0 represent that sample values are below the mean. In the case of squared z-scores, the sum of the squared z-scores is always equal to the number of z-score values .
What is the meaning of the high Z score and low Z score? Suppose we have a high Z-score value then it means a very low probability of data above this Z-score. Suppose we have a low Z-score value then it means a very low probability of data below this Z-score.
Z-Score Normalization implementation Numpy One-Dimensional Arrays Step 1: Import modules. import pandas as pd import numpy as np import scipy.stats as stats Step 2: Create an array of values. data = np.array ([6, 7, 7, 12, 13, 13, 15, 16, 19, 22]) Step 3: Calculate the z-scores for each value in the array. stats.zscore (data) [-1.394, -1.195, -1.195, -0.199, 0, 0, 0.398, 0.598, 1.195, 1.793]
Min-Max Normalization
Min-Max Normalization, or Min-Max Scaling is a process of converting the given data into its corresponding values in a fixed boundary, usually [0,1]. This is a scaling technique that is widely used in Machine Learning Modelling. The dataset is scaled using such a scaling technique before passing it to the model. Assume that the left half of the picture represents the data before Min-Max Normalization and the right half, after the Min-Max Normalization. Here, as you can see, the 3 boys are given equal heights of boxes. This isn’t fair because each one of them are of different heights (or scales). To make it fair, in the right-half of the picture, the boys are given boxes to stand on based on their heights to make sure that all 3 of them are of the same effective height. This is what Min-Max Normalization does .
For example : Consider a dataset with 2 columns, representing marks of 2 groups of students — Group A and Group B. Group A is assigned marks out of 10 whereas Group B is assigned marks out of 100. If we pass this data to a model, without transforming, the model will assume that the students in the Group B performed better than the ones in Group A simply because Group B has a higher numeric value. This is because the model doesn’t know the grading system or simply put, it doesn’t know what are the “Minimum” and the “Maximum” possible values in the Groups A and B. To avoid this partiality, the data is generally scaled to their equivalent values between a fixed range. Since there is no universal scaling system, the values are by default converted into their corresponding and equivalent values between the range [0,1].
Data Aggregation
Data aggregation is the process of collecting data to present it in summary form. This information is then used to conduct statistical analysis and can also help company executives make more informed decisions about marketing strategies, price settings, and structuring operations, among other things. Data aggregation is typically performed on a large scale via software programs known as data aggregators. These tools spread a company’s data to several different publishing websites such as social media platforms, search engines, and review sites. Data scientists and analysts are the most common users of data aggregation tools and data aggregation software Data aggregation is often used to provide statistical analysis for groups of people and to create useful summary data for business analysis.
Aggregation is often done on a large scale, through software tools known as data aggregators. Data aggregators typically include features for collecting, processing and presenting aggregate data. Data aggregation can enable analysts to access and examine large amounts of data in a reasonable time frame. A row of aggregate data can represent hundreds, thousands or even more atomic data records. When the data is aggregated, it can be queried quickly instead of requiring all of the processing cycles to access each underlying atomic data row and aggregate it in real time when it is queried or accessed. As the amount of data stored by organizations continues to expand, the most important and frequently accessed data can benefit from aggregation, making it feasible to access efficiently.
What does data aggregation do? Data aggregators summarize data from multiple sources. They provide capabilities for multiple aggregate measurements, such as sum, average and counting. Examples of aggregate data include the following: Voter turnout by state or county. Individual voter records are not presented, just the vote totals by candidate for the specific region. Average age of customer by product. Each individual customer is not identified, but for each product, the average age of the customer is saved. Number of customers by country. Instead of examining each customer, a count of the customers in each country is presented. Data aggregation can also result in a similar effect to Data anonymization -- as individual data elements with personally identifiable details are combined and replaced with a summary representing a group as a whole
An example of this is creating a summary that shows the aggregate average salary for employees by department, rather than browsing through individual employee records with salary data. Aggregate data does not need to be numeric. You can, for example, count the number of any non-numeric data element. Before aggregating, it is crucial that the atomic data is analyzed for accuracy and that there is enough data for the aggregation to be useful. For example, counting votes when only 5% of results are available is not likely to produce a relevant aggregate for prediction.
How do data aggregators work? Data aggregators work by combining atomic data from multiple sources, processing the data for new insights and presenting the aggregate data in a summary view. Furthermore, data aggregators usually provide the ability to track data lineage and can trace back to the underlying atomic data that was aggregated. Collection. First, data aggregation tools may extract data from multiple sources, storing it in large databases as atomic data. The data may be extracted from internet of things (IoT) sources, such as the following: social media communications; news headlines; personal data and browsing history from IoT devices; and call centers, podcasts, etc. (through speech recognition).
Processing. Once the data is extracted, it is processed. The data aggregator will identify the atomic data that is to be aggregated. The data aggregator may apply predictive analytics, artificial intelligence (AI) or machine learning algorithms to the collected data for new insights. The aggregator then applies the specified statistical functions to aggregate the data. Presentation. Users can present the aggregated data in a summarized format that itself provides new data. The statistical results are comprehensive and high quality. Data aggregation may be performed manually or through the use of data aggregators. However, data aggregation is often performed on a large-scale basis, which makes manual aggregation less feasible. Furthermore, manual aggregation risks accidental omission of crucial data sources and patterns.
Uses for data aggregation Data aggregation can be helpful for many disciplines, such as finance and business strategy decisions, product planning, product and service pricing, operations optimization and marketing strategy creation. Users may be data analysts, data scientists, data warehouse administrators and subject matter experts. Aggregated data is commonly used for statistical analysis to obtain information about particular groups based on specific demographic or behavioral variables, such as age, profession, education level or income. For business analysis purposes, data can be aggregated into summaries that help leaders make well-informed decisions. User data can be aggregated from multiple sources, such as social media communications, browsing history from IoT devices and other personal data, to give companies critical insights into consumers.
Attribute Construction In this method, new attributes or features are created out of the existing features. It simplifies the data and makes data mining more efficient. For example, if we have height and weight features in the data, we can create a new attribute, BMI, using these two features..
Graphical Representation of Data
Graphical Representation Graphical Representation is a way of analysing numerical data. It exhibits the relation between data, ideas, information and concepts in a diagram. It is easy to understand and it is one of the most important learning strategies. It always depends on the type of information in a particular domain.
There are different types of graphical representation. Some of them are as follows: Line Graphs – Line graph or the linear graph is used to display the continuous data and it is useful for predicting future events over time. Bar Graphs – Bar Graph is used to display the category of data and it compares the data using solid bars to represent the quantities. Histograms – The graph that uses bars to represent the frequency of numerical data that are organized into intervals. Since all the intervals are equal and continuous, all the bars have the same width. Line Plot – It shows the frequency of data on a given number line. ‘ x ‘ is placed above a number line each time when that data occurs again. Frequency Table – The table shows the number of pieces of data that falls within the given interval. Circle Graph – Also known as the pie chart that shows the relationships of the parts of the whole. The circle is considered with 100% and the categories occupied is represented with that specific percentage like 15%, 56%, etc. Stem and Leaf Plot – In the stem and leaf plot, the data are organised from least value to the greatest value. The digits of the least place values from the leaves and the next place value digit forms the stems. Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box and whisker show the range (spread) and the middle ( median) of the data.
General Rules for Graphical Representation of Data There are certain rules to effectively present the information in the graphical representation. They are: Suitable Title: Make sure that the appropriate title is given to the graph which indicates the subject of the presentation. Measurement Unit: Mention the measurement unit in the graph. Proper Scale: To represent the data in an accurate manner, choose a proper scale. Index: Index the appropriate colours , shades, lines, design in the graphs for better understanding. Data Sources: Include the source of information wherever it is necessary at the bottom of the graph. Keep it Simple: Construct a graph in an easy way that everyone can understand. Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a visual aid for the presentation of information.
Types of Charts Charts are an essential part of working with data, as they are a way to condense large amounts of data into an easy to understand format. Visualizations of data can bring out insights to someone looking at the data for the first time, as well as convey findings to others who won’t see the raw data. There are countless chart types out there, each with different use cases. Often, the most difficult part of creating a data visualization is figuring out which chart type is best for the task at hand.
Generally, the most popular types of charts are column charts, bar charts, pie charts, doughnut charts, line charts, area charts, scatter charts, spider (radar) charts, gauges, and comparison charts . Here is a quick view of all of these types of charts. The biggest challenge is how to select the most effective type of chart for your task. If you want to choose the most suitable chart type, generally, you should consider the total number of variables, data points, and the period of your data. Each type of chart has specific advantages. For example, scatter diagrams are useful for indicating relations between different factors or topics, while line types are suitable for showing trends.
Line Charts A line chart graphically displays data that changes continuously over time. Each line graph consists of points that connect data to show a trend (continuous change). Line graphs have an x-axis and a y-axis. In the most cases, time is distributed on the horizontal axis. Uses of line graphs: When you want to show trends. For example, how house prices have increased over time. When you want to make predictions based on a data history over time. When comparing two or more different variables, situations, and information over a given period of time.
Example: The following line graph shows annual sales of a particular business company for the period of six consecutive years:
Column Charts Column charts are effective for the comparison of at least one set of data points. The vertical axis, also known as the Y-axis, is often shown in numeric values. The X-axis on the horizontal line indicates a period. Typically, data points in column charts have these kinds: Flowers, Shrubs, Clustered, stacked, and Trees. You can find trends over time by using these types in different colors.
Pie-Charts When it comes to statistical types of graphs and charts, the pie chart (or the circle chart) has a crucial place and meaning. It displays data and statistics in an easy-to-understand ‘pie-slice’ format and illustrates numerical proportion. Each pie slice is relative to the size of a particular category in a given group as a whole. To say it in another way, the pie chart brakes down a group into smaller pieces. It shows part-whole relationships. To make a pie chart, you need a list of categorical variables and numerical variables. Pie Chart Uses: When you want to create and represent the composition of something. It is very useful for displaying nominal or ordinal categories of data. To show percentage or proportional data. When comparing areas of growth within a business such as profit. Pie charts work best for displaying data for 3 to 7 categories.
Example: The pie chart below represents the proportion of types of transportation used by 1000 students to go to their school. Pie charts are widely used by data-driven marketers for displaying marketing data.
Bar Charts Bar charts represent categorical data with rectangular bars (to understand what is categorical data see categorical data examples). ar graphs are among the most popular types of graphs and charts in economics, statistics, marketing, and visualization in digital customer experience. hey are commonly used to compare several categories of data. Each rectangular bar has length and height proportional to the values that they represent. One axis of the bar chart presents the categories being compared. The other axis shows a measured value.
Bar Charts Uses: When you want to display data that are grouped into nominal or ordinal categories To compare data among different categories. Bar charts can also show large data changes over time. Bar charts are ideal for visualizing the distribution of data when we have more than three categories. Example: The bar chart below represents the total sum of sales for Product A and Product B over three years. The bars are 2 types: vertical or horizontal. It doesn’t matter which kind you will use. The above one is a vertical type.
Histogram A histogram shows continuous data in ordered rectangular columns (to understand what is continuous data see our post discrete vs continuous data). Usually, there are no gaps between the columns. The histogram displays a frequency distribution (shape) of a data set. At first glance, histograms look alike to bar graphs. However, there is a key difference between them. Bar Chart represents categorical data and histogram represent continuous data.
Histogram Uses: When the data is continuous. When you want to represent the shape of the data’s distribution. When you want to see whether the outputs of two or more processes are different. To summarize large data sets graphically. To communicate the data distribution quickly to others. Example: The histogram below represents per capita income for five age groups. Histograms are very widely used in statistics, business, and economics.
Box and Whisker Chart A box and whisker chart is a statistical graph for displaying sets of numerical data through their quartiles. It displays a frequency distribution of the data. The box and whisker chart helps you to display the spread and skewness for a given set of data using the five number summary principle: minimum, maximum, median, lower and upper quartiles. The ‘five-number summary’ principle allows providing a statistical summary for a particular set of numbers. It shows you the range (minimum and maximum numbers), the spread (upper and lower quartiles), and the center (median) for the set of data numbers.
A very simple figure of a box and whisker plot you can see below: Box and Whisker Chart Uses: When you want to observe the upper, lower quartiles, mean, median, deviations, etc. for a large set of data. When you want to see a quick view of the dataset distribution. When you have multiple data sets that come from independent sources and relate to each other in some way. When you need to compare data from different categories.
Example: The table and box-and-whisker plots below shows test scores for Maths and Literature for the same class. Box and Whisker charts have applications in many scientific areas and types of analysis such as statistical analysis, test results analysis, marketing analysis, data analysis, and etc