QQ Plot.pptx

rahulborate14 93 views 18 slides Aug 17, 2023
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

QQ Plot introduction


Slide Content

QQ Plot

Introduction  Engineers and scientists work with data. Without data, they are not able to draw any conclusion. Now is the era of creation of data everyday from every aspects of our lives. Some data are random and some are biased . Some may suffer from bias because of the data collection process.  One very important aspect of data is the distribution profile. The collected data may have normal distribution or may be far from normal . It may also be skewed one either side or may follow multimodal pattern. It may be discrete or may be continuous . For continuous data, normal distribution brings a whole lot of advantages compared to its counterparts.  Various inferential statistical process assume that the distribution is normal. A bell-shaped curve is easy to describe with mean and standard deviation.

Introduction

Why Q-Q plot?  Since normal distribution is of so much importance, we need to check if the collected data is normal or not . Here, we will demonstrate the Q-Q plot to check the normality of skewness of data. Q stands for quantile and therefore, Q-Q plot represents quantile-quantile plot. To determine the normality, there are also several statistical tests out there such as the Kolmogorov–Smirnov test and the Shapiro– Wilk test .  However, it is difficult to do so by looking at the table.

Brief explanation ?  We now know Q-Q plot is quantile-quantile plot but what is quantile at the first place?  When the whole data is sorted, 50th quantile means 50% of the data falls below that point and 50% of the data falls above that point. That is the median point.  When we say 1st quantile , only 1% of the data falls below that point and 99% is above that. 25th and 75th quantile points are also known as quartiles. There are three quartiles is the dataset. Q1 = first quartile = 25th quantile Q2 = second quartile = 50th quantile = median Q3 = third quartile = 75th quantile

Brief explanation ?  Quantile are sometimes called percentile.  A typical Q-Q plot is sown below. Let’s explain this plot which seems pretty much a straight line.  Axes The x-axis of a Q-Q plot represents the quantiles of standard normal distribution. Let’s say we have a normal data and we want to standardize it. Standardizing means subtracting mean from each data point and dividing it by standard deviation. The resultant is also known as z-score. Let’s sort those z-scores and plot again. The plot below shows that the x-axis is now centered at 0 and extended up to 3 standard deviations on each side. Statistically, 99.7& of the data falls between this range.

Network Graph: Motivation  Wouldn’t it be nice if you could visualize their connections using an interactive network graph like below?

Network Graph: Pyvis  What is Pyvis ? Pyvis  is a Python library that allows you to create interactive network graphs in a few lines of code. To install pyvis , type: pip install pyvis

What is a Tree map?  What is a Tree map? A tree map is a special type of chart for visualization using a set of nested rectangles of categorical data that is preferably hierarchical. In Hierarchical data, the categories or items share parent-child type relationships in an overall tree structure. The simplest example of this type of data structure can be seen in a company where all individuals and their designations within teams could be grouped under one entity i.e., the company itself

What is a Tree map?  When to use a Tree map? These are some key points to consider before using tree maps for visualization.  Tree maps work well when there is a clear ‘Part-to-whole’ relationship amongst multiple categories present in the data.  Hierarchical Data is needed. This indicates that the data could be arranged in branches and sub-branches.  The focus is not on precise comparisons between categories but rather on spotting the key factors/trends or patterns.

Benefits of using a Tree map? Benefits of using a Tree map: Space constraint:  There is a large amount of hierarchical data that needs to be visualized in a smaller space. Easier to read:  When compared to a circular multi-level pie chart, the tree map is easier to read due to its linear visual appearance. Quickly spot patterns:  Since each group is represented by a rectangle and the area of this rectangle is always proportional to its value, trends and patterns (similarities and anomalies) are quickly visible in tree maps.

Real-world use cases for Tree map Charts? 1. Displaying region-wise customer complaints about a product Suppose there are 10 different types of complaints (assume these are denoted as C1 to C10) about a product and the company wants to visualize which complaints are relevant to a region then in such a case a tree map could be used. Here, it can be clearly seen how different regions have specific types of user complaints.

Real-world use cases for Tree map Charts? 2. Showcasing category-wise product availability like mobile phones  Let us assume that there are four categories of mobile phones with their market share percentages i.e., Low-end (up to 10,000 INR – 15%), Mid-Range (10,000-25000 INR- 55%), Premium (above 25,000 to 50,000 INR-25%), and Top-end (above 50,000 INR-10%).  From this tree map, we can gauge that there is a bigger demand and market for Mid-Range phones while there are limited phones available in the Top-End category.

Real-world use cases for Tree map Charts? 3. Explore customer segmentation for a product  Usually, companies for apparel or personal products divide their customers based on their age. This way they can categorize their products and the product variants separately for each age group.  In the case of this tree map, the company could decide whether to launch more products for particular customer segments based on the distribution.

Challenges associated with a Tree map  Tree maps also come with a set of limitations as outlined below- – Tree maps built with large data points on a single level could be hard to read as well as print for reporting purposes. – Sometimes, additional sorting might be required to understand the data better. However, all the rectangles are automatically ordered within the parent node by area. – With too many categories and colors to represent these, the tree map becomes overwhelming for the reader. – Tree maps become ineffective for datasets with balanced trees i.e., when items are of a similar value. In these cases, the main purpose of a tree map of highlighting the largest item in a given category becomes impossible.

Dendrograms in Python  A dendrogram is a diagram that depicts a tree.  The create_dendrogram figure factory conducts hierarchical clustering on data and depicts the resultant tree.  Distances between clusters are represented by the values on the tree depth axis.  Dendrogram plots are often used in computational biology to depict gene or sample grouping, occasionally in the margins of heatmaps.  Hierarchical clustering produces dendrograms as an output. Many people claim that dendrograms of this type may be used to determine the number of clusters. 

Dendrograms in Python  Wholesale Customer Segmentation Problem using Hierarchical Clustering  We will be working on a wholesale customer segmentation problem. You can download the dataset using  this link .  The data is hosted on the UCI Machine Learning repository. The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc.

Scree Plot in Python  How to Create a Scree Plot in Python Principal components analysis (PCA) is an  unsupervised machine learning technique  that finds principal components (linear combinations of the predictor variables) that explain a large portion of the variation in a dataset.  When we perform PCA, we’re interested in understanding what percentage of the total variation in the dataset can be explained by each principal component.  One of the easiest ways to visualize the percentage of variation explained by each principal component is to create a  scree plot .
Tags