BDA class q2 yhmkppabt7 usb iisiii usjsjjs

ANUNAY14 12 views 110 slides Aug 15, 2024
Slide 1
Slide 1 of 110
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110

About This Presentation

Big data


Slide Content

UNIT- 5 Data Analytics with Big data analytics tools PRITI PRIYADARSANI PRADHAN

Course Details Theory and methods for big data analytics Selected machine learning and data mining methods : support vector machine logistic regression Statistical analysis techniques : conjoint analysis correlation analysis Time series analysis. Big data graph analytics. Exploratory data analysis. Visualization before analysis. Analytics for unstructured data

What is Data analytics Data analytics is the process of analyzing raw data in order to draw out meaningful actionable insights.

Why Big data analytics is required

Big Data???

Big Data Analytics Big data analytics is the process of collecting, examining, and analyzing large amounts of data to discover market trends, insights, and patterns that can help companies make better business decisions, prevent fraudulent activities .

Why Big Data Analytics??? Making Smarter and More Efficient Organization  

Contd … Optimize Business Operations by Analyzing Customer Behaviour

Contd … Cost Reduction: Patients nowadays are using new sensor devices when at home or outside, which send constant streams of data that can be monitored and analyzed in real-time to help patients avoid hospitalization by self-managing their conditions. For hospitalized patients , physicians can use predictive analytics to optimize outcomes and reduce readmissions.

Contd … Risk Management

Contd … Product Development and Innovations

Contd … Better Decision Making

Contd …

Life Cycle of Big Data Analytics

Contd … Presents a clear understanding of the justification, motivation and goals of carrying out the analysis. Example:- Detection of fraudulent claims for a Insurance company Identification of fraud Decrease in monetary loss Budgeting for the acquisition of additional data quality and cleansing tools and newer data visualization technologies

Contd … T o provide insight , identify as many types of related data sources as possible which helps to find hidden patterns and correlations . A number of internal and external datasets are identified. Example:- Internal data includes policy data, insurance application documents, claim data, incident photographs, call center agent notes and emails . External data includes social media data (Twitter feeds), weather reports, geographical (GIS) data. The claim data consists of historical claim data with information, whether the claim was fraudulent or legitimate.

Contd … Data is gathered from all of the data sources that were identified during the previous stage. Data classified as “corrupt ” can include records with missing or nonsensical values or invalid data types . Example:- The policy data is obtained from the policy administration system , the claim data , incident photographs and claim adjuster notes are acquired from the claims management system and the insurance application documents are obtained from the document management system.

Contd … E xtraction of the latitude and longitude coordinates of a user from a single JSON field . Extracting the required fields from delimited textual data (e.g., webserver log files), is possible if the underlying Big Data solution can directly process those files .

Contd … Data may be spread across multiple datasets , requiring that datasets be joined together via common fields, for example date or ID . Example:- Policy data and claim data can be joined using Policy ID field .

Contd … Data analysis can be classified as confirmatory analysis or exploratory analysis Exploratory Data Analysis involves things like: establishing the data’s underlying structure , identifying mistakes and missing data , establishing the key variables , spotting anomalies , checking assumptions and testing hypotheses in relation to a specific model. Confirmatory Data Analysis involves things like: testing hypotheses , producing estimates with a specified level of precision, regression analysis , and variance analysis .

Contd …

Contd …

Types of Big Data Analytics

Descriptive Analytics

Contd … Example

Diagnostic Analytics

Contd … Example

Predictive Analytics

Contd … Example

Prescriptive Analytics

Contd … Example

Big Data Analytics Tools

Big Data Analytics Applications

Contd …

Contd …

Contd …

Contd …

Contd …

Contd …

Machine Learning Machine learning is a technology which enables computers to learn automatically from past data.

Machine Learning

Importance of machine learning in big data analytics Machine learning provides efficient and automated tools for data gathering , analysis and integration. Machine learning processes and integrates large amounts of data regardless of its source.

Machine Learning Types

Classification ???

Support Vector Machine (SVM)

Example: Batsman & Bowler classification

Example: contd …

Example: contd …

Example: contd …

Example: contd …

Understanding SVM

Contd …

Types of Kernel Functions

Applications of SVM

Regression

Logistic Regression

Logistic Regression: Example1

Logistic Regression: Cond…

Logistic Regression: Example2

Logistic Regression: Contd …

Linear Vs Logistic Regression: Contd … Linear Regression Logistic Regression

Logistic Regression Applications

Classification Vs Regression Example

Conjoint Analysis Conjoint analysis is a form of statistical analysis that firms use in market research to understand how customers value different components or features of their products or services. It’s based on the principle that any product can be broken down into a set of attributes that ultimately impact users’ perceived value of an item or service . Conjoint Analysis is a technique used to understand preference or relative importance given to various attributes of a product by the customer while making purchase decisions .

Contd … Conjoint analysis is typically conducted via a specialized survey that asks consumers to rank the importance of the specific features in question. Analyzing the results allows the firm to then assign a value to each one.

Conjoint Analysis: Key Terms Attributes (Features): The product features are evaluated by the analysis. Examples ( Laptops) : Brand, Size, Color, and Battery Life. Levels: The specifications of each attribute. Examples (Laptops) Brands : Samsung, Dell, Apple, and Asus. Relative importance: “attribute importance,” which depicts which of the various attributes of a product/service is more or less important when making a purchasing decision. Example (Laptop) : Brand 35%, Price 30%, Size 15%, Battery Life 15%, and Color 5%. Part- Worths /Utility values: Part- Worths , or utility values, is how much weight an attribute level carries with a respondent. The individual factors that lead to a product’s overall value to consumers are part- worths . Example ( Laptops Brands ): Samsung – 0.11, Dell 0.10, Apple 0.17, and Asus -0.16 . Profiles: Discover the ultimate product with the highest utility value .

Conjoint Analysis: Example Conjoint study on smartphones Survey: Asking each respondent to choose between potential product concepts (or alternatives) formed through the combination of attributes and levels. Attributes Levels of each attributes

Contd … C alculate a numerical value that measures how much each attribute and level influenced the respondent’s choices . Each of these values is called a “ preference score ” (“ partworth utility ” or “ utility score ”). To avoid muti-colinearity (X3=1 means X4 is not possible), from each group remove one variable

Contd … Apply regression in the given data: Regression equation (Y)= 5 +0.75 x (VIVO) + 3.25 x (6000mAh) + 1.25 x (20MP) Regression coefficients are called “Part-worth”. Total worth of the product (option) is calculated from multiple attributes and multiple levels of attributes together. Utilities values for the separate parts of the product (assigned to the attributes)are the part- worths .

Contd … Importance given to each attribute by consumer Note: Range= maximum value – minimum value , Base value=0 Attribute Part-worth Range Brand 0.75-0=0.75 Battery 3.25-0=3.25 Front Camera 1.25-0=1.25 Total Range = 0.75+3.25+1.25=5.25 Attribute Importance Brand 0.75 / 5.25 = 14.3% Battery 3.25 / 25 = 61.9% Front Camera 1.25 / 5.25 = 23.8%

Applications of Conjoint Analysis Predicting what the market share of a proposed new product or service might be considering the current alternatives in the market Understanding consumers’ willingness to pay for a proposed new product or service Quantifying the tradeoffs customers are willing to make among the various attributes or features of the proposed product/service

Correlation Analysis Correlation refers to the statistical technique used to measure the extent to which two variables are related . For example , the height and weight of a person are related, and taller people tend to be heavier than shorter people . Three types of correlation: Positive Correlation: A positive correlation means the two variables increases or decreases in the same direction . Negative Correlation: A negative correlation means the variables change in opposite directions , i.e., one variable decreases while the other increases. No Correlation: No correlation means that the variables behave very differently and have no linear relationship .

Correlation Analysis

Contd …

Correlation Coefficient Correlation coefficients(r) is the measure of the strength of the linear relationship between two variables. If r < 0 , it implies negative correlation If r > 0 , it implies positive correlation If r = 0 , it implies no correlation Two types: Pearson correlation coefficient is defined as the measurement of the strength of the relationship between two variables and their association. Where, r = Coefficient of correlation, xbar = Mean of x-variable, ybar = Mean of y-variable, x i y i = Samples of variable x , y

Contd … Spearman’s rank correlation measures the strength and direction of association between two ranked variables . It basically gives the measure of monotonicity of the relation between two variables i.e. how well the relationship between two variables could be represented using a monotonic function . Where ρ= Spearman rank correlation, d i = Difference between the ranks of corresponding variables, n = Number of Observations

Real-life Applications

Limitation and Benefits of Correlation Analysis Benefits : Reduce Time to Detection: In anomaly detection , working with a vast number of metrics and surfacing correlated anomalous metrics helps draw relationships that reduces time to detection (TTD) and supports shortened time to remediation (TTR ). Reduce Alert Fatigue: In anomaly detection, it reduces  alert fatigue by filtering irrelevant anomalies (based on the correlation) and grouping correlated anomalies into a single alert . Reduce Costs: Correlation analysis helps significantly reduce the costs associated with the time spent investigating meaningless or duplicative alerts. Limitations: Correlation does not indicate causality . We cannot infer that one variable is the cause of another even though there is a very high relationship between them .

Time series Analysis Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent variable. It is used to predict future values based on the previous observed values. A Time-Series represents a series of time-based orders . It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds . The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis (TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal processing , Control Systems, and Communications Systems .

Importance of Time series Analysis E.g., Stock Price Prediction E.g., Sweet sale increases during Festive Season E.g., check whetehr goal is met or not (100 chocolates sale in a day)

Components of Time series Analysis

Components of Time series Analysis: Contd …

Components of Time series Analysis: Contd …

When not to use Time series Analysis E.g., Sales of current month 500 units, next month is also 500 units E.g., function is used.

Data types of Time series Analysis Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality, Cyclical, and Irregularity components of the time series. Non-stationary: If either the mean-variance or covariance is changing with respect to time, the dataset is called non-stationary.

Time series Analysis: Example Car Sales: 4 year data is present, forecast for 5 th year Seasonality: curve repeats in a pattern every quarter (yearly) Trend: Overall direction of plot is increasing Cyclicity: may not be possible (for sales of 20-30 years it may be possible) Irregularity: Random sale increase

Contd … MA -Moving Average CMA - Centered MA

Contd …

Big Data graph Analytics Graph Analytics refers to the analysis performed on the data stored in knowledge graph data . It combines graph-theoretic, statistics, and database technology to model, store, retrieve and analyze graph-structured data. For instance , identifying YouTube influencers and which vlog is going viral, etc. also, recommendation engines are a classic example of graph analytics.  In Graph Analytics , the queries are executed via the edges connecting the entities. The query execution on a graph database is comparatively faster than a relational database.

Graph Database Example

Types of Graph Analytics Path Analysis : Path analysis involves finding out the shortest and widest path between two nodes. This kind of analysis is used in social network analysis, supply chain optimization . Connectivity Analysis: This helps to determine how many edges are flowing into a node and how many are flowing out of that node . Centrality Analysis: This analysis estimates the importance of a node in the network’s connectivity. Determine the social media influencer by ranking out the most highly accessed web pages . Community Analysis / Network Analysis: This is a distance and density-based analysis of relationships used upon people to analyze and find the groups of people frequently interacting with each other in a social network. This also helps identify whether individuals are transient and predicts if the network will grow.

Graph Analytics: Example Identify Social Media Influencers: Find influencers in a social network is by ranking them based on their influencing capability . Each node represents an individual and the edges connecting these nodes denote some relationship. Node 4 might have an influence on two big clusters in this graph because of its crucial position Node 6 influencing 5 individuals

Contd … Centrality: it is essentially a measure of the importance of a node in a graph . Degree Centrality: The degree of a node in a network is the number of edges incident with it . Closeness Centrality: This centrality measure takes into account the distance of a node to all the other nodes in a network . It is formulated as where C i closeness centrality of node i , N=total nodes in the network, d ij shortest path between node i and node j

Graph Analytics Use Cases National Security: N ational intelligence agencies detect unlawful activity using graph analytics. Online activity of both suspected and not suspected individuals are collected and analyzed to identify non-obvious relationships and identify potential crimes . Supply Chain Optimization: In transportation networks, supply chain networks and airline companies use graph analytics algorithms such as shortest path and partitioning as tools to optimize routes. Fraud Detection: Graph Analytics is used to detect fraud detection in businesses that work with networks involving e-commerce marketplaces, financial institutions, and telecom companies . Healthcare: 2020 was a pandemic year in the hands of coronavirus. Being a highly infectious virus, using a graph database helped governments track the spread of this virus . Social Network Analysis: Social media networks such as Instagram , Linked In, and Spotify are relationships and connection-driven applications. Graph analytics has an application in identifying influencers and communities on social media.

Exploratory Data Analysis Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. It helps to gather insights and make better sense of the data , and removes irregularities and unnecessary values from data. Allows a machine learning model to predict our dataset better . Gives more accurate results and it also helps to choose a better machine learning model .

Exploratory Data Analysis: Steps Data Collection: It refers to the process of finding and loading data into the system . Some reliable sites for data collection are Kaggle , Github , Machine Learning Repository, etc . Data Cleaning: It refers to the process of removing unwanted variables and values from dataset and getting rid of any irregularities in it. Univariate Analysis: In Univariate Analysis, data of just one variable is analyzed. A variable in a dataset refers to a single feature/ column . Some visual methods include : Histograms : Bar plots in which the frequency of data is represented with rectangle bars. Box-plots : Here the information is represented in the form of boxes.

C ontd … Bivariate Analysis: T wo variables are used and compared to find how one feature affects the other. It is done with scatter plots , which plot individual data points or correlation matrices that plot the correlation in hues. Boxplots can also be used. Multi- Variate analysis : When the data involves three or more variables, it is categorized under multivariate.

Example Employee Dataset Histogram Boxplot

Contd … Scatter plot Multivariate Analysis

Data Visualization Data Visualization techniques involve the generation of graphical or pictorial representation of DATA , form which leads you to understand the insight of a given data set . This visualization technique aims to identify the Patterns, Trends, Correlations, and Outliers of data sets.

Data Visualization: Benefits Patterns in business operations: Data visualization techniques help us to determine the patterns of business operations. Identify business trends and relate to data: These techniques help us identify market trends by collecting the data on Day-To-Day business activities and preparing trend reports, which helps track the business how influences the market. Understand the current business insights and setting the goals: Businesses can understand the insight of the business KPIs, finding tangible goals and business strategy plannings , therefore they could optimize the data for business strategy plans for ongoing activities. Operational and Performance analysis: Increase the productivity of the manufacturing unit: With the help of visualization techniques the clarity of KPIs ( Key Performance Indicators ) depicting the trends of the productivity of the manufacturing unit, and guiding were to improve the productivity of the plant.

Data Visualization: Tools

Analytics for Unstructured Data

Analytics for Unstructured Data: Tips

Contd … Keep the business objective(s) in mind : The chosen analytical techniques should match up to business objectives. Let’s say the objective is to identify a face in an image. An image has features for mapping—face shape, eye color, width of the mouth, and so on. These features can be stored in a flexible semi-structured format, like a JSON document or MongoDB . Define metadata for faster data access: Metadata stores information about data. Using metadata, an analyst can quickly find data related to their organization or business objectives. For instance : metadata could include information like table of contents, title, author, creation date, tags, or number of words for each document.

Contd … Choose the right analytics techniques: To detect a theft in a area using CCTV footage, advanced deep learning techniques like object recognition, face analysis, and crowd analysis is needed. To find out the average number of milk cartons bought by families in a particular gated community, simple quantitative analysis and grouping the residents based on a certain criteria like 0, 1-3, 4-6, > 6 is sufficient. Identify the right data sources : Analysts need to identify whether they need data from all or few sources to get the right data they need for analysis.

Contd … Evaluate the technologies you’d want to use: Choose the tools that provide scalability, availability, and query capabilities for your particular use case. Get real-time data access: For real-time analytics, it is necessary to have access to new data in real time. For example , fraud prevention or personalized offers are more valuable when fraudulent activity is happening or a customer is still shopping, respectively . Store and integrate data using data lakes: Data lakes unify and store unstructured data from many sources in its native format. Wrangle the unstructured data: Before applying unstructured data analysis techniques, data is cleaned and all the valuable information is present. If there is a lot of noise in the data, the insights will not be accurate.

Analytics for Unstructured Data: Techniques

References https:// www.simplilearn.com/data-analysis-methods-process-types-article https :// www.youtube.com/watch?v=bY6ZzQmtOzk https:// www.youtube.com/watch?v=k7zu3NXEiGY https:// www.youtube.com/watch?v=TtKF996oEl8 https:// www.youtube.com/watch?v=QkAmOb1AMrY&t=250s https://www.questionpro.com/blog/what-is-conjoint-analysis/#how_to_use_conjoint_analysis ? https :// online.hbs.edu/blog/post/what-is-conjoint-analysis https:// www.youtube.com/watch?v=6LSpK1ybs6M https:// www.youtube.com/watch?v=5XccUTnc8zg https:// www.youtube.com/watch?v=chp71nEc320 https:// www.youtube.com/watch?v=5C012eMSeIU https://medium.com/@ xenonstack/graph-analytics-for-big-data-and-its-use-cases-c87d834bd8e3 https://www.analyticsvidhya.com/blog/2020/03/using-graphs-to-identify-social-media-influencers /
Tags