UNIT- 5 Data Analytics with Big data analytics tools PRITI PRIYADARSANI PRADHAN
Course Details Theory and methods for big data analytics Selected machine learning and data mining methods : support vector machine logistic regression Statistical analysis techniques : conjoint analysis correlation analysis Time series analysis. Big data graph analytics. Exploratory data analysis. Visualization before analysis. Analytics for unstructured data
What is Data analytics Data analytics is the process of analyzing raw data in order to draw out meaningful actionable insights.
Why Big data analytics is required
Big Data???
Big Data Analytics Big data analytics is the process of collecting, examining, and analyzing large amounts of data to discover market trends, insights, and patterns that can help companies make better business decisions, prevent fraudulent activities .
Why Big Data Analytics??? Making Smarter and More Efficient Organization
Contd … Optimize Business Operations by Analyzing Customer Behaviour
Contd … Cost Reduction: Patients nowadays are using new sensor devices when at home or outside, which send constant streams of data that can be monitored and analyzed in real-time to help patients avoid hospitalization by self-managing their conditions. For hospitalized patients , physicians can use predictive analytics to optimize outcomes and reduce readmissions.
Contd … Risk Management
Contd … Product Development and Innovations
Contd … Better Decision Making
Contd …
Life Cycle of Big Data Analytics
Contd … Presents a clear understanding of the justification, motivation and goals of carrying out the analysis. Example:- Detection of fraudulent claims for a Insurance company Identification of fraud Decrease in monetary loss Budgeting for the acquisition of additional data quality and cleansing tools and newer data visualization technologies
Contd … T o provide insight , identify as many types of related data sources as possible which helps to find hidden patterns and correlations . A number of internal and external datasets are identified. Example:- Internal data includes policy data, insurance application documents, claim data, incident photographs, call center agent notes and emails . External data includes social media data (Twitter feeds), weather reports, geographical (GIS) data. The claim data consists of historical claim data with information, whether the claim was fraudulent or legitimate.
Contd … Data is gathered from all of the data sources that were identified during the previous stage. Data classified as “corrupt ” can include records with missing or nonsensical values or invalid data types . Example:- The policy data is obtained from the policy administration system , the claim data , incident photographs and claim adjuster notes are acquired from the claims management system and the insurance application documents are obtained from the document management system.
Contd … E xtraction of the latitude and longitude coordinates of a user from a single JSON field . Extracting the required fields from delimited textual data (e.g., webserver log files), is possible if the underlying Big Data solution can directly process those files .
Contd … Data may be spread across multiple datasets , requiring that datasets be joined together via common fields, for example date or ID . Example:- Policy data and claim data can be joined using Policy ID field .
Contd … Data analysis can be classified as confirmatory analysis or exploratory analysis Exploratory Data Analysis involves things like: establishing the data’s underlying structure , identifying mistakes and missing data , establishing the key variables , spotting anomalies , checking assumptions and testing hypotheses in relation to a specific model. Confirmatory Data Analysis involves things like: testing hypotheses , producing estimates with a specified level of precision, regression analysis , and variance analysis .
Contd …
Contd …
Types of Big Data Analytics
Descriptive Analytics
Contd … Example
Diagnostic Analytics
Contd … Example
Predictive Analytics
Contd … Example
Prescriptive Analytics
Contd … Example
Big Data Analytics Tools
Big Data Analytics Applications
Contd …
Contd …
Contd …
Contd …
Contd …
Contd …
Machine Learning Machine learning is a technology which enables computers to learn automatically from past data.
Machine Learning
Importance of machine learning in big data analytics Machine learning provides efficient and automated tools for data gathering , analysis and integration. Machine learning processes and integrates large amounts of data regardless of its source.
Machine Learning Types
Classification ???
Support Vector Machine (SVM)
Example: Batsman & Bowler classification
Example: contd …
Example: contd …
Example: contd …
Example: contd …
Understanding SVM
Contd …
Types of Kernel Functions
Applications of SVM
Regression
Logistic Regression
Logistic Regression: Example1
Logistic Regression: Cond…
Logistic Regression: Example2
Logistic Regression: Contd …
Linear Vs Logistic Regression: Contd … Linear Regression Logistic Regression
Logistic Regression Applications
Classification Vs Regression Example
Conjoint Analysis Conjoint analysis is a form of statistical analysis that firms use in market research to understand how customers value different components or features of their products or services. It’s based on the principle that any product can be broken down into a set of attributes that ultimately impact users’ perceived value of an item or service . Conjoint Analysis is a technique used to understand preference or relative importance given to various attributes of a product by the customer while making purchase decisions .
Contd … Conjoint analysis is typically conducted via a specialized survey that asks consumers to rank the importance of the specific features in question. Analyzing the results allows the firm to then assign a value to each one.
Conjoint Analysis: Key Terms Attributes (Features): The product features are evaluated by the analysis. Examples ( Laptops) : Brand, Size, Color, and Battery Life. Levels: The specifications of each attribute. Examples (Laptops) Brands : Samsung, Dell, Apple, and Asus. Relative importance: “attribute importance,” which depicts which of the various attributes of a product/service is more or less important when making a purchasing decision. Example (Laptop) : Brand 35%, Price 30%, Size 15%, Battery Life 15%, and Color 5%. Part- Worths /Utility values: Part- Worths , or utility values, is how much weight an attribute level carries with a respondent. The individual factors that lead to a product’s overall value to consumers are part- worths . Example ( Laptops Brands ): Samsung – 0.11, Dell 0.10, Apple 0.17, and Asus -0.16 . Profiles: Discover the ultimate product with the highest utility value .
Conjoint Analysis: Example Conjoint study on smartphones Survey: Asking each respondent to choose between potential product concepts (or alternatives) formed through the combination of attributes and levels. Attributes Levels of each attributes
Contd … C alculate a numerical value that measures how much each attribute and level influenced the respondent’s choices . Each of these values is called a “ preference score ” (“ partworth utility ” or “ utility score ”). To avoid muti-colinearity (X3=1 means X4 is not possible), from each group remove one variable
Contd … Apply regression in the given data: Regression equation (Y)= 5 +0.75 x (VIVO) + 3.25 x (6000mAh) + 1.25 x (20MP) Regression coefficients are called “Part-worth”. Total worth of the product (option) is calculated from multiple attributes and multiple levels of attributes together. Utilities values for the separate parts of the product (assigned to the attributes)are the part- worths .
Contd … Importance given to each attribute by consumer Note: Range= maximum value – minimum value , Base value=0 Attribute Part-worth Range Brand 0.75-0=0.75 Battery 3.25-0=3.25 Front Camera 1.25-0=1.25 Total Range = 0.75+3.25+1.25=5.25 Attribute Importance Brand 0.75 / 5.25 = 14.3% Battery 3.25 / 25 = 61.9% Front Camera 1.25 / 5.25 = 23.8%
Applications of Conjoint Analysis Predicting what the market share of a proposed new product or service might be considering the current alternatives in the market Understanding consumers’ willingness to pay for a proposed new product or service Quantifying the tradeoffs customers are willing to make among the various attributes or features of the proposed product/service
Correlation Analysis Correlation refers to the statistical technique used to measure the extent to which two variables are related . For example , the height and weight of a person are related, and taller people tend to be heavier than shorter people . Three types of correlation: Positive Correlation: A positive correlation means the two variables increases or decreases in the same direction . Negative Correlation: A negative correlation means the variables change in opposite directions , i.e., one variable decreases while the other increases. No Correlation: No correlation means that the variables behave very differently and have no linear relationship .
Correlation Analysis
Contd …
Correlation Coefficient Correlation coefficients(r) is the measure of the strength of the linear relationship between two variables. If r < 0 , it implies negative correlation If r > 0 , it implies positive correlation If r = 0 , it implies no correlation Two types: Pearson correlation coefficient is defined as the measurement of the strength of the relationship between two variables and their association. Where, r = Coefficient of correlation, xbar = Mean of x-variable, ybar = Mean of y-variable, x i y i = Samples of variable x , y
Contd … Spearman’s rank correlation measures the strength and direction of association between two ranked variables . It basically gives the measure of monotonicity of the relation between two variables i.e. how well the relationship between two variables could be represented using a monotonic function . Where ρ= Spearman rank correlation, d i = Difference between the ranks of corresponding variables, n = Number of Observations
Real-life Applications
Limitation and Benefits of Correlation Analysis Benefits : Reduce Time to Detection: In anomaly detection , working with a vast number of metrics and surfacing correlated anomalous metrics helps draw relationships that reduces time to detection (TTD) and supports shortened time to remediation (TTR ). Reduce Alert Fatigue: In anomaly detection, it reduces alert fatigue by filtering irrelevant anomalies (based on the correlation) and grouping correlated anomalies into a single alert . Reduce Costs: Correlation analysis helps significantly reduce the costs associated with the time spent investigating meaningless or duplicative alerts. Limitations: Correlation does not indicate causality . We cannot infer that one variable is the cause of another even though there is a very high relationship between them .
Time series Analysis Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent variable. It is used to predict future values based on the previous observed values. A Time-Series represents a series of time-based orders . It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds . The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis (TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal processing , Control Systems, and Communications Systems .
Importance of Time series Analysis E.g., Stock Price Prediction E.g., Sweet sale increases during Festive Season E.g., check whetehr goal is met or not (100 chocolates sale in a day)
Components of Time series Analysis
Components of Time series Analysis: Contd …
Components of Time series Analysis: Contd …
When not to use Time series Analysis E.g., Sales of current month 500 units, next month is also 500 units E.g., function is used.
Data types of Time series Analysis Stationary: A dataset should follow the below thumb rules without having Trend, Seasonality, Cyclical, and Irregularity components of the time series. Non-stationary: If either the mean-variance or covariance is changing with respect to time, the dataset is called non-stationary.
Time series Analysis: Example Car Sales: 4 year data is present, forecast for 5 th year Seasonality: curve repeats in a pattern every quarter (yearly) Trend: Overall direction of plot is increasing Cyclicity: may not be possible (for sales of 20-30 years it may be possible) Irregularity: Random sale increase
Contd … MA -Moving Average CMA - Centered MA
Contd …
Big Data graph Analytics Graph Analytics refers to the analysis performed on the data stored in knowledge graph data . It combines graph-theoretic, statistics, and database technology to model, store, retrieve and analyze graph-structured data. For instance , identifying YouTube influencers and which vlog is going viral, etc. also, recommendation engines are a classic example of graph analytics. In Graph Analytics , the queries are executed via the edges connecting the entities. The query execution on a graph database is comparatively faster than a relational database.
Graph Database Example
Types of Graph Analytics Path Analysis : Path analysis involves finding out the shortest and widest path between two nodes. This kind of analysis is used in social network analysis, supply chain optimization . Connectivity Analysis: This helps to determine how many edges are flowing into a node and how many are flowing out of that node . Centrality Analysis: This analysis estimates the importance of a node in the network’s connectivity. Determine the social media influencer by ranking out the most highly accessed web pages . Community Analysis / Network Analysis: This is a distance and density-based analysis of relationships used upon people to analyze and find the groups of people frequently interacting with each other in a social network. This also helps identify whether individuals are transient and predicts if the network will grow.
Graph Analytics: Example Identify Social Media Influencers: Find influencers in a social network is by ranking them based on their influencing capability . Each node represents an individual and the edges connecting these nodes denote some relationship. Node 4 might have an influence on two big clusters in this graph because of its crucial position Node 6 influencing 5 individuals
Contd … Centrality: it is essentially a measure of the importance of a node in a graph . Degree Centrality: The degree of a node in a network is the number of edges incident with it . Closeness Centrality: This centrality measure takes into account the distance of a node to all the other nodes in a network . It is formulated as where C i closeness centrality of node i , N=total nodes in the network, d ij shortest path between node i and node j
Graph Analytics Use Cases National Security: N ational intelligence agencies detect unlawful activity using graph analytics. Online activity of both suspected and not suspected individuals are collected and analyzed to identify non-obvious relationships and identify potential crimes . Supply Chain Optimization: In transportation networks, supply chain networks and airline companies use graph analytics algorithms such as shortest path and partitioning as tools to optimize routes. Fraud Detection: Graph Analytics is used to detect fraud detection in businesses that work with networks involving e-commerce marketplaces, financial institutions, and telecom companies . Healthcare: 2020 was a pandemic year in the hands of coronavirus. Being a highly infectious virus, using a graph database helped governments track the spread of this virus . Social Network Analysis: Social media networks such as Instagram , Linked In, and Spotify are relationships and connection-driven applications. Graph analytics has an application in identifying influencers and communities on social media.
Exploratory Data Analysis Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. It helps to gather insights and make better sense of the data , and removes irregularities and unnecessary values from data. Allows a machine learning model to predict our dataset better . Gives more accurate results and it also helps to choose a better machine learning model .
Exploratory Data Analysis: Steps Data Collection: It refers to the process of finding and loading data into the system . Some reliable sites for data collection are Kaggle , Github , Machine Learning Repository, etc . Data Cleaning: It refers to the process of removing unwanted variables and values from dataset and getting rid of any irregularities in it. Univariate Analysis: In Univariate Analysis, data of just one variable is analyzed. A variable in a dataset refers to a single feature/ column . Some visual methods include : Histograms : Bar plots in which the frequency of data is represented with rectangle bars. Box-plots : Here the information is represented in the form of boxes.
C ontd … Bivariate Analysis: T wo variables are used and compared to find how one feature affects the other. It is done with scatter plots , which plot individual data points or correlation matrices that plot the correlation in hues. Boxplots can also be used. Multi- Variate analysis : When the data involves three or more variables, it is categorized under multivariate.
Example Employee Dataset Histogram Boxplot
Contd … Scatter plot Multivariate Analysis
Data Visualization Data Visualization techniques involve the generation of graphical or pictorial representation of DATA , form which leads you to understand the insight of a given data set . This visualization technique aims to identify the Patterns, Trends, Correlations, and Outliers of data sets.
Data Visualization: Benefits Patterns in business operations: Data visualization techniques help us to determine the patterns of business operations. Identify business trends and relate to data: These techniques help us identify market trends by collecting the data on Day-To-Day business activities and preparing trend reports, which helps track the business how influences the market. Understand the current business insights and setting the goals: Businesses can understand the insight of the business KPIs, finding tangible goals and business strategy plannings , therefore they could optimize the data for business strategy plans for ongoing activities. Operational and Performance analysis: Increase the productivity of the manufacturing unit: With the help of visualization techniques the clarity of KPIs ( Key Performance Indicators ) depicting the trends of the productivity of the manufacturing unit, and guiding were to improve the productivity of the plant.
Data Visualization: Tools
Analytics for Unstructured Data
Analytics for Unstructured Data: Tips
Contd … Keep the business objective(s) in mind : The chosen analytical techniques should match up to business objectives. Let’s say the objective is to identify a face in an image. An image has features for mapping—face shape, eye color, width of the mouth, and so on. These features can be stored in a flexible semi-structured format, like a JSON document or MongoDB . Define metadata for faster data access: Metadata stores information about data. Using metadata, an analyst can quickly find data related to their organization or business objectives. For instance : metadata could include information like table of contents, title, author, creation date, tags, or number of words for each document.
Contd … Choose the right analytics techniques: To detect a theft in a area using CCTV footage, advanced deep learning techniques like object recognition, face analysis, and crowd analysis is needed. To find out the average number of milk cartons bought by families in a particular gated community, simple quantitative analysis and grouping the residents based on a certain criteria like 0, 1-3, 4-6, > 6 is sufficient. Identify the right data sources : Analysts need to identify whether they need data from all or few sources to get the right data they need for analysis.
Contd … Evaluate the technologies you’d want to use: Choose the tools that provide scalability, availability, and query capabilities for your particular use case. Get real-time data access: For real-time analytics, it is necessary to have access to new data in real time. For example , fraud prevention or personalized offers are more valuable when fraudulent activity is happening or a customer is still shopping, respectively . Store and integrate data using data lakes: Data lakes unify and store unstructured data from many sources in its native format. Wrangle the unstructured data: Before applying unstructured data analysis techniques, data is cleaned and all the valuable information is present. If there is a lot of noise in the data, the insights will not be accurate.