INTRODUCTION TO BIG DATA ANALYTICS.pptx

DEFINITION Big data is defined as collections of datasets whose volume, velocity or variety is so large that it is difficult to store, manage, process and analyze the data using traditional databases and data processing tools.

Estimates of the amount of data generated every minute on popular online platforms Facebook users share nearly 4.16 million pieces of content Twitter users send nearly 300,000 tweets Instagram users like nearly 1.73 million photos YouTube users upload 300 hours of new video content Apple users download nearly 51,000 apps Skype users make nearly 110,000 new calls Amazon receives 4300 new visitors Uber passengers take 694 rides Netflix subscribers stream nearly 77,000 hours of video

SOME EXAMPLES OF BIG DATA Stock markets data Transactional data generated by banking and financial applications Data generated by social networks including text, images, audio and video data Machine sensor data collected from sensors embedded in industrial and energy systems for monitoring their health and detecting failures Click-stream data generated by web applications such as e-Commerce to analyze user behavior

CHARACTERISTICS OF BIG DATA

Volume Big data is a form of data whose volume is so large that it would not fit on a single machine. Therefore specialized tools and frameworks are required to store process and analyze such data.

Velocity Velocity of data refers to how fast the data is generated . Data generated by certain sources can arrive at very high velocities, for example, social media data or sensor data. High velocity of data results in the volume of data accumulated to become very large, in short span of time.

Variety Variety refers to the forms of the data . Big data comes in different forms such as structured, unstructured or semi-structured, including text data, image, audio, video and sensor data. Big data systems need to be flexible enough to handle such variety of data.

Veracity Veracity refers to how accurate is the data . To extract value from the data, the data needs to be cleaned to remove noise. Data-driven applications can reap the benefits of big data only when the data is meaningful and accurate. Therefore, cleansing of data is important so that incorrect and faulty data can be filtered out.

Value Value of data refers to the usefulness of data for the intended purpose. The end goal of any big data analytics system is to extract value from the data.

BIG DATA APPLICATIONS Domain Specific Examples of Big Data

Domain Specific Examples of Big Data Web Financial Healthcare IOT Environment Logistics & Transportation Industry Retail

Web Financial Healthcare IOT Web Analytics Performance Monitoring Ad Targeting & Analytics Content Recommendation Credit Risk Modeling Fraud Detection Epidemiological Surveillance Patient Similarity-based Decision Intelligence Application Adverse Drug Events Prediction Detecting Claim Anomalies Evidence-based Medicine Real-time health monitoring Intrusion Detection Smart Parkings Smart Roads Structural Health Monitoring Smart Irrigation Environment Logistics & Transportation Industry Retail Weather Monitoring Air Pollution Monitoring Noise Pollution Monitoring Forest Fire Detection River Floods Detection Water Quality Monitoring Real-time Fleet Tracking Shipment Monitoring Remote Vehicle Diagnostics Route Generation & Scheduling Hyper-local Delivery Cab/Taxi Aggregators Machine Diagnosis & Prognosis Risk Analysis of Industrial Operations Production Planning and Control Inventory Management Customer Recommendations Store Layout Optimization Forecasting Demand

Web Web Analytics collection and analysis of data on the user visits on websites and cloud applications. User visits are logged on the web server - date and time of visit, resource requested, user’s IP address, HTTP status code A cookie is assigned to the user which identities the user during the visit and the subsequent visits.

Performance Monitoring: For performance monitoring, various types of tests can be performed such as load tests (which evaluate the performance of the system with multiple users and workload levels), stress tests (which load the application to a point where it breaks down) and soak tests (which subject the application to a fixed workload level for long periods of time).

Ad Targeting & Analytics: Two most widely used approaches for Internet advertising are Search advertisements Display advertisements

Content Recommendation: serve content (such as music and video streaming applications), collect various types of data such as user search patterns and browsing history, history of content consumed, and user ratings. Recommendation systems use two broad category approaches – user-based recommendation (new items are recommended to a user based on how similar users rate those items) item based recommendation( new items are recommended to a user based on how the user rated similar items)

Financial Credit Risk Modeling Banking and Financial institutions use credit risk modeling to score credit applications and predict if a borrower will default or not in the future. customer data that includes, credit scores obtained from credit bureaus, credit history, account balance data, account transactions data and spending patterns of the customer. Fraud Detection credit card frauds, money laundering and insurance claim frauds.

Healthcare Electronic Health Record (EHRs) - capture and store information on patient health and provider actions including individual-level laboratory results, diagnostic, treatment, and demographic data Patient Similarity-based Decision Intelligence Application : Big data frameworks can be used for analyzing EHR data to extract a cluster of patient records most similar to a particular target patient

Adverse Drug Events Prediction analyzing EHR data and predict which patients are most at risk for having an adverse response to a certain drug based on adverse drug reactions of other patients. Detecting Claim Anomalies. Heath insurance companies can leverage big data systems for analyzing health insurance claims to detect fraud, abuse, waste, and errors

Evidence-based Medicine Big data systems can combine and analyze data from a variety of sources, including individual-level laboratory results, diagnostic, treatment and demographic data, to match treatments with outcomes, predict patients at risk for a disease. Real-time health monitoring Wearable electronic devices allow non-invasive and continuous monitoring of physiological parameters. These wearable devices may be in various forms such as belts and wrist-bands. Healthcare providers can analyze the collected healthcare data to determine any health conditions or anomalies.

Internet of Things Intrusion Detection Intrusion detection systems use security cameras and sensors (such as PIR sensors and door sensors) to detect intrusions and raise alerts Advanced systems can even send detailed alerts such as an image grab or a short video clip sent as an email attachment Smart Parkings : Smart parkings are powered by IoT systems that detect the number of empty parking slots and send the information over the Internet to smart parking application back-ends

Smart Roads : Smart roads equipped with sensors can provide information on driving conditions, travel time estimates and alerts in case of poor driving conditions, traffic congestions and accidents Structural Health Monitoring : Structural Health Monitoring systems use a network of sensors to monitor the vibration levels in the structures such as bridges and buildings. The data collected from these sensors is analyzed to assess the health of the structures Smart irrigation Smart irrigation systems also collect moisture level measurements in the cloud where the big data systems can be used to analyze the data to plan watering schedules.

Environment Environment monitoring systems generate high velocity and high volume data. Accurate and timely analysis of such data can help in understanding the current status of the environment Weather Monitoring : Weather monitoring systems can collect data from a number of sensor attached (such as temperature, humidity, or pressure) and send the data to cloud-based applications and big data analytics backends Air Pollution Monitoring : Air pollution monitoring systems can monitor emission of harmful gases (CO2, CO, NO, or NO2) by factories and automobiles using gaseous and meteorological sensors

Noise Pollution Monitoring : Due to growing urban development, noise levels in cities have increased and even become alarmingly high in some cities. Noise pollution can cause health hazards for humans due to sleep disruption and stress. Noise pollution monitoring can help in generating noise maps for cities. Forest Fire Detection Early detection of forest fires can help in minimizing the damage. Forest fire detection systems use a number of monitoring nodes deployed at different locations in a forest River Floods Detection Early warnings of floods can be given by monitoring the water level and flow rate. River flood monitoring system use a number of sensor nodes that monitor the water level (using ultrasonic sensors) and flow rate (using the flow velocity sensors)

Water Quality Monitoring : Water quality monitoring can be helpful for identifying and controlling water pollution and contamination due to urbanization and industrialization. Maintaining good water quality is important to maintain good health of plant and animal life.

Logistics & Transportation Shipment Monitoring : Shipment management solutions for transportation systems allow monitoring the conditions inside containers. For example, containers carrying fresh food produce can be monitored to detect spoilage of food. Shipment monitoring systems use sensors such as temperature, pressure, humidity, for instance, to monitor the conditions inside the containers and send the data to the cloud,

Remote Vehicle Diagnostics : Remote vehicle diagnostic systems can detect faults in the vehicles or warn of impending faults for collecting data on vehicle operation (such as speed, engine RPM, coolant temperature, or fault code number) Route Generation & Scheduling : Route generation and scheduling systems can generate end-to-end routes using combination of route patterns and transportation modes and feasible schedules based on the availability of vehicles.

Hyper-local Delivery : Hyper-local delivery platforms are being increasingly used by businesses such as restaurants and grocery stores to expand their reach. These platforms allow customers to order products (such as grocery and food items) using web and mobile applications and the products are sourced from local stores (or restaurants). Cab/Taxi Aggregators : On-demand transport technology aggregators (or cab/taxi aggregators) allow customers to book cabs using web or mobile applications and the requests are routed to nearest available cabs. The cab aggregation platforms use big data systems for real-time processing of requests and dynamic pricing

Retail : Retailers can use big data systems for boosting sales, increasing profitability and improving customer satisfaction Customer Recommendations : Big data systems can be used to analyze the customer data (such as demographic data, shopping history, or customer feedback) and predict the customer preferences. New products can be recommended to customers based on the customer preferences and personalized offers and discounts can be given

Store Layout Optimization : Big data systems can help in analyzing the data on customer shopping patterns and customer feedback to optimize the store layouts Forecasting Demand : Due to a large number of products, seasonal variations in demands and changing trends and customer preferences, retailers find it difficult to forecast demand and sales volumes

BIG DATA VS TRADITIONAL DATA

SL.NO. PARAMETERS TRADITIONAL DATA BIG DATA 1 Structure Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data. 2 Volume The size of the data is very small(GB) The size is more than the traditional data size(TB or PB) 3 Data Source Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. 4 System Configuration Normal system configuration is capable to process traditional data. High system configuration is required to process big data.

5 Data Integration easy Difficult 6 Data Store RDBMS HDFS, NoSQL 7 Generated Rate Per hour, per day… More rapid(almost every second) 8 Database tools used Traditional data base tools are required to perform any data base operation. Special kind of data base tools is required to perform any data base operation. 9 Data Structure Static Schema Dynamic Schema

RISKS OF BIG DATA Security( ebay,JP morgan chase) Privacy Costs Bad Analytics Bad Data

STRUCTURE OF BIG DATA

Structured Data Unstructured Data Semi-Structured Data

Structured Data Data is stored in the form of rows and columns Example : Database Data resides in fixed fields within a record or file similar entities are grouped together to form relations Easy to access and query, So data can be easily used by other programs SQL (Structured Query language) is often used to manage structured data stored in databases. Data elements are addressable, so efficient to analyse and process

Structured data accounts for only about 20% of data Easily scalable in case there is an increment of data Ensuring security to data is easy Operations such as Updating and deleting is easy due to well structured form of data Data mining is easy i.e knowledge can be easily extracted from data

Sources of Structured Data: SQL Databases Spreadsheets such as Excel Online forms

Semi-structured Data Data does not conform to a data model but has some structure. Semi-structured data contains tags and elements (Metadata) which is used to group data and describe how the data is stored Due to lack of a well-defined structure, it can not used by computer programs easily

the data is not constrained by a fixed schema It is possible to view structured data as semi-structured data Its supports users who can not express their need in SQL Queries are less efficient as compared to structured data . Storage cost is high as compared to structured data

Sources of semi-structured Data: E-mails XML and other markup languages TCP/IP packets Web pages Zipped files

Unstructured Data Unstructured data neither conforms to a data model nor has any structure . It is not organised in a pre-defined manner or does not have a pre-defined data model

It requires a lot of storage space to store unstructured data. It is difficult to store videos, images, audios, etc. Due to unclear structure, operations like update, delete and search is very difficult. Storage cost is high as compared to structured data

Sources of Unstructured Data Images (JPEG, GIF, PNG, etc.) Videos Memos Reports Word documents and PowerPoint presentations Surveys

Techniques to interpret unstructured data Data mining Associative rule mining –what goes with what Regression analysis – predict the relationship between two things Collaborative filtering – predicting user preference based on preference of a grp of user

Text analytics or text mining- sentiment analysis, Text Summarization- summarise a topic Natural language processing- related to human computer interaction, computer to understand Noisy text analysis- to eliminate abbrevations,spelling mistakes,missing punctuations

WEB DATA

organizations need to be collecting newly evolving big data sources related to their customers from a variety of extended and newly emerging touch points such as web browsers, mobile applications, kiosks, social media sites, and more

Information is missing on more than 98 percent of web sessions That information needs to be collected and analyzed alongside the final sales data. expensive surveys or research studies that only provide data on a small subset of customers.

identification number can be matched to each unique customer based on a logon, cookie, or similar piece of information. This creates what might be called a “faceless” customer record.

WHAT WEB DATA REVEALS Shopping Behaviors Research Behaviors Feedback Behaviors

Shopping Behaviors Identifying how customers come to a site to begin shopping. What search engine do they use? What specific search terms are entered? Do they use a bookmark they created previously?

Once customers are on a site, start to examine all the products they explore Who read product reviews? Who looked at detailed product specifications? Who looked at shipping information? Who took advantage of any other information that is available on the site?

Customer Purchase Paths and Preferences Using web data, it is possible to identify the ways customers arrive at their buying decisions by watching how they navigate a site.

Research Behaviors Understanding how customers utilize the research content on a site tracking across sessions and combining with other customer data, it is possible to know if people researched one day and then bought later on another day Eg . Reviewing a product and not buying it, checking a specification

Feedback Behaviors Some of the best information customers can provide is detailed feedback on products and services. Simply the fact that customers are willing to take the time to do so indicates that they are engaged with a brand If those reviews are often positive and are read by other customers, then perhaps it is smart to give such customers special incentives to keep the good words coming.

EVOLUTION OF ANALYTIC SCALABILITY The convergence of the analytic and data environments, massively parallel processing (MPP) architectures cloud computing grid computing Map Reduce

A history of scalability Predictive model ( required manually computing all of the statistics) calculators Computers Many sources of big data can generate terabytes to petabytes of data in days or weeks, if not hours.

It used to be that analytic professionals had to pull all their data together into a separate analytics environment to do analysis Analysts do what is called “data preparation.” In this process, they pull data from various sources and merge it all together to create the variables required for an analysis. In the data warehousing world this process is called “ extract, transform, and load (ETL).” Data marts  single -purpose databases Enterprise Data Warehouse (EDW)  combining the various database systems into one big system

MASSIVELY PARALLEL PROCESSING SYSTEMS (MPP) An MPP database spreads data out into independent pieces managed by independent storage and central processing unit (CPU) resources. It removes the constraints of having one central server with only a single set CPU and disk to manage it.

There are at least four primary ways for data preparation and scoring to be pushed into a database today: SQL push down-Many core data preparation tasks can be either translated into SQL by the user, or an analytic tool can generate SQL on the user’s behalf and “push it down” to the database. User-defined functions (UDFs) -What UDFs do is extend SQL functionality by allowing a user to define logic that can be executed in the same manner as a native SQL function. “Select Customer, Sum(Sales) . . .” Embedded processes Predictive modeling markup language (PMML) scoring

CLOUDCOMPUTING Three criteria for a cloud environment: Enterprises incur no infrastructure or capital costs, only operational costs Capacity can be scaled up or down dynamically, The underlying hardware can be anywhere geographically .

Five essential characteristics of a cloud environment by the National Institute of Standards and Technology (NIST) On-demand self-service Broad network access Resource pooling Rapid elasticity Measured service

The two primary types of cloud environments: public clouds private clouds Public cloud Public cloud users are basically loading their data onto a host system and they are then allocated resources as they need them to use that data

Advantages: users only pay for what they use. Resources are available with no hassle Simply pay for the extra resources. It is easy to share data with others regardless of their location since a public cloud by definition is outside of a corporate firewall. Anyone can be given permission to log on to the environment created.

Drawbacks: There are few performance guarantees in a public cloud. How fast a job will run will not be known until it is submitted perception of security concerns is a big problem Expensive (if a cloud isn’t used wisely since users will be charged for everything that they do Private Clouds It’s owned exclusively by one organization and typically housed behind a corporate firewall. A private cloud is going to serve the exact same function as a public cloud, but just for the people or teams within a given organization.

Private Clouds It’s owned exclusively by one organization and typically housed behind a corporate firewall. A private cloud is going to serve the exact same function as a public cloud, but just for the people or teams within a given organization.

advantage organization will have complete control over the data and system security

Grid computing A grid configuration can help both cost and performance when more analysts do more analytics; the servers continue to expand in size and number. It falls into the classification of “high-performance computing.” Instead of having a single high-end server (or maybe a few of them), a large number of lower-cost machines are put in place.

MapReduce MapReduce is a parallel programming framework. It’s neither a database nor a direct competitor to databases. MapReduce consists of two primary processes that a programmer builds: the “map” step the “reduce” step Each MapReduce worker runs the same code against its portion of the data.

the workers do not interact or even have knowledge of each other. Hadoop is a popular open-source version of MapReduce supplied by the Apache organization

20 terabytes of data and 20 MapReduce

EVOLUTION OF ANALYTIC PROCESSES In-database processing is becoming the new standard in order to take advantage of the scalable in-database approach, it is necessary for analysts to have a workspace, or “sandbox,” residing directly within the database system

THE ANALYTIC SANDBOX analytic professionals to utilize an enterprise data warehouse or data mart more effectively, they need the correct permissions and access to do so A sandbox in the analytics context is a set of resources that enable analytic professionals to experiment and reshape data in whatever fashion they need to. An analytic sandbox provides a set of resources with which in-depth analysis can be done

Sandbox users will also be allowed to load data of their own for brief time periods as part of a project, even if that data is not part of the official enterprise data model. When that project is done, delete the data Analytic Sandbox Benefits Independence Flexibility Efficiency Freedom Speed

An Internal Sandbox For an internal sandbox, a portion of an enterprise data warehouse or data mart is carved out to serve as the analytic sandbox. In this case, the sandbox is physically located on the production system.

An External Sandbox For an external sandbox, a physically separate analytic sandbox is created for testing and development of analytic processes

A Hybrid Sandbox A hybrid sandbox environment is the combination of an internal sandbox and an external sandbox

ANALYTIC DATA SET An analytic data set (ADS) is the data that is pulled together in order to create an analysis or model An ADS is generated by transforming, aggregating, and combining data The analytic data set helps to bridge the gap between efficient storage and ease of use

Traditional Analytic Data Sets In a traditional environment, all analytic data sets are created outside of the database. analytic professional creates his or her own analytic data sets independently An ADS is usually generated from scratch for each individual project Risk inconsistencies repetitious work

ENTERPRISE ANALYTIC DATA SETS What an EADS does is to condense hundreds or thousands of variables into a handful of tables and views These tables and views will be available to all analytic professionals, applications, and users. The structure of an EADS can be literally one wide table a number of tables that can be joined together

Embedded Scoring Integration The methods apply for developing embedded scoring processes are SQL push down User-defined functions Predictive modeling markup language (PMML) embedded processes

EVOLUTION OF TOOLS AND METHODS These include ensemble methods commodity modeling analysis of text data

Ensemble Methods Instead of building a single model with a single technique, multiple models are built using multiple techniques. The process of combining the various results can be anything from a simple average of each model’s predictions to a much more complex formula Certain types of customers, for example, may be scored poorly by one technique but very well by another. linear regression, logistic regression, a decision tree, and a neural network are all created to predict the likelihood of a customer purchasing a given product. The Wisdom of Crowds

Commodity Models The goal of a commodity model is not to get the best model, but to quickly get a model that will lead to a better result than if there had been no model at all If you had a 30- to 40-million piece mailing upcoming, then it was absolutely worth the investment to build a model. If you had an upcoming mailing of 30,000 pieces for a fairly inexpensive product, there was no way it was worth investing

Text Analysis One of the most rapidly growing methods utilized by organizations today is the analysis of text and other unstructured data sources everything from e-mails, to social media commentary from sites like Facebook and Twitter, to online inquiries, to text messages, to call center conversations is captured in bulk. Popular commercial text analysis tools include those offered by Attensity , Clarabridge

As each different word within the sentence is stressed, the entire meaning changes

THE EVOLUTION OF ANALYTIC TOOLS The Rise of Graphical User Interfaces The Explosion of Point Solutions - Analytic point solutions are software packages that address a very specific, narrow set of problems. price optimization applications fraud applications demand forecasting applications The History of Open Source Eg-firefox,linux , apache web server The R Project for Statistical Computing The History of Data Visualization bar chart pie chart

INTRODUCTION TO BIG DATA ANALYTICS.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

INTRODUCTION TO BIG DATA ANALYTICS.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77