unit-1 data analytics data science Process steps

Unit-1

Big data Big data collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. big data and data science like the relationship between crude oil and an oil refinery Data science and big data evolved from statistics and traditional data management

The characteristics of big data are often referred to as the three Vs: ■ Volume—How much data is there? ■ ■ Variety —How diverse are different types of data? ■ ■ Velocity—At what speed is new data generated?

Facets of data The main categories of data are these : Structured Unstructured Natural language Machine-generated Graph-based Audio video images Streaming

Structured data depends on a data model and resides in a fixed field within a record. SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases.

Unstructured data data that isn’t easy to fit into a data model because the content is context-specific or varying. One example : regular email Natural language requires knowledge of specific data science techniques and linguistics. entity recognition, topic recognition, summarization, text completion, and sentiment analysis, but mod els trained in one domain don’t generalize well to other domains.

Machine-generated data automatically created by a computer, process, application, or other machine without human intervention

Graph-based or network data graph is a mathematical structure to model pair-wise relationships between objects. focuses on the relationship or adjacency of objects. The graph structures use nodes, edges, and properties to represent and store graphical data social networks, and its structure allows to calculate specific metrics such as the influence of a person and the shortest path between two people.

Eg : LinkedIn follower list on Twitter “friends” on Facebook

Audio, image, and video Streaming data

The data science process

Step1 :Setting the research goal Step 1: Defining research goals and creating a project charter understanding the what, the why, and the how of your project What does the company expect you to do? And why does management place such a value on your research? Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected? information is then best placed in a project charter

Spend time understanding the goals and context of your research Create a project charter A project charter requires teamwork, and your input covers at least the following: A clear research goal The project mission and context How you’re going to perform your analysis What resources you expect to use Proof that it’s an achievable project, or proof of concepts Deliverables and a measure of success A timeline

Step 2: Retrieving data

Start with data stored within the company databases, data marts, data warehouses, and data lakes primary goal database is data storage data warehouse -reading and analyzing that data. A data mart - subset of the data warehouse and geared toward serving a specific business unit. data warehouses and data marts are home to preprocessed data, data lakes contains data in its natural or raw format.

Don’t be afraid to shop around Do data quality checks now to prevent problems later

Step 3: Cleansing, integrating, and transforming data

Cleansing data subprocess of the data science process that focuses on removing errors in your data data becomes a true and consistent representation of the process two types of errors interpretation error--person’s age is greater than 300 years. Inconsistencies between data sources or against company’s standardized values.

DATA ENTRY ERROR humans are only human, they make typos or lose their con centration for a second and introduce an error into the chain. errors originating from machines are transmission errors or bugs in the extract, trans form, and load phase (ETL).

Most errors of this type are easy to fix with simple assignment statements and if-then else rules: if x == “ Godo ” : x = “Good” if x == “Bade” : x = “Bad” REDUNDANT WHITESPACE mismatch of keys such as “FR ” – “FR” For instance, in Python you can use the strip() function to remove leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES applying a function that returns both strings in lowercase, such as .lower() in Python. “ Brazil”.lower () == “ brazil”.lower () should result in true. IMPOSSIBLE VALUES AND SANITY CHECKS check = 0 <= age <= 120

OUTLIERS An outlier is an observation that seems to be distant from other observations one observation that follows a different logic or generative process than the other observations.

DEALING WITH MISSING VALUES

DEVIATIONS FROM A CODE BOOK A code book is a description of your data, a form of metadata. It contains things such as the number of variables per observation, the number of observations, and what each encoding within a variable means. (For instance “0” equals “negative”, “5” stands for “very positive”.) DIFFERENT UNITS OF MEASUREMENT DIFFERENT LEVELS OF AGGREGATION Correct errors as early as possible

Combining data from different data sources THE DIFFERENT WAYS OF COMBINING DATA JOINING TABLES

APPENDING TABLES

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

Transforming data transforming data so it takes a suitable form for data modeling. y = aebx .

REDUCING THE NUMBER OF VARIABLES Euclidean distance between two points in a two-dimensional plane

TURNING VARIABLES INTO DUMMIES

Step 4: Exploratory data analysis use graphical techniques to gain an understanding of data and the inter actions between variables.

Simple graph

Combined graphs combine simple graphs into a Pareto diagram, or 80-20 diagram.

.

brushing and linking combine and link different graphs and tables (or views) so changes in one graph are automatically transferred to the other graphs.

In a histogram a variable is cut into discrete categories and the number of occur rences in each category are summed up

Step 5: Build the models

build models with the goal of making better predictions, classifying objects, or gain ing an understanding of the system

Building a model is an iterative process. most models consist of the following main steps: 1 Selection of a modeling technique and variables to enter in the model 2 Execution of the model 3 Diagnosis and model comparison

Model and variable selection select the variables you want to include in our model and a modeling technique. Must the model be moved to a production environment and, if so, would it be easy to implement? How difficult is the maintenance on the model: how long will it remain relevant if left untouched? Does the model need to be easy to explain?

Model execution chosen a model - need to implement it in code most programming languages, such as Python, already have libraries such as StatsModels or Scikit -learn.

Model diagnostics and model comparison building multiple models from which you then choose the best one based on multiple criteria. A holdout sample is a part of the data you leave out of the model building so it can be used to evaluate the model afterward. fraction of data to estimate the model and the other part, the holdout sample, is kept out of the equation.

Step 6: Presenting findings and building applications on top of them built a well-performing model, you’re ready to present your findings to the world automate your models.

unit-1 data analytics data science Process steps

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

unit-1 data analytics data science Process steps

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

DTI BPI Pivot Small Business - BUSINESS START UP PLAN

CATHOLIC EDUCATIONAL Corporate Responsibilities

Karin Schaupp – Evocation; lançamento: 2000

Pillars of Biblical Oneness in the Book of Acts

7-10. STP + Branding and Product &amp; Services Strategies.pptx

Business Legislation PPT - UNIT 1 jimllpkggg

7-10. STP + Branding and Product & Services Strategies.pptx