Module 2 Data Collection and Management.pdf

VinayVitekari 146 views 43 slides Aug 31, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

No any Description


Slide Content

Data Science
OE-(6CO371)

Module 2 Data Collection and Management
•Introduction
•Sources of data
•Data collection from different sources
•Exploring and fixing data
•Data storage and management
•Using multiple data sources

Introduction
•What is data collection?
•Datacollectionistheprocessofgathering,measuring,andanalyzing
accuratedatafromavarietyofrelevantsourcestofindanswersto
researchproblems,answerquestions,evaluateoutcomes,and
forecasttrendsandprobabilities.
•What’s the goal or purpose of this research?
•What kinds of data are they planning on gathering?
•What methods and procedures will be used to collect, store, and
process the information?

What Are the Different Methods of Data Collection?
•Primary
•This is original, first-hand data collected by the data researchers.

What Are the Different Methods of Data Collection?

What Are the Different Methods of Data Collection?
•Delphi techniques
•Tocollectopinionsonaparticularresearchquestionorspecifictopic,
togainconsensus.
•Focusgroups
•Agroupinterviewofapproximatelysixtotwelvepeoplewhoshare
similarcharacteristicsorcommoninterests.
•Afacilitatorguidesthegroupbasedonapredeterminedsetoftopics.
Thefacilitatorcreatesanenvironmentthatencouragesparticipantsto
sharetheirperceptionsandpointsofview.

What Are the Different Methods of Data Collection?

What Are the Different Methods of Data Collection?
•Secondary Data Collection
•Information has already been collected, the researcher consults various
data sources, such as:
•Financial Statements
•Sales Reports
•Retailer/Distributor/Deal Feedback
•Customer Personal Information (e.g., name, address, age, contact info)
•Business Journals
•Government Records (e.g., census, tax records, Social Security info)
•Trade/Business Magazines
•The internet

What are Common Challenges in Data Collection?
•Dataqualitymeasureshowwelladatasetmeetscriteriaforaccuracy,
completeness,validity,consistency,uniqueness,timelinessandfitness
forpurpose.
•Data Quality Issues
•Inconsistent Data
•Data Downtime
•Ambiguous Data
•Duplicate Data
•Too Much Data

What are Common Challenges in Data Collection?
•Data Quality Issues
•Inaccurate Data
•Hidden Data
•Finding Relevant Data
•Deciding the Data to Collect
•Dealing With Big Data
•Low Response and Other Research Issues

Sources of data
Structured data Unstructured data
Data has a machine-readable format. Data requires a human to interpret.
Data adheres to a predefined data model. Data need not adhere to any predefined model.
Data is in a tabular / rectangular format (columns
display different attributes or variables, rows display
a particular record).
Data is in the form of social media feed, results of
research and development, surveys, call records, and
so on.
Data can be entered, stored, queried, or analysed by
machines.
Data requires human help to manually catalogue the
data.
Analysts can leverage on the model to know how
data is recorded, defining the different attributes
present, and providing information about the data
type and restrictions on their values.
Analysts can use machines to read each word, or
sentence, but not to interpret the meaning. (This is
where machine learning and other elements of
artificial intelligence come in to play.)
Examples: Names, dates, phone numbers, currency
or prices, heights or weights, word count or file size
of a document, credit card numbers, and so on.
Example: Images (both human or and machine-
generated), video files, audio files, social media
posts, product reviews, mobile SMS, and so on.

•Semi-structureddata
•Somedataisneitherstructurednorunstructured,whichiscalled
semi-structureddata.Emailisanexampleofsemi-structureddata.
Emailheaderscontainmetadatalikethedate,language,and
recipient’semailaddress,whicharestructureddata.However,the
emailbody,whichcontainsyourmessage,isunstructured.
•Bigdata
•Theterm‘bigdata’isusedtodescribelarge,complexdatasetsofany
type–structured,unstructured,orevensemi-structured.E.gamount
ofdatabeingcreated,ormadeavailable,especiallybylargeonline
services(YouTube,Netflix,Salesforce,etc.).
•Bigdatahasthreekeyproperties:volume,variety,andvelocity.Each
ofthesepropertiespresentuniquechallenges.

Sources of data

Sources of data

Sources of data
•Internaldata
•Internaldataisdatacapturedbyyourorganizationalprocesses.Your
organizationmayhavemachine-generateddataavailablefromsensorsor
devicesusedtomanufactureaproduct,orrecordedbytheproductitself
(e.g.smartphonesorIoTdevices).
•Forexample:
•transactionaldata(customerpurchasesandstaffpay)
•emailmarketingmetrics(emailopens,clickrates)
•informationincustomerprofiles(names,addresses)
•recordsofcustomerinteractions(emailqueries,supportcalls)
•onlineactivity(placingitemsinanonlineshoppingcart)

•Third-partyanalytics
•Insomecases,youmaynothavethecapacitytocapturedata,in
whichcasethird-partyanalyticscanbeused.Third-partyweb
analyticsservicescanprovidecost-effectivecollectionandanalysis
andevaluatehowyourwebsiteperformsovertime,oragainst
averagesacrosstheprovider’scustomerbase.
•Forexample:
•GoogleAnalyticsisapopulartoolandprovidesbusinesseswiththe
abilitytoanalyseandbetterunderstandhowusersfindandusetheir
websitesandpages.Formoreprivacy-friendlyanalysis,suchaswhat
thegovernmentorhealthsectorschoosetouse,tryPiwikPro
Analytics.

•Externaldata
•Externaldatacanincludealmostanythingfromhistorical
demographicdatatomarketprices,orweatherconditionstosocial
mediatrends.Organizationsuseexternaldatatoanalyseandmodel
economic,political,social,orenvironmentalfactorsthatinfluence
theirbusiness.
•Forexample:
•Opensources(data.gov.uk)[3]
•Socialmediadata(Twitter,Facebook,orLinkedIn)
•Paidsources(ThomsonReutersorWestlaw)

•Opendata
•Opendataisaccessibletoeveryoneandfreetouse.However,ifit’s
high-leveldata,orit’sheavilysummarizedandaggregated,itmight
notbeveryrelevanttoyou.Itmightalsonotbeintheformatyou
need,oritmightbeverydifficultforyoutomakesenseofit.Allof
thesechallengescanrequirealotoftimetomakethedatausable.
•Forexample:
•Governmentdata:data.gov(US),data.gov.uk(UK),data.gov.au(AUS).
•Healthandscientificdata:WorldHealthOrganization(WHO),
Nature.comscientificdata,OpenScienceDataCloud(OSCDC),Center
forOpenScience.
•Socialmedia:Googletrends(i.e.lookatnationaltrendsonsearch
terms),Yahoofinance(greatforstockmarketinformation),Twitter
(allowsyoutosearchbytagsandusers,whichcanbedownloadedby
usingTwitterAPIs).

Data extraction from multiple data sources
•WhatisETL?

Data extraction from multiple data sources
•WhatisETL?
•Extract:Dataiscollectedfrommultiplesourcesystems.
•Thesourcescouldbedatabases,CRMsystems,files,APIs,oreven
webscrapes.
•Datacanbeextractedindifferentwaysincludingfullextraction,
incrementalextraction,orlogicalextraction.
•Infullextraction,allthedataisextractedfromthesourcesystem
withoutanydatabeingleftout.
•Inincrementalextraction,onlythedatathathaschangedsincethe
lastextractionisextracted.
•Logical/partialextractioninvolvestheextractionofdatabasedona
certainconditionorasetofconditions.

•Transform:
•Theextracteddataisthentransformedinordertomakeitsuitablefor
yourpurposes.
•DependingontheETLworkflow,thetransformationstagemay
include:
•Addingorremovingdatatypes,rows,columns,and/orfields.
•Deletingduplicate,out-of-date,and/orextraneousdata.
•Dataintegration-Joiningmultipledatasourcestogether.
•Convertingdatainoneformattoanother.
•Datacleaning-removingerrors
•Datastandardization-conversiontocommonformat
•Datavalidation-checkingforaccuracy
•Datarestructuringorreshaping-adding/removingstructureofdata

•Load:
•Thisphaseinvolvesloadingthetransformeddataintothefinaltarget,
whichcanbeadatawarehouse.
•Dependingontherequirements,theloadingcanbedoneintwo
ways:eitherallatonce(fullload)oronaschedulesuchasdaily,
weekly,etc.(incrementalload).

Example of ETL process
•Supposewehaveasmallretailbusinessthatkeepsrecordsin
differentformats.Wewanttointegratetheserecordsintoaunified
databaseforanalysis.Thedatasourcesare:
•CSVfile:Containscustomerinformation.
•JSONfile:Containsproductdetails.
•SQLdatabase:Containssalestransactions.

Example of ETL process
•Customers Data (CSV):
•CustomerID, Name, Email, SignupDate
•1, John Doe, [email protected], 2021-05-21
•2, Jane Smith, [email protected], 21/06/2021
•3, Bob Brown, bob@example, 2021-07-10

Example of ETL process
•OrdersData(JSON):
•[{"OrderID":101,"CustomerID":1,"Amount":"100.50",
"OrderDate":"2021-06-01T10:15:00“,“ProductID”:1,“Qty”:1},
•{"OrderID":102,"CustomerID":2,"Amount":"200.75","OrderDate":
"2021-07-05T14:30:00“,“ProductID”:2,“Qty”:2},
•{"OrderID":103,"CustomerID":3,"Amount":"150.20","OrderDate":
"2021-08-12T16:45:00“,“ProductID”:3,“Qty”:3}]

Example of ETL process
•SQLdatabase(Productdetails):
ProductID Name Type
1 T-Shirt Cloths
2 Mobile charger Electronics
3 Milk powder Grocery

Example of ETL process
•Transformation:
•Customers:Standardizeemailaddresses(e.g.,converttolowercase).
•FormatDatesConsistently:Datesmaycomeindifferentformats,so
theyneedtobestandardized.Let'sunifythedateformattoYYYY-
MM-DD.
•ConvertDataTypes
•Datatypesshouldbeconsistentwiththedatatheyrepresent.For
instance,amountsshouldbestoredasnumericaltypes,notstrings.

Example of ETL process
•Transformation:
•DeriveNewFields:Sometimes,newfieldsarederivedfromexisting
ones.Forexample,calculatingtheyearfromadateorcomputingthe
totalpricebyaddingatax.Task:AddaTotalAmountfieldbyapplying
a10%taxtotheAmount.
•HandlingMissingValues:Missingvaluescanbeproblematic,sothey
shouldbehandledappropriately,suchasbyfillingwithdefaultvalues,
removingrows,orflaggingthemforreview.
•Example:Inthecustomersdataset,ifEmailismissingaftervalidation,
therowshouldbedropped.

Example of ETL process
•LoadTarget:Unifieddatawarehouse,perhapsaSQLdatabaseor
clouddatawarehouse.
•LoadingProcess:
•Createtablesinthetargetsystemforcustomers,products,and
Orders.
•Insertthecleanedandtransformeddataintothesetables.

Pandas Dataframe
•APandasDataFrameisa
2dimensionaldatastructure,
likea2dimensionalarray,oratable
withrowsandcolumns.

•Locate row/Rows

•Named Indexes

1. Read CSV Files
•AsimplewaytostorebigdatasetsistouseCSVfiles(comma
separatedfiles).
•CSVfilescontainsplaintextandisawellknowformatthatcanbe
readbyeveryoneincludingPandas.
•InourexampleswewillbeusingaCSVfilecalled'data.csv‘.

2 Read text file data
We can read text files in Pandas in the following ways:
•Using theread_fwf()function
•Using theread_table()function
•Using theread_csv()function

3 Read data from JSON file
•Read JSON
•Big data sets are often stored, or extracted as JSON.
•JSON is plain text, but has the format of an object.

ETL Challenges and solutions
•DataQualityIssues:
•Solution:Implementdataprofilinganddataqualitytoolsatthebeginningof
ETLprocessestovalidate,clean,andenhancedata.Establishcleardata
governanceprocedurestoensureconsistentandhigh-qualitydataovertime.
•DataIntegrationfromMultipleSources:
•Solution:Useintegrationplatformsormiddlewarethatcanhandlevarious
datasources.Ensuredatalineageiswell-mappedtounderstandhowdatais
flowingandtransformingacrosssystems.

ETL Challenges and solutions
•DataVolumeandPerformance:
•Solution:OptimizeyourETLprocesswithparallelprocessing,partitioning,and
incrementalloading.Considerusingcloud-basedETLsolutionsthatcanscaleondemand.
•ComplexTransformations:
•Solution:UseETLtoolsthatallowformodularandreusabletransformations.Investtime
indesigningaclearETLarchitecture,sotransformationsaresystematicand
maintainable.
•SchemaChangesinSourceSystems:
•Solution:ImplementSchemaEvolutionpatternsandtools.Periodicallycheckforschema
changesandadjustETLprocessesaccordingly.

ETL Challenges and solutions
•DataSecurityandCompliance:
•Solution:EncryptsensitivedataduringtheETLprocess.Implementauditingand
loggingmechanismstoensurecompliancewithregulationslikeGDPRorCCPA.
•HistoricalDataHandling:
•Solution:DesignETLprocessestohandlehistoricaldataseparatelyfromincremental
loads.Thiscaninvolvestrategieslikeslowlychangingdimensions(SCD)inadata
warehousingcontext.
•DataDuplication:
•Solution:Usede-duplicationtoolsorprocesses.Thismayinvolvematchingand
mergingrecordsbasedoncertaincriteriaorusinguniqueidentifiers.

ETL Challenges and solutions
•LackofMetadata:
•Solution:Incorporatemetadatamanagementtools.Documentandmaintain
metadatatoensureclarityarounddatasources,transformations,andthedata's
meaning.
•TestingChallenges:
•Solution:ImplementautomatedETLtestingsolutions.TesteachETLstage
individually(extract,transform,load)andasawholetoensuredataintegrityand
correctness.
•LackofMonitoringandAlerts:
•Solution:UseETLmonitoringtoolsorplatformsthatofferreal-timemonitoringand
alertingcapabilities.Thishelpsinidentifyingandfixingissuesassoonastheyarise.

Data Storage
•Datastoragereferstomagnetic,opticalormechanicalmediathat
recordsandpreservesdigitalinformationforongoingorfuture
operations.
•Datastoragedevices
•Tostoredata,regardlessofform,usersneedstoragedevices.
Datastoragedevicescomeintwomaincategories:directareastorage
andnetwork-basedstorage.

Data storage devices
•Direct area storage, also known as direct-attached storage (DAS), is as the
name implies. This storage is often in the immediate area and directly
connected to the computing machine accessing it. Often, it's the only
machine connected to it. DAS can provide decent local backup services,
too, but sharing is limited. DAS devices include floppy disks, optical discs—
compact discs (CDs) and digital video discs (DVDs)—hard disk drives (HDD),
flash drives and solid-state drives (SSD).
•Network-basedstorageallowsmorethanonecomputertoaccessit
throughanetwork,makingitbetterfordatasharingandcollaboration.Its
off-sitestoragecapabilityalsomakesitbettersuitedforbackupsanddata
protection.Twocommonnetwork-basedstoragesetupsarenetwork-
attachedstorage(NAS)andstorageareanetwork(SAN).

Types of storage devices
•SSDandflashstorage
•Flashstorageisasolid-statetechnologythatusesflashmemorychipsforwritingand
storingdata.Asolid-statedisk(SSD)flashdrivestoresdatausingflashmemory.
ComparedtoHDDs,asolid-statesystemhasnomovingpartsand,therefore,lesslatency,
sofewerSSDsareneeded.SincemostmodernSSDsareflash-based,flashstorageis
synonymouswithasolid-statesystem.
•Hybridstorage
•SSDsandflashofferhigherthroughputthanHDDs,butall-flasharrayscanbemore
expensive.Manyorganizationsadoptahybridapproach,mixingthespeedofflashwith
thestoragecapacityofharddrives.Abalancedstorageinfrastructureenablescompanies
toapplytherighttechnologyfordifferentstorageneeds.Itoffersaneconomicalwayto
transitionfromtraditionalHDDswithoutgoingentirelytoflash.
•Cloudstorage
•Cloudstoragedeliversacost-effective,scalablealternativetostoringfilestoon-premise
harddrivesorstoragenetworks.Cloudserviceprovidersallowyoutosavedataandfiles
inanoff-sitelocationthatyouaccessthroughthepublicinternetoradedicatedprivate
networkconnection.Theproviderhosts,secures,manages,andmaintainstheservers
andassociatedinfrastructureandensuresyouhaveaccesstothedatawheneveryou
needit.

Types of storage devices
•Hybridcloudstorage
•Hybridcloudstoragecombinesprivateandpubliccloudelements.Withhybrid
cloudstorage,organizationscanchoosewhichcloudtostoredata.Forinstance,
highlyregulateddatasubjecttostrictarchivingandreplicationrequirementsis
usuallymoresuitedtoaprivatecloudenvironment.Whereaslesssensitivedata
canbestoredinthepubliccloud.Someorganizationsusehybridcloudsto
supplementtheirinternalstoragenetworkswithpubliccloudstorage.
•Backupsoftwareandappliances
•Backupstorageandappliancesprotectdatalossfromdisaster,failureorfraud.
Theymakeperiodicdataandapplicationcopiestoaseparate,secondarydevice
andthenusethosecopiesfordisasterrecovery.Backupappliancesrangefrom
HDDsandSSDstotapedrivestoservers,butbackupstoragecanalsobeoffered
asaservice,alsoknownasbackup-as-a-service(BaaS).Likemostas-a-service
solutions,BaaSprovidesalow-costoptiontoprotectdata,savingitinaremote
locationwithscalability.
Tags