Orange Software

RamprakashSingaravel1 499 views 60 slides Sep 01, 2023
Slide 1
Slide 1 of 60
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60

About This Presentation

Data Visualization Tool


Slide Content

OrangeTool
2
Let’s Learn
Orange Data Mining and
Data Visualization Tool

What is Data Mining?
3
•process of analyzingdata from different perspectives
•summarizing it into useful information
•information that can be used to increase revenue,
cuts costs, or both.
•data mining helps analysts recognize significant
information, facts, relationships, trends, patterns,
exceptions, anomalies that might otherwise go
unnoticed.

Major Data Mining Tasks
4
1)Classification: Predicting an item class
2)Clustering: descriptive, finding groups of items
3)Deviation Detection: predictive, finding changes
4)Forecasting: predicting a parameter value
5)Description: describing a group
6)Link analysis: finding relationships and associations

Major Industries Using Data Mining
5
•retail
•finance
•education
•healthcare
•agriculture
•manufacturing
•transportation
•aerospace

WhyOrange?
OpenSource
Componentbased
Noprogramming
Datavisualization
Platformindependentsoftware
Allowsclusteringandclassification
Data mining through visual programming
andpythonscripting
Introduction
Orange is component based visual
programingsoftwarefordatamining.
machinelearninganddataanalysis
Supportscommunicationbetweendata
scientistsanddomainexperts.
Youcangetorangesoftwarefromthis link:
https://orange.biolab.si/getting-started/
6

GettingStartedWithORANGE!!
7

sss

6

Dataset:HeartDisease
ATTRIBUTES
●Narrowingdiameter
●Cholesterol
●Chestpain
●RestECG
●Fastingblood sugar
●MaxHR
●Age,genderandmore
.
7
●Has303instances
●13attributes
●Categoricalclasswith2
values(0,1)
●In.csvformat
●Source: pre loaded
datasetsofOrange.
.

●Age:heartdiseaseincreaseswithagegreaterthan 65
●Fattydepositscalledplaquesalsocollectalongyourarterywalls
●Slowthebloodflowfromtheheart
●Causingcoronaryheartdiseases.
●Gender:Heartdiseaseisleadingcauseof death for both menandwomen.
Dataset:Howfollowingfactorscause
HeartDisease?
1
1

●Aangina:ischestpainordiscomfortcausedwhenyourheartmuscledoesn't
getenough oxygen-richblood.
●Cholesterol:Whenthereis toomuchcholesterol inyourblood.
●itbuildsupinthewallsofyourarteries
●causing aprocesscalledatherosclerosis(heartdisease),
●DiameterNarrowing:
●Heartdiseaseiscaused bythenarrowingorblockageofthecoronaryarteries.
●Targetattribute(0,1)
1
2

Loadingdata fileinto data table:
14

EDA:Exploratorydataanalysis
●Distributions
.
15

●Distributions
16

17

Algorithms:
●KNN
●NaïveBayes'
●DecisionTree
SelectedAlgorithm
●NeuralNetwork
●RandomForest
●LogisticRegression
19

Experimental
Setup
20
Thisishowwedraganddropthewidgetsand
implementsouralgorithms

KNN(knearestneighbor)
KNNisnon-parametricmethodusedforclassificationandregression.
Requires threethings
The setofstoredrecords.
Distance Metrictocomputedistancebetweenrecords.
Thevalueofk,thenumberofnearestneighborstoretrieveUnknownrecord
Mathequation:d(p,q)=√Σ(pi–????????????)??????
21

22

23

24

25

Decisiontree
Usedtovisuallyandexplicitlyrepresentdecisionsanddecisionmaking.
predictive modelling approachesusedin:
statistics,dataminingandmachinelearning
m
Entropy(D)p
ilog
2(p
i)
i1
26

27

28

29

30

31

32

33

NaïveBaye's
AlsoknownasNaiveBayesClassifiers.
Attributesarestatisticallyindependentononeanother.
Unlikeotherclassifiers foragivenclass
Therewill besomecorrelationbetweenfeatures.
Explicitlymodelsthefeaturesasconditionallyindependentgiven theclass.
P(H|X)=
P(X|H)(PH
??????(??????)
34

35

36

37

38

RandomForest
Itisaflexibleandsimple
RandomForestalgorithmavoidtheoverfittingproblem.
Usedforidentifyingthemostimportantfeaturesfromthetrainingdataset.
Itcanbeusedforbothclassificationandregressiontasks.
39

40

41

42

43

LogisticRegression
Usedtoassignobservations toadiscretesetofclasses.
Logisticregressioncanbebinomial, ordinalormultinomial.
Binary(Pass/Fail)
Multi (Cats,Dogs,Sheep)
Ordinal(Low,Medium,High)
Canviewprobability scoresunderlyingthemodel’s classifications.
44

45

46

47

NeuralNetwork
Neuralnetworksislearningalgorithms.
Interpretsensorydata
Throughakind ofmachineperception,labelingorclusteringrawinput.
Consistofdifferentlayersforanalyzingandlearningdata.
Mathequation:
f(X)=b+∑iwixi
48

49

50

51

52

Concluding
Results
53

Tabletocomparedata
Recall Precision F-Measures
NeuralNetwork 0.813 0.814 0.814
LogisticRegression 0.848 0.848 0.848
Randomforest 0.807 0.807 0.807
54

55

56

57

Projects:
58
1.Traffic Communication Data Analysis
2.Job Scam Data Analysis
3.Email Communication Data Analysis
4.Social Media Data Analysis
5.Healthcare Data Analysis

59
EMAIL
COMMUNICATION
DATA
ANALYSIS

References:
https://www.youtube.com/watch?v=pYXOF0jziGM&index=6&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://www.youtube.com/watch?v=bp0VtVS3LN4&index=9&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://orange.biolab.si/getting-started/
https://en.wikipedia.org/wiki/Random_forest
https://en.wikipedia.org/wiki/Decision_tree_learning
http://orange.biolab.si/docs/latest/–
http://en.wikipedia.org/wiki/Data_mining–
http://www.oracle.com/technetwork/database/options/advanced-
analytics/odm/index.html–
http://eprints.fri.uni-lj.si/1150/1/DataMining-Kyoto.pdf
60

Thanks!
Anyquestions?
61