Lecture 5 Variable Selection, Categorical Input Consolidation, Adjusting for Separate Sampling for handling Imbalanced Data.pdf

NURASYIKINshamsuddin 1 views 10 slides Oct 25, 2025
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Lecture 5 Variable Selection, Categorical Input Consolidation, Adjusting for Separate Sampling for handling Imbalanced Data.p


Slide Content

Variable Selection

❑Redundantorirrelevantinputsareremovedfromthetrainingdatasettoreduce
overfittingandimprovespredictionperformance.
❑Somepredictionmodeldonotincludemethodsforselectinginputs.Therefore,
separateSASEnterpriseMinertoolisusedforinputselection.
❑Alternativesforinput(variable)selection:
1)RegressionNode-Sequentialselection
2)VariableSelectionNode-UnivariateandForwardselection(R-square)
-Tree-likeselection(Chi-Square)
3)PartialLeastSquaresNode–VariableImportanceintheProjection
4)DecisionTreeNode–SplitSearchselection

❑UsingRegressionInputSelection:
❖TheRegressionNodeinSASE-Minerprovidesthreesequentialselectionmethods.
❖ForwardSelection–createsasequenceofmodelsofincreasingcomplexity.Itbegins
withanemptymodelandaddsinvariablesonebyone.Ineachforwardstep,a
variable(withsmallp-value)thatgivesthesinglebestimprovementisaddedtothe
model.
❖BackwardElimination–createsasequenceofmodelsofdecreasingcomplexity.
Backwardeliminationbeginswithallvariablesselectedandeliminatesvariablesone
atatimeuntilastoppingcriterionisreached(Stopwhenallp-values<??????
????????????????????????).
❖StepwiseSelection–combinestheelementsfromboththeforwardandbackward
selectionprocedures.Themethodbeginsinthesamewayastheforwardprocedure,
sequentiallyaddinginputs.Allcandidatevariablesinthemodelarecheckedtoseeif
theirsignificancehasbeenreducedbelowthespecifiedtolerancelevel.Ifa
nonsignificantvariableisfound,itisremovedfromthemodel.
❖Stepwiseregressionrequirestwosignificancelevels:oneforaddingvariablesandone
forremovingvariables.Thecutoffprobabilityforaddingvariablesshouldbelessthan
thecutoffprobabilityforremovingvariablessothattheproceduredoesnotgetinto
aninfiniteloop.

❑UsingVariableSelectionforInputSelection:
❖TheVariableSelectiontoolprovidesaselectionbasedononeofthetwocriteria.
❖R-SquaredCriterion–isatwo-stepprocess.
Step1:Thesquaredcorrelationforeachvariableiscomputed.Variableswithvalueless
thanthesquaredcorrelationcriterionisrejected.(DefaultR-squares=0.005).
Step2:SASE-MinerevaluatestheremainingvariablesusingforwardstepwiseR-squared
regression.VariablewithstepwiseR-squaredimprovementlessthanthresholdcriterion
(default=0.0005)isrejected.
❖Chi-SquaredSelectionCriterion–variableselectionisperformedusingbinarysplitsfor
maximizingthechi-squaredvalue.Theapproachisquitesimilartodecisiontree
algorithmwithitsabilitytodetectnonlinearrelationshipbetweeninputsandtarget.
However,themethodforhandlingcategoricalinputsmakesitsensitivetospurious
input/targetcorrelation.

❑UsingPartialLeastSquares(PLS)InputSelection:
❖Thegoalistohavelinearcombinationoftheinputsthataccountforvariationinboth
theinputsandtarget.AusefulfeaturesofthePLSprocedureistheinclusionof
variableimportanceintheprojection(VIP).VIPquantifiestherelativeimportanceof
theoriginalinputvariablestothelatentvectors(linearcombinationofinputs).

❑UsingDecisionTreeInputSelection:
❖Decisiontreescanbeusedtoselectinputsforflexiblepredictivemodels.
❖The Importance column quantifies the
overall variability in the target each
input explains.
❖The variable Importance not only
consider inputs appear in tree but also
accounts for surrogate inputs.
Notes:
Surrogate inputs are correlated with the
selected split input. Advantage of
adding surrogates in the variable
selection is to enable inclusion of inputs
that do not appear in tree but still
important predictors for the target.

Categorical Input
Consolidation

❑CategoricalInputConsolidationistechniqueofcombiningcategoricalinputlevelsthat
havesimilarprimaryoutcomeproportions.
❑DecisionTreemethodcaneasilygroupedthedistinctlevelsofthecategoricalvariable
togetherandproducegoodpredictions.
❑InSASE-Miner,theDecisionTreenodeisconnectedtotheImputenode.Now,the
DecisionTreenodeisknownasConsolidationTreenode.
❑ThegroupingcanbedonebysimplyrunningtheConsolidationTreenodeoryoumay
changethefeaturesofthenode.

Adjusting for Separate
Sampling for handling
Imbalanced Data

❑Imbalancedclassesareacommonprobleminmachinelearningclassificationwhere
thereareadisproportionateratioofobservationsineachclass.
❑Classimbalancecanbefoundinmanydifferentareasincludingmedicaldiagnosis,
spamfiltering,andfrauddetection.Example:IndatasetCreditCardFraudDetection
BankA,only0.17%ofthecustomerarecategoriesasFraud.
❑Proportionatesamplingcouldnotbeusedbecausethenumberofobservations
betweenoutcomesarebiased.Separatesamplingisapplied.
❑Inseparatesampling,samplesaredrawnseparatelybasedonthetargetoutcome.
❑Separatesamplingenablesustoobtainamodelofsimilarpredictivepowerwitha
smalleroverallcasecount.
Tags