DATA MINING-MODULE II NOTES(S4 BCA)_________.pdf

LeslieQwer 40 views 40 slides Jun 22, 2024
Slide 1
Slide 1 of 40
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40

About This Presentation

Data mining


Slide Content

DATA MINING & DATA
WAREHOUSES
S4 BCA –KU –MODULE II NOTES ( PREPARED BY VINEETH P )
CHRIST NAGAR COLLEGE , MARANALLOOR
1

SYLLABUS - MODULE II
PREPARED BY VINEETH P 2

What is Market Basket Analysis
Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analysing large data sets, such as
purchase history, to reveal product groupings, as well as products that are likely to be purchased
together.
The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS)
systems. Compared to handwritten records kept by store owners, the digital records generated by
POS systems made it easier for applications to process and analyse large volumes of purchase
data.
PREPARED BY VINEETH P 3

TYPES OF MARKET BASKET ANALYSIS
Retailers should understand the following types of market basket analysis:
•Predictive market basket analysis. This type considers items purchased in sequence to
determine cross-sell.
•Differential market basket analysis. This type considers data across different stores, as well as
purchases from different customer groups during different times of the day, month or year. If a
rule holds in one dimension, such as store, time period or customer group, but does not hold in
the others, analysts can determine the factors responsible for the exception. These insights can
lead to new product offers that drive higher sales.
PREPARED BY VINEETH P 4

ALGORITHM FOR MARKET BASKET
ANALYSIS
In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together,
seeking to find associations that occur far more often than expected.
Algorithms that use association rules include AIS, SETM and Apriori. The Apriori algorithm is
commonly cited by data scientists in research articles about market basket analysis and is used
to identify frequent items in the database, then evaluate their frequency as the datasets are
expanded to larger sizes.
PREPARED BY VINEETH P 5

Well Known Example
Amazon's website uses a well-known example of market basket analysis. On a
product page, Amazon presents users with related products, under the headings
of "Frequently bought together" and "Customers who bought this item also
bought."
PREPARED BY VINEETH P 6

Benefits of Market-Basket Analysis
Market basket analysis can increase sales and customer satisfaction. Using data
to determine that products are often purchased together, retailers can optimize
product placement, offer special deals and create new product bundles to
encourage further sales of these combinations.
These improvements can generate additional sales for the retailer, while making
the shopping experience more productive and valuable for customers. By using
market basket analysis, customers may feel a stronger sentiment or brand
loyalty toward the company.
PREPARED BY VINEETH P 7

APRIORI ALGORITHM - INTRODUCTION
R.AgarwalandR.SrikanttoarethecreatorsoftheApriorialgorithm.Theycreateditin1994by
identifyingthemostfrequentthemesthroughBooleanassociationrules.Thealgorithmhas
foundgreatuseinperformingMarketBasketAnalysis,allowingbusinessestoselltheirproducts
moreeffectively.
Theuseofthisalgorithmisnotjustformarketbasketanalysis.Variousfields,likehealthcare,
education,etc,alsouseit.Itswidespreaduseisprimarilyduetoitssimpleyeteffective
implementation,asitutilizestheknowledgeofpreviouscommonitemsetfeatures.TheApriori
algorithmgreatlyhelpstoincreasetheeffectivenessoflevel-wiseproductionoffrequentitem-
sets.
PREPARED BY VINEETH P 8

Terms in Apriori Algorithm
•Itemset
Item-setreferstoasetofitemscombined.Wecanrefertoanitemasak-itemsetbecause
ithasaknumberofuniqueitems.Typically,anitemsetcontainsatleasttwoitems.
•FrequentItemset
Thenextimportantconceptisfrequentitemset.Afrequentitemsetreferstoanitemset
thatoccursmostfrequently.Forexample,afrequentitemsetcanbeof{bread,butter},
{chips,colddrink},{laptop,antivirussoftware}etc.
PREPARED BY VINEETH P 9

Terms in Apriori Algorithm
Support is a metric that indicates transactions with products or items purchased together
(in a single transaction).Confidence indicates those transactions where the
products/items are purchased one after the other.
Support(X) means – how many times item X got purchased in the list of total no of
transactions
Support(X ^ Y) means - How many times the items X and Y together purchased in the list
of total transactions
PREPARED BY VINEETH P 10

Process of extracting frequent item-sets
Miningfrequentitem-setsistheprocessofidentifyingthem,andthisinvolves
usingspecificthresholdsforSupportandConfidencetodefinethefrequent
item-sets.Theissue,however,isfindingthecorrectthresholdvaluesforthese
metrics.
Normallythethresholdvalueminimumsupport(callmin_sup)willbegivenin
questionitself.
PREPARED BY VINEETH P 11

TofurtherexplaintheAprioriAlgorithm,weneedtounderstandAssociation
RuleMining.TheApriorialgorithmworksbyfindingrelationshipsamong
numerousitemsinadataset.Themethodknownasassociationrulemining
makesthisdiscovery.
Forexample,inasupermarket,apatternemergeswherepeoplebuycertain
itemstogether.Let’sassumethatindividualsmightbuycolddrinksandchips
togethertomaketheexamplemoreconcrete.Similarly,it’salsofoundthat
customersalsoputnotebooksandpenstogetherinapurchase.
PREPARED BY VINEETH P 12

Throughassociationrulemining,you,asasupermarketowner,canleverage
identifiedrelationshipstoboostsales.Strategieslikepackagingassociated
productstogether,placingthemincloseproximity,offeringgroupdiscounts,and
optimizinginventorymanagementcanleadtoincreasedprofits.
PREPARED BY VINEETH P 13

Support of an Item
Supportindicatesanitem’spopularity,calculatedbycountingthetransactions
wherethatparticularitemwaspresent.Foritem‘Z,’itsSupportwouldbethe
numberoftimestheitemwaspurchased,asthetransactiondataindicates.
Sometimes,thiscountisdividedbythetotalnumberoftransactionstomake
thenumbereasilyrepresentable.Let’sunderstandSupportwithanexample.
Supposethereistransactiondataforadayhaving1,000transactions.
PREPARED BY VINEETH P 14

Theitemsyouareinterestedinareapples,oranges,andapples+oranges(a
combinationitem).Now,youcountthetransactionswheretheseitemswere
boughtandfindthatthecountforapples,oranges,andapples+orangesis200,
150,and100.
TheformulaforSupportis-
Support (Z) = Transactions containing item Z / Total transactions
PREPARED BY VINEETH P 15

PREPARED BY VINEETH P 16

IntheApriorialgorithm,suchametricisusedtocalculatethe“support”for
differentitemsanditem-setstoestablishthatthefrequencyoftheitem-setsis
enoughtobeconsideredforgeneratingcandidateitem-setsforthenextiteration.
Here,thesupportthresholdplaysacrucialroleasit’susedtodefineitems/item-
setsthatarenotfrequentenough.
PREPARED BY VINEETH P 17

Confidence of a rule
ThiskeymetricisusedintheApriorialgorithmtoindicatetheprobabilityofan
item‘Y’beingpurchasedifacustomerhasboughtanitem‘Z’.Ifyounotice,
here,conditionalprobabilityisgettingcalculated,i.e.,inthiscase,it’sthe
conditionalprobabilitythatitemZappearsinatransaction,giventhatanother
itemYappearsinthesametransaction.Therefore,theformulaforcalculating
Confidenceis
PREPARED BY VINEETH P 18

Confidence of a rule
P(Z|Y)=P(YandZ)/P(Y)
Itcanalsobewrittenas
Support(Y∪Z)/Support(Y)
Confidenceistypicallydenotedby
(Y→Z)
Ex:
Confidence (Apples → Oranges) = 100
/ 200
Confidence (Apples → Oranges) = 0.5
[ Meaning when apples are purchased ,
there is 50% of chance that customer
also buy oranges ]
PREPARED BY VINEETH P 19

Lift to determine strength of a rule
Liftdenotesthestrengthofanassociationrule.Supposeyouneedtocalculate
theLift(Y→Z);thenyoucandosobydividingConfidence(Y→Z)bySupport(Z),
i.e.,
Lift(Y->Z)=Confidence(Y->Z)/Support(Z)
AnotherwayofcalculatingLiftisbyconsideringSupportof(Y,Z)anddividingby
Support(Y)*Support(Z),i.e.,it’stheratioofSupportoftwoitemsoccurringtogether
totheSupportoftheindividualitemsmultipliedtogether.
PREPARED BY VINEETH P 20

Lift to determine strength of a rule
Intheaboveexample,theLiftforApples??????Orangeswouldbethefollowing-
Lift(Apple->Orange)=Confidence(Apple->Orange)/Support(Orange)
Lift(Apple->Orange)=0.5/0.15
Lift(Apple->Orange)=33.33
PREPARED BY VINEETH P 21

Interpreting Lift Value
❖ALiftvalueof1generallyindicatesrandomness,suggestingindependent
items,andtheassociationrulecanbedisregarded.
❖Avalueabove1signifiesapositiveassociation,indicatingthattwoitemswill
likelybepurchasedtogether.
❖Conversely,avaluebelow1indicatesanegativeassociation,suggestingthat
thetwoitemsaremorelikelytobepurchasedseparately.
PREPARED BY VINEETH P 22

Overall Steps of Apriori Algorithm
PREPARED BY VINEETH P 23

Steps in Apriori Algorithm
1. Start
2.Define the minimum threshold
3.Create a list of frequent items
4.Create candidate item-sets
5.Calculate the support of each candidate
6.Prune the candidate item-sets
7.Repeats the above steps until a single item-set remains in candidate set ( iteration )
8.Generate association rules
9.Evaluate association rules
10.Stop
PREPARED BY VINEETH P 24

Steps in Apriori Algorithm
PREPARED BY VINEETH P 25

Example Problem
Consider the following dataset and we will find frequent itemsets and generate association rules
for them.
minimum support count is 2
minimum confidence is 60%
PREPARED BY VINEETH P 26

Example Problem
Step-1: K=1
(I) Create a table containing support count of each
item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with
minimum support count(here min_support=2 if
support_count of candidate set items is less than
min_support then remove those items). This gives us
itemset L1
PREPARED BY VINEETH P 27

Example Problem
Step-2: K=2
•Generate candidate set C2 using L1 (this is
called join step). Condition of joining L
k-1 and L
k-
1 is that it should have (K-2) elements in
common.
•Check all subsets of an itemset are frequent or
not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2}
they are frequent.Check for each itemset)
•Now find support count of these itemsets by
searching in dataset.
PREPARED BY VINEETH P 28

Example Problem
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.
PREPARED BY VINEETH P 29

Example Problem
Step-3:
•Generate candidate set C3 using L2 (join step).
Condition of joining L
k-1 and L
k-1 is that it should
have (K-2) elements in common. So here, for
L2, first element should match.
So itemset generated by joining L2 is {I1, I2,
I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3,
I5}
•Check if all subsets of these itemsets are
frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1,
I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it.
Similarly check for every itemset)
•find support count of these remaining itemset
by searching in dataset.
PREPARED BY VINEETH P 30

Example Problem
(II) Compare candidate (C3) support count with minimum support
count(here min_support=2 if support_count of candidate set item is
less than min_support then remove those items) this gives us
itemset L3.
PREPARED BY VINEETH P 31

Example Problem
Step-4:
•Generate candidate set C4 using L3 (join step).
Condition of joining L
k-1 and L
k-1 (K=4) is that, they
should have (K-2) elements in common. So here, for
L3, first 2 elements (items) should match.
•Check all subsets of these itemsets are frequent or
not (Here itemset formed by joining L3 is {I1, I2, I3,
I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
•We stop here because no frequent itemsets are
found further
Thus, we have discovered all the frequent item-
sets. Now generation of strong association rule
comes into picture. For that we need to calculate
confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the
customers, who purchased milk and bread also
bought butter.
PREPARED BY VINEETH P 32

Example Problem
PREPARED BY VINEETH P 33

Limitations of Apriori Algorithm
•Computational complexity.
•Time & space overhead.
•Difficulty handling sparse data.
•Limited discovery of complex patterns.
•Higher memory usage.
•Bias of minimum support threshold.
•Inability to handle numeric data.
•Lack of incorporation of context.
PREPARED BY VINEETH P 34

Ways to Improve the efficiency of Apriori
Algorithm
There are some variations of the Apriori algorithm that have been projected that target
developing the efficiency of the original algorithm which are as follows −
The hash-based technique (hashing itemsets into corresponding buckets) − A hash-
based technique can be used to decrease the size of the candidate k-itemsets, C
k, for k
> 1. For instance, when scanning each transaction in the database to create the
frequent 1-itemsets,L
1, from the candidate 1-itemsets in C
1, it can make some 2-
itemsets for each transaction, hash (i.e., map) them into the several buckets of a hash
table structure, and increase the equivalent bucket counts.
PREPARED BY VINEETH P 35

Ways to Improve the efficiency of Apriori
Algorithm
Transaction reduction − A transaction that does not include some frequent k-itemsets cannot
include some frequent (k + 1)-itemsets. Thus, such a transaction can be marked or deleted
from further consideration because subsequent scans of the database for j-itemsets, where j >
k, will not need it.
Partitioning − A partitioning technique can be used that needed two database scans to mine
the frequent itemsets. It includes two phases involving In Phase I, the algorithm subdivides the
transactions of D into n non-overlapping partitions. If the minimum support threshold for
transactions in D is min_sup, therefore the minimum support count for a partition is min_sup ×
the number of transactions in that partition.
PREPARED BY VINEETH P 36

Ways to Improve the efficiency of Apriori
Algorithm
For each partition, all frequent itemsets within the partition are discovered. These
are defined as local frequent itemsets. The process employs a specific data
structure that, for each itemset, records the TIDs of the transactions including the
items in the itemset. This enables it to find all of the local frequent k-itemsets, for k
= 1, 2... in only one scan of the database.
PREPARED BY VINEETH P 37

Ways to Improve the efficiency of Apriori
Algorithm
A local frequent itemset can or cannot be frequently related to the whole database, D. Any
itemset that is possibly frequent related D must appear as a frequent itemset is partially
one of the partitions. Thus, all local frequent itemsets are candidate itemsets slightly D.
The set of frequent itemsets from all partitions forms the worldwise candidate itemsets for
D. In Phase II, the second scan of D is organized in which the actual support of each
candidate is assessed to decide the global frequent itemsets.
PREPARED BY VINEETH P 38

Ways to Improve the efficiency of Apriori
Algorithm
Sampling − The fundamental idea of the sampling approach is to select a random
sample S of the given data D, and then search for frequent itemsets in S rather than D.
In this method, it can trade off some degree of accuracy against efficiency. The sample
size of S is such that the search for frequent itemsets in S can be completed in main
memory, and therefore only one scan of the transactions in S is needed overall.
PREPARED BY VINEETH P 39

References
Apriori Algorithm In Data Mining : Methods, Examples, and More (analytixlabs.co.in)
https://www.geeksforgeeks.org/apriori-algorithm/
PREPARED BY VINEETH P 40
Tags