Learning sets of rules, Sequential Learning Algorithm,FOIL

1,757 views 19 slides Apr 24, 2020
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Presentation covers Learning sets of rules, Generic to specific search, Beam search,Sequential Learning Algorithm and FOIL


Slide Content

Learning Sets of
Rules
SEQUENTIAL COVERING ALGORITHMS
LEARNING RULE SETS
LEARNING FIRST ORDER RULES
LEARNING SETS OF FIRST ORDER RULES
4/24/2020
1

Learning Rules
Oneofthemostexpressiveandhumanreadable
representationsforlearnedhypotheses(Target
Function)issetsofproductionrules(if-thenrules).
Rules can be derived from other representations
Decision trees
Genetic Algorithms
or they can be learned directly (direct method).
An important aspect of direct rule-learning algorithms
is that they can learn sets of first-order ruleswhich
have much more representational power than the
propositionalrules that can be derived from decision
trees.
4/24/2020
2

Propositional versus First-
Order Logic
PropositionalLogicdoesnotincludevariablesandthus
cannotexpressgeneralrelationsamongthevaluesofthe
attributes.
Example1:inPropositionallogic,youcanwrite:
IF(Father
1=Bob)^(Name
2=Bob)^(Female
1=True)THEN
Daughter
1,2=True.
Thisruleappliesonlytoaspecificfamily!
Example2:InFirst-Orderlogic,youcanwrite:
IFFather(y,x)^Female(y),THENDaughter(x,y)
Thisrule(whichyoucannotwriteinPropositionalLogic)appliesto
anyfamily!
4/24/2020
3

Sequential Covering Algorithms:
In SCA, “Family of Algorithms used to learn the rule sets
based on the strategy of learning one rule by removing
the data it covers and iterating the process “
Example: (explains why it is called sequential covering)
Assume a subroutine “Learn one rule”
Input : Set of positive and Negative examples
Expected output : Rule that covers many +veand few –ve
examples
Requirement : Output rule with high accuracy and low
coverage
Approach :
Learn a rule and remove a positive example covered by rule
Invoke subroutines again to learn second rule for another
positive example
Iterate it as many times as required to learn rules as “disjunctive
set of rules”
4/24/2020
4

Learning Propositional Rules:
Sequential Covering Algorithms
Sequential-Covering(Target_attribute, Attributes,
Examples, Threshold)
Learned_rules<--{ }
Rule<--Learn-one-rule(Target_attribute, Attributes,
Examples)
While Performance(Rule, Examples) > Threshold, do
Learned_rules<--Learned_rules+ Rule
Examples<--Examples-{examples correctly classified
by Rule}
Rule<--Learn-one-rule(Target_attribute, Attributes,
Examples)
Learned_rules<--sort Learned_rulesaccording to
Performance over Examples
Return Learned_rules
4/24/2020
5

Learning Propositional Rules:
Sequential Covering Algorithms
The algorithm is called a sequential covering algorithmbecause it
sequentially learns a set of rules that together cover the whole set of
positive examples.
It has the advantage of reducing the problem of learning a
disjunctive set of rules to a sequence of simpler problems, each
requiring that a single conjunctive rule be learned.
The final set of rules is sorted so that the most accurate rules are
considered first at classification time.
However, because it does not backtrack, this algorithm is not
guaranteed to find the smallest or best set of rules ---> Learn-one-
rule must be very effective!
4/24/2020
6

Learning Propositional Rules: Learn-one-rule
General-to-Specific Search:
Consider the most general rule (hypothesis) which matches
every instances in the training set.
Repeat
Add the attribute that most improves rule performance measured over the training set.
Until the hypothesis reaches an acceptable level of
performance.
The difference between this algorithm and ID3 is that it
follows a single descendant at each step
It has no backtracking and hence chances of making
suboptimal choice.
Solution is making use of beam search rather than single
best candidate
4/24/2020
7

General-to-Specific Beam Search
(CN2):
Rather than considering a single candidate at each
search step, keep track of the kbest candidates.
It keep track of most promising alternative to the
current top rated hypotheses
Learn-one-rule (Target attribute, Attributes, Examples,
k)
It returns a single rule that covers some of the examples
Conducts general to specific greedy beam search for the
best rule guided by performance metric
8

Implementation of Learn one rule using
generic to specific beam search
Learn-one-rule (Target attribute, Attributes, Examples, k)
1.Best-Hypothesis most general hypothesis {ø}
2.Candidate-Hypotheses {Set of best hypothesis}
3.While Candidate-Hypothesis is not empty, Do
a.Generate the next most specific Candidate Hypothesis
All-Constraints set of all constraints of the form (a=v)
New Candidate-Hypothesis Generated for each h in Candidate-
Hypothesis and for each c in all constraints
Remove from New Candidate-Hypotheses duplicate, inconsistent or
not maximally specific hypothesis
b.Update Best Hypothesis
For all h in New Candidate-Hypotheses do
IF (Performance(h, Examples, Target-attribute) > Performance(Best-
Hypotheses, Examples, Target-attribute))
Then Best Hypothesis h
9

10
c. Update Candidate Hypotheses
Candidate –Hypotheses the k best members of New
Candidate-Hypotheses, according to the performance measure
Return a rule of the form
“IF Best-Hypothesis THEN prediction”
Where prediction is the most frequent value of Target
attribute among those examples that match Best-Hypothesis
Performance (h, Examples, Target-attribute)
h-examples the subset of examples that match h
return –Entropy (h-examples), where entropy is w.r.t.
Target attribute
Variation:
1.Learning only the rules covered by positive examples
(negation as failure)
2.Using single positive example and covering attributes that
cover the example

Comments and Variations regarding the
Basic Rule Learning Algorithms
Sequential versus Simultaneous covering: sequential covering
algorithms (CN2) make a larger number of independent choices than
simultaneous covering ones (ID3). Choice depends on the number of
dataset
Direction of the search:CN2 uses a general-to-specific search
strategy. Other systems (GOLEM) uses a specific to general search
strategy. General-to-specific search has the advantage of having a
single hypothesis from which to start.
Generate-then-test versus example-driven:CN2(Clark and Niblett-
Learn one rule) is a generate-then-test method(impact of noise is
minimized).
Other methods (Find S, CandElimi.) are example-driven. Generate-then-
test systems are more robust to noise(easily misled by single noisy data).
4/24/2020
11

Comments and Variations regarding the
Basic Rule Learning Algorithms, Cont’d
Post-Pruning:pre-conditionscanberemovedfromtherule
wheneverthisleadstoimprovedperformanceoverasetof
pruningexamplesdistinctfromthetrainingset.
Performancemeasure:differentevaluationcanbeused.
Example:relativefrequency=n
c/n
m-estimateofaccuracy=(n
c+mp)/(n+m)(certainversions
ofCN2)
Entropy(originalCN2).
4/24/2020
12

Learning Sets of First-Order
Rules: FOIL (Quinlan, 1990)
FOIL is similar to the Propositional Rule
learning approach except for the following:
FOIL accommodates first-orderrules and thus needs to
accommodate variablesin the rule pre-conditions.
FOIL uses a special performance measure(FOIL-GAIN) which takes
into account the different variable bindings.
FOILS seeks only rules that predict when the target literal is True
(instead of predicting when it is True or when it is False).
FOIL performs a simple hillclimbingsearch rather than a beam
search.
4/24/2020
13

Induction as Inverted Deduction
Let Dbe a set of training examples, each of the form <x
i,f(x
i)>.
Then, learning is the problem of discovering a hypothesis h, such
that the classification f(x
i)of each training instance x
ifollows
deductively from the hypothesis h, the description of x
iand any
other background knowledge Bknown to the system.
Example:
x
i: Male(Bob), Female(Sharon), Father(Sharon, Bob)
f(x
i): Child(Bob, Sharon)
B: Parent(u,v) <--Father(v,u)
we want to findh such that(B^h^xi) |--f(xi).
h1: Child(u,v) <--Father(v,u)
h2: Child(u,v) <--Parents(v,u)
14

Name1=Sharon, Mother1 = Louise, Father1=Bob,
Male1= False, Female1= True,
Name2 = Bob, father2 = Victor, Mother2=Nora, Male2
= True, Female2 = False,
Daughter1,2 = True
IF((Father1= Bob)^ (Name2 = Bob)^(Female1=True))
Then Daughter1,2 = True
If Father(y, x) ^ Female(y), Then Daughter (x, y)
If father(y, z) ^ Mother (z, x) ^ Female (y)
then __?__ (x, y)
Granddaughter
15

Basic Definitions from First Order Logic
Constants(capitalized Symbols)
Ex : Bob, Sharon
Variables(Lower case symbols)
Ex : x , y, z
Predicate symbols (Capitalized Symbols)
Female, Male, Married
Function Symbols(Lower case symbols)
age
Term: Any Constant, Variable or “function applied to any term”
Example : Mary, x, Age(Mary), Age(x)
Literal: Any Predicate or its negation applied to any term
Example : Married(Bob) positive literal
Greater_Than(age (Sharon),20) negative literal
Ground Literal : Literal without any variable
Clause : Disjunction of literals
16

Basic Definitions from First Order Logic
Horn Clause : Clause with at most one positive literal
H L1 L2 . . . . . . Ln
H (L1лL2 лL3 л. . . . . Ln) IF (L1лL2 лL3 л. . . . . Ln)
THEN H
In a Horn Clause :
HHead or Consequent of Horn clause.
(L1лL2 лL3 л. . . . . Ln) Body or antecedents of Horn Clause
Well formed expression composed of :
1.Constants 3. Variables
2.Functions 4. Predicates
Example : Female(Joe)
Substitution: A function that replaces variables by terms
Example: {x/3, y/z} replaces x by 3 and y by z
In general, Substitution is θand literal is L (Lθ) represents
substitution
A unifying Substitution : For any two literals L1 and L2,
substitution θis such that L1θ=L2θ
17

Learning Sets of First Order Rules (FOIL)
FOIL (Target_Predicate, Predicates, Examples)
Pos examples with Target_Predicate True
Neg examples with Target_Predicate False
Learned Rules { }
While Pos, do ; Outer loop
Learn a new rule
NewRule the rule that predicts Target_Predicate with
no preconditions
NewRuleNeg Neg
While NewRuleNeg, do ; inner loop
Add a new literal to specialize NewRule
Candidate_literals generate candidate new literals for
NewRule, based on Predicates
Best_Literal argmax Foil_Gain (L, NewRule)
L ЄCandidate_literals
18

Add Best_literal to preconditions of NewRule
NewRuleNeg subset of NewRuleNeg that satisfies
NewRule preconditions
Leraned_rules Learned_rules + NewRule
PosPos –{Members of Pos covered by NewRule}
Return Learned_rules
Comments :
Each iteration of outer loop adds new rule to
Learned_rules by performimg
specific to general hypothesis
Inner loop performs finer grained search to determine
the exact definition of each rule by performing general
to specific hill climbing search
19