How to get more people onboard with Pharo ?

esug 94 views 33 slides Sep 04, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Talk from ESUG2024

How to get more people onboard with Pharo ? Applying Large Language Models (LLM) as support for the onboarding of new developers. (Marius Pingaud and Pascal Zaragoza)

PDF: http://archive.esug.org/ESUG2024/day1/05-Pharo%20RAG%20LLM%20-%20PZAR.pdf


Slide Content

1 1 1
How to get more
people onboard with
Pharo ?
Applying Large Language Models (LLM) as
support for the onboarding of new
developers.
ChatGPT3.5
Marius Pingaud
Pascal Zaragoza

2 2 2
The Typical New User Experience: How do I do this?
ESUG 2024
User
How do I do … in
Pharo / Smalltalk?

3 3 3
The Typical New User Experience: Using documentation
ESUG 2024
User
How do I do … in
Pharo / Smalltalk?
documentation
Question
Answer
+ highly complete information
-Highly dependent on
documentation availability

4 4 4
The Typical New User Experience: Asking experts
ESUG 2024
User
How do I do … in
Pharo / Smalltalk?
Expert
Question
AnswerAnswer
Question
Forum
+rich&interactiveresponse
-Canbeaslowprocess

5 5 5
The New User Experience: Large Language Models (LLM)
ESUG 2024
Question
LLM
User Interface
(e.g., ChatGPT)User
Answer
Simple User Interaction with LLM
Question
Answer
What are the main methods
used to sort a collection in
Pharo, and how do they differ?
In Pharo Smalltalk, the
main methods used to
sort a collection are
`sort`, `sort:` and
`sort:ascending:`.

6 6 6
New User Experience using Large Language Models (LLM)
ESUG 2024
User
In Pharo Smalltalk, the
main methods used to
sort a collection are
`sort`, `sort:` and
`sort:ascending:`.
What?!

7 7 7
New User Experience using Large Language Models (LLM)
ESUG 2024
User
Sorting methods for Collection & Ordered Collection.

8 8 8
New User Experience using Large Language Models (LLM)
ESUG 2024
User
Sorting methods for Collection & Ordered Collection.
Problem #1: LLMs have a limited
knowledge with cutoff date for
learning

9 9 9
New User Experience using Large Language Models (LLM)
ESUG 2024
Question
LLM
User Interface
(e.g., ChatGPT)User
Answer
Simple User Interaction with LLM
Question
Answer
How do I use … library?
I’m sorry, I don’t
anything about…

10 10 10
New User Experience using Large Language Models (LLM)
ESUG 2024
Question
LLM
User Interface
(e.g., ChatGPT)User
Answer
Simple User Interaction with LLM
Question
Answer
How do I use … library?
I’m sorry, I don’t
anything about…
Problem #2: Internal business
rules are unknown to
foundational LLMs.

11 11 11
The problem with foundational LLMs
ESUG 2024
Problem #1: LLMs have a limited
knowledge with cutoff date for
learning
Problem #2: Internal business
rules are unknown to
foundational LLMs.
Solution #1: Finetune the model with
business rules and knowledge.
-Advantage: high initial costs, lower
inference costs
-Disadvantage: expensive and difficult
processwhich can still cause
hallucinationswhen answering
questions.
Solution #2: Retrieval-Augmented
Generation (RAG)
-Provide both the relevant business rules
and the initial question to the user
-Advantage: Documentation is all you
need.
-Disadvantage: Higher inference costs
due to bigger context size.

12 12 12
The problem with foundational LLMs
ESUG 2024

13 13 13
The problem with foundational LLMs
ESUG 2024
Problem #1: LLMs have a limited
knowledge with cutoff date for
learning
Problem #2: Internal business
rules are unknown to
foundational LLMs.
Solution #1: Finetune the model with
business rules and knowledge.
-Advantage: high initial costs, lower
inference costs
-Disadvantage: expensive and difficult
processwhich can still cause
hallucinationswhen answering
questions.
Solution #2: Retrieval-Augmented
Generation (RAG)
-Provide both the relevant business rules
and the initial question to the user
-Advantage: Documentation is all you
need.
-Disadvantage: Higher inference costs
due to bigger context size.

14 14 14
The problem with foundational LLMs
ESUG 2024
Problem #1: LLMs have a limited
knowledge with cutoff date for
learning
Problem #2: Internal business
rules are unknown to
foundational LLMs.
Solution #1: Finetune the model with
business rules and knowledge.
-Advantage: high initial costs, lower
inference costs
-Disadvantage: expensive and difficult
processwhich can still cause
hallucinationswhen answering
questions.
Solution #2: Retrieval-Augmented
Generation (RAG)
-Provide both the relevant business rules
and the initial question to the user
-Advantage: Documentation is all you
need.
-Disadvantage: Higher inference costs
due to bigger context size.

15 15 15
Retrieval Augmented Generation (RAG)
ESUG 2024
User

16 16 16
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
User Interface
User
Answer

17 17 17
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
User Interface
Vectorial DB with
documentation
User
API
Answer
Top
N docs
Question

18 18 18
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question

19 19 19
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface ??????
Formatting the
LLM response
Returning the answer with the sources
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question

20 20 20
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface ??????
Formatting the
LLM response
Returning the answer with the sources
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question
InformationRetrieval

21 21 21
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface ??????
Formatting the
LLM response
Returning the answer with the sources
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question
Generation

22 22 22
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface ??????
Formatting the
LLM response
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question
How do we build this
database ?
Returning the answer with the sources

23 23 23
Preparing the data source
ESUG 2024
Collect Pharo
documentation
Clean and parse each
document
Segment document
and embed them
Pharo-wiki
Pharo by
Example
Roassal
Documentation
Extracts the data as a
readable text
Associate a N-
dimension vector to the
semantic value of each
document

24 24 24
Context-
buildinglogic
Retrieval Augmented Generation (RAG)
ESUG 2024
Question
Document
Retrieval
Algorithm
LLM
User Interface ??????
Formatting the
LLM response
User
API
Answer
Vectorial DB with
documentation
Answer
Question
+
docs
Top
N docs
Question
Returning the answer with the sources
Tuple(vector, segment, metadata)

25 25 25
Tooling&Evaluation

26 26 26
▪Set out to create the initial tooling in Python
▪IdentifyandprocessasetofPharo documentation:
▪https://github.com/pharo-open-
documentation/pharo-wiki.git
▪https://github.com/SquareBracketAssociates/PharoByE
xample9
▪https://github.com/SquareBracketAssociates/Learning
OOPWithPharo
▪https://github.com/SquareBracketAssociates/BuildingA
pplicationWithSpec2
▪https://github.com/pharo-
graphics/RoassalDocumentation.git
Tooling
ESUG 2024

27 27 27
Onboardingtool demo

28 28 28
Evaluation protocol
ESUG 2024
▪Objective: Comparethe LLM’sability to answer Pharo-
related questions with a RAG-enabled LLM
▪Generate a set of textbook questions(21 questions)
▪Extract a set of questions from Discordand Stack
Overflow(19 questions)
▪Manual evaluation of eachanswer on a scale of 0 to 3
▪0 : terrible
▪1 : notok
▪2:ok
▪3 : perfect
Compare Naïve and RAG results
LLM
Q
Q
Q
A AVS

29 29 29
Initial Results
ESUG 2024
0
5
10
15
20
0 1 2 3
Results from Textbook
Questions
Naive LLM RAG
▪Textbook questions (21 questions):
▪When askingbasic questions that exists in the
documentation, the RAG technique gives near perfect
answer
▪Stack Overflow questions (19 questions):
▪Initial resultsarecloser
▪There is a shift towards higher answer quality
0
2
4
6
8
0 1 2 3
Results from Stack Overflow
Questions
Naive LLM RAG

30 30 30
Conclusion & Perspectives

31 31 31
▪Textbook-basedresults are extremely positive
▪Stack Overflow-based results shows a lack of significant
impactfor a RAG-based onboarding LLM.
▪Why ? What happened ?
▪Answers where the RAG-based technique was better, the
documents providedwith the questions did contain an
element of the answer needed.
▪AnswerswheretheRAG-basedtechnique was notbetter,
thedocumentsprovidedwiththequestiondidnotcontain
anyelementsoftheanswerrequired.
▪What could this mean ?
▪Not enough documentsto answer some of these questions
▪the question is too complexto answer with a simple request
Conclusion
ESUG 2024
Answer contains
a complete
answer with an
example.
Answer contains
a simple
explanation with
some ambiguity.
RAG-based
LLM
Naïve
LLM

32 32 32
▪Increasethe amount of documentationwe parse.
▪Include other data sources(e.g., moose models,
Pharo code).
▪Allows use to ask questions about a specific code
base (e.g., what is the super class of …)
▪Use agent-basedprogramming to iterateover more
complex questions.
▪User experimentation(launching experimentation
within the company so that our users can try the tool
-> collect Q&A with evaluation)
Perspectives
ESUG 2024

33 33 33
Thank you for your
attention.
Any Questions?
onboarding.pharo.research-bl.com
[email protected]
[email protected]