Introduction to LLMs and their relevance for Official Statistics

DarioBuonoPhDinEcono 130 views 55 slides Sep 08, 2024
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

An introduction to large language models and their relevance for statistical offices , 2024 edition.
Can be downloaded at https://op.europa.eu/en/publication-detail/-/publication/f4a703b3-ea60-11ee-bf53-01aa75ed71a1/language-en
This manual is a straightforward resource for data professionals of Sta...


Slide Content

An introduction to Large Language Models and their relevance for Statistical Offices Dario Buono, Ph.D. Marius Felecan, MSE Cristiano Tessitore, Ph.D. An Eurostat AI paper by https://ec.europa.eu/eurostat/product?code=KS-TC-24-001

Hallucinate

Part One Framing the Context

GenAI Technology Large Language Models whats and hows.

AI definition 2018 “Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal . AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions” Independent High-Level Group on AI (hired by the European Commission), 2018

AI definition 2024 ‘AI system‘ is a machine-based system designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments” Art. 3, EU AI Act

Large Language Models Architecture Domains Fields Transformers NLP Self Driving Tesla LLMs : Gemini ChatGPT Deep Learning Machine Learning Artificial Intelligence

Large Language Models - Training LLM ( parameters ) Corpus

Large Language Models – Fine tuning LLM ( parameters )

Large Language Models – Inference Mary had a little LLM lamb Prompt (Context)

Charting LLMs A Call for Standardization

Emerging & Disrupting Emerging - new and innovative development that significantly alters the current landscape of business and society. Disrupting - groundbreaking product or service that fundamentally changes the market or society.

Cars Prediction (1903) “The automobile is a fad, a novelty. Horses are here to stay.” President of Michigan Savings Bank

… forecasts about disruptors PC Prediction (mid-1970s) “There is no reason an individual would ever want a computer in their home.” Ken Olsen, founder of DEC digital photography, mobile computing, smart phone … Internet is now 35 years old! still reinventing itself

Not mature: Still progress to be made. Not well understood : Still the expectations are sometimes unrealistic. Emergent Disruptive Huge potential Chain reactions How fast? Hype vs. Reality

FOMO (fear of missing out) FOBO (fear of a better option) Fear of Being too Early high risk, high reward

Now & Near Future Some Strategies to Consider Education and Awareness Technology Adoption and Integration Workforce Reskilling and Upskilling Partnerships and Collaborations Ethical AI Use and Governance

Jeff Bezos on AI: Large language models are ‘not inventions, they’re discoveries’ Source: Jeff Bezos - Amazon and Blue Origin, Lex Fridman Podcast, YouTube

Questions ?

Part Two Use Cases

LLM4Statistics Interesting concepts and applications

Degrees of AI integration Assisted AI (Supportive) : AI systems that enhance human tasks without replacing human decision-making. Examples include analytical tools that provide insights for humans to interpret. Augmented AI (Collaborative) : These AI systems work alongside humans, enhancing their capabilities through suggestions or automating routine parts of tasks. This is seen in applications like medical diagnostics, where AI assists in analyzing data. Automated AI (Independent) : AI that operates fully autonomously, performing tasks without human intervention. Common uses include robotic process automation for repetitive tasks such as data entry. Autonomous AI (Self-sufficient) : The most advanced form, these systems operate independently in changing environments, making decisions without human input. Autonomous vehicles and intelligent drones are key examples.

Multitool vs Screwdriver

LLM vs SLM LLM SLM Parameters Several billions (even trillion) Few billions (1 to 3) Knowledge Wide To be specialized Languages Several English Fine tuning Expensive Cheap Hardware specs High Low Performances on benchmark High variable

Mixture of experts INPUT Router Expert 1 Expert 2 Expert 3 Expert 4 Expert 5 Additive combination OUTPUT LMSYS  Chatbot Arena Leaderboard as of 07/03/24  

Training Extremely expensive Fine tuning May be not that cheap RAG = Retrieval-Augmented Generation Solution for incorporating knowledge from external databases Dynamic | Cost-effective | Modular Training vs Fine tuning vs RAG systems

RAG systems Source: NVIDIA

RAG systems Source: Gao et al (2024)

Selected Use cases

GPT@JRC Cloud + Hosted Models Unified Graphical User Interface API available Internal Security and Privacy Rules Compliant European Commission Joint Research Centre GPTs Platform

Coding

Improved efficiency Our Use Case: Old, inherited codebase Documenting and Commenting

Boilerplate code Our recommendation: Use IDE integrated plugins, mature and efficient Comments: careful with prompts Code Generation

Test Case Generation Our experience: debugging works better for not so experienced developers Testing and Debugging

(Semi-)Automation Teams of GPTs Future developments

Research

Partner Assistant Critic (reviewer) Personas

ChatGPT with plugins Commercially available solutions Writefull - for academic and technical writing HeyGPT - Chat with PDFs Litmaps - Best Literature Search Jenni - Helps you write, edit, and cite with confidence. In-house custom developed scripts Literature review

(Semi-)Automation Exploring big context models Future developments

Model tuning Going beyond prompting

EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office . These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EU-BERT Model Text Classification, Question Answering, Named entity recognition, part-of-speech tagging, text generation ... https://huggingface.co/EuropeanParliament/EUBERT Using the EuroHPC Meluxina cluster, a core component of Europe's high-performance computing landscape: Hardware Type: 4 x GPUs 24GB GPU Days: 16 Model size: 94M params

Questions ?

Part Three IP and E thics

Generative AI tools are already pre-trained when available for use.  Most of the main models have been trained vast amounts of data available on the Internet , not always respecting copyright or intellectual property. The output of the model can infringe copyright The proprietorship of the output of GenAI is an issue, as copyright protection is only available for works created by human beings. What’s different on GenAI

Copyright issue – text

Copyright issue - image

Practical scenarios Intellectual Property No intention to use the output in any document; Intention to use the output in an internal document; Intention to use the output in an external document. Nature of the information Public information; Non-Public information; Nature of the system Public system; Trusted-cloud provider; Internal LLM.

[Rule n°1] Staff must never share any information that is not already in the public domain, nor personal data, with an online available generative AI model. [Rule n°2] Staff should always critically assess any response produced by an online available generative AI model for potential biases and factually inaccurate information. European Commission Guidelines

[Rule n°3] Staff should always critically assess whether the outputs of an online available generative AI model are not violating intellectual property rights, in particular copyright of third parties. [Rule n°4] Staff shall never directly replicate the output of a generative AI model in public documents, such as the creation of Commission texts, notably legally binding ones. [Rule n°5] Staff should never rely on online available generative AI models for critical and time-sensitive processes. European Commission Guidelines

Ethics

I t’s a wrap Not a magic wand but a useful tool

To be continued … https://cros.ec.europa.eu/dashboard/ntts-2025

Thank you © European Union 2024 CREDITS:

Dictionary.com

Gartner, https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2023-gartner-hype-cycle