Introduction to LLMs and their relevance for Official Statistics

DarioBuonoPhDinEcono 130 views 55 slides Sep 08, 2024

Slide 1 of 55

About This Presentation

An introduction to large language models and their relevance for statistical offices , 2024 edition.
Can be downloaded at https://op.europa.eu/en/publication-detail/-/publication/f4a703b3-ea60-11ee-bf53-01aa75ed71a1/language-en
This manual is a straightforward resource for data professionals of Sta...

Size: 22.05 MB

Language: en

Added: Sep 08, 2024

Slides: 55 pages

Slide Content

An introduction to Large Language Models and their relevance for Statistical Offices Dario Buono, Ph.D. Marius Felecan, MSE Cristiano Tessitore, Ph.D. An Eurostat AI paper by https://ec.europa.eu/eurostat/product?code=KS-TC-24-001

Hallucinate

Part One Framing the Context

GenAI Technology Large Language Models whats and hows.

AI definition 2018 “Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal . AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions” Independent High-Level Group on AI (hired by the European Commission), 2018

AI definition 2024 ‘AI system‘ is a machine-based system designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments” Art. 3, EU AI Act

Large Language Models Architecture Domains Fields Transformers NLP Self Driving Tesla LLMs : Gemini ChatGPT Deep Learning Machine Learning Artificial Intelligence

Large Language Models - Training LLM ( parameters ) Corpus

Large Language Models – Fine tuning LLM ( parameters )

Large Language Models – Inference Mary had a little LLM lamb Prompt (Context)

Charting LLMs A Call for Standardization

Emerging & Disrupting Emerging - new and innovative development that significantly alters the current landscape of business and society. Disrupting - groundbreaking product or service that fundamentally changes the market or society.

Cars Prediction (1903) “The automobile is a fad, a novelty. Horses are here to stay.” President of Michigan Savings Bank

… forecasts about disruptors PC Prediction (mid-1970s) “There is no reason an individual would ever want a computer in their home.” Ken Olsen, founder of DEC digital photography, mobile computing, smart phone … Internet is now 35 years old! still reinventing itself

Not mature: Still progress to be made. Not well understood : Still the expectations are sometimes unrealistic. Emergent Disruptive Huge potential Chain reactions How fast? Hype vs. Reality

FOMO (fear of missing out) FOBO (fear of a better option) Fear of Being too Early high risk, high reward

Now & Near Future Some Strategies to Consider Education and Awareness Technology Adoption and Integration Workforce Reskilling and Upskilling Partnerships and Collaborations Ethical AI Use and Governance

Jeff Bezos on AI: Large language models are ‘not inventions, they’re discoveries’ Source: Jeff Bezos - Amazon and Blue Origin, Lex Fridman Podcast, YouTube

Questions ?

Part Two Use Cases

LLM4Statistics Interesting concepts and applications

Degrees of AI integration Assisted AI (Supportive) : AI systems that enhance human tasks without replacing human decision-making. Examples include analytical tools that provide insights for humans to interpret. Augmented AI (Collaborative) : These AI systems work alongside humans, enhancing their capabilities through suggestions or automating routine parts of tasks. This is seen in applications like medical diagnostics, where AI assists in analyzing data. Automated AI (Independent) : AI that operates fully autonomously, performing tasks without human intervention. Common uses include robotic process automation for repetitive tasks such as data entry. Autonomous AI (Self-sufficient) : The most advanced form, these systems operate independently in changing environments, making decisions without human input. Autonomous vehicles and intelligent drones are key examples.

Multitool vs Screwdriver

LLM vs SLM LLM SLM Parameters Several billions (even trillion) Few billions (1 to 3) Knowledge Wide To be specialized Languages Several English Fine tuning Expensive Cheap Hardware specs High Low Performances on benchmark High variable

Mixture of experts INPUT Router Expert 1 Expert 2 Expert 3 Expert 4 Expert 5 Additive combination OUTPUT LMSYS Chatbot Arena Leaderboard as of 07/03/24

Training Extremely expensive Fine tuning May be not that cheap RAG = Retrieval-Augmented Generation Solution for incorporating knowledge from external databases Dynamic | Cost-effective | Modular Training vs Fine tuning vs RAG systems

RAG systems Source: NVIDIA

RAG systems Source: Gao et al (2024)

Selected Use cases

GPT@JRC Cloud + Hosted Models Unified Graphical User Interface API available Internal Security and Privacy Rules Compliant European Commission Joint Research Centre GPTs Platform

Coding

Improved efficiency Our Use Case: Old, inherited codebase Documenting and Commenting

Boilerplate code Our recommendation: Use IDE integrated plugins, mature and efficient Comments: careful with prompts Code Generation

Test Case Generation Our experience: debugging works better for not so experienced developers Testing and Debugging

(Semi-)Automation Teams of GPTs Future developments

Research

Partner Assistant Critic (reviewer) Personas

ChatGPT with plugins Commercially available solutions Writefull - for academic and technical writing HeyGPT - Chat with PDFs Litmaps - Best Literature Search Jenni - Helps you write, edit, and cite with confidence. In-house custom developed scripts Literature review

(Semi-)Automation Exploring big context models Future developments

Model tuning Going beyond prompting

EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office . These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EU-BERT Model Text Classification, Question Answering, Named entity recognition, part-of-speech tagging, text generation ... https://huggingface.co/EuropeanParliament/EUBERT Using the EuroHPC Meluxina cluster, a core component of Europe's high-performance computing landscape: Hardware Type: 4 x GPUs 24GB GPU Days: 16 Model size: 94M params

Questions ?

Part Three IP and E thics

Generative AI tools are already pre-trained when available for use. Most of the main models have been trained vast amounts of data available on the Internet , not always respecting copyright or intellectual property. The output of the model can infringe copyright The proprietorship of the output of GenAI is an issue, as copyright protection is only available for works created by human beings. What’s different on GenAI

Practical scenarios Intellectual Property No intention to use the output in any document; Intention to use the output in an internal document; Intention to use the output in an external document. Nature of the information Public information; Non-Public information; Nature of the system Public system; Trusted-cloud provider; Internal LLM.

[Rule n°1] Staff must never share any information that is not already in the public domain, nor personal data, with an online available generative AI model. [Rule n°2] Staff should always critically assess any response produced by an online available generative AI model for potential biases and factually inaccurate information. European Commission Guidelines

[Rule n°3] Staff should always critically assess whether the outputs of an online available generative AI model are not violating intellectual property rights, in particular copyright of third parties. [Rule n°4] Staff shall never directly replicate the output of a generative AI model in public documents, such as the creation of Commission texts, notably legally binding ones. [Rule n°5] Staff should never rely on online available generative AI models for critical and time-sensitive processes. European Commission Guidelines

Ethics

I t’s a wrap Not a magic wand but a useful tool

To be continued … https://cros.ec.europa.eu/dashboard/ntts-2025

Dictionary.com

Gartner, https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2023-gartner-hype-cycle

Introduction to LLMs and their relevance for Official Statistics

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Introduction to LLMs and their relevance for Official Statistics

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx