An introduction to large language models and their relevance for statistical offices , 2024 edition.
Can be downloaded at https://op.europa.eu/en/publication-detail/-/publication/f4a703b3-ea60-11ee-bf53-01aa75ed71a1/language-en
This manual is a straightforward resource for data professionals of Sta...
An introduction to large language models and their relevance for statistical offices , 2024 edition.
Can be downloaded at https://op.europa.eu/en/publication-detail/-/publication/f4a703b3-ea60-11ee-bf53-01aa75ed71a1/language-en
This manual is a straightforward resource for data professionals of Statistical Offices, introducing the use of Large Language Models (LLMs) in the field of Official Statistics. It outlines how LLMs can tackle complex data problems with their advanced language processing capabilities and integrates these models into current processes. This guide introduces LLMs, delineating their evolution, architecture, applications, and implications for future employment within the AI realm. Additionally, it emphasizes the need for ethical and responsible applications, blending research insights with practical industry examples to ensure professionals can maximize LLM benefits while maintaining trust and reliability in their work.
Size: 22.05 MB
Language: en
Added: Sep 08, 2024
Slides: 55 pages
Slide Content
An introduction to Large Language Models and their relevance for Statistical Offices Dario Buono, Ph.D. Marius Felecan, MSE Cristiano Tessitore, Ph.D. An Eurostat AI paper by https://ec.europa.eu/eurostat/product?code=KS-TC-24-001
Hallucinate
Part One Framing the Context
GenAI Technology Large Language Models whats and hows.
AI definition 2018 “Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal . AI systems can either use symbolic rules or learn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions” Independent High-Level Group on AI (hired by the European Commission), 2018
AI definition 2024 ‘AI system‘ is a machine-based system designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments” Art. 3, EU AI Act
Large Language Models Architecture Domains Fields Transformers NLP Self Driving Tesla LLMs : Gemini ChatGPT Deep Learning Machine Learning Artificial Intelligence
Large Language Models - Training LLM ( parameters ) Corpus
Large Language Models – Fine tuning LLM ( parameters )
Large Language Models – Inference Mary had a little LLM lamb Prompt (Context)
Charting LLMs A Call for Standardization
Emerging & Disrupting Emerging - new and innovative development that significantly alters the current landscape of business and society. Disrupting - groundbreaking product or service that fundamentally changes the market or society.
Cars Prediction (1903) “The automobile is a fad, a novelty. Horses are here to stay.” President of Michigan Savings Bank
… forecasts about disruptors PC Prediction (mid-1970s) “There is no reason an individual would ever want a computer in their home.” Ken Olsen, founder of DEC digital photography, mobile computing, smart phone … Internet is now 35 years old! still reinventing itself
Not mature: Still progress to be made. Not well understood : Still the expectations are sometimes unrealistic. Emergent Disruptive Huge potential Chain reactions How fast? Hype vs. Reality
FOMO (fear of missing out) FOBO (fear of a better option) Fear of Being too Early high risk, high reward
Now & Near Future Some Strategies to Consider Education and Awareness Technology Adoption and Integration Workforce Reskilling and Upskilling Partnerships and Collaborations Ethical AI Use and Governance
Jeff Bezos on AI: Large language models are ‘not inventions, they’re discoveries’ Source: Jeff Bezos - Amazon and Blue Origin, Lex Fridman Podcast, YouTube
Questions ?
Part Two Use Cases
LLM4Statistics Interesting concepts and applications
Degrees of AI integration Assisted AI (Supportive) : AI systems that enhance human tasks without replacing human decision-making. Examples include analytical tools that provide insights for humans to interpret. Augmented AI (Collaborative) : These AI systems work alongside humans, enhancing their capabilities through suggestions or automating routine parts of tasks. This is seen in applications like medical diagnostics, where AI assists in analyzing data. Automated AI (Independent) : AI that operates fully autonomously, performing tasks without human intervention. Common uses include robotic process automation for repetitive tasks such as data entry. Autonomous AI (Self-sufficient) : The most advanced form, these systems operate independently in changing environments, making decisions without human input. Autonomous vehicles and intelligent drones are key examples.
Multitool vs Screwdriver
LLM vs SLM LLM SLM Parameters Several billions (even trillion) Few billions (1 to 3) Knowledge Wide To be specialized Languages Several English Fine tuning Expensive Cheap Hardware specs High Low Performances on benchmark High variable
Mixture of experts INPUT Router Expert 1 Expert 2 Expert 3 Expert 4 Expert 5 Additive combination OUTPUT LMSYS Chatbot Arena Leaderboard as of 07/03/24
Training Extremely expensive Fine tuning May be not that cheap RAG = Retrieval-Augmented Generation Solution for incorporating knowledge from external databases Dynamic | Cost-effective | Modular Training vs Fine tuning vs RAG systems
RAG systems Source: NVIDIA
RAG systems Source: Gao et al (2024)
Selected Use cases
GPT@JRC Cloud + Hosted Models Unified Graphical User Interface API available Internal Security and Privacy Rules Compliant European Commission Joint Research Centre GPTs Platform
Coding
Improved efficiency Our Use Case: Old, inherited codebase Documenting and Commenting
Boilerplate code Our recommendation: Use IDE integrated plugins, mature and efficient Comments: careful with prompts Code Generation
Test Case Generation Our experience: debugging works better for not so experienced developers Testing and Debugging
(Semi-)Automation Teams of GPTs Future developments
Research
Partner Assistant Critic (reviewer) Personas
ChatGPT with plugins Commercially available solutions Writefull - for academic and technical writing HeyGPT - Chat with PDFs Litmaps - Best Literature Search Jenni - Helps you write, edit, and cite with confidence. In-house custom developed scripts Literature review
(Semi-)Automation Exploring big context models Future developments
Model tuning Going beyond prompting
EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office . These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EU-BERT Model Text Classification, Question Answering, Named entity recognition, part-of-speech tagging, text generation ... https://huggingface.co/EuropeanParliament/EUBERT Using the EuroHPC Meluxina cluster, a core component of Europe's high-performance computing landscape: Hardware Type: 4 x GPUs 24GB GPU Days: 16 Model size: 94M params
Questions ?
Part Three IP and E thics
Generative AI tools are already pre-trained when available for use. Most of the main models have been trained vast amounts of data available on the Internet , not always respecting copyright or intellectual property. The output of the model can infringe copyright The proprietorship of the output of GenAI is an issue, as copyright protection is only available for works created by human beings. What’s different on GenAI
Copyright issue – text
Copyright issue - image
Practical scenarios Intellectual Property No intention to use the output in any document; Intention to use the output in an internal document; Intention to use the output in an external document. Nature of the information Public information; Non-Public information; Nature of the system Public system; Trusted-cloud provider; Internal LLM.
[Rule n°1] Staff must never share any information that is not already in the public domain, nor personal data, with an online available generative AI model. [Rule n°2] Staff should always critically assess any response produced by an online available generative AI model for potential biases and factually inaccurate information. European Commission Guidelines
[Rule n°3] Staff should always critically assess whether the outputs of an online available generative AI model are not violating intellectual property rights, in particular copyright of third parties. [Rule n°4] Staff shall never directly replicate the output of a generative AI model in public documents, such as the creation of Commission texts, notably legally binding ones. [Rule n°5] Staff should never rely on online available generative AI models for critical and time-sensitive processes. European Commission Guidelines
Ethics
I t’s a wrap Not a magic wand but a useful tool
To be continued … https://cros.ec.europa.eu/dashboard/ntts-2025