Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

getindata 124 views 30 slides Jun 04, 2024
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet r...


Slide Content

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1.Why do we need yet
another(open-source ) Copilot?
2.How can we build one?
3.Architecture and evaluation
4.DEMO
…how to turn best practices into AI coding assistant

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
(Data) Context is king!
●Explicitand precisedata context of your whole data
platform
●Data transformation codebase
●Data models with comments
and table relationships
●Other user queries
●Lineage and human curated
dataset descriptions from
data catalogs

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
●open-source tools, such as WrenAI, Venna.AI, Dataherald
focus on Text-to-SQL to be embedded in web interfaces –i.e.
chatbots or own SQL editors –meant for non-technicalusers.
●closed source AI-Powered Assistants to BigQuery
(SQL+Dataform), Snowflake (SQL), Databricks (SQL+Python)
web interfaces, more like a black-box not-meant for
customizations.
●missing Analytics Engineer Copilotwith a dbt/SQL support
Data Assistants landscape

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Customized and specialized models are the future.

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
●Many other small (7-34b) models
licensed for commercial use, e.g. :
✓starcoder2
✓dolphincoder
✓deepseeek-coder
✓Opencodeinterpreter
✓Llama3
When quantized can be even run locally!
sqlcoder-7b and others
May 9th updates

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How turn your best practices into Copilots ?
●Vector database as a knowledge base -what ?
●Prompts as instructions following best practices -how ?
●LLM to combineboth…
Retrieval-Augmented
Generation(RAG)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
RAG for Text-to-SQL

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Hybrid search
•combination of keyword and vector search

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
●a technique used to search for similar items
based on their vector representations, called
embeddings
●Approximate Nearest Neighbours algorithms
Vector search

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Data Copilot RAG architecture
Data programming
is more about
repeatabletasks!

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
GID Data Copilot (GDC)
●An extensible AI
programming assistant for
SQLand dbtcode
●Powered by:
●Large Language Models
(SOTA LLMs)
●Robust Retrieval
Augmented Generation
(RAG) architecture
●Hybrid searchtechniques
●Fast Vector Database
●CuratedPrompts
●BuiltinData commands

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue -an open-source copilot
●support for both tasks and tab autocompletion
●highly extensible
○use any LLM model you wish -also multiple, specialized modelsfor different
languages or tasks
○support for many model providers, such as Ollama, vLLM, LM Studio
○custom context providersfor more control over LLMs augmentation
○custom slash commands that can combine own prompts, contextsand
modelsfor specialized, reusable tasks
●support for VSCodeand Jetbrains
●secure (i.e. can be run locally, on-premise or cloud VPC)
●translate your best practicesinto ”slash data commands”

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue -a custom context provider

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
dbtSQLtask = custom(context + prompt + model)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama
●fast and easy self-hosting of LLMs almost everywhere
●hybrid CPU+GPU inference
●powered by llama.cpp
●rich libraryof existing LLMs in different flavours
●GGUF-fast and memory efficient
quantization for GPU
●Serve model with one-liner:
ollamarun starcoder2:7b
●vLLMfor production deployments
(Our video tutorial)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama-custom model in 4 steps
1.Download a model in the GGUF format
2.Create a Modelfile, e.g.:
FROM ./sqlcoder-7b-q5_k_m.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER stop "<|endoftext|>"
3. Create a model with Ollama
ollamacreate sqlcoder-7b-2 -f Modefile
4. Serve it
ollamarun sqlcoder-7b-2

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LanceDB
●fast (Rust), serverless and embeddable -DuckDBfor ML
●powered by Lancefile format for ML (versioning, zero-copy)
●multi-modal
●support for hybrid (semantic + keyword) search
●Llamaindexintegration
●Python API and fast data exchange
with polarsand Arrow

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Technical architecture

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Question representation
1
1
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 1/2
●Not meant to be yet another benchmark, such as: Spider
sql-evalor Bird-SQLfor jus SQL generation
●Jaffle Shopexample -simple but not trivial
●Zero-shot –Agentic Workflow with Reflection TBD
●4 typical data tasks
○Data model exploration/discovery
○SQL: an easy one (single table) and more complex (joins with sorting and
aggregations)
○dbtmodel generation
○dbttests generation based on rules

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 2/2
+-perfect or almost perfect
+/--not optimal or some minor tweaks needed
-/+-not very helpful, serious hallucinations
--totally useless

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
gpt4 vs dbrxvs sqlcoder-7b-2 vs llama-3-sqlcoder-8b

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
fine-tuning impact: llama3-8b vs llama-3-sqlcoder-8b

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Quantization effects: dbrx 8/4/2 bit

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
A handful of conclusions…with a grain of salt
●NL-to-SQL and dbtcode generation are challenging
●commercialmodels are in most cases still better… but
●there are very promising open-source 7-30b alternatives
●smallermodels perform better than larger after quantization
●SQL-dedicatedand fine-tuned models can turn out a bit a
disappointing (beam search?), e.g. :
○unnecessary joins elimination
○wrong data types inference
○occasional hallucinations

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Future directions
●Implementationof in-contextlearning, suchas Query
SimilaritySelection(few-shotstrategy) and Agentic RAG
with Reflection Strategy
●Model fine-tuning (dbt)
●Data Modeling (DV 2.0)
●Various SQL dialects and platform migrations
●Prompt optimizations, e.g. with DSPy

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
DEMO

© Copyright. All rights reserved. Not to be reproduced without prior written consent.