SlideShare a Scribd company logo
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
1. Why do we need yet
another (open-source ) Copilot?
2. How can we build one?
3. Architecture and evaluation
4. DEMO
…howto turn bestpractices into AI coding assistant
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
(Data)Context is king!
● Explicit and precise data context of your whole data
platform
● Data transformation codebase
● Data models with comments
and table relationships
● Other user queries
● Lineage and human curated
dataset descriptions from
data catalogs
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● open-source tools, such as WrenAI, Venna.AI, Dataherald
focus on Text-to-SQL to be embedded in web interfaces – i.e.
chatbots or own SQL editors – meant for non-technical users.
● closed source AI-Powered Assistants to BigQuery
(SQL+Dataform),Snowflake (SQL), Databricks (SQL+Python)
web interfaces,more like a black-box not-meant for
customizations.
● missingAnalytics EngineerCopilot with a dbt/SQL support
Data Assistants landscape
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Customized and specialized models are the future.
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● Many other small (7-34b) models
licensed for commercial use, e.g. :
✓ starcoder2
✓ dolphincoder
✓ deepseeek-coder
✓ Opencodeinterpreter
✓ Llama3
sqlcoder-7b and others
May 9th updates
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
How turnyour best practices into Copilots ?
● Vector database as a knowledge base - what ?
● Prompts as instructions following best practices - how ?
● LLM to combine both…
Retrieval-Augmented
Generation(RAG)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
RAG for Text-to-SQL
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Hybrid search
• combination of keyword and vector search
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
● a technique used to search for similar items
based on their vector representations, called
embeddings
● Approximate Nearest Neighbours algorithms
Vector search
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Data CopilotRAG architecture
Data programming
is more about
repeatable tasks!
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
GID Data Copilot(GDC)
● An extensible AI
programming assistant for
SQL and dbt code
● Powered by:
● Large Language Models
(SOTA LLMs)
● Robust Retrieval
Augmented Generation
(RAG) architecture
● Hybrid search techniques
● Fast Vector Database
● Curated Prompts
● Builtin Data commands
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue - an open-source copilot
● support for both tasks and tab autocompletion
● highly extensible
○ use any LLM model you wish - also multiple, specialized models for different
languages or tasks
○ support for many model providers, such as Ollama, vLLM, LM Studio
○ custom context providers for more control over LLMs augmentation
○ custom slash commands that can combine own prompts, contexts and
models for specialized, reusable tasks
● support for VSCode and Jetbrains
● secure (i.e. can be run locally, on-premise or cloud VPC)
● translate your best practices into ”slash data commands”
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Continue - a custom contextprovider
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
dbtSQL task = custom(context + prompt + model)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama
● fast and easy self-hosting of LLMs almost everywhere
● hybrid CPU+GPU inference
● powered by llama.cpp
● rich library of existing LLMs in different flavours
● GGUF - fast and memory efficient
quantization for GPU
● Serve model with one-liner:
ollama run starcoder2:7b
● vLLM for production deployments
(Our video tutorial)
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Ollama - custom model in 4 steps
1. Download a model in the GGUF format
2. Create a Modelfile, e.g.:
FROM ./sqlcoder-7b-q5_k_m.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER stop "<|endoftext|>"
3. Create a model with Ollama
ollama create sqlcoder-7b-2 -f Modefile
4. Serve it
ollama run sqlcoder-7b-2
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LanceDB
● fast (Rust ), serverless and embeddable - DuckDB for ML
● powered by Lance file format for ML (versioning,zero-copy)
● multi-modal
● support for hybrid (semantic + keyword) search
● Llamaindex integration
● Python API and fast data exchange
with polars and Arrow
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Technical architecture
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Questionrepresentation1
1
Text-to-SQLEmpowered by Large Language Models: A BenchmarkEvaluation
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 1/2
● Not meant to be yet another benchmark, such as: Spider
sql-eval or Bird-SQL for jus SQL generation
● Jaffle Shop example - simple but not trivial
● Zero-shot – Agentic Workflow with Reflection TBD
● 4 typical data tasks
○ Data model exploration/discovery
○ SQL: an easy one (single table) and more complex (joins with sorting and
aggregations)
○ dbt model generation
○ dbt tests generation based on rules
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
LLMs evaluation 2/2
+- perfect or almost perfect
+/- - not optimal or some minor tweaks needed
-/+ - not very helpful, serious hallucinations
- - totally useless
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
gpt4 vs dbrx vs sqlcoder-7b-2 vs llama-3-sqlcoder-8b
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
fine-tuning impact: llama3-8b vs llama-3-sqlcoder-8b
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Quantization effects: dbrx 8/4/2 bit
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
A handful of conclusions…with a grain of salt
● NL-to-SQL and dbt code generationare challenging
● commercialmodels are in most cases still better… but
● there are very promising open-source 7-30b alternatives
● smaller models perform better than larger after quantization
● SQL-dedicated and fine-tuned models can turn out a bit a
disappointing (beam search?), e.g. :
○ unnecessary joins elimination
○ wrong data types inference
○ occasional hallucinations
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
Future directions
● Implementation of in-context learning, such as Query
Similarity Selection (few-shot strategy) and Agentic RAG
with Reflection Strategy
● Model fine-tuning (dbt)
● Data Modeling (DV 2.0)
● Various SQL dialects and platform migrations
● Prompt optimizations, e.g. with DSPy
© Copyright. All rights reserved. Not to be reproduced without prior written consent.
DEMO
© Copyright. All rights reserved. Not to be reproduced without prior written consent.

More Related Content

Similar to Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
Rogue Wave Software
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Muga Nishizawa
 
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
DataScienceConferenc1
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
South West Data Meetup
 
Artifactory Essentials Workshop on August 27, 2020 by JFrog
Artifactory Essentials Workshop on August 27, 2020 by JFrogArtifactory Essentials Workshop on August 27, 2020 by JFrog
Artifactory Essentials Workshop on August 27, 2020 by JFrog
Cloud Study Network
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon Web Services
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
DataWorks Summit
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
Yshay Yaacobi
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
Yugabyte
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
Taro L. Saito
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
imcpune
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
GetInData
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
GetInData
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
LiveDB Introduction at devsum 2011
LiveDB Introduction at devsum 2011LiveDB Introduction at devsum 2011
LiveDB Introduction at devsum 2011
Robert Friberg
 
Let’s template
Let’s templateLet’s template
Let’s template
AllenKao7
 
MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014
Dave Stokes
 

Similar to Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf (20)

Advanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applicationsAdvanced technologies and techniques for debugging HPC applications
Advanced technologies and techniques for debugging HPC applications
 
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
 
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Artifactory Essentials Workshop on August 27, 2020 by JFrog
Artifactory Essentials Workshop on August 27, 2020 by JFrogArtifactory Essentials Workshop on August 27, 2020 by JFrog
Artifactory Essentials Workshop on August 27, 2020 by JFrog
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Instant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositoriesInstant developer onboarding with self contained repositories
Instant developer onboarding with self contained repositories
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
Welcome to MLOps candy shop and choose your flavour! - Mateusz Pytel & Marius...
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
LiveDB Introduction at devsum 2011
LiveDB Introduction at devsum 2011LiveDB Introduction at devsum 2011
LiveDB Introduction at devsum 2011
 
Let’s template
Let’s templateLet’s template
Let’s template
 
MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014MySQL 5.7 -- SCaLE Feb 2014
MySQL 5.7 -- SCaLE Feb 2014
 

More from GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
GetInData
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
GetInData
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
GetInData
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
GetInData
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
GetInData
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
GetInData
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
GetInData
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
GetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
GetInData
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
GetInData
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
GetInData
 

More from GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
 
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
Real time analytics that controls 50% of mobile network in Poland - Maciej Br...
 
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
How to maximize profit from IoT by using data platform - Albert Lewandowski, ...
 

Recently uploaded

potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
fatima shekh$A17
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
mohamed Ibrahim
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
huseindihon
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
kittycrispy617
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
erynsouthern
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
kevig
 

Recently uploaded (20)

potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Data Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learnersData Preprocessing Cheatsheet for learners
Data Preprocessing Cheatsheet for learners
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
potential development of the A* search algorithm specifically
potential development of the A* search algorithm specificallypotential development of the A* search algorithm specifically
potential development of the A* search algorithm specifically
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...Experience, Excellence & Commitment are the characteristics that describe Fla...
Experience, Excellence & Commitment are the characteristics that describe Fla...
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy DsouzaOpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
OpenMetadata Spotlight - OpenMetadata @ Aspire by Vinol Joy Dsouza
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDeliveryBDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
BDSM Girls Call Mumbai 👀 9820252231 👀 Cash Payment With Room DeliveryDelivery
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
 

Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf

  • 2. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
  • 3. © Copyright. All rights reserved. Not to be reproduced without prior written consent. 1. Why do we need yet another (open-source ) Copilot? 2. How can we build one? 3. Architecture and evaluation 4. DEMO …howto turn bestpractices into AI coding assistant
  • 4. © Copyright. All rights reserved. Not to be reproduced without prior written consent. (Data)Context is king! ● Explicit and precise data context of your whole data platform ● Data transformation codebase ● Data models with comments and table relationships ● Other user queries ● Lineage and human curated dataset descriptions from data catalogs
  • 5. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● open-source tools, such as WrenAI, Venna.AI, Dataherald focus on Text-to-SQL to be embedded in web interfaces – i.e. chatbots or own SQL editors – meant for non-technical users. ● closed source AI-Powered Assistants to BigQuery (SQL+Dataform),Snowflake (SQL), Databricks (SQL+Python) web interfaces,more like a black-box not-meant for customizations. ● missingAnalytics EngineerCopilot with a dbt/SQL support Data Assistants landscape
  • 6. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Customized and specialized models are the future.
  • 7. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● Many other small (7-34b) models licensed for commercial use, e.g. : ✓ starcoder2 ✓ dolphincoder ✓ deepseeek-coder ✓ Opencodeinterpreter ✓ Llama3 sqlcoder-7b and others May 9th updates
  • 8. © Copyright. All rights reserved. Not to be reproduced without prior written consent. How turnyour best practices into Copilots ? ● Vector database as a knowledge base - what ? ● Prompts as instructions following best practices - how ? ● LLM to combine both… Retrieval-Augmented Generation(RAG)
  • 9. © Copyright. All rights reserved. Not to be reproduced without prior written consent. RAG for Text-to-SQL
  • 10. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Hybrid search • combination of keyword and vector search
  • 11. © Copyright. All rights reserved. Not to be reproduced without prior written consent. ● a technique used to search for similar items based on their vector representations, called embeddings ● Approximate Nearest Neighbours algorithms Vector search
  • 12. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Data CopilotRAG architecture Data programming is more about repeatable tasks!
  • 13. © Copyright. All rights reserved. Not to be reproduced without prior written consent. GID Data Copilot(GDC) ● An extensible AI programming assistant for SQL and dbt code ● Powered by: ● Large Language Models (SOTA LLMs) ● Robust Retrieval Augmented Generation (RAG) architecture ● Hybrid search techniques ● Fast Vector Database ● Curated Prompts ● Builtin Data commands
  • 14. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Continue - an open-source copilot ● support for both tasks and tab autocompletion ● highly extensible ○ use any LLM model you wish - also multiple, specialized models for different languages or tasks ○ support for many model providers, such as Ollama, vLLM, LM Studio ○ custom context providers for more control over LLMs augmentation ○ custom slash commands that can combine own prompts, contexts and models for specialized, reusable tasks ● support for VSCode and Jetbrains ● secure (i.e. can be run locally, on-premise or cloud VPC) ● translate your best practices into ”slash data commands”
  • 15. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Continue - a custom contextprovider
  • 16. © Copyright. All rights reserved. Not to be reproduced without prior written consent. dbtSQL task = custom(context + prompt + model)
  • 17. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Ollama ● fast and easy self-hosting of LLMs almost everywhere ● hybrid CPU+GPU inference ● powered by llama.cpp ● rich library of existing LLMs in different flavours ● GGUF - fast and memory efficient quantization for GPU ● Serve model with one-liner: ollama run starcoder2:7b ● vLLM for production deployments (Our video tutorial)
  • 18. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Ollama - custom model in 4 steps 1. Download a model in the GGUF format 2. Create a Modelfile, e.g.: FROM ./sqlcoder-7b-q5_k_m.gguf TEMPLATE """{{ .Prompt }}""" PARAMETER stop "<|endoftext|>" 3. Create a model with Ollama ollama create sqlcoder-7b-2 -f Modefile 4. Serve it ollama run sqlcoder-7b-2
  • 19. © Copyright. All rights reserved. Not to be reproduced without prior written consent. LanceDB ● fast (Rust ), serverless and embeddable - DuckDB for ML ● powered by Lance file format for ML (versioning,zero-copy) ● multi-modal ● support for hybrid (semantic + keyword) search ● Llamaindex integration ● Python API and fast data exchange with polars and Arrow
  • 20. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Technical architecture
  • 21. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Questionrepresentation1 1 Text-to-SQLEmpowered by Large Language Models: A BenchmarkEvaluation
  • 22. © Copyright. All rights reserved. Not to be reproduced without prior written consent. LLMs evaluation 1/2 ● Not meant to be yet another benchmark, such as: Spider sql-eval or Bird-SQL for jus SQL generation ● Jaffle Shop example - simple but not trivial ● Zero-shot – Agentic Workflow with Reflection TBD ● 4 typical data tasks ○ Data model exploration/discovery ○ SQL: an easy one (single table) and more complex (joins with sorting and aggregations) ○ dbt model generation ○ dbt tests generation based on rules
  • 23. © Copyright. All rights reserved. Not to be reproduced without prior written consent. LLMs evaluation 2/2 +- perfect or almost perfect +/- - not optimal or some minor tweaks needed -/+ - not very helpful, serious hallucinations - - totally useless
  • 24. © Copyright. All rights reserved. Not to be reproduced without prior written consent. gpt4 vs dbrx vs sqlcoder-7b-2 vs llama-3-sqlcoder-8b
  • 25. © Copyright. All rights reserved. Not to be reproduced without prior written consent. fine-tuning impact: llama3-8b vs llama-3-sqlcoder-8b
  • 26. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Quantization effects: dbrx 8/4/2 bit
  • 27. © Copyright. All rights reserved. Not to be reproduced without prior written consent. A handful of conclusions…with a grain of salt ● NL-to-SQL and dbt code generationare challenging ● commercialmodels are in most cases still better… but ● there are very promising open-source 7-30b alternatives ● smaller models perform better than larger after quantization ● SQL-dedicated and fine-tuned models can turn out a bit a disappointing (beam search?), e.g. : ○ unnecessary joins elimination ○ wrong data types inference ○ occasional hallucinations
  • 28. © Copyright. All rights reserved. Not to be reproduced without prior written consent. Future directions ● Implementation of in-context learning, such as Query Similarity Selection (few-shot strategy) and Agentic RAG with Reflection Strategy ● Model fine-tuning (dbt) ● Data Modeling (DV 2.0) ● Various SQL dialects and platform migrations ● Prompt optimizations, e.g. with DSPy
  • 29. © Copyright. All rights reserved. Not to be reproduced without prior written consent. DEMO
  • 30. © Copyright. All rights reserved. Not to be reproduced without prior written consent.