SlideShare a Scribd company logo
1 of 9
Download to read offline
Reframing LLM ‘Chat with Data’: Introducing LLM-
Assisted Data Recipes
• In this article, we cover some of the limitations in using Large Language
Models (LLMs) to ‘Chat with Data’, proposing a ‘Data Recipes’
methodology which may be an alternative in some situations. Data
Recipes extends the idea of reusable code snippets but includes data and
has the advantage of being programmed conversationally using an LLM.
This enables the creation of a reusable Data Recipes Library — for
accessing data and generating insights — which offers more transparency
for LLM-generated code with a human-in-the-loop to moderate recipes as
required. Cached results from recipes — sourced from SQL queries or calls
to external APIs — can be refreshed asynchronously for improved
response times. The proposed solution is a variation of the LLMs As Tool
Makers (LATM) architecture which splits the workflow into two streams:
(i) A low transaction volume / high-cost stream for creating recipes; and
(ii) A high transaction volume / low-cost stream for end-users to use
recipes. Finally, by having a library of recipes and associated data
integration, it is possible to create a ‘Data Recipes Hub’ with the
possibility of community contribution.
Using LLMs for conversational data analysis
• There are some very clever patterns now that allow people to ask
questions in natural language about data, where a Large Language
Model (LLM) generates calls to get the data and summarizes the
output for the user. Often referred to as ‘Chat with Data’, I’ve
previously posted some articles illustrating this technique, for
example using Open AI assistants to help people prepare for climate
change. There are many more advanced examples out there it can be
an amazing way to lower the technical barrier for people to gain
insights from complicated data.
Examples of using LLMs to generate SQL queries from user inputs, and summarize output to provide an
answer. Sources: Langchain SQL Agents
Examples of using LLMs to generate API calls from user inputs, and summarize output to provide an answer.
Sources: Langchain Interacting with APIs
1.Generating Database queries: The LLM converts natural language to
a query language such as SQL or Cypher
2.Generating API Queries: The LLM converts natural language to text
used to call APIs
• The application executes the LLM-provided suggestion to get the
data, then usually passes the results back to the LLM to summarize.
The method for accessing data typically falls into
the following categories …
Getting the Data Can be a Problem
• It’s amazing that these techniques now exist, but in turning them into
production solutions each has its advantages and disadvantages …
LLMs can generate text for executing database queries and calling external APIs, but each has its advantages
and disadvantages
For example, generating SQL supports all the amazing things a modern
database query language can do, such as aggregation across large
volumes of data. However, the data might not already be in a database
where SQL can be used. It could be ingested and then queried with SQL,
but building pipelines like this can be complex and costly to manage.
Accessing data directly through APIs means the data doesn’t have to be in a
database and opens up a huge world of publically available datasets, but
there is a catch. Many APIs do not support aggregate queries like those
supported by SQL, so the only option is to extract the low-level data, and
then aggregate it. This puts more burden on the LLM application and can
require extraction of large amounts of data.
So both techniques have limitations.
Passing Data Directly through LLMs Doesn’t Scale
• On top of this, another major challenge quickly emerges when
operationalizing LLMs for data analysis. Most solutions, such as Open AI
Assistants can generate function calls for the caller to execute to extract
data, but the output is then passed back to the LLM. It’s unclear exactly
what happens internally at OpenAI, but it’s not very difficult to pass
enough data to cause a token limit breach, suggesting the LLM is being
used to process the raw data in a prompt. Many patterns do something
along these lines, passing the output of function calling back to the LLM.
This, of course, does not scale in the real world where data volumes
required to answer a question can be large. It soon becomes expensive
and often fails.

More Related Content

Similar to "Innovative Engineer: Crafting Tomorrow"

system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfParthNavale
 
Sql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaperSql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaperKlaudiia Jacome
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Societyconfluent
 
Database project edi
Database project ediDatabase project edi
Database project ediRey Jefferson
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesKnoldus Inc.
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristicsSalegram Padhee
 
Database performance management
Database performance managementDatabase performance management
Database performance managementscottaver
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSijcsit
 
86921864 olap-case-study-vj
86921864 olap-case-study-vj86921864 olap-case-study-vj
86921864 olap-case-study-vjhomeworkping4
 
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...Mihir Gandhi
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
TOLL MANAGEMENT SYSTEM
TOLL MANAGEMENT SYSTEMTOLL MANAGEMENT SYSTEM
TOLL MANAGEMENT SYSTEMvishnuRajan20
 
Toll management system (1) (1)
Toll management system (1) (1)Toll management system (1) (1)
Toll management system (1) (1)vishnuRajan20
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...HostedbyConfluent
 
Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsLightbend
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami WebinarTaqi Jaffri
 

Similar to "Innovative Engineer: Crafting Tomorrow" (20)

system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
 
Sql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaperSql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaper
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Introducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building SocietyIntroducing Events and Stream Processing into Nationwide Building Society
Introducing Events and Stream Processing into Nationwide Building Society
 
Database project edi
Database project ediDatabase project edi
Database project edi
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Implement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data PipelinesImplement Test Harness For Streaming Data Pipelines
Implement Test Harness For Streaming Data Pipelines
 
Enterprise application characteristics
Enterprise application characteristicsEnterprise application characteristics
Enterprise application characteristics
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
Query Optimization for Big Data Analytics
Query Optimization for Big Data AnalyticsQuery Optimization for Big Data Analytics
Query Optimization for Big Data Analytics
 
86921864 olap-case-study-vj
86921864 olap-case-study-vj86921864 olap-case-study-vj
86921864 olap-case-study-vj
 
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
Sigmod 2013 - On Brewing Fresh Espresso - LinkedIn's Distributed Data Serving...
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
TOLL MANAGEMENT SYSTEM
TOLL MANAGEMENT SYSTEMTOLL MANAGEMENT SYSTEM
TOLL MANAGEMENT SYSTEM
 
Toll management system (1) (1)
Toll management system (1) (1)Toll management system (1) (1)
Toll management system (1) (1)
 
lamp.pptx
lamp.pptxlamp.pptx
lamp.pptx
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
 
Five Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data ApplicationsFive Early Challenges Of Building Streaming Fast Data Applications
Five Early Challenges Of Building Streaming Fast Data Applications
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami Webinar
 

Recently uploaded

Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfmahaffeycheryld
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...archanaece3
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...IJECEIAES
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfAshrafRagab14
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesRashidFaridChishti
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..MaherOthman7
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...IJECEIAES
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)NareenAsad
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.MdManikurRahman
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Studentskannan348865
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 

Recently uploaded (20)

Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
AI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdfAI in Healthcare Innovative use cases and applications.pdf
AI in Healthcare Innovative use cases and applications.pdf
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 

"Innovative Engineer: Crafting Tomorrow"

  • 1. Reframing LLM ‘Chat with Data’: Introducing LLM- Assisted Data Recipes
  • 2. • In this article, we cover some of the limitations in using Large Language Models (LLMs) to ‘Chat with Data’, proposing a ‘Data Recipes’ methodology which may be an alternative in some situations. Data Recipes extends the idea of reusable code snippets but includes data and has the advantage of being programmed conversationally using an LLM. This enables the creation of a reusable Data Recipes Library — for accessing data and generating insights — which offers more transparency for LLM-generated code with a human-in-the-loop to moderate recipes as required. Cached results from recipes — sourced from SQL queries or calls to external APIs — can be refreshed asynchronously for improved response times. The proposed solution is a variation of the LLMs As Tool Makers (LATM) architecture which splits the workflow into two streams: (i) A low transaction volume / high-cost stream for creating recipes; and (ii) A high transaction volume / low-cost stream for end-users to use recipes. Finally, by having a library of recipes and associated data integration, it is possible to create a ‘Data Recipes Hub’ with the possibility of community contribution.
  • 3. Using LLMs for conversational data analysis • There are some very clever patterns now that allow people to ask questions in natural language about data, where a Large Language Model (LLM) generates calls to get the data and summarizes the output for the user. Often referred to as ‘Chat with Data’, I’ve previously posted some articles illustrating this technique, for example using Open AI assistants to help people prepare for climate change. There are many more advanced examples out there it can be an amazing way to lower the technical barrier for people to gain insights from complicated data.
  • 4. Examples of using LLMs to generate SQL queries from user inputs, and summarize output to provide an answer. Sources: Langchain SQL Agents
  • 5. Examples of using LLMs to generate API calls from user inputs, and summarize output to provide an answer. Sources: Langchain Interacting with APIs
  • 6. 1.Generating Database queries: The LLM converts natural language to a query language such as SQL or Cypher 2.Generating API Queries: The LLM converts natural language to text used to call APIs • The application executes the LLM-provided suggestion to get the data, then usually passes the results back to the LLM to summarize. The method for accessing data typically falls into the following categories …
  • 7. Getting the Data Can be a Problem • It’s amazing that these techniques now exist, but in turning them into production solutions each has its advantages and disadvantages … LLMs can generate text for executing database queries and calling external APIs, but each has its advantages and disadvantages
  • 8. For example, generating SQL supports all the amazing things a modern database query language can do, such as aggregation across large volumes of data. However, the data might not already be in a database where SQL can be used. It could be ingested and then queried with SQL, but building pipelines like this can be complex and costly to manage. Accessing data directly through APIs means the data doesn’t have to be in a database and opens up a huge world of publically available datasets, but there is a catch. Many APIs do not support aggregate queries like those supported by SQL, so the only option is to extract the low-level data, and then aggregate it. This puts more burden on the LLM application and can require extraction of large amounts of data. So both techniques have limitations.
  • 9. Passing Data Directly through LLMs Doesn’t Scale • On top of this, another major challenge quickly emerges when operationalizing LLMs for data analysis. Most solutions, such as Open AI Assistants can generate function calls for the caller to execute to extract data, but the output is then passed back to the LLM. It’s unclear exactly what happens internally at OpenAI, but it’s not very difficult to pass enough data to cause a token limit breach, suggesting the LLM is being used to process the raw data in a prompt. Many patterns do something along these lines, passing the output of function calling back to the LLM. This, of course, does not scale in the real world where data volumes required to answer a question can be large. It soon becomes expensive and often fails.