"Versatile engineer adept at solving complex problems, designing innovative solutions, and advancing technology for a brighter, more efficient future."
2. • In this article, we cover some of the limitations in using Large Language
Models (LLMs) to ‘Chat with Data’, proposing a ‘Data Recipes’
methodology which may be an alternative in some situations. Data
Recipes extends the idea of reusable code snippets but includes data and
has the advantage of being programmed conversationally using an LLM.
This enables the creation of a reusable Data Recipes Library — for
accessing data and generating insights — which offers more transparency
for LLM-generated code with a human-in-the-loop to moderate recipes as
required. Cached results from recipes — sourced from SQL queries or calls
to external APIs — can be refreshed asynchronously for improved
response times. The proposed solution is a variation of the LLMs As Tool
Makers (LATM) architecture which splits the workflow into two streams:
(i) A low transaction volume / high-cost stream for creating recipes; and
(ii) A high transaction volume / low-cost stream for end-users to use
recipes. Finally, by having a library of recipes and associated data
integration, it is possible to create a ‘Data Recipes Hub’ with the
possibility of community contribution.
3. Using LLMs for conversational data analysis
• There are some very clever patterns now that allow people to ask
questions in natural language about data, where a Large Language
Model (LLM) generates calls to get the data and summarizes the
output for the user. Often referred to as ‘Chat with Data’, I’ve
previously posted some articles illustrating this technique, for
example using Open AI assistants to help people prepare for climate
change. There are many more advanced examples out there it can be
an amazing way to lower the technical barrier for people to gain
insights from complicated data.
4. Examples of using LLMs to generate SQL queries from user inputs, and summarize output to provide an
answer. Sources: Langchain SQL Agents
5. Examples of using LLMs to generate API calls from user inputs, and summarize output to provide an answer.
Sources: Langchain Interacting with APIs
6. 1.Generating Database queries: The LLM converts natural language to
a query language such as SQL or Cypher
2.Generating API Queries: The LLM converts natural language to text
used to call APIs
• The application executes the LLM-provided suggestion to get the
data, then usually passes the results back to the LLM to summarize.
The method for accessing data typically falls into
the following categories …
7. Getting the Data Can be a Problem
• It’s amazing that these techniques now exist, but in turning them into
production solutions each has its advantages and disadvantages …
LLMs can generate text for executing database queries and calling external APIs, but each has its advantages
and disadvantages
8. For example, generating SQL supports all the amazing things a modern
database query language can do, such as aggregation across large
volumes of data. However, the data might not already be in a database
where SQL can be used. It could be ingested and then queried with SQL,
but building pipelines like this can be complex and costly to manage.
Accessing data directly through APIs means the data doesn’t have to be in a
database and opens up a huge world of publically available datasets, but
there is a catch. Many APIs do not support aggregate queries like those
supported by SQL, so the only option is to extract the low-level data, and
then aggregate it. This puts more burden on the LLM application and can
require extraction of large amounts of data.
So both techniques have limitations.
9. Passing Data Directly through LLMs Doesn’t Scale
• On top of this, another major challenge quickly emerges when
operationalizing LLMs for data analysis. Most solutions, such as Open AI
Assistants can generate function calls for the caller to execute to extract
data, but the output is then passed back to the LLM. It’s unclear exactly
what happens internally at OpenAI, but it’s not very difficult to pass
enough data to cause a token limit breach, suggesting the LLM is being
used to process the raw data in a prompt. Many patterns do something
along these lines, passing the output of function calling back to the LLM.
This, of course, does not scale in the real world where data volumes
required to answer a question can be large. It soon becomes expensive
and often fails.