This intervention draws on experimentations ongoing in the context of the OECD-led Statistical Information System Collaboration Community (SIS-CC) to enable AI applications with SDMX. One important use case is to use AI for better accessibility and discoverability of the data: whilst UX techniques, lexical search improvements, and data harmonisation can take statistical organisations to a good level of accessibility, however, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints. That is where AI – and most importantly, NLP and LLM techniques – could potentially make a difference. The “StatsBot” could be this natural language, conversational engine that could facilitate access and usage of the data. The “StatsBot” could leverage the semantics of any SDMX source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal and create the StatsBot as a universal, open asset usable by all statistical organisations. In a first step, the concept tested is to use Large Language Models with the Apache Solr index of SDMX objects so as to transform natural language queries into SDMX queries. In a second step, results could be framed as a natural language statement complementing the top-k search results. For the purpose of initial PoCs – aimed to demonstrate functional features and feasibility – a commercial LLM (such as OpenAI GPT-4) will be used; in a later stage substitution with an open source LLM will be analysed. The presentation will include the results of the first experimental work, lessons learnt, and scope future work that should lead to defining the path for production-grade, fully open source, and universal StatsBot.
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics More Accessible And Discoverable.pptx
1. 29/10/2023
Alessandro Benedetti
When SDMX Meets AI:
Leveraging Open Source LLMs to Make Official
Statistics More Accessible and Discoverable
9th SDMX Global Conference
2. ‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
3. ‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3
4. AGENDA
4
LLMs and Open Source
SIS-CC to enable AI applications with SDMX
From Natural Language to structured queries
Findings and future Works
5. AI, Machine learning and Deep Learning
https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
6. WHAT IS A LARGE LANGUAGE MODEL?
● Transformers
● Next-token-prediction and masked-
language-modeling
● estimate the likelihood of each possible
word (in its vocabulary) given the
previous sequence
● learn the statistical structure of
language
● pre-trained on huge quantities of text
https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot-1ce5fca96286
7. OPEN SOURCE LARGE LANGUAGE MODELS
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
https://sease.io/2023/06/how-to-choose-the-right-large-language-model-for-your-domain-open-
source-edition.html
● Generalists
○ Falcon
○ LLaMA
■ alpaca
■ vicuna
… many others!
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
8. Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s
that impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● OECD lead initiative (The Organisation for
Economic Co-operation and Development)
● The Statistical Information System Collaboration
Community
● .Stat Suite and Apache Solr
https://siscc.org/developers/technology/
SIS-CC TO ENABLE AI APPLICATIONS WITH SDMX
10. Use case example
Natural language query:
What were the sulfur oxide emissions in
Australia in 2013?
2 Requests to LLM
Search API
GENERATIVE EXTRACTIVE
{ "name":"Emissions of air pollutants",
"id":"ds-siscc-qa:DF_AIR_EMISSIONS"}] },
"Dimensions":[ "Country", "Pollutant", "Unit", "Year"]}, …
Statistic
Filters
{
"Country": ["0|Australia#AUS#"],
"Pollutant": ["0|Sulphur Oxides#SOX#"],
"Unit Of measure": ["0|Total man-made emissions#TOT#"],
"Year": "2013"
}
FROM NATURAL LANGUAGE TO STRUCTURED QUERIES
11. ● How do you update this approach in time?
● New large language models?
● Better prompts?
● How to fit user interactions?
IMPROVING OVER TIME
12. RESULTS: What were the sulfur oxide emissions in Australia in 2013
GPT Generative answer is:
['Sulfur dioxide emissions', 'Air
pollution', 'Environmental impact', 'Fossil
fuel combustion', 'Acid rain']
GPT Extractive answer is:
{'srQMgwl_en_ss': ['1|Environment#ENV#|Air
and climate#ENV_AC#'], 'dimensions_en_ss':
['Time period', 'Reference area',
'Pollutant', 'Country']}
Dataflow retrieved is:
[{'id': 'ds-siscc-qa:DF_AIR_EMISSIONS', 'name':
'Emissions of air pollutants', 'description':
''}]
4 dimensions are available for the above
dataflow:
['Country', 'Year', 'Pollutant', 'Variable']
{
"dataflow": [
{
"description": "",
"id": "ds-siscc-qa:DF_AIR_EMISSIONS",
"name": "Emissions of air pollutants"
}
],
"filters": {
"Country": "0|Australia#AUS#",
"Pollutant": "0|Sulphur Oxides#SOX#",
"Variable": "0|Total man-made emissions#TOT#",
"Year": "2013"
},
"natural_language_query": "What were the sulfur oxide
emissions in Australia in 2013"
}
13. ● Promising! - LLM are good in query expansion (generative or extractive)
● gpt-3.5-turbo-instruct -> new models can do much better!
● 4k tokens - (for both prompt and response) is not enough
● Mistakes - Some times dimension values are associated to the wrong dimension,
more prompt engineering!
FINDINGS
14. ● [Solr] Fine-tune the dimension retrieval Solr query
● [LLM] Select the best model to date - Explore the State of the Art (both commercially and
Open source)
● [LLM] Refine the prompts according to the model
● [LLM] Implement integration tests with the most common failures -> LLM/prompt
engineering to solve them
● [Solr] Finalising the dataflow retrieval Solr query
● [Performance] Stress test the solution
● [Quality] Set up queries/expected documents
THE ROAD TO PRODUCTION
15. ● StatsBot - More conversation!
● Retrieval Augmented Generation
● Results Summarization
FUTURE WORKS