1 | © Copyright 2024 Zilliz
1
1 Table = 1000 Words?
Foundation Models for Tabular Data
Stefan Webb, Developer Advocate @ Zilliz
2 | © Copyright 2024 Zilliz
2
2 | © Copyright 9/25/23 Zilliz
2 | © Copyright 9/25/23 Zilliz
Speaker
Stefan Webb
Developer Advocate
stefan.webb@zilliz.com
stefan-webb
3 | © Copyright 2024 Zilliz
3 | © Copyright 9/25/23 Zilliz
3
Milvus is an Open-Source Vector Database to
store, index, manage, and use the massive
number of embedding vectors generated by
deep neural networks and LLMs.
contributors
400
stars
30K
docker pulls
66M
forks
2.7K
+
Milvus: High-performance, scalable vector database
4 | © Copyright 2024 Zilliz
4
Milvus Users
5 | © Copyright 2024 Zilliz
5
Deployment Options
Milvus Lite
● Locally hosted
● Suitable for prototyping
and demos
Milvus Standalone
● Single remote/local server
● “Medium” scale
● Simplified setup,
maintenance, etc.
compared to cluster
Milvus Cluster
● Distributed system
● Many different types of
nodes
● Scales to 100s of billions
of vectors
6 | © Copyright 2024 Zilliz
6
7 | © Copyright 2024 Zilliz
7
Why Not Traditional Databases?
Suboptimal
Indexing / Search
Scaling Inadequate Query
& Analytics Support
8 | © Copyright 2024 Zilliz
8
BM25
● Simple implementation
● Good Retrieval performance
○ Queries Per Second
○ Accuracy
● Used by others in the industry
○ ElasticSearch
○ Apache Lucene
○ MongoDB Atlas Search
○ …
9 | © Copyright 2024 Zilliz
9
Full Text Search
10 | © Copyright 2024 Zilliz
10
Benchmarks
Shows 3-20x faster comparing with open
source Milvus
At least 6x faster than other vector databases
https://github.com/zilliztech/VectorDBBench
11 | © Copyright 2024 Zilliz
11
Structured vs Unstructured Data?
“Structuredˮ “Unstructuredˮ
Text, Image, Audio, Video, Geo, 3D, Graphs, Networks, Molecules
12 | © Copyright 2024 Zilliz
12
Example Application
Table
Query
Response
Gen AI
“Black Magicˮ
13 | © Copyright 2024 Zilliz
13
Other Applications
● Summarization
● Table understanding
● Fact verification
● Table-to-text generation
● Text-to-table generation
● Etc.
14 | © Copyright 2024 Zilliz
14
CONTENTS
01 Foundation Models for Tabular Data
02 Training the Model
Data, Data, Data
03
Agents
04
Code Demo
05
15 | © Copyright 2024 Zilliz
15
1. Foundation Models for
Tabular Data
16 | © Copyright 2024 Zilliz
16
Modeling Paradigm
Table
Encoder
Translator
/ Adaptor
Combined
Prompt
Tokenizer
text
table
LLM response
(fine-tuned)
Qwen2.5
“QFormerˮ
Specialized
Transformer
17 | © Copyright 2024 Zilliz
17
TableGPT2
Su et al., 2024 Zhejiang University)]
18 | © Copyright 2024 Zilliz
18
Table Encoder
Su et al., 2024 Zhejiang University)]
19 | © Copyright 2024 Zilliz
19
QFormer
Su et al., 2024 Zhejiang University)]
20 | © Copyright 2024 Zilliz
20
2. Training the Model
21 | © Copyright 2024 Zilliz
21
Training
Su et al., 2024 Zhejiang University)]
LLM Table Encoder LLM
DPO
22 | © Copyright 2024 Zilliz
22
3. Data Collection /
Processing
23 | © Copyright 2024 Zilliz
23
Data Collection
● Focus on coding applications (important to BI)
● Data:
○ Coding data
■ StackOverflow
■ GitHub
○ General
■ Textbook
● Finance
● Maths
● Biology
● Etc.
■ Kaggle
● Data analysis QAs
○ Table-related QA
● Synthetically generated table QAs
○ Vetted by humans
○ Synthethize then refine
24 | © Copyright 2024 Zilliz
24
Tabular Data
● Database tables
● Web pages
● Excel
● Academic task
● Research datasets
● Misc.
Data Curation
● Filtering
● Cleaning
● Augmentation
25 | © Copyright 2024 Zilliz
25
4. Agents for Tabular
Data Analysis
26 | © Copyright 2024 Zilliz
26
Agentic Workflow
Su et al., 2024 Zhejiang University)]
normalize tables, perform vector
database lookup with Milvus
27 | © Copyright 2024 Zilliz
27
System Prompt
[https://github.com/tablegpt/tablegpt-agent/blob/main/src/tablegpt/agent/data_analyzer.py]
28 | © Copyright 2024 Zilliz
28
5. Code and Demo
• TableGPT2  FineTuned Qwen2.5
• Documentation explaining agentic workflow
29 | © Copyright 2024 Zilliz
29
Summary
30 | © Copyright 2024 Zilliz
30
Summary
● We can learn self-supervised / unsupervised representations of
tables
● We can then learn to combine these representations with
foundation models, e.g. LLMs, to work across modalities
● Many applications in Business Intelligence
○ Construct an Agent with Milvus, Langchain, etc.
● TableGPT2 is only one such approach!
31 | © Copyright 2024 Zilliz
31
Criticisms of TableGPT2
● Code and weights for table encoder not currently available (as
of Jan 13, 2024
● Trouble reconciling TableLLM and TableGPT2 benchmark
tables. GPT3.5 and GPT4 often best
● Potential cherry-picking of benchmarks and baselines
32 | © Copyright 10/22/23 Zilliz
32 | © Copyright 10/22/23 Zilliz
Milvus Office
Hours!
Get 11 hands-on support on your
vector database projects
33 | © Copyright 10/22/23 Zilliz
33 | © Copyright 10/22/23 Zilliz
Women in AI
RAG Hackathon
Connect, learn, and push the
boundaries of AI technology
in a supportive and inspiring
environment!
📍Stanford, Palo Alto
🗓 Jan 25, 2025
🔗 lu.ma/women-in-tech
34 | © Copyright 2024 Zilliz
34
x.com/milvusio
linkedin.com/company/
the-milvus-project
milvus.io/discord
github.com/milvus-io/milvus
zilliz.com/learn/
milvus-notebooks
milvus.io/docs/tutorials-
overview.md

1 Table = 1000 Words? Foundation Models for Tabular Data

  • 1.
    1 | ©Copyright 2024 Zilliz 1 1 Table = 1000 Words? Foundation Models for Tabular Data Stefan Webb, Developer Advocate @ Zilliz
  • 2.
    2 | ©Copyright 2024 Zilliz 2 2 | © Copyright 9/25/23 Zilliz 2 | © Copyright 9/25/23 Zilliz Speaker Stefan Webb Developer Advocate stefan.webb@zilliz.com stefan-webb
  • 3.
    3 | ©Copyright 2024 Zilliz 3 | © Copyright 9/25/23 Zilliz 3 Milvus is an Open-Source Vector Database to store, index, manage, and use the massive number of embedding vectors generated by deep neural networks and LLMs. contributors 400 stars 30K docker pulls 66M forks 2.7K + Milvus: High-performance, scalable vector database
  • 4.
    4 | ©Copyright 2024 Zilliz 4 Milvus Users
  • 5.
    5 | ©Copyright 2024 Zilliz 5 Deployment Options Milvus Lite ● Locally hosted ● Suitable for prototyping and demos Milvus Standalone ● Single remote/local server ● “Medium” scale ● Simplified setup, maintenance, etc. compared to cluster Milvus Cluster ● Distributed system ● Many different types of nodes ● Scales to 100s of billions of vectors
  • 6.
    6 | ©Copyright 2024 Zilliz 6
  • 7.
    7 | ©Copyright 2024 Zilliz 7 Why Not Traditional Databases? Suboptimal Indexing / Search Scaling Inadequate Query & Analytics Support
  • 8.
    8 | ©Copyright 2024 Zilliz 8 BM25 ● Simple implementation ● Good Retrieval performance ○ Queries Per Second ○ Accuracy ● Used by others in the industry ○ ElasticSearch ○ Apache Lucene ○ MongoDB Atlas Search ○ …
  • 9.
    9 | ©Copyright 2024 Zilliz 9 Full Text Search
  • 10.
    10 | ©Copyright 2024 Zilliz 10 Benchmarks Shows 3-20x faster comparing with open source Milvus At least 6x faster than other vector databases https://github.com/zilliztech/VectorDBBench
  • 11.
    11 | ©Copyright 2024 Zilliz 11 Structured vs Unstructured Data? “Structuredˮ “Unstructuredˮ Text, Image, Audio, Video, Geo, 3D, Graphs, Networks, Molecules
  • 12.
    12 | ©Copyright 2024 Zilliz 12 Example Application Table Query Response Gen AI “Black Magicˮ
  • 13.
    13 | ©Copyright 2024 Zilliz 13 Other Applications ● Summarization ● Table understanding ● Fact verification ● Table-to-text generation ● Text-to-table generation ● Etc.
  • 14.
    14 | ©Copyright 2024 Zilliz 14 CONTENTS 01 Foundation Models for Tabular Data 02 Training the Model Data, Data, Data 03 Agents 04 Code Demo 05
  • 15.
    15 | ©Copyright 2024 Zilliz 15 1. Foundation Models for Tabular Data
  • 16.
    16 | ©Copyright 2024 Zilliz 16 Modeling Paradigm Table Encoder Translator / Adaptor Combined Prompt Tokenizer text table LLM response (fine-tuned) Qwen2.5 “QFormerˮ Specialized Transformer
  • 17.
    17 | ©Copyright 2024 Zilliz 17 TableGPT2 Su et al., 2024 Zhejiang University)]
  • 18.
    18 | ©Copyright 2024 Zilliz 18 Table Encoder Su et al., 2024 Zhejiang University)]
  • 19.
    19 | ©Copyright 2024 Zilliz 19 QFormer Su et al., 2024 Zhejiang University)]
  • 20.
    20 | ©Copyright 2024 Zilliz 20 2. Training the Model
  • 21.
    21 | ©Copyright 2024 Zilliz 21 Training Su et al., 2024 Zhejiang University)] LLM Table Encoder LLM DPO
  • 22.
    22 | ©Copyright 2024 Zilliz 22 3. Data Collection / Processing
  • 23.
    23 | ©Copyright 2024 Zilliz 23 Data Collection ● Focus on coding applications (important to BI) ● Data: ○ Coding data ■ StackOverflow ■ GitHub ○ General ■ Textbook ● Finance ● Maths ● Biology ● Etc. ■ Kaggle ● Data analysis QAs ○ Table-related QA ● Synthetically generated table QAs ○ Vetted by humans ○ Synthethize then refine
  • 24.
    24 | ©Copyright 2024 Zilliz 24 Tabular Data ● Database tables ● Web pages ● Excel ● Academic task ● Research datasets ● Misc. Data Curation ● Filtering ● Cleaning ● Augmentation
  • 25.
    25 | ©Copyright 2024 Zilliz 25 4. Agents for Tabular Data Analysis
  • 26.
    26 | ©Copyright 2024 Zilliz 26 Agentic Workflow Su et al., 2024 Zhejiang University)] normalize tables, perform vector database lookup with Milvus
  • 27.
    27 | ©Copyright 2024 Zilliz 27 System Prompt [https://github.com/tablegpt/tablegpt-agent/blob/main/src/tablegpt/agent/data_analyzer.py]
  • 28.
    28 | ©Copyright 2024 Zilliz 28 5. Code and Demo • TableGPT2  FineTuned Qwen2.5 • Documentation explaining agentic workflow
  • 29.
    29 | ©Copyright 2024 Zilliz 29 Summary
  • 30.
    30 | ©Copyright 2024 Zilliz 30 Summary ● We can learn self-supervised / unsupervised representations of tables ● We can then learn to combine these representations with foundation models, e.g. LLMs, to work across modalities ● Many applications in Business Intelligence ○ Construct an Agent with Milvus, Langchain, etc. ● TableGPT2 is only one such approach!
  • 31.
    31 | ©Copyright 2024 Zilliz 31 Criticisms of TableGPT2 ● Code and weights for table encoder not currently available (as of Jan 13, 2024 ● Trouble reconciling TableLLM and TableGPT2 benchmark tables. GPT3.5 and GPT4 often best ● Potential cherry-picking of benchmarks and baselines
  • 32.
    32 | ©Copyright 10/22/23 Zilliz 32 | © Copyright 10/22/23 Zilliz Milvus Office Hours! Get 11 hands-on support on your vector database projects
  • 33.
    33 | ©Copyright 10/22/23 Zilliz 33 | © Copyright 10/22/23 Zilliz Women in AI RAG Hackathon Connect, learn, and push the boundaries of AI technology in a supportive and inspiring environment! 📍Stanford, Palo Alto 🗓 Jan 25, 2025 🔗 lu.ma/women-in-tech
  • 34.
    34 | ©Copyright 2024 Zilliz 34 x.com/milvusio linkedin.com/company/ the-milvus-project milvus.io/discord github.com/milvus-io/milvus zilliz.com/learn/ milvus-notebooks milvus.io/docs/tutorials- overview.md