1 | © Copyright 2024 Zilliz
1
Introduction to Data Prep Kit and
Open Source RAG
Tim Spann @ Zilliz
2 | © Copyright 2024 Zilliz
2
2 | © Copyright 10/22/23 Zilliz
2 | © Copyright 2024 Zilliz
Tim Spann
Principal Developer
Advocate, Milvus
https://medium.com/@tspann
https://www.linkedin.com/in/timothyspann/
https://x.com/PaaSDev
The challenge of Unstructured Data
● Problem: Unstructured data comes in lots of forms, no
easy way to interact with it all
● Solution: Vector embeddings
● How: Neural networks e.g. embedding models
Vector
Databases
Unstructured Data is Everywhere
Unstructured data is any data that does not
conform to a predefined data model.
Currently, 90% of unstructured data is never
analyzed.
Images Videos and
more!
Text
Common Tasks when Preparing Data for
LLMs
• Documents
– De-duplication of docs
– Extracting text from PDFs /
Documents
– Removing excess markup
– Tokenizing / Chunking documents
– Assessing document quality
– PII detection
• Code
– Language detection
– Malware detection
– Code quality
Say Hello to "Data Prep Kit"
Open source toolkit
Helps with data prep
Handles documents + code
Many ready to use modules
out of the box
Python
Develop on laptop, scale on
clusters
https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/README.md
Data Prep Kit
8
Retrieval-Augmented Generation (RAG)
2024
A technique that combines the strength
of retrieval-based and generative
models:
● Improve accuracy and relevance
● Eliminate hallucination
● Provide domain-specific
knowledge
Open Source RAG
Data Prep Kit
RAG Flow
11 | © Copyright 2024 Zilliz
11
Milvus Vector
Database
Milvus is an open-source vector database
for GenAI projects. pip install on your
laptop, plug into popular AI dev tools, and
push to production with a single line of
code.
30K+
GitHub Stars
66M+
Docker Pulls
400+
Contributors
2.7K+
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, filtering, reranking and beyond
12 | © Copyright 2024 Zilliz
12
New Challenge: Search in Vector Spaces
How to Index and
Search?
● High-dimensional
● > 1000 dims
How to Scale?
● 10-100 million vectors?
● Billions?
● Trillions?
● Billions of users?
Multiple Data Types?
● Text
● Images
● Audio
● Graphs
● …
13 | © Copyright Zilliz
13
14 | © Copyright Zilliz
14
Easy Open RAG Stack Highlighted
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database
15 | © Copyright 2024 Zilliz
15
15
This week in Vector Databases, Gen AI, LLM,
Apache NiFi, Apache Flink, Apache Kafka, ML,
AI, Apache Spark, Apache Iceberg, Python,
Java, Vector DB and Open Source friends.
https://bit.ly/32dAJft
https://github.com/milvus-io/milvus
AIM Weekly by Tim Spann
16 | © Copyright Zilliz
16
T H A N K Y O U

tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG

  • 1.
    1 | ©Copyright 2024 Zilliz 1 Introduction to Data Prep Kit and Open Source RAG Tim Spann @ Zilliz
  • 2.
    2 | ©Copyright 2024 Zilliz 2 2 | © Copyright 10/22/23 Zilliz 2 | © Copyright 2024 Zilliz Tim Spann Principal Developer Advocate, Milvus https://medium.com/@tspann https://www.linkedin.com/in/timothyspann/ https://x.com/PaaSDev
  • 3.
    The challenge ofUnstructured Data ● Problem: Unstructured data comes in lots of forms, no easy way to interact with it all ● Solution: Vector embeddings ● How: Neural networks e.g. embedding models Vector Databases
  • 4.
    Unstructured Data isEverywhere Unstructured data is any data that does not conform to a predefined data model. Currently, 90% of unstructured data is never analyzed. Images Videos and more! Text
  • 5.
    Common Tasks whenPreparing Data for LLMs • Documents – De-duplication of docs – Extracting text from PDFs / Documents – Removing excess markup – Tokenizing / Chunking documents – Assessing document quality – PII detection • Code – Language detection – Malware detection – Code quality
  • 6.
    Say Hello to"Data Prep Kit" Open source toolkit Helps with data prep Handles documents + code Many ready to use modules out of the box Python Develop on laptop, scale on clusters https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/README.md
  • 7.
  • 8.
    8 Retrieval-Augmented Generation (RAG) 2024 Atechnique that combines the strength of retrieval-based and generative models: ● Improve accuracy and relevance ● Eliminate hallucination ● Provide domain-specific knowledge
  • 9.
  • 10.
  • 11.
    11 | ©Copyright 2024 Zilliz 11 Milvus Vector Database Milvus is an open-source vector database for GenAI projects. pip install on your laptop, plug into popular AI dev tools, and push to production with a single line of code. 30K+ GitHub Stars 66M+ Docker Pulls 400+ Contributors 2.7K+ Forks Easy Setup Pip-install to start coding in a notebook within seconds Integration Plug into OpenAI, Langchain, LlmaIndex, and many more Reusable Code Write once, and deploy with one line of code into the production environment Feature-rich Dense & sparse embeddings, filtering, reranking and beyond
  • 12.
    12 | ©Copyright 2024 Zilliz 12 New Challenge: Search in Vector Spaces How to Index and Search? ● High-dimensional ● > 1000 dims How to Scale? ● 10-100 million vectors? ● Billions? ● Trillions? ● Billions of users? Multiple Data Types? ● Text ● Images ● Audio ● Graphs ● …
  • 13.
    13 | ©Copyright Zilliz 13
  • 14.
    14 | ©Copyright Zilliz 14 Easy Open RAG Stack Highlighted Framework Hardware Infrastructure Embedding Models LLMs Software Infrastructure Vector Database
  • 15.
    15 | ©Copyright 2024 Zilliz 15 15 This week in Vector Databases, Gen AI, LLM, Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, Vector DB and Open Source friends. https://bit.ly/32dAJft https://github.com/milvus-io/milvus AIM Weekly by Tim Spann
  • 16.
    16 | ©Copyright Zilliz 16 T H A N K Y O U