tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG

1 | © Copyright 2024 Zilliz
1
Introduction to Data Prep Kit and
Open Source RAG
Tim Spann @ Zilliz

2
2 | © Copyright 10/22/23 Zilliz
Tim Spann
Principal Developer
Advocate, Milvus
https://medium.com/@tspann
https://www.linkedin.com/in/timothyspann/
https://x.com/PaaSDev

The challenge of Unstructured Data
● Problem: Unstructured data comes in lots of forms, no
easy way to interact with it all
● Solution: Vector embeddings
● How: Neural networks e.g. embedding models
Vector
Databases

Unstructured Data is Everywhere
Unstructured data is any data that does not
conform to a predefined data model.
Currently, 90% of unstructured data is never
analyzed.
Images Videos and
more!
Text

Common Tasks when Preparing Data for
LLMs
• Documents
– De-duplication of docs
– Extracting text from PDFs /
Documents
– Removing excess markup
– Tokenizing / Chunking documents
– Assessing document quality
– PII detection
• Code
– Language detection
– Malware detection
– Code quality

Say Hello to "Data Prep Kit"
Open source toolkit
Helps with data prep
Handles documents + code
Many ready to use modules
out of the box
Python
Develop on laptop, scale on
clusters
https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/README.md

8
Retrieval-Augmented Generation (RAG)
2024
A technique that combines the strength
of retrieval-based and generative
models:
● Improve accuracy and relevance
● Eliminate hallucination
● Provide domain-specific
knowledge

11
Milvus Vector
Database
Milvus is an open-source vector database
for GenAI projects. pip install on your
laptop, plug into popular AI dev tools, and
push to production with a single line of
code.
30K+
GitHub Stars
66M+
Docker Pulls
400+
Contributors
2.7K+
Forks
Easy Setup
Pip-install to start coding in a notebook within seconds
Integration
Plug into OpenAI, Langchain, LlmaIndex, and many more
Reusable Code
Write once, and deploy with one line of code into the production
environment
Feature-rich
Dense & sparse embeddings, filtering, reranking and beyond

12
New Challenge: Search in Vector Spaces
How to Index and
Search?
● High-dimensional
● > 1000 dims
How to Scale?
● 10-100 million vectors?
● Billions?
● Trillions?
● Billions of users?
Multiple Data Types?
● Text
● Images
● Audio
● Graphs
● …

14 | © Copyright Zilliz
14
Easy Open RAG Stack Highlighted
Framework
Hardware
Infrastructure
Embedding Models LLMs
Software Infrastructure
Vector Database

15
15
This week in Vector Databases, Gen AI, LLM,
Apache NiFi, Apache Flink, Apache Kafka, ML,
AI, Apache Spark, Apache Iceberg, Python,
Java, Vector DB and Open Source friends.
https://bit.ly/32dAJft
https://github.com/milvus-io/milvus
AIM Weekly by Tim Spann

tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG

More Related Content

Similar to tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG

More from Timothy Spann

Recently uploaded

tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG