How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

How Kafka Powers the World’s Most Popular
Vector Database
Frank Liu

Speaker
Frank Liu
Director of Operations & ML Architect
frank@zilliz.com
https://linkedin.com/in/fzliu
https://twitter.com/frankzliu

01 Unstructured Data and Embeddings
CONTENTS
02 Vector Database Overview
03 Milvus Architecture
04 Kafka as a Messaging Backbone

01
Unstructured Data and Embeddings

What is Unstructured Data?
Any data that does not conform to a pre-de
fi
ned data model.

Using Vectors to Represent Data
Embeddings!

Vector Database Overview
A database purpose-built to store, index, and query large quantities of
embeddings.

Vector Databases in Production

Milvus Features
• Supports hardware accelerators
• SIMD support on CPUs
• GPU support for faster querying & indexing
• Supports key database functions
• Data partitioning and data sharing
• Filtered queries and searches
• Multiple options for indices and similarity metrics
• FAISS (HNSW, Flat, PQ), ANNOY, DiskANN, ScANN, etc…
• Euclidean, dot product (cosine), boolean
• A number of SDKs
• Python
• Go
• Node
• Java

Cloud Nativity
• Kubernetes native
• Deployment through Helm
• Native S3 support
• MinIO-based design
• Azure Blob and GCS support
• Easy on-prem to cloud conversion
• Fully distributed
• Highly elastic and horizontally scalable
• Disaggregated storage and compute (shared storage)
• Separate read, write, and background (indexing) services

Access Layer
• Data-related languages (SQL context)
• Data de
fi
nition language: modify/de
fi
ne database schema
• Data management language: store, modify, and retrieve data
• Data control language: de
fi
ne user rights and permissions
• Access layer = multiple proxy nodes
• Proxy node functions
• Manage message ingestion and routing
• Points DDL and DCL instructions to coordinators
• Point DML to log for for worker consumption

Coordinator Layer
• Root coordinator node
• Handles DDL and DCL requests
• Data coordinator node
• Triggers background data operations (
fl
ush, compact, etc)
• Manages data node cluster
• Maintains metadata of inserted data
• Query coordinator node
• Manages query node cluster
• Index coordinator node
• Manages index node cluster
• Determines when indexes are built
• Maintains index metadata

Worker Layer
• Worker overview
• All workers are stateless
• All DML requests are handled by workers
• Data node
• Retrieves incremental log data from log
• Packs and stores log data into log snapshots
• Processes mutation requests
• Query node
• Loads indexes and data from object storage
• Runs searches and queries
• Index node
• Builds indexes on inserted data

Storage Layer
• Log broker - Kafka
• Streaming data persistence
• Execution of reliable asynchronous queries
• Event noti
fi
cation
• Metadata storage - etcd
• Service registration and health checks
• Message consumption checkpoints
• Object storage - S3/MinIO
• Stores snapshot
fi
les of logs
• Stores index
fi
les for scalar and vector data
• Stores intermediate query results

Key Takeaways
• Single coordinator instance per service type
• Coordinators manage corresponding worker node cluster
• Data is stored in Collections
• Akin to collections in MongoDB or tables in relational databases
• Disaggregation of query, indexing, and data
• Signi
fi
cant horizontal scalability
• Support for a wide range of application requirements
• Message streams are core to Milvus
• All data passes through message queue
• Kafka innate cloud nativity allows Milvus to easily scale

04
Kafka as a Messasging Backbone

Milvus’ Messaging Backbone
• Log as data
• Operations are centralized around the log broker
• CRUD operations by subscribing to and consuming logs
• Pub/sub scheme allows for stream & batch processing
• Decoupling of read and write components
• Coordinators manage corresponding worker node cluster
• Support for both streaming and batched execution
• Data nodes read from streams and write to binlog
• Streaming uses WAL, batching uses binlog
• All requests that change system state go through WAL
• Create collection, delete collection
• Insert, update, delete vector

Example: Vector Insert
• Loggers are organized into a hash ring
• Time Stamp Oracle (TSO) ensures logger consistency
• Different channels for different requests
• Prevents request type interference
• Data nodes subscribe to speci
fi
c channels
• Inserts hashed across multiple channels (+ef
fi
ciency)
• Data nodes can be freely expanded to increase throughput
• Convert row-based WAL to column-based binlogs
• Kafka’s cloud nativity powers Milvus’ scalability

THANK YOU FOR LISTENING
https://github.com/milvus-io/milvus
https://zilliz.com

How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022

Similar to How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022 (20)

More from HostedbyConfluent

More from HostedbyConfluent (20)

Recently uploaded

Recently uploaded (20)

How Kafka Powers the World's Most Popular Vector Database System with Charles Xie and Frank Liu | Current 2022