SlideShare a Scribd company logo
1 of 72
Download to read offline
Powered by
Vector Search for
Data Scientists
A Case Study with Twitter Analytics
#1 - How is my data distributed?
#2 - Are there outliers in my data?
#3 - Are my variables correlated with each other?
Common questions in Data Science
#1 - Can we capture the semantics in vector representations?
#2 - What can we learn about our data from semantic clusters?
Vector Search
Social Media Clicks and Twitter Analytics
Twitter Analytics
Twitter Analytics CSV Data
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms
allow us to Vector Search
in massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Feature Engineering:
Contains Emoji?
Character Count?
Word Count?
Contains “Weaviate”?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Data
3. Vector Segmentation
4. Weaviate for Twitter
Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Key Takeaway #1 -
Segmentation in
Data Science
Visualizing Distributions of Values
Segmentation in Data Science
● What Time was the Tweet sent?
● Is there a URL Link in the Tweet?
● Symbolic vs. Vector Segmentation
What Time was the Tweet sent?
Is there a URL Link in the Tweet?
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
How can we segment analytics based on the
semantics of…
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
Summary of
Takeaway #1
Segmentation in
Data Science
We visualize the Distribution
of our data to get a sense of it.
For example we see that
Impressions are somewhat
Normally Distributed.
Is that also true for Tweets sent
at 3 AM?
What about Tweets related to
Deep Learning for Robotics?
Key Takeaway #2 -
Vector
Representations
of Data
Symbols compared to Vectors
Symbols
Category - [0, 1, 0, 0, 0, 0]
Numeric - 52
Boolean - True
[0.1, 0.8, 0.34, 0.8, … 0.2]
Vectors
Vector Representations of Data
Photo by Shayna Douglas on Unsplash
0.83
0.35
..
0.02
Photo by Bill Stephan on Unsplash
0.74
0.01
..
0.95
Are these puppies similar?
Let’s ask Vector Distance!
L2 Distance = ∑ || ai
- bi
||2
L2 Distance (Puppy1, Puppy2) = (4-2)2
+ (8-9)2
+ (10-11)2
= 6
L2 Distance (Puppy1, Airplane) = (4-1)2
+ (8-20)2
+ (10-20)2
= 253
6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane
Vector Name Value 1 Value 2 Value 3
Puppy1 4 8 10
Puppy2 2 9 11
Airplane 1 20 20
Capturing Semantics in Vector Representations
How do Vectors represent real-world objects?
0.08 0.53 0.16 … 0.83 0.18
384 dimensional vector
Does this represent how much of a “brand” this is?
We aren’t sure! But there are research fields such as “Multimodal Neurons”
from OpenAI, and the general field of Disentangled RepresentationLearning
that are making great strides in understanding this.
Can we compress these vectors?
…
384 dimensional vector
Sometimes!
Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values
Ideas like Product Quantization - 384-d vector mapped to 32-d
Semantic Similarity with Vector Representations
Sentence-BERT:
Sentence Embeddings
using Siamese
BERT-Networks
Authored by
Nils Reimers and Iryna
Gurevych
Published 2019
Query Point
Positive and Negative Pair Sampling
Positive (Semantically Similar)
Negative (Semantically Different)
Another strategy - Data2Vec, Baevski et al. 2022
Do we need to train our own
models?
No! There are many pre-trained
models that work very well for
a broad range of data!
Great place to get started: Sentence Transformers
Summary of
Takeaway #2
Vector
Representations
of Data
Data such as Images, Text, Code,
… can be represented as Vectors
with Deep Learning models.
These models are trained to
maximize semantic similarity
with massive collections of data.
We often do not need to train the
models ourselves for particular
data domains to reach
reasonable performance.
Key Takeaway #3 -
Vector
Segmentation
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
We can segment analytics based on the
semantics of…
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
More Examples
House Hunting
Symbols: # of bedrooms, # of bathrooms, square feet, city
→ With Vectors we can encode:
● Visual style
● Neighborhood structure
● Moreflexibleinterfacetodefinefeatureswithtext
e-Commerce Products
Symbols: “Shoes”, “T-Shirt”, “Pants” or colors
→ With Vectors we can encode visual styles
Movies
Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi”
→ With Vectors we can encode:
● Themes
● Characters
● Storylines
Scientific Papers
Symbols: “Biology”, “Machine Learning”
→ With Vectors we can encode
● Nuance of the ideas
● Writing style
Music
Symbols can differentiate between genres like “Hip Hop”, “Dance”
→ With Vectors we can encode:
● Tone
● Lyrics
● Instruments
“That’s the magic of deep learning:
turning meaning into vectors, then into geometric
spaces, and then incrementally learning complex
geometric transformations that map one space to
another. All you need are spaces of sufficiently high
dimensionality in order to capture the full scope of
the relationships found in the original data.”
- Francois Chollet, Deep Learning with Python, 2nd edition
Summary of
Takeaway #3
Vector
Segmentation
Vector representations, also
known as embeddings,
enable an Interfaceto split
analytics based on the
Semanticsof the content.
This content could be Text,
Images, Code, Audio,
Videos, …
Key Takeaway #4 -
Weaviate for
Twitter Analytics
Twitter Analytics
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms allow
us to Vector Search in
massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Cloud Data Upload
There are many other ways to do this as well
Google Colab Weaviate Cloud
Services
GraphQL Live Demo
5 Nearest Neighbors to → “Weaviate Coding Tutorial”
Content Impressions
“We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the
Weaviate Database as a Document Store in Haystack pipelines … ”
311
“We have 2 new coding tutorials on Weaviate YouTube…” 1144
“@weaviate_io Love the integration of this with the GraphQL API!” 378
“Here are some thoughts on combining Weaviate and Haystack! TLDR:
Weavaite is a great Vector Search database…”
15563
“Weaviate (@weaviate_io) is also announcing a collaboration with Jina
AI (@JinaAI_)! …”
586
What was the Tweet about?
Have I tweeted something like this before?
Have any Weaviate Podcast guests
tweeted something like this recently?
Tweet, Author, Likes
GraphQL Live Demo
GraphQL Wikipedia Demo
Wikipedia Live Demo - Graph Data Model
GraphQL Wikipedia Demo
● Weaviate is a Vector
Search Database, rather
than a Library such as
Facebook’s FAISS or
ANNOY from Spotify
● Weaviate has a
Graph-like Data Model
Expanding Twitter project with Graph Model
Summary of
Takeaway #4
Weaviate for
Twitter Analytics
We can segment Impressions
on Twitterbased on the
content of the tweet without
manual labeling!
Weaviateis a Vector Search
Databasethat can be used to
store and search through
semantic embeddings of data.
Key Takeaway #5 -
Research Questions
and Discussion
Research Questions and Discussion
● Should I fine-tune my embedding model?
● Large-Scale Vector Search with Approximate
Nearest Neighbor (ANN) Algorithms
● How does Vector Search differ from Classification or
Regression models?
Vector Search versus Regression on Impressions
8,530 Impressions
Model Prediction
Interpretability of Vector Search
Nearest Neighbors
Interpretability of Vector Search and Prediction
8,530 Impressions
Model Prediction
What do we want to know about our Tweets?
Should I post this?
When might be a better time to post it?
What might be a better phrasing of this tweet?
Expanding from individuals to teams
● Has anyone on my team tweeted something
like this recently?
● Who on our team would be best fit to tell this
story?
● What topics should we be tweeting about?
Summary of
Takeaway #5
Research
Questions and
Discussion
How can we improve these
systems? What looks
promising?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Unstructured Data
3. Vector Segmentation
4. Weaviate Example for
Twitter Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Connect with us!
Weaviate Slack Channel
YouTube: Weaviate • Vector Search Engine
Weaviate Podcast
Twitter @weaviate_io
Thank you for Watching!
Special thanks to Sebastian Witalec in
advising the development of this presentation
and Svitlana Smolianova for visual styling.

More Related Content

What's hot

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentationadvaitdeo
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Data Modeling in Looker
Data Modeling in LookerData Modeling in Looker
Data Modeling in LookerLooker
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Institute of Contemporary Sciences
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 

What's hot (20)

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Data Modeling in Looker
Data Modeling in LookerData Modeling in Looker
Data Modeling in Looker
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
Improving Search Relevance in Elasticsearch Using Machine Learning - Milorad ...
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 

Similar to Vector Search for Data Scientists.pdf

IRJET - Deep Learning based Chatbot
IRJET - Deep Learning based ChatbotIRJET - Deep Learning based Chatbot
IRJET - Deep Learning based ChatbotIRJET Journal
 
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdfvkharish18
 
Weaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdfWeaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdfEvgenios Skitsanos
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory acijjournal
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaperUrjit Patel
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008Blogtalk 2008
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection ModelIRJET Journal
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental AnalysisIRJET Journal
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfDevinSohi
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systemsDavide Eynard
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynoteinfoblog
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sasAnalyst
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash TagIRJET Journal
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 

Similar to Vector Search for Data Scientists.pdf (20)

Aman chaudhary
 Aman chaudhary Aman chaudhary
Aman chaudhary
 
IRJET - Deep Learning based Chatbot
IRJET - Deep Learning based ChatbotIRJET - Deep Learning based Chatbot
IRJET - Deep Learning based Chatbot
 
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
apidays LIVE Australia 2021 - Tracing across your distributed process boundar...
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 
Weaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdfWeaviate and Pinecone Comparison.pdf
Weaviate and Pinecone Comparison.pdf
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
DataChat_FinalPaper
DataChat_FinalPaperDataChat_FinalPaper
DataChat_FinalPaper
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
CIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer KeynoteCIDR 2009: Jeff Heer Keynote
CIDR 2009: Jeff Heer Keynote
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 

Recently uploaded

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Recently uploaded (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Vector Search for Data Scientists.pdf

  • 1. Powered by Vector Search for Data Scientists A Case Study with Twitter Analytics
  • 2. #1 - How is my data distributed? #2 - Are there outliers in my data? #3 - Are my variables correlated with each other? Common questions in Data Science
  • 3. #1 - Can we capture the semantics in vector representations? #2 - What can we learn about our data from semantic clusters? Vector Search
  • 4. Social Media Clicks and Twitter Analytics
  • 6. Twitter Analytics CSV Data Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36 Feature Engineering: Contains Emoji? Character Count? Word Count? Contains “Weaviate”?
  • 7. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Data 3. Vector Segmentation 4. Weaviate for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 8. Key Takeaway #1 - Segmentation in Data Science
  • 10. Segmentation in Data Science ● What Time was the Tweet sent? ● Is there a URL Link in the Tweet? ● Symbolic vs. Vector Segmentation
  • 11. What Time was the Tweet sent?
  • 12. Is there a URL Link in the Tweet?
  • 13. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 14. How can we segment analytics based on the semantics of… ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … !
  • 15. Summary of Takeaway #1 Segmentation in Data Science We visualize the Distribution of our data to get a sense of it. For example we see that Impressions are somewhat Normally Distributed. Is that also true for Tweets sent at 3 AM? What about Tweets related to Deep Learning for Robotics?
  • 16. Key Takeaway #2 - Vector Representations of Data
  • 17. Symbols compared to Vectors Symbols Category - [0, 1, 0, 0, 0, 0] Numeric - 52 Boolean - True [0.1, 0.8, 0.34, 0.8, … 0.2] Vectors
  • 18. Vector Representations of Data Photo by Shayna Douglas on Unsplash 0.83 0.35 .. 0.02 Photo by Bill Stephan on Unsplash 0.74 0.01 .. 0.95
  • 19. Are these puppies similar? Let’s ask Vector Distance! L2 Distance = ∑ || ai - bi ||2 L2 Distance (Puppy1, Puppy2) = (4-2)2 + (8-9)2 + (10-11)2 = 6 L2 Distance (Puppy1, Airplane) = (4-1)2 + (8-20)2 + (10-20)2 = 253 6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane Vector Name Value 1 Value 2 Value 3 Puppy1 4 8 10 Puppy2 2 9 11 Airplane 1 20 20
  • 20. Capturing Semantics in Vector Representations
  • 21. How do Vectors represent real-world objects? 0.08 0.53 0.16 … 0.83 0.18 384 dimensional vector Does this represent how much of a “brand” this is? We aren’t sure! But there are research fields such as “Multimodal Neurons” from OpenAI, and the general field of Disentangled RepresentationLearning that are making great strides in understanding this.
  • 22. Can we compress these vectors? … 384 dimensional vector Sometimes! Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values Ideas like Product Quantization - 384-d vector mapped to 32-d
  • 23. Semantic Similarity with Vector Representations Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Authored by Nils Reimers and Iryna Gurevych Published 2019
  • 24. Query Point Positive and Negative Pair Sampling
  • 27. Another strategy - Data2Vec, Baevski et al. 2022
  • 28. Do we need to train our own models? No! There are many pre-trained models that work very well for a broad range of data!
  • 29. Great place to get started: Sentence Transformers
  • 30. Summary of Takeaway #2 Vector Representations of Data Data such as Images, Text, Code, … can be represented as Vectors with Deep Learning models. These models are trained to maximize semantic similarity with massive collections of data. We often do not need to train the models ourselves for particular data domains to reach reasonable performance.
  • 31. Key Takeaway #3 - Vector Segmentation
  • 32. ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … ! We can segment analytics based on the semantics of…
  • 33. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 35. House Hunting Symbols: # of bedrooms, # of bathrooms, square feet, city → With Vectors we can encode: ● Visual style ● Neighborhood structure ● Moreflexibleinterfacetodefinefeatureswithtext
  • 36. e-Commerce Products Symbols: “Shoes”, “T-Shirt”, “Pants” or colors → With Vectors we can encode visual styles
  • 37. Movies Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi” → With Vectors we can encode: ● Themes ● Characters ● Storylines
  • 38. Scientific Papers Symbols: “Biology”, “Machine Learning” → With Vectors we can encode ● Nuance of the ideas ● Writing style
  • 39. Music Symbols can differentiate between genres like “Hip Hop”, “Dance” → With Vectors we can encode: ● Tone ● Lyrics ● Instruments
  • 40. “That’s the magic of deep learning: turning meaning into vectors, then into geometric spaces, and then incrementally learning complex geometric transformations that map one space to another. All you need are spaces of sufficiently high dimensionality in order to capture the full scope of the relationships found in the original data.” - Francois Chollet, Deep Learning with Python, 2nd edition
  • 41. Summary of Takeaway #3 Vector Segmentation Vector representations, also known as embeddings, enable an Interfaceto split analytics based on the Semanticsof the content. This content could be Text, Images, Code, Audio, Videos, …
  • 42. Key Takeaway #4 - Weaviate for Twitter Analytics
  • 43. Twitter Analytics Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36
  • 44.
  • 45. Cloud Data Upload There are many other ways to do this as well Google Colab Weaviate Cloud Services
  • 47. 5 Nearest Neighbors to → “Weaviate Coding Tutorial” Content Impressions “We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the Weaviate Database as a Document Store in Haystack pipelines … ” 311 “We have 2 new coding tutorials on Weaviate YouTube…” 1144 “@weaviate_io Love the integration of this with the GraphQL API!” 378 “Here are some thoughts on combining Weaviate and Haystack! TLDR: Weavaite is a great Vector Search database…” 15563 “Weaviate (@weaviate_io) is also announcing a collaboration with Jina AI (@JinaAI_)! …” 586
  • 48.
  • 49. What was the Tweet about?
  • 50. Have I tweeted something like this before?
  • 51. Have any Weaviate Podcast guests tweeted something like this recently?
  • 52.
  • 56. Wikipedia Live Demo - Graph Data Model
  • 58. ● Weaviate is a Vector Search Database, rather than a Library such as Facebook’s FAISS or ANNOY from Spotify ● Weaviate has a Graph-like Data Model
  • 59. Expanding Twitter project with Graph Model
  • 60.
  • 61. Summary of Takeaway #4 Weaviate for Twitter Analytics We can segment Impressions on Twitterbased on the content of the tweet without manual labeling! Weaviateis a Vector Search Databasethat can be used to store and search through semantic embeddings of data.
  • 62. Key Takeaway #5 - Research Questions and Discussion
  • 63. Research Questions and Discussion ● Should I fine-tune my embedding model? ● Large-Scale Vector Search with Approximate Nearest Neighbor (ANN) Algorithms ● How does Vector Search differ from Classification or Regression models?
  • 64. Vector Search versus Regression on Impressions 8,530 Impressions Model Prediction
  • 65. Interpretability of Vector Search Nearest Neighbors
  • 66. Interpretability of Vector Search and Prediction 8,530 Impressions Model Prediction
  • 67. What do we want to know about our Tweets? Should I post this? When might be a better time to post it? What might be a better phrasing of this tweet?
  • 68. Expanding from individuals to teams ● Has anyone on my team tweeted something like this recently? ● Who on our team would be best fit to tell this story? ● What topics should we be tweeting about?
  • 69. Summary of Takeaway #5 Research Questions and Discussion How can we improve these systems? What looks promising?
  • 70. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Unstructured Data 3. Vector Segmentation 4. Weaviate Example for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 71. Connect with us! Weaviate Slack Channel YouTube: Weaviate • Vector Search Engine Weaviate Podcast Twitter @weaviate_io
  • 72. Thank you for Watching! Special thanks to Sebastian Witalec in advising the development of this presentation and Svitlana Smolianova for visual styling.