SlideShare a Scribd company logo
1 of 37
Download to read offline
O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X
TweetMogaz: The Arabic Tweets Platform
Ahmed Adel
Team Lead, BADR
3
01
Who Am I?
• Bs.c. Engineering from Alexandria University

• BADR Co-Founder

• Now: Part-time Team Lead @ BADR
• 8+ years experience in software development

• Mainly Java, JavaScript

• Solr, Hadoop, Hive, ...
4
02
BADR
• Established Software House in Egypt

• Was founded in 2006

• Provide BigData consulting services

and solutions

• Machine Learning, NLP, Data Science, ...

• Hadoop, Solr, Spark, Hive, Flume, Incorta, ...
5
02
Agenda
• What is TweetMogaz
• System Modules
• Tweets processing
• Indexing
• Event detection
• Archivers
• …
• System Architecture
• Tricks and Challenges
• What’s Next
6
02
What Is TweetMogaz?
• Innovation and applied research project @ BADR
• Portal for browsing, filtering and searching Arabic Tweets
• ... and events detection
• Based on several research papers
• Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.

CIKM 2012
• Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013

• Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in

Social Media. CIKM 2014

• Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.

ICWSM 2014
7
02
Why Arabic
• 230 Millions speakers
• 6th largest in

the world (native + 2nd)
• One of the 6 UN

official languages
Mandarin Chinese
English
Hindi
Spanish
Russian
Arabic
German
Bengali
Portuguese
Japanese
Speakers in Millions
0 300 600 900 1,200
Native 2nd
8
02
Main Features
• Classifying • Browsing• Searching
• Event Detection • Time machine
9
02
System Modules
• Tweets processing module
• Indexing module
• Event detection module
• Events
• Active Hashtags
• WordCloud generator
• Archivers
• Short-term
• Long-term
• Analytics
10
Tweets Processing Module
11
02
Tweets Processing Module
• Retrieves tweets

(streams and search q's)
• Filters out inappropriate

tweets
• Text pre-processing
• Normalization
• ‫ى‬ ، ‫ي‬
• ‫آ‬ ، ‫إ‬ ، ‫ا‬ ، ‫أ‬
• ‫ة‬ ، ‫ه‬
• Kashida: ‫ـ‬ ، ْ
• Removing stop-words
12
02
• Classification at indexing time
• Multiple classes map to multi-value field (politics, sport, religious, etc)

• Boolean classifier

• Adaptive classifier (Naïve Bayes/SVM (experimental))
• Scoring at indexing time
• Recent (date): latest tweets in a specific category

• Top (score field): trending tweets (high retweet rate in the past 48 hours)

Tweets Processing Module
13
02
Score
Score
0
0.005
0.009
0.014
0.018
Tweet Age (seconds)
0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k
14
Indexing Module
15
02
Indexing Module
• Responsible for indexing

tweets to corresponding

Solr cores
• Realtime core (< 10 mins)
• up to 48 hours cores
• Media: photos, videos
• Text only and text that contains

links
• All tweets
• Short term archives cores

(>48 hours and <30 days)
16
Event Detection Module
17
Event Detection Module
18
Event Detection Module
• Responsible for detecting events
• Elsawy E., M. Mokhtar, and W. Magdy.

TweetMogaz v2: Identifying News Stories

in Social Media. CIKM 2014
• Feature-pivot (term) approach
19
02
Event Detection Module
• Clusters are created based on

a distance threshold (fuzzy clusters)
• Distance threshold 0.4 (experimental)
S
SS
S
• In 8 hours window
• Processed text faceting with using min_count
• Builds facets for stems
• For each facet, calculate distance

to all other facets O(n2)
20
02
Event Detection Module
• Cluster enrichment
• Enhancing clusters with less than 6 terms
• Running Solr AND query with all keywords and

selecting terms with highest TFIDF to

enrich the cluster
21
02
Event Detection Module
• Cluster de-duplication over time
• Search using cluster keywords of each detected

cluster
• For each response result, build stem frequency

vector
• Compare the two vectors for similarity

(Cosine = 0.5: experimental)
• Old clusters are updated to maintain the

chronological order of events
22
02
Event Detection Module
• Relevant tweets retrieval
• Query against 48 hours cores
23
02
Event Detection Module
• Active hash tag detection
• Separate field added at index time
• Stored in events core with type hashtag
• Build normalized top hashtag facets every 24 hours for the past week
• Query Solr for hashtags older that 1 week and eliminate them
24
WordCloud
25
02
Word Cloud: Bi-gram detection
• Facet for specific class
• Facets next to each other, with a specific threshold, tend to be a bi-gram
• For example: ‫العالم‬ ‫كأس‬ - ‫مدريد‬ ‫ريال‬ (Real Madrid - World Cup)
• min_count applies
26
Archiving Module
27
02
Archiving Module
• Why?
• Space in finite!
• Faster performance of searching recent cores

• Short-term archiving
• Archive tweets that are older than 48 hours
• Same Solr instance

• Long-term archiving
• Archive tweets that are older than 30 days
• Separate Solr instance
28
System Architecture
29
02
System Architecture
• SolrCloud
• 2 Shards
• Replication factor of 2
• Zookeeper ensemble

for distribution management
• SolrJ API

• Front-end
• Node.js
• AngularJS (Web and mobile web)

• Long-term archive
• Separate Solr Instance
30
Analytics and Visualization
31
02
Analytics and Visualization
• Banana Dashboards
• Deployed on both realtime

and archive
• Insights on the tweets distribution

per class, trends over time of

specific search queries
• Realtime on production with

‘Auto-refresh’ feature
• Users with highest retweets

32
Challenges and Tricks
33
02
• Archiving
• Initially developed on Solr 4.4
• Upgrade to 4.7+ for deep paging
• Archivers Sync’ing
• Short-term is writing and long term is reading
• Have to sync in case of deep paging
Short-term
cores
Long-term
cores
Short-term archiver

(W)
Long-term archiver

(R)
Tricks
34
02
Challenges
• Twitter (Micro-blogs) very short text
• Arabic has many dialects: colloquial, formal, regional variations
35
Next Steps
36
02
Next Steps
• Integrating an adaptive classifier that can handle the

characteristics of micro-blogs
• Search query trend over time
• Engage system users
• Integrate R for statistical processing (classification, detection, …)
37
03
Thank you!
Ahmed Adel
email: me@aadel.io
twitter: @ahmadadel
website: badrit.com

More Related Content

What's hot

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Lucidworks
 
Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with Fusion
Lucidworks
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Lucidworks
 

What's hot (20)

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
 
Elasticsearch 5.0
Elasticsearch 5.0Elasticsearch 5.0
Elasticsearch 5.0
 
Webinar: Fusion for Data Science
Webinar: Fusion for Data ScienceWebinar: Fusion for Data Science
Webinar: Fusion for Data Science
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Webinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with FusionWebinar: Rapid Solr Development with Fusion
Webinar: Rapid Solr Development with Fusion
 
What's new in Elasticsearch v5
What's new in Elasticsearch v5What's new in Elasticsearch v5
What's new in Elasticsearch v5
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
 
The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!The Ultimate Logging Architecture - You KNOW you want it!
The Ultimate Logging Architecture - You KNOW you want it!
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
 
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
Mail Search As A Sercive: Presented by Rishi Easwaran, AolMail Search As A Sercive: Presented by Rishi Easwaran, Aol
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 

Similar to TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 

Similar to TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR (20)

Big Search 4 Big Data War Stories
Big Search 4 Big Data War StoriesBig Search 4 Big Data War Stories
Big Search 4 Big Data War Stories
 
[DSC Europe 23] Muhammad Arslan - A Journey of Auditlogs from Kafka to Elasti...
[DSC Europe 23] Muhammad Arslan - A Journey of Auditlogs from Kafka to Elasti...[DSC Europe 23] Muhammad Arslan - A Journey of Auditlogs from Kafka to Elasti...
[DSC Europe 23] Muhammad Arslan - A Journey of Auditlogs from Kafka to Elasti...
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Decode2018 report
Decode2018 reportDecode2018 report
Decode2018 report
 
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with KubernetesSumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
 
Apereo OAE - Bootcamp
Apereo OAE - BootcampApereo OAE - Bootcamp
Apereo OAE - Bootcamp
 
MIGRATION - PAIN OR GAIN?
MIGRATION - PAIN OR GAIN?MIGRATION - PAIN OR GAIN?
MIGRATION - PAIN OR GAIN?
 
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, TruliaReal Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Swift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StorySwift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer Story
 
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
Containers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup GrazContainers, Habitat and Orchestration - Infracoders Meetup Graz
Containers, Habitat and Orchestration - Infracoders Meetup Graz
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 

TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 5 • A U S T I N , T X
  • 2. TweetMogaz: The Arabic Tweets Platform Ahmed Adel Team Lead, BADR
  • 3. 3 01 Who Am I? • Bs.c. Engineering from Alexandria University
 • BADR Co-Founder
 • Now: Part-time Team Lead @ BADR • 8+ years experience in software development
 • Mainly Java, JavaScript
 • Solr, Hadoop, Hive, ...
  • 4. 4 02 BADR • Established Software House in Egypt
 • Was founded in 2006
 • Provide BigData consulting services
 and solutions
 • Machine Learning, NLP, Data Science, ...
 • Hadoop, Solr, Spark, Hive, Flume, Incorta, ...
  • 5. 5 02 Agenda • What is TweetMogaz • System Modules • Tweets processing • Indexing • Event detection • Archivers • … • System Architecture • Tricks and Challenges • What’s Next
  • 6. 6 02 What Is TweetMogaz? • Innovation and applied research project @ BADR • Portal for browsing, filtering and searching Arabic Tweets • ... and events detection • Based on several research papers • Magdy W. and A. Ali, and K. Darwish. A Summarization Tool for Time-Sensitive Social Media.
 CIKM 2012 • Magdy W. TweetMogaz: A News Portal of Tweets. SIGIR 2013
 • Elsawy E., M. Mokhtar, and W. Magdy. TweetMogaz v2: Identifying News Stories in
 Social Media. CIKM 2014
 • Magdy W. and T. Elsayed. Adaptive Method for Following Dynamic Topics on Twitter.
 ICWSM 2014
  • 7. 7 02 Why Arabic • 230 Millions speakers • 6th largest in
 the world (native + 2nd) • One of the 6 UN
 official languages Mandarin Chinese English Hindi Spanish Russian Arabic German Bengali Portuguese Japanese Speakers in Millions 0 300 600 900 1,200 Native 2nd
  • 8. 8 02 Main Features • Classifying • Browsing• Searching • Event Detection • Time machine
  • 9. 9 02 System Modules • Tweets processing module • Indexing module • Event detection module • Events • Active Hashtags • WordCloud generator • Archivers • Short-term • Long-term • Analytics
  • 11. 11 02 Tweets Processing Module • Retrieves tweets
 (streams and search q's) • Filters out inappropriate
 tweets • Text pre-processing • Normalization • ‫ى‬ ، ‫ي‬ • ‫آ‬ ، ‫إ‬ ، ‫ا‬ ، ‫أ‬ • ‫ة‬ ، ‫ه‬ • Kashida: ‫ـ‬ ، ْ • Removing stop-words
  • 12. 12 02 • Classification at indexing time • Multiple classes map to multi-value field (politics, sport, religious, etc)
 • Boolean classifier
 • Adaptive classifier (Naïve Bayes/SVM (experimental)) • Scoring at indexing time • Recent (date): latest tweets in a specific category
 • Top (score field): trending tweets (high retweet rate in the past 48 hours)
 Tweets Processing Module
  • 13. 13 02 Score Score 0 0.005 0.009 0.014 0.018 Tweet Age (seconds) 0 3k 6k 9k 12k 15k 18k 21k 24k 27k 30k 33k 36k 39k
  • 15. 15 02 Indexing Module • Responsible for indexing
 tweets to corresponding
 Solr cores • Realtime core (< 10 mins) • up to 48 hours cores • Media: photos, videos • Text only and text that contains
 links • All tweets • Short term archives cores
 (>48 hours and <30 days)
  • 18. 18 Event Detection Module • Responsible for detecting events • Elsawy E., M. Mokhtar, and W. Magdy.
 TweetMogaz v2: Identifying News Stories
 in Social Media. CIKM 2014 • Feature-pivot (term) approach
  • 19. 19 02 Event Detection Module • Clusters are created based on
 a distance threshold (fuzzy clusters) • Distance threshold 0.4 (experimental) S SS S • In 8 hours window • Processed text faceting with using min_count • Builds facets for stems • For each facet, calculate distance
 to all other facets O(n2)
  • 20. 20 02 Event Detection Module • Cluster enrichment • Enhancing clusters with less than 6 terms • Running Solr AND query with all keywords and
 selecting terms with highest TFIDF to
 enrich the cluster
  • 21. 21 02 Event Detection Module • Cluster de-duplication over time • Search using cluster keywords of each detected
 cluster • For each response result, build stem frequency
 vector • Compare the two vectors for similarity
 (Cosine = 0.5: experimental) • Old clusters are updated to maintain the
 chronological order of events
  • 22. 22 02 Event Detection Module • Relevant tweets retrieval • Query against 48 hours cores
  • 23. 23 02 Event Detection Module • Active hash tag detection • Separate field added at index time • Stored in events core with type hashtag • Build normalized top hashtag facets every 24 hours for the past week • Query Solr for hashtags older that 1 week and eliminate them
  • 25. 25 02 Word Cloud: Bi-gram detection • Facet for specific class • Facets next to each other, with a specific threshold, tend to be a bi-gram • For example: ‫العالم‬ ‫كأس‬ - ‫مدريد‬ ‫ريال‬ (Real Madrid - World Cup) • min_count applies
  • 27. 27 02 Archiving Module • Why? • Space in finite! • Faster performance of searching recent cores
 • Short-term archiving • Archive tweets that are older than 48 hours • Same Solr instance
 • Long-term archiving • Archive tweets that are older than 30 days • Separate Solr instance
  • 29. 29 02 System Architecture • SolrCloud • 2 Shards • Replication factor of 2 • Zookeeper ensemble
 for distribution management • SolrJ API
 • Front-end • Node.js • AngularJS (Web and mobile web)
 • Long-term archive • Separate Solr Instance
  • 31. 31 02 Analytics and Visualization • Banana Dashboards • Deployed on both realtime
 and archive • Insights on the tweets distribution
 per class, trends over time of
 specific search queries • Realtime on production with
 ‘Auto-refresh’ feature • Users with highest retweets

  • 33. 33 02 • Archiving • Initially developed on Solr 4.4 • Upgrade to 4.7+ for deep paging • Archivers Sync’ing • Short-term is writing and long term is reading • Have to sync in case of deep paging Short-term cores Long-term cores Short-term archiver
 (W) Long-term archiver
 (R) Tricks
  • 34. 34 02 Challenges • Twitter (Micro-blogs) very short text • Arabic has many dialects: colloquial, formal, regional variations
  • 36. 36 02 Next Steps • Integrating an adaptive classifier that can handle the
 characteristics of micro-blogs • Search query trend over time • Engage system users • Integrate R for statistical processing (classification, detection, …)
  • 37. 37 03 Thank you! Ahmed Adel email: me@aadel.io twitter: @ahmadadel website: badrit.com