SlideShare a Scribd company logo
T W I N D E R
INSIGHT DATA ENGINEERING
LORYFEL NUNEZ
THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH
ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM
User is a Twitter user
User_Representation is a combination of a user's tweets and her description represented
as a Bag-of-NGrams
Match is the maximum intersection of UserA's and UserB's User_Representation
User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the
User_Representation
Word_Query - Query to the vocab_user table where table has a row of word and a
list of users (Inverted Index)
PIPELINE
TWITTER
API
CHALLENGES
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
convert user_desc and text to TOPICS
groupByKey
Inverted Index for search
@abc loves dogs, yoga Did you see the Beiber movie 1234 834234123
@abc dogs, yoga, NYC
dogs (@abc, @doglover, @def)
yoga (@abc, @yogalover, @xyz)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are
modifiable
Functions that I can extend —>
lambda is my friend
Passing data to workers
(serialization and closure)
UPDATES
PERFORMANCE (EXPERIMENTS AND TUNING)
PROCESS File Size Tweets Users Vocab Time Cores
analyze 15MB 5,600 4,700 10,171 14s 3
analyze with POS 15MB 5,600 4,700 10,171 9min 32
analyze with POS 15MB 5,600 4,700 10,171 8min 3
analyze 15GB 4.9M 950,000 1.3M 21m 3
analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36
analyze with update (>
1d) with Hive
15GB 4.9M 1.5M 1.8M 26min 36/10/3
ABOUT ME
NEXT STEPS
OPTIMIZATIONS
QUALITY OF MATCHES
combine the models and i put my output to HDFS
What happens when vocabulary increases to x
What happens when you do a batch run for 1 weeks (105 GB at a time)
REAL-TIME QUERIES
search by topics — SOLR on Cassandra
TESTING
Support for NLP Techniques — faster processing for
algorithm, data lookups
DATA
▸ VOLUME
Historical Twitter Data for testing, Daily Twitter Dumps
▸ VARIETY and VERACITY:
~Text Preprocessing, Metadata Extraction
▸ VELOCITY
▸ FOCUS: Fast computation Data structures for fast reads/
updates, Long Term Storage, Data Collection
SAMPLES
rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname,
udesc, time)) 

          .map(lambda (userid, id_str, text, uname, udesc, time): (userid, text)) 

          .groupByKey().mapValues(list) 

          .flatMap(lambda (uid, tweets) : get_topics(uid, tweets))



rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b)

rdd_users = rdd_tuples.map(lambda x: (x[::-1])) 

data_raw = sc.textFile(file_name)

data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True).
saveAsTextFile(revised_file_name)
NLP
▸ Bag of words model
▸ Experimented with ways to clean data (Stemming, POS
Tagging)
▸ Sci kit learn - Count Vectorizer, 2-gram
NLP
LoryfelNunez
LoryfelNunez
LoryfelNunez
LoryfelNunez

More Related Content

What's hot

Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
Jay Luker
 
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
Gabriel Recchia
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Kodaira Tomonori
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
Kodaira Tomonori
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Resultsxiaojuzheng
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
inventionjournals
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
Zachary Thomas
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Asiri Wijesinghe
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserDavid Dias
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
Harry Potter
 

What's hot (16)

Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Presentation dropbox
Presentation dropboxPresentation dropbox
Presentation dropbox
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the Browser
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 

Viewers also liked

HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
Angela Razzell
 
Eat Sleep Tweet Repeat
Eat Sleep Tweet RepeatEat Sleep Tweet Repeat
Eat Sleep Tweet Repeat
barshashrest
 
Game post
Game postGame post
Game post
Sam Nguyen
 
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark Summit
 
The Internet of Everywhere—How IBM The Weather Company Scales
The Internet of Everywhere—How IBM The Weather Company ScalesThe Internet of Everywhere—How IBM The Weather Company Scales
The Internet of Everywhere—How IBM The Weather Company Scales
Spark Summit
 
PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive AnalyticsPowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics
Spark Summit
 
Insights Presentation
Insights PresentationInsights Presentation
Insights Presentation
Simon Law
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 

Viewers also liked (8)

HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
 
Eat Sleep Tweet Repeat
Eat Sleep Tweet RepeatEat Sleep Tweet Repeat
Eat Sleep Tweet Repeat
 
Game post
Game postGame post
Game post
 
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...
 
The Internet of Everywhere—How IBM The Weather Company Scales
The Internet of Everywhere—How IBM The Weather Company ScalesThe Internet of Everywhere—How IBM The Weather Company Scales
The Internet of Everywhere—How IBM The Weather Company Scales
 
PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive AnalyticsPowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics
 
Insights Presentation
Insights PresentationInsights Presentation
Insights Presentation
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
 

Similar to LoryfelNunez

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
Meghaj Mallick
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
Florian Stegmaier
 
A Tour of Tensorflow's APIs
A Tour of Tensorflow's APIsA Tour of Tensorflow's APIs
A Tour of Tensorflow's APIs
Dean Wyatte
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor... Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
andrejusb
 
Wis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LODWis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LOD
Pramod Koneru
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
Jihyun Ahn
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNidhin Pattaniyil
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
Rising Media, Inc.
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
telss09
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 

Similar to LoryfelNunez (20)

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Dbms
DbmsDbms
Dbms
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
 
A Tour of Tensorflow's APIs
A Tour of Tensorflow's APIsA Tour of Tensorflow's APIs
A Tour of Tensorflow's APIs
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor... Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 
Wis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LODWis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LOD
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Presentation
PresentationPresentation
Presentation
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_poster
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

LoryfelNunez

  • 1. T W I N D E R INSIGHT DATA ENGINEERING LORYFEL NUNEZ
  • 2. THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM User is a Twitter user User_Representation is a combination of a user's tweets and her description represented as a Bag-of-NGrams Match is the maximum intersection of UserA's and UserB's User_Representation User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the User_Representation Word_Query - Query to the vocab_user table where table has a row of word and a list of users (Inverted Index)
  • 4. CHALLENGES TOPIC1 (USER_ID1, USER_ID2…USER_IDN) FULL TWITTER DATA JSON USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME USER_ID, (TOPIC1,…,TOPIC10) convert user_desc and text to TOPICS groupByKey Inverted Index for search @abc loves dogs, yoga Did you see the Beiber movie 1234 834234123 @abc dogs, yoga, NYC dogs (@abc, @doglover, @def) yoga (@abc, @yogalover, @xyz)
  • 5. FULL TWITTER DATA JSON USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME USER_ID, (TOPIC1,…,TOPIC10) TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are modifiable Functions that I can extend —> lambda is my friend Passing data to workers (serialization and closure) UPDATES
  • 6. PERFORMANCE (EXPERIMENTS AND TUNING) PROCESS File Size Tweets Users Vocab Time Cores analyze 15MB 5,600 4,700 10,171 14s 3 analyze with POS 15MB 5,600 4,700 10,171 9min 32 analyze with POS 15MB 5,600 4,700 10,171 8min 3 analyze 15GB 4.9M 950,000 1.3M 21m 3 analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36 analyze with update (> 1d) with Hive 15GB 4.9M 1.5M 1.8M 26min 36/10/3
  • 8. NEXT STEPS OPTIMIZATIONS QUALITY OF MATCHES combine the models and i put my output to HDFS What happens when vocabulary increases to x What happens when you do a batch run for 1 weeks (105 GB at a time) REAL-TIME QUERIES search by topics — SOLR on Cassandra TESTING Support for NLP Techniques — faster processing for algorithm, data lookups
  • 9. DATA ▸ VOLUME Historical Twitter Data for testing, Daily Twitter Dumps ▸ VARIETY and VERACITY: ~Text Preprocessing, Metadata Extraction ▸ VELOCITY ▸ FOCUS: Fast computation Data structures for fast reads/ updates, Long Term Storage, Data Collection
  • 10. SAMPLES rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname, udesc, time))           .map(lambda (userid, id_str, text, uname, udesc, time): (userid, text))           .groupByKey().mapValues(list)           .flatMap(lambda (uid, tweets) : get_topics(uid, tweets)) 
 rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b) rdd_users = rdd_tuples.map(lambda x: (x[::-1])) data_raw = sc.textFile(file_name) data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True). saveAsTextFile(revised_file_name)
  • 11.
  • 12.
  • 13. NLP ▸ Bag of words model ▸ Experimented with ways to clean data (Stemming, POS Tagging) ▸ Sci kit learn - Count Vectorizer, 2-gram
  • 14. NLP