SlideShare a Scribd company logo
T W I N D E R
INSIGHT DATA ENGINEERING
LORYFEL NUNEZ
THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH
ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM
User is a Twitter user
User_Representation is a combination of a user's tweets and her description represented
as a Bag-of-NGrams
Match is the maximum intersection of UserA's and UserB's User_Representation
User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the
User_Representation
Word_Query - Query to the vocab_user table where table has a row of word and a
list of users (Inverted Index)
PIPELINE
TWITTER
API
CHALLENGES
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
convert user_desc and text to TOPICS
groupByKey
Inverted Index for search
@abc loves dogs, yoga Did you see the Beiber movie 1234 834234123
@abc dogs, yoga, NYC
dogs (@abc, @doglover, @def)
yoga (@abc, @yogalover, @xyz)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are
modifiable
Functions that I can extend —>
lambda is my friend
Passing data to workers
(serialization and closure)
UPDATES
PERFORMANCE (EXPERIMENTS AND TUNING)
PROCESS File Size Tweets Users Vocab Time Cores
analyze 15MB 5,600 4,700 10,171 14s 3
analyze with POS 15MB 5,600 4,700 10,171 9min 32
analyze with POS 15MB 5,600 4,700 10,171 8min 3
analyze 15GB 4.9M 950,000 1.3M 21m 3
analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36
analyze with update (>
1d) with Hive
15GB 4.9M 1.5M 1.8M 26min 36/10/3
ABOUT ME
NEXT STEPS
OPTIMIZATIONS
QUALITY OF MATCHES
combine the models and i put my output to HDFS
What happens when vocabulary increases to x
What happens when you do a batch run for 1 weeks (105 GB at a time)
REAL-TIME QUERIES
search by topics — SOLR on Cassandra
TESTING
Support for NLP Techniques — faster processing for
algorithm, data lookups
DATA
▸ VOLUME
Historical Twitter Data for testing, Daily Twitter Dumps
▸ VARIETY and VERACITY:
~Text Preprocessing, Metadata Extraction
▸ VELOCITY
▸ FOCUS: Fast computation Data structures for fast reads/
updates, Long Term Storage, Data Collection
SAMPLES
rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname,
udesc, time)) 

          .map(lambda (userid, id_str, text, uname, udesc, time): (userid, text)) 

          .groupByKey().mapValues(list) 

          .flatMap(lambda (uid, tweets) : get_topics(uid, tweets))



rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b)

rdd_users = rdd_tuples.map(lambda x: (x[::-1])) 

data_raw = sc.textFile(file_name)

data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True).
saveAsTextFile(revised_file_name)
NLP
▸ Bag of words model
▸ Experimented with ways to clean data (Stemming, POS
Tagging)
▸ Sci kit learn - Count Vectorizer, 2-gram
NLP

More Related Content

What's hot

Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
Jay Luker
 
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
Gabriel Recchia
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Kodaira Tomonori
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
Kodaira Tomonori
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Resultsxiaojuzheng
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
inventionjournals
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
Zachary Thomas
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Asiri Wijesinghe
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserDavid Dias
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
Vijay Koushik
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
marxliouville
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
Harry Potter
 

What's hot (16)

Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...More Data Trumps Smarter Algorithms:  Training Computational Models of Semant...
More Data Trumps Smarter Algorithms: Training Computational Models of Semant...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Presentation dropbox
Presentation dropboxPresentation dropbox
Presentation dropbox
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
 
Neural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and WordsNeural Summarization by Extracting Sentences and Words
Neural Summarization by Extracting Sentences and Words
 
Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
P2P Resource Discovery for the Browser
P2P Resource Discovery for the BrowserP2P Resource Discovery for the Browser
P2P Resource Discovery for the Browser
 
Vchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joinsVchunk join an efficient algorithm for edit similarity joins
Vchunk join an efficient algorithm for edit similarity joins
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 

Similar to LoryfelNunezInsight

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Vivian S. Zhang
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
ivan provalov
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
Meghaj Mallick
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
Johan Blomme
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
Florian Stegmaier
 
A Tour of Tensorflow's APIs
A Tour of Tensorflow's APIsA Tour of Tensorflow's APIs
A Tour of Tensorflow's APIs
Dean Wyatte
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor... Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
andrejusb
 
Wis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LODWis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LOD
Pramod Koneru
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Erik Hatcher
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
Jihyun Ahn
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNidhin Pattaniyil
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
Rising Media, Inc.
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
telss09
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 

Similar to LoryfelNunezInsight (20)

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Dbms
DbmsDbms
Dbms
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
 
A Tour of Tensorflow's APIs
A Tour of Tensorflow's APIsA Tour of Tensorflow's APIs
A Tour of Tensorflow's APIs
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor... Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
Machine Learning Applied - Contextual Chatbots Coding, Oracle JET and Tensor...
 
Wis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LODWis2011_presentation_Realtime_Events_on_LOD
Wis2011_presentation_Realtime_Events_on_LOD
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
About "Apache Cassandra"
About "Apache Cassandra"About "Apache Cassandra"
About "Apache Cassandra"
 
Presentation
PresentationPresentation
Presentation
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_poster
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

LoryfelNunezInsight

  • 1. T W I N D E R INSIGHT DATA ENGINEERING LORYFEL NUNEZ
  • 2.
  • 3.
  • 4.
  • 5.
  • 6. THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM User is a Twitter user User_Representation is a combination of a user's tweets and her description represented as a Bag-of-NGrams Match is the maximum intersection of UserA's and UserB's User_Representation User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the User_Representation Word_Query - Query to the vocab_user table where table has a row of word and a list of users (Inverted Index)
  • 8. CHALLENGES TOPIC1 (USER_ID1, USER_ID2…USER_IDN) FULL TWITTER DATA JSON USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME USER_ID, (TOPIC1,…,TOPIC10) convert user_desc and text to TOPICS groupByKey Inverted Index for search @abc loves dogs, yoga Did you see the Beiber movie 1234 834234123 @abc dogs, yoga, NYC dogs (@abc, @doglover, @def) yoga (@abc, @yogalover, @xyz)
  • 9. FULL TWITTER DATA JSON USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME USER_ID, (TOPIC1,…,TOPIC10) TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are modifiable Functions that I can extend —> lambda is my friend Passing data to workers (serialization and closure) UPDATES
  • 10. PERFORMANCE (EXPERIMENTS AND TUNING) PROCESS File Size Tweets Users Vocab Time Cores analyze 15MB 5,600 4,700 10,171 14s 3 analyze with POS 15MB 5,600 4,700 10,171 9min 32 analyze with POS 15MB 5,600 4,700 10,171 8min 3 analyze 15GB 4.9M 950,000 1.3M 21m 3 analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36 analyze with update (> 1d) with Hive 15GB 4.9M 1.5M 1.8M 26min 36/10/3
  • 12. NEXT STEPS OPTIMIZATIONS QUALITY OF MATCHES combine the models and i put my output to HDFS What happens when vocabulary increases to x What happens when you do a batch run for 1 weeks (105 GB at a time) REAL-TIME QUERIES search by topics — SOLR on Cassandra TESTING Support for NLP Techniques — faster processing for algorithm, data lookups
  • 13. DATA ▸ VOLUME Historical Twitter Data for testing, Daily Twitter Dumps ▸ VARIETY and VERACITY: ~Text Preprocessing, Metadata Extraction ▸ VELOCITY ▸ FOCUS: Fast computation Data structures for fast reads/ updates, Long Term Storage, Data Collection
  • 14. SAMPLES rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname, udesc, time))           .map(lambda (userid, id_str, text, uname, udesc, time): (userid, text))           .groupByKey().mapValues(list)           .flatMap(lambda (uid, tweets) : get_topics(uid, tweets)) 
 rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b) rdd_users = rdd_tuples.map(lambda x: (x[::-1])) data_raw = sc.textFile(file_name) data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True). saveAsTextFile(revised_file_name)
  • 15.
  • 16.
  • 17. NLP ▸ Bag of words model ▸ Experimented with ways to clean data (Stemming, POS Tagging) ▸ Sci kit learn - Count Vectorizer, 2-gram
  • 18. NLP