LoryfelNunezInsight

T W I N D E R
INSIGHT DATA ENGINEERING
LORYFEL NUNEZ

THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH
ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM
User is a Twitter user
User_Representation is a combination of a user's tweets and her description represented
as a Bag-of-NGrams
Match is the maximum intersection of UserA's and UserB's User_Representation
User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the
User_Representation
Word_Query - Query to the vocab_user table where table has a row of word and a
list of users (Inverted Index)

CHALLENGES
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
convert user_desc and text to TOPICS
groupByKey
Inverted Index for search
@abc loves dogs, yoga Did you see the Beiber movie 1234 834234123
@abc dogs, yoga, NYC
dogs (@abc, @doglover, @def)
yoga (@abc, @yogalover, @xyz)

FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are
modiﬁable
Functions that I can extend —>
lambda is my friend
Passing data to workers
(serialization and closure)
UPDATES

PERFORMANCE (EXPERIMENTS AND TUNING)
PROCESS File Size Tweets Users Vocab Time Cores
analyze 15MB 5,600 4,700 10,171 14s 3
analyze with POS 15MB 5,600 4,700 10,171 9min 32
analyze with POS 15MB 5,600 4,700 10,171 8min 3
analyze 15GB 4.9M 950,000 1.3M 21m 3
analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36
analyze with update (>
1d) with Hive
15GB 4.9M 1.5M 1.8M 26min 36/10/3

NEXT STEPS
OPTIMIZATIONS
QUALITY OF MATCHES
combine the models and i put my output to HDFS
What happens when vocabulary increases to x
What happens when you do a batch run for 1 weeks (105 GB at a time)
REAL-TIME QUERIES
search by topics — SOLR on Cassandra
TESTING
Support for NLP Techniques — faster processing for
algorithm, data lookups

DATA
▸ VOLUME
Historical Twitter Data for testing, Daily Twitter Dumps
▸ VARIETY and VERACITY:
~Text Preprocessing, Metadata Extraction
▸ VELOCITY
▸ FOCUS: Fast computation Data structures for fast reads/
updates, Long Term Storage, Data Collection

SAMPLES
rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname,
udesc, time))

.map(lambda (userid, id_str, text, uname, udesc, time): (userid, text))

.groupByKey().mapValues(list)

.flatMap(lambda (uid, tweets) : get_topics(uid, tweets))

 
rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b)

rdd_users = rdd_tuples.map(lambda x: (x[::-1]))

data_raw = sc.textFile(file_name)

data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True).
saveAsTextFile(revised_file_name)

NLP
▸ Bag of words model
▸ Experimented with ways to clean data (Stemming, POS
Tagging)
▸ Sci kit learn - Count Vectorizer, 2-gram

LoryfelNunezInsight

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to LoryfelNunezInsight

Similar to LoryfelNunezInsight (20)

Recently uploaded

Recently uploaded (20)

LoryfelNunezInsight