LoryfelNunez

T W I N D E R
INSIGHT DATA ENGINEERING
LORYFEL NUNEZ

THE MOTIVATION. THE DATA, THE PROBLEM. THE APPROACH
ETL, TEXT PROCESSING ON A DISTRIBUTED SYSTEM
User is a Twitter user
User_Representation is a combination of a user's tweets and her description represented
as a Bag-of-NGrams
Match is the maximum intersection of UserA's and UserB's User_Representation
User_Query - Query to the user_vocab table where table has a row of user and a vocab list is the
User_Representation
Word_Query - Query to the vocab_user table where table has a row of word and a
list of users (Inverted Index)

CHALLENGES
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)
FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
convert user_desc and text to TOPICS
groupByKey
Inverted Index for search
@abc loves dogs, yoga Did you see the Beiber movie 1234 834234123
@abc dogs, yoga, NYC
dogs (@abc, @doglover, @def)
yoga (@abc, @yogalover, @xyz)

FULL TWITTER DATA JSON
USER_HANDLE, USER_DESC, TWEET_TEXT, TWEET_ID, TIME
USER_ID, (TOPIC1,…,TOPIC10)
TOPIC1 (USER_ID1, USER_ID2…USER_IDN)Shared Data Structures that are
modiﬁable
Functions that I can extend —>
lambda is my friend
Passing data to workers
(serialization and closure)
UPDATES

PERFORMANCE (EXPERIMENTS AND TUNING)
PROCESS File Size Tweets Users Vocab Time Cores
analyze 15MB 5,600 4,700 10,171 14s 3
analyze with POS 15MB 5,600 4,700 10,171 9min 32
analyze with POS 15MB 5,600 4,700 10,171 8min 3
analyze 15GB 4.9M 950,000 1.3M 21m 3
analyze with update(1d) 15GB 4.9M 1.3M 1.5M 30min 36
analyze with update (>
1d) with Hive
15GB 4.9M 1.5M 1.8M 26min 36/10/3

NEXT STEPS
OPTIMIZATIONS
QUALITY OF MATCHES
combine the models and i put my output to HDFS
What happens when vocabulary increases to x
What happens when you do a batch run for 1 weeks (105 GB at a time)
REAL-TIME QUERIES
search by topics — SOLR on Cassandra
TESTING
Support for NLP Techniques — faster processing for
algorithm, data lookups

DATA
▸ VOLUME
Historical Twitter Data for testing, Daily Twitter Dumps
▸ VARIETY and VERACITY:
~Text Preprocessing, Metadata Extraction
▸ VELOCITY
▸ FOCUS: Fast computation Data structures for fast reads/
updates, Long Term Storage, Data Collection

SAMPLES
rdd_tuples = df_pared.map(lambda (userid, id_str, text, uname, udesc, time): (userid, id_str, tokenize(text, udesc), uname,
udesc, time))

.map(lambda (userid, id_str, text, uname, udesc, time): (userid, text))

.groupByKey().mapValues(list)

.flatMap(lambda (uid, tweets) : get_topics(uid, tweets))

 
rdd_topics = rdd_tuples.reduceByKey (lambda a, b: a + '|' + b)

rdd_users = rdd_tuples.map(lambda x: (x[::-1]))

data_raw = sc.textFile(file_name)

data_final = data_raw.map(lambda line: convert_line(line)).coalesce(1, shuffle = True).
saveAsTextFile(revised_file_name)

NLP
▸ Bag of words model
▸ Experimented with ways to clean data (Stemming, POS
Tagging)
▸ Sci kit learn - Count Vectorizer, 2-gram

LoryfelNunez

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (8)

Similar to LoryfelNunez

Similar to LoryfelNunez (20)

Recently uploaded

Recently uploaded (20)

LoryfelNunez