Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Context from Big Data
Startup Showcase
IEEE Big Data Conference
November 1, 2015
Santa Clara, CA
Delroy Cameron, Data Scie...
People
URX has 40 people: 75%
product/eng, 25% business
Customers
URX partners with the world’s top
publisher & advertiser...
What problem does URX solve?
URX serves contextually relevant native ads.
URX interprets page
context to dynamically
determine the best
message & actio...
How does URX affect the mobile ecosystem?
Volume (Apps) Volume (web pages) Variety (entities)
Why is this a Big Data problem?
Rhapsody
(Music)
Fansided
(Sports)
App...
How do we collect, store, and process the data needed
to build our machine learning models?
1.Data Collection and Parsing
2.Data Storage
• Persistent Storage
• Search Index
3.Data Processing
• Dictionary Building
•...
11GB XML dump (gzip file)
15M pages (but only 4M articles)
Wikitext Grammar
Wikipedia Corpus (English)
1. Data collection ...
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
1. Data collection & parsing
sax library, generator
20 secs/doc, 10 years
FullWikiParser (mediawikiparser)
sax library, ge...
wikipedia-indexer
datanode 1
Namenode
datanode 2
datanode n
.
.
.
HDFS Elasticsearch Index
ClusterNode1
ClusterNode 2
Clus...
(0 taylor) . . . (1999995 zion)
(1 alison) . . . (1999996 dozer)
(2 swift) . . . (1999997 tank)
(3 born) . . . (1999998 tr...
Alias Candidate Entity f1 f2 … fn
Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34
wikipedia:Taylor_Swift_(album) 0.42...
Wikipedia
Corpus
corpus-parser
corpus-indexer
HDFS
(Wikilinks)
Wikilinks
Corpus
X
Corpus
Data
Processor
Dictionary TF-IDF ...
Demo
Linked Entities
1. http://en.wikipedia.org/wiki/Macgyver
2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson
3. http://en....
● Tuning pyspark jobs (64 cores, 8GB Driver RAM)
● Bringing down the elasticsearch cluster
● Rejoining the union after sec...
Getting started is easy.
Sign Up Download SDK Start Building
Visit http://urx.com/sign-up for more information.
Thank you.
delroy@urx.com
Upcoming SlideShare
Loading in …5
×

Context from Big Data

7,920 views

Published on

URX Data Scientist Delroy Cameron explains URX's approach to extracting context from Big Data at the IEEE Big Data Conference on November 1st, 2015 in Santa Clara, California.

For more information, please see the full post on the URX Blog here: http://blog.urx.com/urx-blog/2015/11/6/how-urx-derives-context-from-big-data

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Context from Big Data

  1. 1. Context from Big Data Startup Showcase IEEE Big Data Conference November 1, 2015 Santa Clara, CA Delroy Cameron, Data Scientist @urxtech | urx.com | research@urx.com
  2. 2. People URX has 40 people: 75% product/eng, 25% business Customers URX partners with the world’s top publisher & advertisers. Funding URX raised $15M from Accel, Google Ventures, and others Who is URX? URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.
  3. 3. What problem does URX solve?
  4. 4. URX serves contextually relevant native ads. URX interprets page context to dynamically determine the best message & action.
  5. 5. How does URX affect the mobile ecosystem?
  6. 6. Volume (Apps) Volume (web pages) Variety (entities) Why is this a Big Data problem? Rhapsody (Music) Fansided (Sports) Apple (Music, TV, Books) Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/ 1.6M Apps (Android) 1.5M Apps (Apple Store)
  7. 7. How do we collect, store, and process the data needed to build our machine learning models?
  8. 8. 1.Data Collection and Parsing 2.Data Storage • Persistent Storage • Search Index 3.Data Processing • Dictionary Building • Vectorization (Feature Vector Creation) Important tasks
  9. 9. 11GB XML dump (gzip file) 15M pages (but only 4M articles) Wikitext Grammar Wikipedia Corpus (English) 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/ <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve"> #REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}} </text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
  10. 10. 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/
  11. 11. 1. Data collection & parsing sax library, generator 20 secs/doc, 10 years FullWikiParser (mediawikiparser) sax library, generator 200 docs/sec, ~ 21 hours FastWikiParser (mwparserfromhell) hbase, lxml parser 6 docs/sec, ~ one month HTMLWikiParser (URX Index) multithreading, generator ~ 3 hours GensimWikiCorpusParser 1. pyspark (64 cores, 8GB RAM) 2. wikihadoop (StreamWikiDumpInputFormat) • split input file 3. mwparserfromhell • parse to raw text 4. ~20 minutes wikipedia-parser
  12. 12. wikipedia-indexer datanode 1 Namenode datanode 2 datanode n . . . HDFS Elasticsearch Index ClusterNode1 ClusterNode 2 ClusterNode m . . . 2. Data storage wikipedia-parser
  13. 13. (0 taylor) . . . (1999995 zion) (1 alison) . . . (1999996 dozer) (2 swift) . . . (1999997 tank) (3 born) . . . (1999998 trinity) (4 december) . . . (1999999 neo) 3. Data Processor (Dictionary building) wikihadoop, StreamWikiDumpInputFormat dictionary, tfidfmodel ~ 1 hour Pyspark (Gensim) multithreading, generator corpus, dictionary, tfidfmodel ~ 6 hours GensimWikiCorpusParser
  14. 14. Alias Candidate Entity f1 f2 … fn Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34 wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42 wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31 wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23 wikipedia:John_Swift 0.00 0.19 … 0.56 4. Data Processor (Vectorization) ~ 350ms predict entity per alias Gensim ~ 100ms predict entity per alias Cython
  15. 15. Wikipedia Corpus corpus-parser corpus-indexer HDFS (Wikilinks) Wikilinks Corpus X Corpus Data Processor Dictionary TF-IDF Model Machine Learning Module HDFS (Wikipedia) HDFS (X Corpus) Elasticsearch1 Elasticsearch2 Elasticsearchn 1 2 3 4 5 6 7
  16. 16. Demo
  17. 17. Linked Entities 1. http://en.wikipedia.org/wiki/Macgyver 2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson 3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson 4. http://en.wikipedia.org/wiki/Josh_Holloway 5. http://en.wikipedia.org/wiki/NBC 6. http://en.wikipedia.org/wiki/CBS 7. http://en.wikipedia.org/wiki/James_Wan 8. http://en.wikipedia.org/wiki/Netflix 9. http://en.wikipedia.org/wiki/America_America http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
  18. 18. ● Tuning pyspark jobs (64 cores, 8GB Driver RAM) ● Bringing down the elasticsearch cluster ● Rejoining the union after secession (elasticsearch nodes) ● Text Cleaning (lowercasing, character encoding) ● Merging in Hadoop for dictionary creation Things to watch out for
  19. 19. Getting started is easy. Sign Up Download SDK Start Building Visit http://urx.com/sign-up for more information.
  20. 20. Thank you. delroy@urx.com

×