SlideShare a Scribd company logo
1 of 20
Context from Big Data
Startup Showcase
IEEE Big Data Conference
November 1, 2015
Santa Clara, CA
Delroy Cameron, Data Scientist
@urxtech | urx.com | research@urx.com
People
URX has 40 people: 75%
product/eng, 25% business
Customers
URX partners with the world’s top
publisher & advertisers.
Funding
URX raised $15M from Accel,
Google Ventures, and others
Who is URX?
URX is a mobile technology platform that focuses on publisher monetization,
content distribution, and user engagement.
What problem does URX solve?
URX serves contextually relevant native ads.
URX interprets page
context to dynamically
determine the best
message & action.
How does URX affect the mobile ecosystem?
Volume (Apps) Volume (web pages) Variety (entities)
Why is this a Big Data problem?
Rhapsody
(Music)
Fansided
(Sports)
Apple
(Music, TV, Books)
Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/
1.6M Apps (Android)
1.5M Apps (Apple Store)
How do we collect, store, and process the data needed
to build our machine learning models?
1.Data Collection and Parsing
2.Data Storage
• Persistent Storage
• Search Index
3.Data Processing
• Dictionary Building
• Vectorization (Feature Vector Creation)
Important tasks
11GB XML dump (gzip file)
15M pages (but only 4M articles)
Wikitext Grammar
Wikipedia Corpus (English)
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility"/>
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space=“preserve">
#REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}}
</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
1. Data collection & parsing
sax library, generator
20 secs/doc, 10 years
FullWikiParser (mediawikiparser)
sax library, generator
200 docs/sec, ~ 21 hours
FastWikiParser (mwparserfromhell)
hbase, lxml parser
6 docs/sec, ~ one month
HTMLWikiParser (URX Index)
multithreading, generator
~ 3 hours
GensimWikiCorpusParser
1. pyspark (64 cores, 8GB RAM)
2. wikihadoop (StreamWikiDumpInputFormat)
• split input file
3. mwparserfromhell
• parse to raw text
4. ~20 minutes
wikipedia-parser
wikipedia-indexer
datanode 1
Namenode
datanode 2
datanode n
.
.
.
HDFS Elasticsearch Index
ClusterNode1
ClusterNode 2
ClusterNode m
.
.
.
2. Data storage
wikipedia-parser
(0 taylor) . . . (1999995 zion)
(1 alison) . . . (1999996 dozer)
(2 swift) . . . (1999997 tank)
(3 born) . . . (1999998 trinity)
(4 december) . . . (1999999 neo)
3. Data Processor (Dictionary building)
wikihadoop, StreamWikiDumpInputFormat
dictionary, tfidfmodel
~ 1 hour
Pyspark (Gensim)
multithreading, generator
corpus, dictionary, tfidfmodel
~ 6 hours
GensimWikiCorpusParser
Alias Candidate Entity f1 f2 … fn
Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34
wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42
wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31
wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23
wikipedia:John_Swift 0.00 0.19 … 0.56
4. Data Processor (Vectorization)
~ 350ms predict entity per alias
Gensim
~ 100ms predict entity per alias
Cython
Wikipedia
Corpus
corpus-parser
corpus-indexer
HDFS
(Wikilinks)
Wikilinks
Corpus
X
Corpus
Data
Processor
Dictionary TF-IDF Model
Machine Learning Module
HDFS
(Wikipedia)
HDFS
(X Corpus)
Elasticsearch1
Elasticsearch2
Elasticsearchn
1
2
3
4
5
6
7
Demo
Linked Entities
1. http://en.wikipedia.org/wiki/Macgyver
2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson
3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson
4. http://en.wikipedia.org/wiki/Josh_Holloway
5. http://en.wikipedia.org/wiki/NBC
6. http://en.wikipedia.org/wiki/CBS
7. http://en.wikipedia.org/wiki/James_Wan
8. http://en.wikipedia.org/wiki/Netflix
9. http://en.wikipedia.org/wiki/America_America
http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
● Tuning pyspark jobs (64 cores, 8GB Driver RAM)
● Bringing down the elasticsearch cluster
● Rejoining the union after secession (elasticsearch nodes)
● Text Cleaning (lowercasing, character encoding)
● Merging in Hadoop for dictionary creation
Things to watch out for
Getting started is easy.
Sign Up Download SDK Start Building
Visit http://urx.com/sign-up for more information.
Thank you.
delroy@urx.com

More Related Content

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Context from Big Data

  • 1. Context from Big Data Startup Showcase IEEE Big Data Conference November 1, 2015 Santa Clara, CA Delroy Cameron, Data Scientist @urxtech | urx.com | research@urx.com
  • 2. People URX has 40 people: 75% product/eng, 25% business Customers URX partners with the world’s top publisher & advertisers. Funding URX raised $15M from Accel, Google Ventures, and others Who is URX? URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.
  • 3. What problem does URX solve?
  • 4. URX serves contextually relevant native ads. URX interprets page context to dynamically determine the best message & action.
  • 5. How does URX affect the mobile ecosystem?
  • 6. Volume (Apps) Volume (web pages) Variety (entities) Why is this a Big Data problem? Rhapsody (Music) Fansided (Sports) Apple (Music, TV, Books) Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/ 1.6M Apps (Android) 1.5M Apps (Apple Store)
  • 7. How do we collect, store, and process the data needed to build our machine learning models?
  • 8. 1.Data Collection and Parsing 2.Data Storage • Persistent Storage • Search Index 3.Data Processing • Dictionary Building • Vectorization (Feature Vector Creation) Important tasks
  • 9. 11GB XML dump (gzip file) 15M pages (but only 4M articles) Wikitext Grammar Wikipedia Corpus (English) 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/ <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve"> #REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}} </text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
  • 10. 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/
  • 11. 1. Data collection & parsing sax library, generator 20 secs/doc, 10 years FullWikiParser (mediawikiparser) sax library, generator 200 docs/sec, ~ 21 hours FastWikiParser (mwparserfromhell) hbase, lxml parser 6 docs/sec, ~ one month HTMLWikiParser (URX Index) multithreading, generator ~ 3 hours GensimWikiCorpusParser 1. pyspark (64 cores, 8GB RAM) 2. wikihadoop (StreamWikiDumpInputFormat) • split input file 3. mwparserfromhell • parse to raw text 4. ~20 minutes wikipedia-parser
  • 12. wikipedia-indexer datanode 1 Namenode datanode 2 datanode n . . . HDFS Elasticsearch Index ClusterNode1 ClusterNode 2 ClusterNode m . . . 2. Data storage wikipedia-parser
  • 13. (0 taylor) . . . (1999995 zion) (1 alison) . . . (1999996 dozer) (2 swift) . . . (1999997 tank) (3 born) . . . (1999998 trinity) (4 december) . . . (1999999 neo) 3. Data Processor (Dictionary building) wikihadoop, StreamWikiDumpInputFormat dictionary, tfidfmodel ~ 1 hour Pyspark (Gensim) multithreading, generator corpus, dictionary, tfidfmodel ~ 6 hours GensimWikiCorpusParser
  • 14. Alias Candidate Entity f1 f2 … fn Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34 wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42 wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31 wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23 wikipedia:John_Swift 0.00 0.19 … 0.56 4. Data Processor (Vectorization) ~ 350ms predict entity per alias Gensim ~ 100ms predict entity per alias Cython
  • 15. Wikipedia Corpus corpus-parser corpus-indexer HDFS (Wikilinks) Wikilinks Corpus X Corpus Data Processor Dictionary TF-IDF Model Machine Learning Module HDFS (Wikipedia) HDFS (X Corpus) Elasticsearch1 Elasticsearch2 Elasticsearchn 1 2 3 4 5 6 7
  • 16. Demo
  • 17. Linked Entities 1. http://en.wikipedia.org/wiki/Macgyver 2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson 3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson 4. http://en.wikipedia.org/wiki/Josh_Holloway 5. http://en.wikipedia.org/wiki/NBC 6. http://en.wikipedia.org/wiki/CBS 7. http://en.wikipedia.org/wiki/James_Wan 8. http://en.wikipedia.org/wiki/Netflix 9. http://en.wikipedia.org/wiki/America_America http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
  • 18. ● Tuning pyspark jobs (64 cores, 8GB Driver RAM) ● Bringing down the elasticsearch cluster ● Rejoining the union after secession (elasticsearch nodes) ● Text Cleaning (lowercasing, character encoding) ● Merging in Hadoop for dictionary creation Things to watch out for
  • 19. Getting started is easy. Sign Up Download SDK Start Building Visit http://urx.com/sign-up for more information.

Editor's Notes

  1. I am Delroy, Data Scientist @URX Challenges and Experiences
  2. URX is a mobile advertising company in SF spotify, bandsintown, lyft, seekgeek, stubhub, airbnb
  3. Mobile App or Mobile Web - ad from left field
  4. create a more cohesive and relevant mobile experience Deeplinking refers to the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website Deeplinks are important because they connect the content within one app directly to the content within another app fansided.com to seetgeek.com
  5. Enable developers to better monetize by linking the content within apps Create more engagement by allowing users to convert intent into actions
  6. The Statistics Portal (July 2015) - 1.6M Apps (Android) - 1.5M Apps (Apple Store) 1M pages - Rhapsody, Pandora (URX Index)
  7. Entity Linking Problem What happens is we just search wikipedia?
  8. Wikipedia Comprehensive Accurate (due to crowdsourcing) wikification D2W
  9. Full - 3/min, 180/hr, 4300/day, 1.5M/year
  10. Wikipedia Pages + Mentions Index Batch updates to reduce IO ~ 15 or more hours Persistent Cluster Node rejoins the union if it dies
  11. What happens at test time? www.dancingastronaut.com
  12. http compute:65432/?scores=max url=http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/ http compute:65432/mentions url=http://zap2it.com/2015/10/finding-carter-season-2b-ben-escape-cash-explained/ http compute:65432/?scores=max url=http://zap2it.com/2015/10/finding-carter-season-2b-ben-escape-cash-explained/
  13. http compute:65432/?scores=max url=http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/