0
Quality to Quantity to Quality         on the Web     Andraž Tori, CTO at Zemanta              @andraz
Topics- a bit about Zemanta- how advanced “data tools” and spammers  interact
We are all trying to organize the web
Making it right,making it useful  and linked
Not so long time ago, in a city not far away...
some other people
are trying to do the opposite
trying to disorganize it,  make it confusing,and to profit from that
using the tools we have built!
Their motives are not sinster          (mostly)
it is about profit
Profit- publish as much content as possible- quality is not (that) important- get traffic or high page ranking for certain...
So, why do I care?
Job openingYou will get a spreadsheet with 180 blog url’s andlogins. You will log into each blog and schedule 2posts per w...
And why might you care?- the organized information is great tool for those   that try to disorganize it- they are poisonin...
What do we do at
- is a “personal writing assistant”- suggesting content while you write (your blog)- analyzing your text- connecting it wi...
Opening up the hood
the reality
How it works                                       Content                                       suggestionsPlain text    ...
Main design goals- Input is meaningful chunk of text (not a keyword   or a phrase)- Input is (semi) English language- Has ...
Analysis pipeline                                           Known phrasesNamed Entity                                     ...
Analysis pipeline                                                                    Known phrases                        ...
Background knowledge- Data from Wikipedia, MusicBrainz, Freebase…  and world wild web- Includes linguistical and semantica...
Background knowledge- 7M mined and linked up entities and  concepts                                           Triple store...
After analysisText           SOLR         Related articles          articles           SOLR             Images          im...
Example SOLR query
boost(((                                  wiki_entities:Health insurancewiki_entities:Medical underwriting        wiki_ent...
Solr- We adapted Solr for “query by document”- 52% precision (at 10) on internal evaluations  - plain Lucene MLT comes to ...
Lucene plain “More Like This”
Metrics & tests- Every part of the system is being constantly  evaluted- Precision/recall at 5 different points in the  sy...
Overview- We do pretty deep processing to deliver simple  user experience of “personal authoring  assistant”- And everythi...
What API offers?• Tags                               Most used• Categories• Concepts and entities   Most interesting• Rela...
So mash-ups happen...
Some API users
We are just one of the many people offering services based on large amounts of web dataeach spending man-years trying to o...
now back to the bad guys
Job openingYou will get a spreadsheet with 180 blog url’s andlogins. You will log into each blog and schedule 2posts per w...
Theres more than meets the eye
Gather search terms               Analyze →                Find / create(extensions, logs, guess)     what people search f...
Warnings- Ive seen no single system using the whole   pipeline as described, however all parts were   found in the wild- E...
Gather search terms             Analyze →                 Find / create(extensions, logs, guess)    what people search for...
Finding their keywords, niches- Domain expertise- Users like to install extensions and say “yes”- You observe referrers on...
The sophisticated part of the market“Demand Media relies on a proprietary algorithm  to help editors best determine what s...
Gather search terms               Analyze →               Find / create(extensions, logs, guess)     what people search fo...
Find / create content- Steal- Take from “open article directories”- Have your own “content assembly line” like  Demand Media
Open article directories
Gather search terms               Analyze →                Find / create(extensions, logs, guess)     what people search f...
Tһiѕ iѕ nοt the text you аre lookinɡ for.
Tһiѕ iѕ nοt the text you аre lookinɡ for.
Translate it to random language and back to EnglishÜbersetzen sie zufällig Sprache und wieder auf Englisch  Language and t...
Covering their tracks- Trying to fool search engines or people?- Search engines are catching up- Google Translate API is b...
Gather search terms              Analyze →                 Find / create(extensions, logs, guess)    what people search fo...
Spammers say darndest things
Gather search terms               Analyze →                Find / create(extensions, logs, guess)     what people search f...
Remixing linked data and spam- Currently mostly the good guys are using Linked  Data- However, its just too tempting to be...
Gather search terms               Analyze →                Find / create(extensions, logs, guess)     what people search f...
Publish- On hosted third party platforms  - eating their resources- Platforms have hard time killing spammers- Smaller one...
Gather search terms               Analyze →               Find / create(extensions, logs, guess)     what people search fo...
Valuable commentsAs I write this post, Zemanta is showing me 5 different articles that are related to my post. I could vis...
- Guy in previous slide is honest and well-  meaning- But what if you automate that via Amazon  Mechanical Turk or oDesk?
Gather search terms               Analyze →                Find / create(extensions, logs, guess)     what people search f...
Profit?- sell ads- sell links- sell “fully developed site”- to the highest bidder
Search engines to the rescue?- Mahalo cut 10% of the staff the day after  Google announced ranking changes- Demand Medias ...
Ecosystem- Very sophisticated, large players  - moving to more high quality content, video?- Small time operations  - usin...
Food for thought
Can we make spammers (and others) work for us,         making linked data better?             (think reCAPTCHA)
Could article directories be fruitfully used?eZineArticles.com, GoArticles.com, etc...
Find rewritten articles and use them as parallel                     corpus?
Could we use global workforce market more    efficiently to get more linked data?
Thesis, antithesis, synthesis?               http://xkcd.com/810/
Thank you!Questions?
Image sources    http://www.flickr.com/photos/dzingeek/4587871752/    http://www.flickr.com/photos/25101572@N02/43934740...
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Quality, Quantity, Web and Semantics
Upcoming SlideShare
Loading in...5
×

Quality, Quantity, Web and Semantics

1,173

Published on

How is organized data used by some web players having not the best intentions? How can tools that try to help individual authors be subverted by spammers? Also, how does Zemanta work and why are we interested in this topic.

Published in: Technology, Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,173
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Quality, Quantity, Web and Semantics"

  1. 1. Quality to Quantity to Quality on the Web Andraž Tori, CTO at Zemanta @andraz
  2. 2. Topics- a bit about Zemanta- how advanced “data tools” and spammers interact
  3. 3. We are all trying to organize the web
  4. 4. Making it right,making it useful and linked
  5. 5. Not so long time ago, in a city not far away...
  6. 6. some other people
  7. 7. are trying to do the opposite
  8. 8. trying to disorganize it, make it confusing,and to profit from that
  9. 9. using the tools we have built!
  10. 10. Their motives are not sinster (mostly)
  11. 11. it is about profit
  12. 12. Profit- publish as much content as possible- quality is not (that) important- get traffic or high page ranking for certain terms- sell clicks, links or whole “fully built” sites to the highest bidder- users and search engines are necessary evil to be tricked as cheaply as possible
  13. 13. So, why do I care?
  14. 14. Job openingYou will get a spreadsheet with 180 blog url’s andlogins. You will log into each blog and schedule 2posts per week ...You will spice up every post with images and/orrelated links within the content, using a Wordpressplugin called Zemantahttps://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
  15. 15. And why might you care?- the organized information is great tool for those that try to disorganize it- they are poisoning “our web”, including twitter, facebook- and its hard to see in the fog they are causing- it is just matter of time when they start poisioning linked data too
  16. 16. What do we do at
  17. 17. - is a “personal writing assistant”- suggesting content while you write (your blog)- analyzing your text- connecting it with background knowledge, other stories on the web, images- you choose what suggestions to include- to make your writing more informative, vivid and useful
  18. 18. Opening up the hood
  19. 19. the reality
  20. 20. How it works Content suggestionsPlain text Semantic Analysis (article) search Linked data RSS feeds
  21. 21. Main design goals- Input is meaningful chunk of text (not a keyword or a phrase)- Input is (semi) English language- Has to work across all domains in the open world - music, celebrities, finance, entertainment, politics, gardening, parenting, …
  22. 22. Analysis pipeline Known phrasesNamed Entity extraction Extraction (aho-corasick) Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc. Disambiguated entities
  23. 23. Analysis pipeline Known phrases Named Entity extraction Extraction (aho-corasick)Categorization to Dmoz Triple store Surface form features evaluation Statistical comparison to background knowledge Semantic coherence and hand-tuned heuristics etc.Categories Ambigious named entities Disambiguated entities
  24. 24. Background knowledge- Data from Wikipedia, MusicBrainz, Freebase… and world wild web- Includes linguistical and semantical properties + unstructured data- Present in two forms: - in “original” custom built triple store on top of MySQL (150 GB) - processed into 7 GB optimized “memory mapped dump”
  25. 25. Background knowledge- 7M mined and linked up entities and concepts Triple store- 30M aliases- Refreshed about once a month - want to make it real-time- Input data quality is really important etc.
  26. 26. After analysisText SOLR Related articles articles SOLR Images images
  27. 27. Example SOLR query
  28. 28. boost((( wiki_entities:Health insurancewiki_entities:Medical underwriting wiki_entities:United Stateswiki_entities:Affordable Care Act wiki_entities:Barack Obamawiki_entities:Lifetime (TV network) wiki_entities:Insurancewiki_entities:Preventive medicine wiki_entities:Childwiki_entities:Patient Protection and Affordable Care Act ) ^3.0)(text:zemhealthinsurq^0.68 text:health^0.62 text:premium^0.36text:zeminsurcompaniq^0.56 text:increas^0.29 text:rate^0.27text:zemhealthinsurcompaniq^0.35 text:zempreventcareq^0.26text:medic^0.26 text:compani^0.23 text:obamacar^0.21text:todai^0.21 text:polici^0.21 text:care^0.19 ) ^105.0((dmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Healthdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_Statesdmoz_categories:Top/Business/Financial_Services/Insurance/Agents_and_Marketers/Health/United_States/California) ^0.1),(1 - 0.2) * sqrt(1.0/(1.15E-8*float(1285185600000 - date(published_datetime)ms)+1.0)) + 0.2)
  29. 29. Solr- We adapted Solr for “query by document”- 52% precision (at 10) on internal evaluations - plain Lucene MLT comes to 44% - difference is from “bag of terms” approach over “bag of words” (terms coming from analysis step)- Our live index is 5M articles- Solr is really not optimized to handle 50 terms in a single query
  30. 30. Lucene plain “More Like This”
  31. 31. Metrics & tests- Every part of the system is being constantly evaluted- Precision/recall at 5 different points in the system- Mostly bi-weekly releases of new datasets and the engine
  32. 32. Overview- We do pretty deep processing to deliver simple user experience of “personal authoring assistant”- And everything is available over the web API - tagging - named entity recognition and disambiguation to Linked Open Data URIs
  33. 33. What API offers?• Tags Most used• Categories• Concepts and entities Most interesting• Related articles• Related images
  34. 34. So mash-ups happen...
  35. 35. Some API users
  36. 36. We are just one of the many people offering services based on large amounts of web dataeach spending man-years trying to organize their data, trying to offer best possible service
  37. 37. now back to the bad guys
  38. 38. Job openingYou will get a spreadsheet with 180 blog url’s andlogins. You will log into each blog and schedule 2posts per week ...You will spice up every post with images and/orrelated links within the content, using a Wordpressplugin called Zemantahttps://www.odesk.com/jobs/Wordpress-Blog-Poster_~~c8c04549b8e6b600
  39. 39. Theres more than meets the eye
  40. 40. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  41. 41. Warnings- Ive seen no single system using the whole pipeline as described, however all parts were found in the wild- Examples used are from all kinds of sites – good, bad and ugly- I am not trying to imply that all of the steps in the diagram are bad, but they can be used by bad guys efficiently
  42. 42. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  43. 43. Finding their keywords, niches- Domain expertise- Users like to install extensions and say “yes”- You observe referrers on sites you control- You buy the data on the black market
  44. 44. The sophisticated part of the market“Demand Media relies on a proprietary algorithm to help editors best determine what subjects their writers should tackle.”Factors: - Keyword competition - Revenue - Driving traffic to/from existing conent http://emediavitals.com/article/16/demand-media-s-content-assembly-line
  45. 45. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  46. 46. Find / create content- Steal- Take from “open article directories”- Have your own “content assembly line” like Demand Media
  47. 47. Open article directories
  48. 48. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  49. 49. Tһiѕ iѕ nοt the text you аre lookinɡ for.
  50. 50. Tһiѕ iѕ nοt the text you аre lookinɡ for.
  51. 51. Translate it to random language and back to EnglishÜbersetzen sie zufällig Sprache und wieder auf Englisch Language and translate it happen again in English Μεταφράστε αυτό σε δειγματοληπτικούς γλώσσα και πίσω στην αγγλική γλώσσα Translate this random language back to English Traduisez au langage aléatoire et revenir à langlais Translate to random language to English and back 它翻译成随机的语言和回英文Translate it back into the English language and random
  52. 52. Covering their tracks- Trying to fool search engines or people?- Search engines are catching up- Google Translate API is being closed due to “abuse”?- The trend is “rewriting” by human editors, procured on the global market
  53. 53. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta, OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  54. 54. Spammers say darndest things
  55. 55. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such contentPull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  56. 56. Remixing linked data and spam- Currently mostly the good guys are using Linked Data- However, its just too tempting to be left alone- Fully synthetic articles using factual information from linked data? – Using advanced tools to form proper natural language sentences and maybe even storyline?
  57. 57. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  58. 58. Publish- On hosted third party platforms - eating their resources- Platforms have hard time killing spammers- Smaller ones dont necessarily have the incentive- If they remove spammer too fast, it is easier for spammer to probe the limits - Platforms use “kill with delay”- Spam detection is resource intensive
  59. 59. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  60. 60. Valuable commentsAs I write this post, Zemanta is showing me 5 different articles that are related to my post. I could visit each one of these sites and reach out to the owner to see if they would be interested in linking to my post, or I could leave a valuable comment on the page and include a link back to my post. http://www.mainelyseo.com/zemanta-review-seo-link-building-with-the-zemanta-plugin/
  61. 61. - Guy in previous slide is honest and well- meaning- But what if you automate that via Amazon Mechanical Turk or oDesk?
  62. 62. Gather search terms Analyze → Find / create(extensions, logs, guess) what people search for? such content Pull additional content Use Zemanta or OpenCalais Cover your tracks from Freebase to add tags, images, links Publish Amazon Mechanical Turk Use Zemanta to find to post comments Profit? similar blogs and links back to your site
  63. 63. Profit?- sell ads- sell links- sell “fully developed site”- to the highest bidder
  64. 64. Search engines to the rescue?- Mahalo cut 10% of the staff the day after Google announced ranking changes- Demand Medias stock isnt doing that well anymore- However this is a never-ending story, well have co-evolution for foreseeable future
  65. 65. Ecosystem- Very sophisticated, large players - moving to more high quality content, video?- Small time operations - using more and more sophisticated tools available on the market cheaply (modern asymmetric warfare?)- Dark industry specifically building tools to poison the web and sell them to small time operators
  66. 66. Food for thought
  67. 67. Can we make spammers (and others) work for us, making linked data better? (think reCAPTCHA)
  68. 68. Could article directories be fruitfully used?eZineArticles.com, GoArticles.com, etc...
  69. 69. Find rewritten articles and use them as parallel corpus?
  70. 70. Could we use global workforce market more efficiently to get more linked data?
  71. 71. Thesis, antithesis, synthesis? http://xkcd.com/810/
  72. 72. Thank you!Questions?
  73. 73. Image sources http://www.flickr.com/photos/dzingeek/4587871752/ http://www.flickr.com/photos/25101572@N02/4393474025/ http://www.flickr.com/photos/billward/4740384434/ http://www.flickr.com/photos/jurvetson/542500748 http://www.flickr.com/photos/legofenris/4288913574 http://www.flickr.com/photos/ekilby/3733627940 http://www.flickr.com/photos/ekilby/3732799269/ http://www.flickr.com/photos/cipherswarm/38354452 http://xkcd.com/810/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×