0
Building upon the Zemanta API Andraz Tori, CTO [email_address] Twitter: andraz
Overview <ul><li>General purpose
Functionality
Examples, demos & use-cases </li></ul>
What does it do?
A Stargate Computer Processable Data Human Understandable Text
Initial design <ul><li>Input: a chunk of text
Domain agnostic!
Avoid proprietary entity identifiers or taxonomies
Standard response formats: JSON, XML, RDF/XML </li></ul>
What gives? <ul><li>Tags
Categories
Concepts and entities
Related articles
Related images </li></ul>Most used  Most interesting  Most obvious
Tags <ul><li>Words, phrases
„ Interesting“ tags  </li><ul><li>Explicitely mentioned
What the text is about as a whole
What concepts were not mentioned, but could be relevant (for SEO) </li></ul></ul>
Categories <ul><li>Deep hirarchy (100k categories)
Customized smaller taxonomies
Good for content organization, ad-targeting, etc </li></ul>
Categories example <ul><li>Branded &quot;unfilmable&quot;, Watchmen - the cult graphic novel about a group of retired, fla...
An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... i...
First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator D...
Categories <ul><li>Top/Society/History/By_Time_Period/Twentieth_Century/Cold_War (0.11)
Top/Arts/Comics/Reviews (0.10)
Top/Society/History/By_Time_Period (0.08)
Top/Arts/Comics (0.08)
Top/Society/History/By_Time_Period/Twentieth_Century (0.08)
Top/Society/History (0.08)
Top/Shopping/Publications/Books (0.08)
Top/Shopping/Publications/Books/Fiction (0.08) </li></ul>
Categories example <ul><li>Branded &quot;unfilmable&quot;, Watchmen - the cult graphic novel about a group of retired, fla...
An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... i...
First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator D...
Concepts and entities <ul><li>Identify  relevant  concepts and entities
All  disambiguated!
Upcoming SlideShare
Loading in...5
×

SemWeb install-fest presentation

1,447

Published on

Presentation for developers about what Zemanta API can do for you.

Published in: Technology, Sports
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,447
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Cross domain – we didn&apos;t start with financial or health domain and then expanded our algorithms, we started from day one with cross domain capabilities
  • Tags have no background meaning, they are not tied to any database and they are not normalized in any way. They are what you would expect of a human not caring for standardization or normalization to choose from For example text mentioning Apple, Android and Google might get iPhone as a tag And mobile web as a tag, even when it wasn&apos;t mentioned anywhere.
  • Tags have no background meaning, they are not tied to any database and they are not normalized in any way. They are what you would expect of a human not caring for standardization or normalization to choose from For example text mentioning Apple, Android and Google might get iPhone as a tag And mobile web as a tag, even when it wasn&apos;t mentioned anywhere.
  • Disambiguation is done using background knowledge, for example we differ between London the city in UK, London in Ohio or Texas and Jack London, the writer
  • We are big fans of Freebase and Linking Open Data project
  • Zigtag, Faviki, AdpativeBlue, Zemanta, Yahoo, Freebase
  • Disambiguation is done using background knowledge, for example we differ between London the city in UK, London in Ohio or Texas and Jack London, the writer
  • Transcript of "SemWeb install-fest presentation"

    1. 1. Building upon the Zemanta API Andraz Tori, CTO [email_address] Twitter: andraz
    2. 2. Overview <ul><li>General purpose
    3. 3. Functionality
    4. 4. Examples, demos & use-cases </li></ul>
    5. 5. What does it do?
    6. 6. A Stargate Computer Processable Data Human Understandable Text
    7. 7. Initial design <ul><li>Input: a chunk of text
    8. 8. Domain agnostic!
    9. 9. Avoid proprietary entity identifiers or taxonomies
    10. 10. Standard response formats: JSON, XML, RDF/XML </li></ul>
    11. 11. What gives? <ul><li>Tags
    12. 12. Categories
    13. 13. Concepts and entities
    14. 14. Related articles
    15. 15. Related images </li></ul>Most used Most interesting Most obvious
    16. 16. Tags <ul><li>Words, phrases
    17. 17. „ Interesting“ tags </li><ul><li>Explicitely mentioned
    18. 18. What the text is about as a whole
    19. 19. What concepts were not mentioned, but could be relevant (for SEO) </li></ul></ul>
    20. 20. Categories <ul><li>Deep hirarchy (100k categories)
    21. 21. Customized smaller taxonomies
    22. 22. Good for content organization, ad-targeting, etc </li></ul>
    23. 23. Categories example <ul><li>Branded &quot;unfilmable&quot;, Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credits roll, it is clear Watchmen is not your typical superhero movie.
    24. 24. An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... in graphic slow motion. What follows is a two-and-three-quarter hour epic that centres on an outlawed group of deeply flawed former heroes as a Cold War Doomsday clock inches ever closer to midnight and nuclear apocalypse.
    25. 25. First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator Dave Gibbons. </li></ul>
    26. 26. Categories <ul><li>Top/Society/History/By_Time_Period/Twentieth_Century/Cold_War (0.11)
    27. 27. Top/Arts/Comics/Reviews (0.10)
    28. 28. Top/Society/History/By_Time_Period (0.08)
    29. 29. Top/Arts/Comics (0.08)
    30. 30. Top/Society/History/By_Time_Period/Twentieth_Century (0.08)
    31. 31. Top/Society/History (0.08)
    32. 32. Top/Shopping/Publications/Books (0.08)
    33. 33. Top/Shopping/Publications/Books/Fiction (0.08) </li></ul>
    34. 34. Categories example <ul><li>Branded &quot;unfilmable&quot;, Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credits roll, it is clear Watchmen is not your typical superhero movie.
    35. 35. An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... in graphic slow motion. What follows is a two-and-three-quarter hour epic that centres on an outlawed group of deeply flawed former heroes as a Cold War Doomsday clock inches ever closer to midnight and nuclear apocalypse.
    36. 36. First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator Dave Gibbons. </li></ul>
    37. 37. Concepts and entities <ul><li>Identify relevant concepts and entities
    38. 38. All disambiguated!
    39. 39. At least one URL for each concept, possibly more </li></ul>
    40. 40. How we disambiguate <ul><li>Use knowledge from Wikipedia, Freebase, Dmoz, third party databases...
    41. 41. Mine the web
    42. 42. Use knowledge from choices of our users
    43. 43. Use both semantic data and statistics based methods </li></ul>
    44. 44. Linking to... <ul><li>Traditional </li></ul><ul><li>Semantic </li></ul>... ... ...
    45. 45. How to build upon this <ul><li>Step 1 : We give you exact identifiers
    46. 46. Step 2 : Then you look up the information about them (connections, images, …) in your or third party databases
    47. 47. Step 3 : ?
    48. 48. Step 4 : Profit! </li></ul>
    49. 49. Discovery example <ul><li>A US Airways Airbus A320 passenger plane carrying 135 people has crashed into the Hudson River in New York, the Federal Aviation Administration says.
    50. 50. Rescue boats and ferries are alongside the plane attempting to pick up people standing on both of the plane's wings.
    51. 51. The plane, which the FAA said was flight 1549 from LaGuardia Airport to Charlotte, is partially submerged.
    52. 52. It is not known how the plane came to land in the river, but the FAA said it might have been due to a bird strike. </li></ul>
    53. 53. You get <ul><li>A US Airways Airbus A320 passenger plane carrying 135 people has crashed into the Hudson River in New York , the Federal Aviation Administration says.
    54. 54. Rescue boats and ferries are alongside the plane attempting to pick up people standing on both of the plane's wings.
    55. 55. The plane, which the FAA said was flight 1549 from LaGuardia Airport to Charlotte , is partially submerged.
    56. 56. It is not known how the plane came to land in the river, but the FAA said it might have been due to a bird strike . </li></ul>entities concepts
    57. 57. Or more precisely... LaGuardia Airport http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000018f654 LaGuardia Airport http://dbpedia.org/resource/LaGuardia_Airport Federal Aviation Administration http://rdf.freebase.com/ns/guid/9202a8c04000641f8000000000017df0 Federal Aviation Administration http://dbpedia.org/resource/Federal_Aviation_Administration Hudson River http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000005ebb5 Hudson River http://dbpedia.org/resource/Hudson_River Airbus A320 family http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000012f918 Airbus A320 family http://dbpedia.org/resource/Airbus_A320_family Bird strike http://rdf.freebase.com/ns/guid/9202a8c04000641f80000000004744df Bird strike http://dbpedia.org/resource/Bird_strike US Airways http://rdf.freebase.com/ns/guid/9202a8c04000641f80000000001b4dc5 US Airways http://dbpedia.org/resource/US_Airways New York http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000054dd5d New York http://dbpedia.org/resource/New_York Charlotte, North Carolina http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000006e148 Charlotte, North Carolina http://dbpedia.org/resource/Charlotte%2C_North_Carolina Ferr http://rdf.freebase.com/ns/guid/9202a8c04000641f8000000000063292 Ferry at http://dbpedia.org/resource/Ferry
    58. 58. You can query relationships http://test.infoblow.zemanta.com/infoblow/galaxy/
    59. 59. Or more complex ones...
    60. 60. Concepts and entities use cases <ul><li>Quick 'overviews' of topics
    61. 61. Discovery-supporting user interfaces
    62. 62. Automatic deep information delivery (hoovers, widgets) </li></ul>
    63. 63. Balloons example <ul><li>Deliver deep information on exact concepts and entities </li></ul>
    64. 64. Fantastic public graph <ul><li>Information about concepts/entities
    65. 65. Types: human, building, location...
    66. 66. Relationships with other entities
    67. 67. Hard data: dates, places, amounts </li></ul>
    68. 68. Connected Dream? September 2008
    69. 69. Connected Dream? July 2009
    70. 70. Opportunities in leveraging linked data <ul><li>There are internal and external benefits of linking into larger pool of exact data
    71. 71. Pulling together custom data becomes orders of magnitude easier
    72. 72. However we still miss strong success stories </li></ul>
    73. 73. Related articles <ul><li>20k blogs and media sites
    74. 74. You can provide your own list of feeds to recommend from
    75. 75. Or use our 'global whitelisted pool' </li></ul>
    76. 76. Related articles use cases <ul><li>Better experience for the readers
    77. 77. Information discovery (for authors)
    78. 78. Creating interlinked mini-comunities (example: bloggers using our tool to discover others in the niche) </li></ul>
    79. 79. Related images <ul><li>From Wikipedia, Flickr, Daylife, Amazon, Last.fm, Snooth, social networks
    80. 80. We filter totally unacceptable licenses out, keep the rest
    81. 81. Each image has a license spelled out, developer/author choses </li></ul>
    82. 82. Zemanta API <ul><li>http://developer.zemanta.com
    83. 83. Examples in Java, Javascript, Python, Ruby, PHP, Perl, C#...
    84. 84. JavaScript SDK for quick custom CMS integration
    85. 85. Up to 10.000 requests/day free! </li></ul>
    86. 86. Ease of API use <ul><li>import urllib, simplejson, pprint
    87. 87. args = {'format': 'json',
    88. 88. 'method': 'zemanta.suggest',
    89. 89. 'api_key': 'np9cbnby9x8tsc47recwuhqm',
    90. 90. 'return_categories': 'dmoz',
    91. 91. 'return_rdf_links': 1,
    92. 92. 'text': ''' Branded &quot;unfilmable&quot;, Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credit An ageing vigilante, The Comedian, is attacked ...
    93. 93. '''}
    94. 94. args_enc = urllib.urlencode(args)
    95. 95. response_raw = urllib.urlopen(„ http://api.zemanta.com/services/rest/0.0/ “, args_enc).read()
    96. 96. response = simplejson.loads(response_raw)
    97. 97. pprint.pprint(response) </li></ul>
    98. 98. Works for <ul><li>All kinds of texts (not just financial or journalistic articles)
    99. 99. Tweets!
    100. 100. Wherever you need to go from text documents to something structured to put into your algorithm/data store </li></ul>
    101. 101. Some API users
    102. 102. How the API is used? <ul><li>Place extraction and disambiguation – used by Outside.in
    103. 103. Analysis of tweets – used by Klout.net
    104. 104. Custom categorization – used by Slideshare
    105. 105. Semantic tagging – used by Faviki </li></ul>
    106. 106. CommonTag Initiative by AdaptiveBlue, DERI (NUI Galway), Faviki, Freebase, Yahoo!, Zemanta, and Zigtag <ul><li>Exact tagging
    107. 107. RDFa as a transport layer
    108. 108. Freebase & LOD as vocabularies
    109. 109. Full-circle ecosystem from day one (publishers, services, better search, better browsing) </li></ul>
    110. 110. The next web ... the next web will be like a great party host , introducing us to each other and bringing us together into meaningful conversation. Marta Strickland, Organic
    111. 111. The future? Zemify me up, Scotty ! Andraz Tori [email_address] Twitter: andraz
    112. 112. Image attributions <ul><li>http://www.flickr.com/photos/constanzavolare/2475833775/in/photostream/ CC by Constanza Volare </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×