AWS Customer Presentation - SemantiNet


Published on

Published in: Technology, Travel
  • Be the first to comment

  • Be the first to like this

AWS Customer Presentation - SemantiNet

  1. 1. SemantiNet & Amazon Web Services Tal Muskal CTO & Founder [email_address]
  2. 2. <ul><li>Platform for organizing web content : </li></ul><ul><ul><li>Semantic Content Analysis: Semantic Indexing & Querying </li></ul></ul><ul><ul><li>Topic Page generation </li></ul></ul><ul><ul><li>Organize content by meaning rather than keywords </li></ul></ul>About Headup
  3. 3. Same term has many different meanings. Organize content by meaning rather than keywords Unique challenges & opportunities Twilight – 2008 film Over 50 meanings in Wikipedia, 8 films
  4. 4. Same meaning has different names Organize content by meaning rather than keywords Unique challenges & opportunities Bridge to Nowhere - Disambiguation Matched: Knik Arm Bridge, Alaska
  5. 5. Organize content by meaning rather than keywords Unique challenges & opportunities <ul><li>Indexing based on implicit object properties: </li></ul><ul><li>Categories/Sections – “basketball players”,”movies in theatre”, “peace talks 2010” </li></ul><ul><li>Locations – “Negev”, “Upper East Side, Manhattan”, “Yarkon Park” </li></ul><ul><li>Generating rich topic pages: </li></ul><ul><li>For Movie: include Actors, Trailers, Reviews, etc. </li></ul><ul><li>For Location: include Map, Nearby points of interest, etc. </li></ul><ul><li>For Band: include Albums, Music Video Clips, Lyrics, etc. </li></ul><ul><li>Includes articles from our index </li></ul>
  6. 6. Headup Topic Pages – powering a movies blog
  7. 7. Headup Topic Pages – powering JPost
  8. 8. <ul><li>Web content semantic analysis: </li></ul><ul><li>Long tail of entities </li></ul><ul><li>Ambiguity problems </li></ul><ul><li>Organizing should be based on implicit data too </li></ul><ul><li>All Require prior knowledge about entities </li></ul>Organize content by meaning rather than keywords Unique challenges & opportunities
  9. 9. World Knowledge Graph <ul><li>Contains over 150M entities in different domains: </li></ul><ul><ul><li>Bands, Places, Politicians, Movies, TV Shows, etc. </li></ul></ul><ul><li>Contains over 1B labeled connections between entities </li></ul><ul><li>Stored in our proprietary graph DB </li></ul><ul><li>Dynamic, evolving data: New terms (new movies…) New meanings for existing terms New connections between entities </li></ul><ul><li>Generating it heavily utilizes Elastic Map Reduce (Pig / Hadoop) </li></ul>
  10. 10. Entities in page (+Connections) Relationships between entities Context, semantics SemantiNet Technology <ul><ul><li>Over 20 sources Freebase, Wikipedia, LinkedMDB… </li></ul></ul><ul><ul><li>150M entities, over 1Bn connections, highly compressed </li></ul></ul><ul><ul><li>Very high speed read access </li></ul></ul><ul><ul><li>Updated regularly using Elastic Map Reduce </li></ul></ul><ul><ul><li>Indexing millions of web pages </li></ul></ul><ul><ul><li>Based on URIs rather than keywords </li></ul></ul><ul><ul><li>Index includes metadata: properties, categories, geo-coordinates </li></ul></ul><ul><ul><li>Based on the knowledge graph </li></ul></ul><ul><ul><li>Understand meaning using context </li></ul></ul>Semantic Analysis Knowledge Graph Semantic Index Query Interface
  11. 11. Architecture SemantiNet Architecture Client Facing Layer Amazon Load Balancer Cache x.large large small Workers perform atomic tasks: Crawl, Model, Analyze Text, Index & Render S3 RDS SimpleDB Graph SQS Data Store Frontend Web Browser CMS Frontend Frontend
  12. 12. Monitoring and Elasticity
  13. 13. Elasticity <ul><li>Amount of instances changes at least once a day </li></ul><ul><li>Cost optimizations using combination of Spot instances & on-demand instances </li></ul><ul><li>RDS is upgraded when under load in peak times </li></ul><ul><li>Elasticity allows fast iterations and experiments for finding the optimal setup </li></ul>
  14. 14. Lessons learned <ul><li>Learning curve for tweaking costs </li></ul><ul><li>IT is easier, but very different </li></ul><ul><li>Modularity is critical in cloud environments: </li></ul><ul><ul><li>Flexibility in solving different bottlenecks in data processing pipeline </li></ul></ul><ul><ul><li>Allows to slide on the cost/performance curve based on elasticity </li></ul></ul><ul><li>Store processed results </li></ul><ul><ul><li>Cache </li></ul></ul><ul><ul><li>Intermediate steps of processing </li></ul></ul><ul><li>One of the best APIs </li></ul>
  15. 15. <ul><li>Thank you </li></ul>