Hello, my name is Tal Muskal and I am the CTO & founder of SemantiNet. We develop solutions for organizing web content using semantic technologies. We have been around since 2006, and we have been using amazon web services for about 2 years now, I would like to present you some of the challenges that we tackle and how amazon helps us to overcome them.
So, as I said, We develop a platform for organizing web content. we do this by utilizing semantic content analysis. Ok, so what do I mean by semantic analysis or semantic querying? And how does it help with organizing content? I will give you a few examples of what that means.
When analyzing this blog post for example, the term twilight may refer to many different things: It may refer to one of the 10 music albums or dozens of songs with that name. It may also refer to the place in Pennsylvania, twilight the game developer, or the star trek episode with that name. Or one of the 8 films with that name… or maybe the original comics which this movie was based on. It is important to recognize the right entity, because later we want to use it’s properties for indexing. And if we are wrong about the specific entity, we will also use the wrong properties, which amplifies the problem.
Another quick example. Here the “Knik arm bridge” is referred to the “bridge to nowhere”, that specific bridge can be also referred to as “Don Young's Way” for example. So the same object may have many different names.
Once we understand which entities a web page contains, we can then organize this content based on implicit facts. facts that usually a reader would knows before reading the text, but they are not explicitly stated: like the fact that lebron james is a basketball player, or that Bibi netanyahu is related to the peace talks process. This allows us to create special pages that aggregate content based on object properties. For example a page that aggregates articles about “Nobel prize winners”, about a neighborhood (will contain articles that mentions landmarks in this neighborhood) , or even a page that aggregates articles about “movies in theatres near me” We also generate pages about specific entities , that contain contextually relevant meta data
Here we can see a topic page for Sir Michael Cain that we generated for a uk movie blog called heyuguys. As you can see we include a short abstract, Films, directors & actors who worked with him In many films (and on what films in this little popup if you hover one of them). Related topics, Interview videos, image, and friends who are fan of Michael Caie. And of course, articles from this site that are related to this topic.
or we can generate a broader topic page, such as this “Peace talks” page we generated for Jerusalem Post containing all the articles around this topic. (while the words “peace talks” do not necessarily appear in the articles)
so just to summarize, these are the main challenges that we have in semantic analysis. Requires prior knowledge about entities, I want to talk about this for a minute.
This prior knowledge is what we call the world knowledge graph, and it is a huge network of named objects from the real world(such as people, places, bands, movies, companies, even things like dog breeds and chemical elements) and the connections between them, as well as their properties. (such as birthdate, height, if it’s a movie: the release date) Since ambiguity resolving requires performing very high speed graph operations over this huge data source. we had to develop our own GraphDB, and we update the data on a weekly basis. so for this we are using ELB – instead of generating a graph in a week (which would results a week old data), we generate it in a day every week. (costs the same, data is fresher) – and theoretically we can reduce it to one hour at the same cost.
So, theses are the basic pillars of our technology: 1) Semantic Analysis – taking web pages and extracting entities and facts. And resolving context using the knowledge in the graph. 2) Knowledge Graph – includes many sources, highly compressed, compressed enough to put in the memory. 3) Semantic Index – provides us the ability to query based on object properties, or different hierarchies. (we could do a page for “tall married old british actors”) Feeding facts from index back to the knowledge graph.
Let’s see a common flow in our system: There are different kinds of tasks that are system performs: crawling/modeling/analyzing, indexing and rendering a topic page. (some are dependent on others) Front facing layer gets the request, very responsive. Never blocking. either there is a result ready or we need to perform this task. Results are not in the S3, so job is queued in SQS and RDS. (RDS contains details and state for the task) Workers – spot instances. Receive different tasks from SQS, workers are in different sizes since different tasks require different capabilities. Our workers probe their specification when booting, and decide what kind of jobs they can take. When it finished, it writes task status to RDS, renders results to S3 or index analysis results to SimpleDB. In the next request for that task, the data is ready in S3.
Based on different bottlenecks. When we work with a new site, we need to index their entire archives quickly (500K+) , - so when in the initial setup with the site, we may need to raise 10 instances, and when it’s done, we can keep supporting all our clients with just one instance for new articles. That could mean down time if we didn’t have elasticity. (or a very expensive setup just to support this period). Experimenting with different setups can lead to design decisions. Such as putting a specific portion of your data in the memory. It may cost 4 time more, but can work 40 times faster… so can ten times cheaper overall
What did we learn in semantinet about working with the cloud. Takes time to learn to tweak costs. IT is very different, requires very different skill sets. Control over bottlenecks. Processed results: Cache – serving from S3 is very cheap. Steps – where it makes sense, it encourage you to break down you system into small modules. This would help to make a more elastic solution. 5) (we have worked with over 100 APIs…) so our standards are high. (design, docs, language support, etc.)
We are open to share our insight and help others that are transitioning to Amazon web services
AWS Customer Presentation - SemantiNet
SemantiNet & Amazon Web Services Tal Muskal CTO & Founder [email_address]
<ul><li>Platform for organizing web content : </li></ul><ul><ul><li>Semantic Content Analysis: Semantic Indexing & Querying </li></ul></ul><ul><ul><li>Topic Page generation </li></ul></ul><ul><ul><li>Organize content by meaning rather than keywords </li></ul></ul>About Headup
Same term has many different meanings. Organize content by meaning rather than keywords Unique challenges & opportunities Twilight – 2008 film Over 50 meanings in Wikipedia, 8 films
Same meaning has different names Organize content by meaning rather than keywords Unique challenges & opportunities Bridge to Nowhere - Disambiguation Matched: Knik Arm Bridge, Alaska
Organize content by meaning rather than keywords Unique challenges & opportunities <ul><li>Indexing based on implicit object properties: </li></ul><ul><li>Categories/Sections – “basketball players”,”movies in theatre”, “peace talks 2010” </li></ul><ul><li>Locations – “Negev”, “Upper East Side, Manhattan”, “Yarkon Park” </li></ul><ul><li>Generating rich topic pages: </li></ul><ul><li>For Movie: include Actors, Trailers, Reviews, etc. </li></ul><ul><li>For Location: include Map, Nearby points of interest, etc. </li></ul><ul><li>For Band: include Albums, Music Video Clips, Lyrics, etc. </li></ul><ul><li>Includes articles from our index </li></ul>
<ul><li>Web content semantic analysis: </li></ul><ul><li>Long tail of entities </li></ul><ul><li>Ambiguity problems </li></ul><ul><li>Organizing should be based on implicit data too </li></ul><ul><li>All Require prior knowledge about entities </li></ul>Organize content by meaning rather than keywords Unique challenges & opportunities
World Knowledge Graph <ul><li>Contains over 150M entities in different domains: </li></ul><ul><ul><li>Bands, Places, Politicians, Movies, TV Shows, etc. </li></ul></ul><ul><li>Contains over 1B labeled connections between entities </li></ul><ul><li>Stored in our proprietary graph DB </li></ul><ul><li>Dynamic, evolving data: New terms (new movies…) New meanings for existing terms New connections between entities </li></ul><ul><li>Generating it heavily utilizes Elastic Map Reduce (Pig / Hadoop) </li></ul>
Entities in page (+Connections) Relationships between entities Context, semantics SemantiNet Technology <ul><ul><li>Over 20 sources Freebase, Wikipedia, LinkedMDB… </li></ul></ul><ul><ul><li>150M entities, over 1Bn connections, highly compressed </li></ul></ul><ul><ul><li>Very high speed read access </li></ul></ul><ul><ul><li>Updated regularly using Elastic Map Reduce </li></ul></ul><ul><ul><li>Indexing millions of web pages </li></ul></ul><ul><ul><li>Based on URIs rather than keywords </li></ul></ul><ul><ul><li>Index includes metadata: properties, categories, geo-coordinates </li></ul></ul><ul><ul><li>Based on the knowledge graph </li></ul></ul><ul><ul><li>Understand meaning using context </li></ul></ul>Semantic Analysis Knowledge Graph Semantic Index Query Interface
Architecture SemantiNet Architecture Client Facing Layer Amazon Load Balancer Cache x.large large small Workers perform atomic tasks: Crawl, Model, Analyze Text, Index & Render S3 RDS SimpleDB Graph SQS Data Store Frontend Web Browser CMS Frontend Frontend
Elasticity <ul><li>Amount of instances changes at least once a day </li></ul><ul><li>Cost optimizations using combination of Spot instances & on-demand instances </li></ul><ul><li>RDS is upgraded when under load in peak times </li></ul><ul><li>Elasticity allows fast iterations and experiments for finding the optimal setup </li></ul>
Lessons learned <ul><li>Learning curve for tweaking costs </li></ul><ul><li>IT is easier, but very different </li></ul><ul><li>Modularity is critical in cloud environments: </li></ul><ul><ul><li>Flexibility in solving different bottlenecks in data processing pipeline </li></ul></ul><ul><ul><li>Allows to slide on the cost/performance curve based on elasticity </li></ul></ul><ul><li>Store processed results </li></ul><ul><ul><li>Cache </li></ul></ul><ul><ul><li>Intermediate steps of processing </li></ul></ul><ul><li>One of the best APIs </li></ul>