Nils Kübler, MeMo News, @nkueblerMeMo News: Online MediaMonitoring with Hadoop14.05.12
Classic Media Monitoring                           14.05.12
Online Media Monitoring                          14.05.12
Social Media Monitoring                          14.05.12
Web Crawler Types                                Intrinsic                                Quality                     Rese...
Monitoring with MeMo News                            14.05.12
Search Engine ArchitectureOffline                                      Online Retrieving   Gathering   Indexing   Index   ...
Offline-Part: Technology-Stack                            14.05.12
Offline-Part: Technology-Stack                            14.05.12
The MeMo Newsagent                     14.05.12
Cluster: Masters                   14.05.12
Cluster: Workers                   14.05.12
Scaling the Newsagent                        14.05.12
FAQQUESTIONS?
AttributionsPhotos:http://www.flickr.com/photos/foreignoffice/4036442903/http://www.flickr.com/photos/hytok/2640161873/htt...
14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)
Upcoming SlideShare
Loading in …5
×

14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)

1,067 views
963 views

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,067
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Welcome, my name is Nils Kübler, I work for MeMo News in Kreuzlingen. This is my first english presentation so it is not that fluent, please excuse that. Today I am going to introduce you our first hadoop projject, which is our Media Monitoring Platform at   MeMo news . In this presentation, I will first show you what Media Monitoring is about and how online and social monitoring works . second i will come on some of the core features of the memonews-platform , because they are important to understand some of our design decisions. In the third part I will come to the basics of our architecture ,  will introduce our technology stack and how our software uses the technology. In the final I will give you an overview of our little cluster and will wrap-up by showing how our software currently scales .
  • Media Monitoring means monitoring the output of print, online and broadcast data.   It helps the customers to validate the results of marketing campains or to react early on changes in the public opinion .  I will explain you classic and online media monitoring types: The classical approach , is to systematically record radio or television broadasts and to collect clippings from print media and scan them for keywords .
  • In the online world, there are two common forms: the online media monitoring , where online-sources such as news-portals or web-blogs are used as sources.
  • And another form is the Social Media monitoring , which primary monitors social networks such as twitter, facebook or even forums.  MeMo News is both , an online media monitor and also a social media monitoring tool. Q: Has anybody an idea what type of software could be used for such a thing?  A: By downloading the internet :) Q: And how is that done? A: with a web-crawler
  • So what's a web-crawler? A web-crawler downloads contents from the web, gathers data from the downloaded contents and persists them somewhere. Most likely in na search-index , to make it searchable.  A web crawler has to balance between quality of the downloaded contents and the freshness . We also distinquish between different aspects of  quailty.  For example, an archive crawler tries to copy objects of the web as good as possible. This means they have a good representional quality . A Research crawler on the other hand, tries to give weight on data that is relevant to the user. This means, they have good instrinsic quality . The crawler type which fits the online-monitoring task best, is the newsagent . The newsagent primary focuses on having very fresh information . It is also sometimes called "near-realtime" search. Before we going deeper into the details of our newsagent , I will explain you the most important use-cases with the MeMo-News platform, because they are important to understand our architecture.
  • Monitoring with memonews works by creating so called search agents . The user creates the agents in the platform . Each agent consists a name and a query . The queries are then used to generate mail-reports or realtime alerts when articles that match are found. Additionally, the user is also able to login to the web-platform , where he can navigate through the agents , look at the most recent results and may also put results to different archives , which will never get delelted.   Because of this, we can never delete anything from our search index, that means it grows and grows.
  • Here you can see the mask for editing an example agent for monitoring articles mentioning the Swiss Hadoop User group. Lets take a deeper look into the architecture of a search-engine.
  • A search engine consists of two parts: The online and offline part. The online part is responsible for the user-interaction with our system.  The offline part is about everything that happens under the hood. That is: retrieving the contents from the web, gathering the important contents from it, and store them to our index.  In the following we focus primary on the offline part, which is running on the hadoop platform in memonews.
  • Our Technology-Stack consists of 5 Maint-Parts: Data-Storage is provided by HDFS, and we use coordination via zookeeper. HBase extends HDFS by prioding Random Data Access and MapReduce allowes us to process both: HDFS and HBase Data in an asynchronous way.  The most important component though, is the solr , which isn't part of the core hadoop platform. We won't get deeper into those technologies here, but i would like to ask if there is interest about some of these topics for future presentations? 
  • Our Newsagent directly uses all of these parts:  - We make heavy use of zookeeper to coordinate our downloaders. - We persist everithing we download on HBase . - This Data will also get Indexed to SOLR - And We make also heavy use of Mapreduce for Job-Scheduling and for asynchronous anlyzation tasks, such as priority calculation for downloads. OK, then lets take a look into our newsagent
  • About every 3 Minutes or so, we make a full scan of all our known sources via mapreduce , and check if they need to be downloaded. If so, a job will be generated on zookeeper . This is what we call scheduling . We have a distributed application called the http-loader which is responsible to download sources. Every Http-Loader knows exactly for which jobs he is responsible, and will execute it, as soon as possible. When the source is downloaded, the new articles are stored on hbase.  On each update on hbase, the lily-Rowlog is triggered, which will start an update of the search-index and also do a so called prospective search  to send realtime-alerts to users.  During the Prospective search  we check every new article for matches with any of the existing agents. We used the lily-rowlog as trigger-mechanismn, because the previous version of hbase did not provide coprocessors. Coprocessors are like triggers in RDBMS, they enable to execute custom code as soon as data in hbase changes.  In a future version we will replace that, we already got an experimental implementation for that. Is there any interest about prospecivte search, Coprocessors  or lily rowlog  for a future presentation? OK, then let's take a short look in our current cluster setup
  • I have separated the setup in  two slides, this here are our masters. We got 2 Virtual Master hosts, where we have each Master running in it's own Virtual Machine.  There is always only one VM active for each service, the other  one will only get activated when the active VM is down.  So even when one Virtual Master Host fail, the other one is able to take over all master-services. This gives us a pretty good reliability. The SOLR-Server need to run on a standalone machine, because it requires so much resources. Here we follow th same pattern, when the active Server fails, the other one will take over the work.  ... pause ... Currently one machine is suitable to hold our SOLR shards, but we will soon need to distribute them across multiple machines, because our index grows and grows.
  • Our "unit of scale" is the Worker. Each worker has the typical hadoop-services installed: Datanode, Tasktracker and Regionserver. And currently 3 of the Workers are also providing our Zookeeper-Quorum, which we may migrate some time to another place. We also have two own processes runng on each  worker: The http-loader and the index-updater, which i introduced already on a previous slide. ... pause ... So, we already reached the last slide, where we can see how our  newsagent currently scales.
  • As you can see, with 1 worker we download around 15 articles per second.  With 2 workers, this number is around doubled.  The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon.    And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...
  • ... any questions?
  • As you can see, with 1 worker we download around 15 articles per second.  With 2 workers, this number is around doubled.  The scalling seems to be reduced with 3 or 4 workers, but this is only one measurement. We are positive that we can handle that, even when this graphic may suggest that we will reach an scalability-problem soon.    And we definitly will need to scale our system more as we want to crawl more sources. Another point why we will need to introduce much more workers is, that our analyzis tasks will grow and grow as we introduce more and more analytic tasks, such as sentiment analyzis. .. pause ... so thats it. Thank you for your attention ...
  • 14.05.2012 Social Media Monitoring with Hadoop (Nils Kübler, MeMo News)

    1. 1. Nils Kübler, MeMo News, @nkueblerMeMo News: Online MediaMonitoring with Hadoop14.05.12
    2. 2. Classic Media Monitoring 14.05.12
    3. 3. Online Media Monitoring 14.05.12
    4. 4. Social Media Monitoring 14.05.12
    5. 5. Web Crawler Types Intrinsic Quality Research Focused Crawlers Crawlers General Crawler Archive News agents Crawlers Mirroring Representational systems Freshness Quality 14.05.12
    6. 6. Monitoring with MeMo News 14.05.12
    7. 7. Search Engine ArchitectureOffline Online Retrieving Gathering Indexing Index Search 14.05.12
    8. 8. Offline-Part: Technology-Stack 14.05.12
    9. 9. Offline-Part: Technology-Stack 14.05.12
    10. 10. The MeMo Newsagent 14.05.12
    11. 11. Cluster: Masters 14.05.12
    12. 12. Cluster: Workers 14.05.12
    13. 13. Scaling the Newsagent 14.05.12
    14. 14. FAQQUESTIONS?
    15. 15. AttributionsPhotos:http://www.flickr.com/photos/foreignoffice/4036442903/http://www.flickr.com/photos/hytok/2640161873/http://www.flickr.com/photos/videocrab/116136642/http://commons.wikimedia.org/wiki/File:Crystal_Clear_app_personal.png 14.05.12

    ×