• Like
  • Save

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

  • 3,247 views
Uploaded on

Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume's flexible architecture allows us to stream data to our production data center as well as …

Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume's flexible architecture allows us to stream data to our production data center as well as Amazon's Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we've made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,247
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
14

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Collecting content from twitter, facebook, blogs, and news outlets. Allow our users to search on this content, monitor it, and analyze it.
  • Screen shot of the app shows a user's list of monitors and content matching those monitors. Users can filter by sentiment and by the content source. They can engage in social conversations through twitter and facebook. And they can create discussions within Jive SBS.
  • Users can analyze social media trends over time with graph views for sentiment and content sources.
  • Old system takes data from content sources and throws it on a queue. Queue acts as a buffer to processors that process the content and insert it into a MySQL DB. Some fault tolerance with multiple servers connecting to multiple queues. But required a fair bit of monitoring and manual intervention when problems arise.
  • Limited because we throw away most of our content. Pushing the limits of MySQL can be painful.
  • Wanted to store all content (limited window), search it, and analyze it.
  • Chose HBase for random lookup. HDFS for chronological streaming. Katta for distributing Lucene shards. Hadoop for running map reduce.
  • Built out prototype of new system using Amazon's EC2 and needed a way to stream data into these servers. Internal / External IP addresses of EC2 made it difficult to connect directly to HDFS and HBase. Flume provided this connectivity along with desirable delivery guarantees.
  • Additionally, can fan out the data to bring data into EC2 along with our production system.
  • Additionally, can fan out the data to bring data into EC2 along with our production system.
  • KATTA For those not familiar with Katta, it is a distributed search engine that has two major responsibilities The first is distributing indexes from HDFS to any number of katta nodes. Katta nodes can run across as many machines as you want, easy to add more, and katta will redistribute indexes if nodes fail Katta has a highly customizable distribution policy – you can round robin, have hot/cold topologies where newer indexes are placed on faster machines As part of the distribution there is also replication of indexes for increased load performance and failover All of this is managed through zookeeper, so it is quite resilient, and does a very good job at keeping indexes where zookeeper says it should The second responsibility of katta is to take a single search request and send the request to every katta node and gather the results
  • OVERVIEW OF SEARCH – 30 days of twitter, facebook, major news and blogs Next few slides are going to show how we tackled searching a moving window of 30 days of twitter (full firehose), public facebook feed, and Spinn3r (which includes all major news and blog sites) SEARCH IS USED – INVESTIGATE MONITOR CREATION, ADHOC ANALYTICS -search is used to investigate what monitor to create, so searching historical data is of course key -also allows to do ad-hoc analytics over recent history. Show me sentimate, or raw counts for an ad-hoc query over the last 30 days
  • TRANSITION – OTHER REQUIREMENTS NEED FLEXIBILITY Other requirements of course pop up, so it was good that we chose Flume so that we could add easily add on new functionality One of the key customization areas of Flume are the custom sources sinks and decorators you can supply SOURCES OVERVIEW Sources allow you to create custom hooks into data providers. There is a huge list of sources provided out of the box from tailing files to avro http end points where you can send raw events to flume over http with a flume event avro schema SINK OVERVIEW Sinks allow you to create custom places to put the events.. Again there are a slew of out of the box sinks such as hbase and hdfs DECORATOR OVERVIEW And then there are decorators that you can place pretty much add anywhere in the topology where you are allowed to inspect each event and add meta data, change the contents, or throw them on the floor SOME OF OUR OWN Want to highlight a few customizations we did: (rest on slide)

Transcript

  • 1. Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
  • 2. Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
  • 3. Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
  • 4. Searching and Following Social Media content
  • 5. Analyzing Social Media content
  • 6.
    • Massive amount of data
      • 230M+ entries / day
      • Facebook/Twitter growing fast
      • Blogs, news, YouTube, Flickr, etc
      • 125+ GB / day
      • 3.5k / sec (5k / sec peak)
      • Volume is growing exponentially
  • 7.
    • Previous solution
      • Maxed out large MySQL boxes
      • Throw away data (95%) that doesn't match our users' filters
      • Billions of rows in MySQL
  • 8. And its limitations
      • Can't search historical data
      • Can't do ad hoc analytics
      • MySQL severely stressed
      • Schema migrations were painful
  • 9. Next Generation
      • Storage
      • Search
      • Analytics
  • 10. Next Generation
      • Storage
      • Search
      • Analytics
    • Distribute data with Flume
      • HBase (and HDFS)
      • Katta / Lucene
      • Hadoop
  • 11.  
  • 12.  
  • 13. JSME Flume Topology
  • 14. JSME Flume Topology
  • 15. Why Flume?
    • Need to distribute our data reliably to multiple locations and systems (e.g. servers in our datacenter, in ec2, to HBase, to Hadoop)
    • Flume Design Goals
      • Reliability – failover collectors, master failover
      • Scalability – linear scale by adding collector nodes
      • Manageability – central zookeeper managed configs
      • Extensibility – custom sources and sinks
    • Good match!
  • 16. Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
  • 17. Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query
  • 18. Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
  • 19. Jive Social Media Search Architecture
  • 20. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
  • 21. Raw.seq Systems Overview Events HDFS HBase Collector Fanout
  • 22. Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS HBase Collector Fanout Index 1
  • 23. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS HBase Collector Fanout Index 1
  • 24. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
  • 25. Distributed Lucene Indexer Job Input HDFS Blocks Shard 1 Shard 2
  • 26. Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4
  • 27. Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
  • 28. 5 Minute Index Deployment Incremental Indexer Job Raw.seq
  • 29. 5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job
  • 30. 5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job
  • 31. Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 32. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 33. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 34. Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
  • 35. Custom sources / sinks / decorators
      • HBase Sink - There is now a supported HBase sink, but we do some of our own transformations before insertion (e.g. understands our json data)
      • Zoie Realtime Search Sink - real-time searching of events on flume (more details next slide)
      • Regex Filter Decorator - allows only events through that match a key value 
  • 36. Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 37. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 38. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq
  • 39. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds
  • 40. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
  • 41. Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
  • 42. Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
  • 43. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
  • 44. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
  • 45. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
  • 46. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
  • 47.
    • Track user activity to:
      • Power recommendations
        • What Matters Activity Stream
        • people you should meet
        • topics you are interested in
      • Social Search search ranking based on
        • social graph
        • topical graph
        • keywords
      • Analytics community manager understands:
        • what users are collaborating
        • how engagement is increasing
    Hadoop Ecosystem @Jive
  • 48.
    • Track System/App/Web Logs to:
      • A/B Testing
      • Usage analysis, finding bugs, capacity planning
      • Log searching (distributed grep)
    Hadoop Ecosystem @Jive
  • 49.
    • Flume to collect
      • Activities from 100’s of Jive instances
      • System and App Logs
    • Custom Hadoop Jobs/Pig
      • graph analysis,
      • semantic/topical analysis
    • Reporting
      • Custom reporting infrastructure
      • Datameer
    Hadoop Ecosystem @Jive
  • 50. Questions
    • Lance Riedel @lanceriedel
    • Brent Halsey @bhalsey
    • jivesoftware.com/bigdata