Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Socia...
Searching and Following Social Media content
Analyzing Social Media content
<ul><li>Massive amount of data </li></ul><ul><ul><li>230M+ entries / day </li></ul></ul><ul><ul><li>Facebook/Twitter growi...
<ul><li>Previous solution </li></ul><ul><ul><li>Maxed out large MySQL boxes </li></ul></ul><ul><ul><li>Throw away data (95...
And its limitations <ul><ul><li>Can't search historical data </li></ul></ul><ul><ul><li>Can't do ad hoc analytics </li></u...
Next Generation <ul><ul><li>Storage  </li></ul></ul><ul><ul><li>Search  </li></ul></ul><ul><ul><li>Analytics  </li></ul></ul>
Next Generation <ul><ul><li>Storage  </li></ul></ul><ul><ul><li>Search  </li></ul></ul><ul><ul><li>Analytics  </li></ul></...
 
 
JSME Flume Topology
JSME Flume Topology
Why Flume? <ul><li>Need to distribute our data reliably to multiple locations and systems (e.g. servers in our datacenter,...
Flume Overview: The Canonical Use Case  Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Ag...
Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase ...
Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq ...
Jive Social Media Search Architecture
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search  ...
Raw.seq Systems Overview Events HDFS   HBase Collector Fanout
Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS   HBase Collector Fanout Index 1
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS   HBase Collector...
Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search  ...
Distributed Lucene Indexer Job Input HDFS  Blocks Shard 1 Shard 2
Distributed Lucene Indexer Job Map Map Map Map Raw  Events Input HDFS  Blocks  Index 1 Index 2 Index 3 Index 4
Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw  Events Input HDFS  Blocks  Shuffle/ Sort Key -> shard nu...
5 Minute Index Deployment Incremental Indexer Job Raw.seq
5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job
5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job
Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq....
Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw ...
Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw ...
Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run IN...
Custom sources / sinks / decorators <ul><ul><li>HBase Sink - There is now a supported HBase sink, but we do some of our ow...
Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller   HBase Collector Fanout Raw.seq Index 1 Distrib...
Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller   HBase Collector Fano...
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds
Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller   HBase Collector Fano...
<ul><li>Track user activity to: </li></ul><ul><ul><li>Power recommendations </li></ul></ul><ul><ul><ul><li>What Matters Ac...
<ul><li>Track System/App/Web Logs to: </li></ul><ul><ul><li>A/B Testing </li></ul></ul><ul><ul><li>Usage analysis, finding...
<ul><li>Flume to collect  </li></ul><ul><ul><li>Activities from 100’s of Jive instances </li></ul></ul><ul><ul><li>System ...
Questions <ul><li>Lance Riedel  @lanceriedel </li></ul><ul><li>Brent Halsey @bhalsey </li></ul><ul><li>jivesoftware.com/bi...
Upcoming SlideShare
Loading in...5
×

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

3,399

Published on

Jive is using Flume to deliver the content of a social web (250M messages/day) to HDFS and HBase. Flume's flexible architecture allows us to stream data to our production data center as well as Amazon's Web Services datacenter. We periodically build and merge Lucene indices with Hadoop jobs and deploy these to Katta to provide near real time search results. This talk will explore our infrastructure and decisions we've made to handle a fast growing set of real time data feeds. We will further explore other uses for Flume throughout Jive including log collection and our distributed event bus.

Published in: Technology, Business
0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,399
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
14
Embeds 0
No embeds

No notes for slide
  • Collecting content from twitter, facebook, blogs, and news outlets. Allow our users to search on this content, monitor it, and analyze it.
  • Screen shot of the app shows a user&apos;s list of monitors and content matching those monitors. Users can filter by sentiment and by the content source. They can engage in social conversations through twitter and facebook. And they can create discussions within Jive SBS.
  • Users can analyze social media trends over time with graph views for sentiment and content sources.
  • Old system takes data from content sources and throws it on a queue. Queue acts as a buffer to processors that process the content and insert it into a MySQL DB. Some fault tolerance with multiple servers connecting to multiple queues. But required a fair bit of monitoring and manual intervention when problems arise.
  • Limited because we throw away most of our content. Pushing the limits of MySQL can be painful.
  • Wanted to store all content (limited window), search it, and analyze it.
  • Chose HBase for random lookup. HDFS for chronological streaming. Katta for distributing Lucene shards. Hadoop for running map reduce.
  • Built out prototype of new system using Amazon&apos;s EC2 and needed a way to stream data into these servers. Internal / External IP addresses of EC2 made it difficult to connect directly to HDFS and HBase. Flume provided this connectivity along with desirable delivery guarantees.
  • Additionally, can fan out the data to bring data into EC2 along with our production system.
  • Additionally, can fan out the data to bring data into EC2 along with our production system.
  • KATTA For those not familiar with Katta, it is a distributed search engine that has two major responsibilities The first is distributing indexes from HDFS to any number of katta nodes. Katta nodes can run across as many machines as you want, easy to add more, and katta will redistribute indexes if nodes fail Katta has a highly customizable distribution policy – you can round robin, have hot/cold topologies where newer indexes are placed on faster machines As part of the distribution there is also replication of indexes for increased load performance and failover All of this is managed through zookeeper, so it is quite resilient, and does a very good job at keeping indexes where zookeeper says it should The second responsibility of katta is to take a single search request and send the request to every katta node and gather the results
  • OVERVIEW OF SEARCH – 30 days of twitter, facebook, major news and blogs Next few slides are going to show how we tackled searching a moving window of 30 days of twitter (full firehose), public facebook feed, and Spinn3r (which includes all major news and blog sites) SEARCH IS USED – INVESTIGATE MONITOR CREATION, ADHOC ANALYTICS -search is used to investigate what monitor to create, so searching historical data is of course key -also allows to do ad-hoc analytics over recent history. Show me sentimate, or raw counts for an ad-hoc query over the last 30 days
  • TRANSITION – OTHER REQUIREMENTS NEED FLEXIBILITY Other requirements of course pop up, so it was good that we chose Flume so that we could add easily add on new functionality One of the key customization areas of Flume are the custom sources sinks and decorators you can supply SOURCES OVERVIEW Sources allow you to create custom hooks into data providers. There is a huge list of sources provided out of the box from tailing files to avro http end points where you can send raw events to flume over http with a flume event avro schema SINK OVERVIEW Sinks allow you to create custom places to put the events.. Again there are a slew of out of the box sinks such as hbase and hdfs DECORATOR OVERVIEW And then there are decorators that you can place pretty much add anywhere in the topology where you are allowed to inspect each event and add meta data, change the contents, or throw them on the floor SOME OF OUR OWN Want to highlight a few customizations we did: (rest on slide)
  • Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

    1. 1. Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software
    2. 2. Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps
    3. 3. Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview
    4. 4. Searching and Following Social Media content
    5. 5. Analyzing Social Media content
    6. 6. <ul><li>Massive amount of data </li></ul><ul><ul><li>230M+ entries / day </li></ul></ul><ul><ul><li>Facebook/Twitter growing fast </li></ul></ul><ul><ul><li>Blogs, news, YouTube, Flickr, etc </li></ul></ul><ul><ul><li>125+ GB / day </li></ul></ul><ul><ul><li>3.5k / sec (5k / sec peak) </li></ul></ul><ul><ul><li>Volume is growing exponentially </li></ul></ul>
    7. 7. <ul><li>Previous solution </li></ul><ul><ul><li>Maxed out large MySQL boxes </li></ul></ul><ul><ul><li>Throw away data (95%) that doesn't match our users' filters </li></ul></ul><ul><ul><li>Billions of rows in MySQL </li></ul></ul>
    8. 8. And its limitations <ul><ul><li>Can't search historical data </li></ul></ul><ul><ul><li>Can't do ad hoc analytics </li></ul></ul><ul><ul><li>MySQL severely stressed </li></ul></ul><ul><ul><li>Schema migrations were painful </li></ul></ul>
    9. 9. Next Generation <ul><ul><li>Storage </li></ul></ul><ul><ul><li>Search </li></ul></ul><ul><ul><li>Analytics </li></ul></ul>
    10. 10. Next Generation <ul><ul><li>Storage </li></ul></ul><ul><ul><li>Search </li></ul></ul><ul><ul><li>Analytics </li></ul></ul><ul><li>Distribute data with Flume </li></ul><ul><ul><li>HBase (and HDFS) </li></ul></ul><ul><ul><li>Katta / Lucene </li></ul></ul><ul><ul><li>Hadoop </li></ul></ul>
    11. 13. JSME Flume Topology
    12. 14. JSME Flume Topology
    13. 15. Why Flume? <ul><li>Need to distribute our data reliably to multiple locations and systems (e.g. servers in our datacenter, in ec2, to HBase, to Hadoop) </li></ul><ul><li>Flume Design Goals </li></ul><ul><ul><li>Reliability – failover collectors, master failover </li></ul></ul><ul><ul><li>Scalability – linear scale by adding collector nodes </li></ul></ul><ul><ul><li>Manageability – central zookeeper managed configs </li></ul></ul><ul><ul><li>Extensibility – custom sources and sinks </li></ul></ul><ul><li>Good match! </li></ul>
    14. 16. Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS
    15. 17. Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query
    16. 18. Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1
    17. 19. Jive Social Media Search Architecture
    18. 20. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
    19. 21. Raw.seq Systems Overview Events HDFS HBase Collector Fanout
    20. 22. Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS HBase Collector Fanout Index 1
    21. 23. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS HBase Collector Fanout Index 1
    22. 24. Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1
    23. 25. Distributed Lucene Indexer Job Input HDFS Blocks Shard 1 Shard 2
    24. 26. Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4
    25. 27. Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4
    26. 28. 5 Minute Index Deployment Incremental Indexer Job Raw.seq
    27. 29. 5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job
    28. 30. 5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job
    29. 31. Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
    30. 32. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
    31. 33. Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
    32. 34. Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes
    33. 35. Custom sources / sinks / decorators <ul><ul><li>HBase Sink - There is now a supported HBase sink, but we do some of our own transformations before insertion (e.g. understands our json data) </li></ul></ul><ul><ul><li>Zoie Realtime Search Sink - real-time searching of events on flume (more details next slide) </li></ul></ul><ul><ul><li>Regex Filter Decorator - allows only events through that match a key value  </li></ul></ul>
    34. 36. Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
    35. 37. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
    36. 38. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq
    37. 39. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds
    38. 40. Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
    39. 41. Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout
    40. 42. Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink
    41. 43. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink
    42. 44. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink
    43. 45. Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink
    44. 46. Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results
    45. 47. <ul><li>Track user activity to: </li></ul><ul><ul><li>Power recommendations </li></ul></ul><ul><ul><ul><li>What Matters Activity Stream </li></ul></ul></ul><ul><ul><ul><li>people you should meet </li></ul></ul></ul><ul><ul><ul><li>topics you are interested in </li></ul></ul></ul><ul><ul><li>Social Search search ranking based on </li></ul></ul><ul><ul><ul><li>social graph </li></ul></ul></ul><ul><ul><ul><li>topical graph </li></ul></ul></ul><ul><ul><ul><li>keywords </li></ul></ul></ul><ul><ul><li>Analytics community manager understands: </li></ul></ul><ul><ul><ul><li>what users are collaborating </li></ul></ul></ul><ul><ul><ul><li>how engagement is increasing </li></ul></ul></ul>Hadoop Ecosystem @Jive
    46. 48. <ul><li>Track System/App/Web Logs to: </li></ul><ul><ul><li>A/B Testing </li></ul></ul><ul><ul><li>Usage analysis, finding bugs, capacity planning </li></ul></ul><ul><ul><li>Log searching (distributed grep) </li></ul></ul>Hadoop Ecosystem @Jive
    47. 49. <ul><li>Flume to collect </li></ul><ul><ul><li>Activities from 100’s of Jive instances </li></ul></ul><ul><ul><li>System and App Logs </li></ul></ul><ul><li>Custom Hadoop Jobs/Pig </li></ul><ul><ul><li>graph analysis, </li></ul></ul><ul><ul><li>semantic/topical analysis </li></ul></ul><ul><li>Reporting </li></ul><ul><ul><li>Custom reporting infrastructure </li></ul></ul><ul><ul><li>Datameer </li></ul></ul>Hadoop Ecosystem @Jive
    48. 50. Questions <ul><li>Lance Riedel @lanceriedel </li></ul><ul><li>Brent Halsey @bhalsey </li></ul><ul><li>jivesoftware.com/bigdata </li></ul>

    ×