Near Real Time Processing of Social Media Data with HBase

3,131 views

Published on

Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time.

In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach.

This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,131
On SlideShare
0
From Embeds
0
Number of Embeds
444
Actions
Shares
0
Downloads
38
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Near Real Time Processing of Social Media Data with HBase

  1. 1. CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
  2. 2. August 31, 2012•  Why Hadoop and HBase? 2•  Social Media Monitoring •  Prospective Search and Coprocessors•  Challenges & Lessons Learned•  Resources to get startedAgenda
  3. 3. August 31, 2012•  Spin-off of MeMo News AG, the 3 leading provider for Social Media Monitoring & Analytics in Switzerland•  Big Data expert, focused on Hadoop, HBase and Solr•  Objective: Transforming data into insightsAbout Sentric
  4. 4. CC 2.0 by Editor B| h"p://flic.kr/p/bcU5aD  
  5. 5. August 31, 2012 5 Information Information Analysis & Insight Gathering Processing Interpretation PresentationWhy Hadoop and HBase?Social Media Monitoring Process
  6. 6. August 31, 2012 6 Cost effective High Freshness scalable SMM Reliable RT Alerting Analytical capabilitiesWhy Hadoop and HBase?Requirements
  7. 7. August 31, 2012•  HDFS + MapReduce 7•  Based on Google Papers•  Distributed Storage and Computation Framework•  Affordable Hardware, Free Software•  Significant AdoptionWhy Hadoop and HBase?Hadoop
  8. 8. August 31, 2012•  Non-Relational, Distributed Database 8•  Column-Oriented•  Multi-Dimensional•  High Availability•  High Performance•  Build on top of HDFS as storage layerWhy Hadoop and HBase?HBase
  9. 9. August 31, 2012 9Storage HBase /HDFSSearch SolrAnalytics Hadoop MahoutEvent mechanism (MQ) HBase RowLogReal-time alerting Prospective searchWhy Hadoop and HBase?Technology Stack
  10. 10. CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
  11. 11. August 31, 2012 11Downloaded Articles match?Search AgentsOutput Web-UI Reports RT Alerts Icons by http://dryicons.comSocial Media MonitoringOverview
  12. 12. August 31, 2012 12 n News Agents REST HBase Coprocessor Web-UI MySQL Solr RT Alerts Icons by http://dryicons.comSocial Media MonitoringSolution Architecture
  13. 13. August 31, 2012 13 Processing Put operations Prospective Search HRegion RT Alerts HRegionServer Icons by http://dryicons.comSocial Media MonitoringProspective Search with Coprocessors
  14. 14. August 31, 2012•  Monthly growth 14 •  Index: 200GB •  50 Mio. docs/month •  HBase: 600 GB •  Raw data, meta data and extracted data•  A few 1000 map-reduce jobs/ monthSocial Media MonitoringKey Figures
  15. 15. CC 2.0 by saebaryo | h"p://flic.kr/p/5T4t5L
  16. 16. Augus t 31, 20121  Benchmarks - workloads2  Supervision 163  Keys and shards – Schema design /LG4  Timestamps, the 4th dimension5  Short ColumnFamily names->6  File handles. OS7  JVM Tuning, GC !!!8  Scaling region servers, data locality!9  Automatic vs manual splits, compaction10  Do not use HBase as rock solid in prod11  Forget feuerwehr aktionen, it takes some time12  Use Hbase for a apropriate use case13  Tune and tweak – it‘s not a project – it‘s a process14  You need devops in production15  Huge know-how curve, you need to know the hole ecosystem16  Use a distribution, ist packed, tested and supports migration, enterprise grade17  Virtualisierung, Hardware18  Dont struggle to much, there is a good community19  Share your knowledge20  It‘s early state, many tools around, a few still missingChallenges & Lessons Learned
  17. 17. August 31, 2012•  Everyone is still learning 17•  Some issues only appear at scale •  At scale, nothing works as advertised•  Production cluster configuration •  Hardware issues •  Tuning cluster configuration to our work loads•  HBase stability•  Monitoring health of HBaseChallenges & Lessons LearnedChallenges
  18. 18. August 31, 2012•  Do not rely on HBase as frontend 18 storage layer. It’s not going to be rock solid•  Don’t struggle to much, there is a good community•  Share your knowledge•  It‘s early stage, many tools around, a few still missingChallenges & Lessons LearnedLessons - General
  19. 19. August 31, 2012•  Use HBase for an appropriate use case 19•  Use a distribution, its packed, tested and supports migration, enterprise grade•  Benchmarks – know your workloads & query patterns •  YCSB•  Schema & Key Design •  What’s queried together should be stored together•  Scaling region servers, data locality!•  Virtualization vs. Real HardwareChallenges & Lessons LearnedLessons - Planning
  20. 20. August 31, 2012•  Number of CF < 10 20 •  Compaction + Flushing I/O intensive•  Short ColumnFamily names •  HFile index size occupying aloc RAM (storefileindexSize)•  OS file handles •  ulimit –n 32768•  JVM Tuning, GC !!! •  HMaster 1024 MB •  RegionServer 8192 MB •  -XX:+UseConcMarkSweepGC •  -XX:+CMSIncrementalMode•  Automatic vs. manual splits•  Be careful with expensive operations in coprocessors•  Play with all the configurations and benchmark for tuningChallenges & Lessons LearnedLessons - Performance Tuning
  21. 21. August 31, 2012•  Monitoring/Operational tooling is most 21 important•  Forget “emergency actions”, it takes some time•  Tune and tweak – it‘s not a project – it‘s a process•  You need DevOps in production•  Huge know-how curve, you need to know the whole ecosystem •  Hadoop, HDFS, MapRedChallenges & Lessons LearnedLessons - Operation
  22. 22. August 31, 2012•  http://hbase.apache.org/book.html 22•  http://www.sentric.ch/blog/best- practice-why-monitoring-hbase-is- important•  http://www.sentric.ch/blog/hadoop- overview-of-top-3-distributions•  http://www.sentric.ch/blog/hadoop- best-practice-cluster-checklist•  http://outerthought.org/blog/465- ot.htmlResources to get started
  23. 23. August 31, 2012 23 Questions? Christian Gügi, christian.guegi@sentric.ch Jean-Pierre König, jean-pierre.koenig@sentric.chNoSQL Roadshow BaselThank you!
  24. 24. Augus t 31, 2012 24MastersCluster
  25. 25. Augus t 31, 2012 25WorkerCluster

×