Your SlideShare is downloading. ×
Near Real Time Processing of Social Media Data with HBase
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Near Real Time Processing of Social Media Data with HBase

2,060

Published on

Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of …

Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time.

In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach.

This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,060
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
30
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
  • 2. August 31, 2012•  Why Hadoop and HBase? 2•  Social Media Monitoring •  Prospective Search and Coprocessors•  Challenges & Lessons Learned•  Resources to get startedAgenda
  • 3. August 31, 2012•  Spin-off of MeMo News AG, the 3 leading provider for Social Media Monitoring & Analytics in Switzerland•  Big Data expert, focused on Hadoop, HBase and Solr•  Objective: Transforming data into insightsAbout Sentric
  • 4. CC 2.0 by Editor B| h"p://flic.kr/p/bcU5aD  
  • 5. August 31, 2012 5 Information Information Analysis & Insight Gathering Processing Interpretation PresentationWhy Hadoop and HBase?Social Media Monitoring Process
  • 6. August 31, 2012 6 Cost effective High Freshness scalable SMM Reliable RT Alerting Analytical capabilitiesWhy Hadoop and HBase?Requirements
  • 7. August 31, 2012•  HDFS + MapReduce 7•  Based on Google Papers•  Distributed Storage and Computation Framework•  Affordable Hardware, Free Software•  Significant AdoptionWhy Hadoop and HBase?Hadoop
  • 8. August 31, 2012•  Non-Relational, Distributed Database 8•  Column-Oriented•  Multi-Dimensional•  High Availability•  High Performance•  Build on top of HDFS as storage layerWhy Hadoop and HBase?HBase
  • 9. August 31, 2012 9Storage HBase /HDFSSearch SolrAnalytics Hadoop MahoutEvent mechanism (MQ) HBase RowLogReal-time alerting Prospective searchWhy Hadoop and HBase?Technology Stack
  • 10. CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
  • 11. August 31, 2012 11Downloaded Articles match?Search AgentsOutput Web-UI Reports RT Alerts Icons by http://dryicons.comSocial Media MonitoringOverview
  • 12. August 31, 2012 12 n News Agents REST HBase Coprocessor Web-UI MySQL Solr RT Alerts Icons by http://dryicons.comSocial Media MonitoringSolution Architecture
  • 13. August 31, 2012 13 Processing Put operations Prospective Search HRegion RT Alerts HRegionServer Icons by http://dryicons.comSocial Media MonitoringProspective Search with Coprocessors
  • 14. August 31, 2012•  Monthly growth 14 •  Index: 200GB •  50 Mio. docs/month •  HBase: 600 GB •  Raw data, meta data and extracted data•  A few 1000 map-reduce jobs/ monthSocial Media MonitoringKey Figures
  • 15. CC 2.0 by saebaryo | h"p://flic.kr/p/5T4t5L
  • 16. Augus t 31, 20121  Benchmarks - workloads2  Supervision 163  Keys and shards – Schema design /LG4  Timestamps, the 4th dimension5  Short ColumnFamily names->6  File handles. OS7  JVM Tuning, GC !!!8  Scaling region servers, data locality!9  Automatic vs manual splits, compaction10  Do not use HBase as rock solid in prod11  Forget feuerwehr aktionen, it takes some time12  Use Hbase for a apropriate use case13  Tune and tweak – it‘s not a project – it‘s a process14  You need devops in production15  Huge know-how curve, you need to know the hole ecosystem16  Use a distribution, ist packed, tested and supports migration, enterprise grade17  Virtualisierung, Hardware18  Dont struggle to much, there is a good community19  Share your knowledge20  It‘s early state, many tools around, a few still missingChallenges & Lessons Learned
  • 17. August 31, 2012•  Everyone is still learning 17•  Some issues only appear at scale •  At scale, nothing works as advertised•  Production cluster configuration •  Hardware issues •  Tuning cluster configuration to our work loads•  HBase stability•  Monitoring health of HBaseChallenges & Lessons LearnedChallenges
  • 18. August 31, 2012•  Do not rely on HBase as frontend 18 storage layer. It’s not going to be rock solid•  Don’t struggle to much, there is a good community•  Share your knowledge•  It‘s early stage, many tools around, a few still missingChallenges & Lessons LearnedLessons - General
  • 19. August 31, 2012•  Use HBase for an appropriate use case 19•  Use a distribution, its packed, tested and supports migration, enterprise grade•  Benchmarks – know your workloads & query patterns •  YCSB•  Schema & Key Design •  What’s queried together should be stored together•  Scaling region servers, data locality!•  Virtualization vs. Real HardwareChallenges & Lessons LearnedLessons - Planning
  • 20. August 31, 2012•  Number of CF < 10 20 •  Compaction + Flushing I/O intensive•  Short ColumnFamily names •  HFile index size occupying aloc RAM (storefileindexSize)•  OS file handles •  ulimit –n 32768•  JVM Tuning, GC !!! •  HMaster 1024 MB •  RegionServer 8192 MB •  -XX:+UseConcMarkSweepGC •  -XX:+CMSIncrementalMode•  Automatic vs. manual splits•  Be careful with expensive operations in coprocessors•  Play with all the configurations and benchmark for tuningChallenges & Lessons LearnedLessons - Performance Tuning
  • 21. August 31, 2012•  Monitoring/Operational tooling is most 21 important•  Forget “emergency actions”, it takes some time•  Tune and tweak – it‘s not a project – it‘s a process•  You need DevOps in production•  Huge know-how curve, you need to know the whole ecosystem •  Hadoop, HDFS, MapRedChallenges & Lessons LearnedLessons - Operation
  • 22. August 31, 2012•  http://hbase.apache.org/book.html 22•  http://www.sentric.ch/blog/best- practice-why-monitoring-hbase-is- important•  http://www.sentric.ch/blog/hadoop- overview-of-top-3-distributions•  http://www.sentric.ch/blog/hadoop- best-practice-cluster-checklist•  http://outerthought.org/blog/465- ot.htmlResources to get started
  • 23. August 31, 2012 23 Questions? Christian Gügi, christian.guegi@sentric.ch Jean-Pierre König, jean-pierre.koenig@sentric.chNoSQL Roadshow BaselThank you!
  • 24. Augus t 31, 2012 24MastersCluster
  • 25. Augus t 31, 2012 25WorkerCluster

×