Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Realtime Sentiment Analysis Application Using Hadoop and HBase


Published on

Published in: Technology, Business

Realtime Sentiment Analysis Application Using Hadoop and HBase

  1. 1. A Real Time Sentiment Analysis Application usingHadoop and HBase in the CloudJagane SundarFounder, AltoScale Inc.June 14, 2012 Hadoop Summit 2012 AltoScale
  2. 2. AltoScale About meØ Extensive Knowledge of Hadoop, Cloud Compute and VirtualizationØ Co-founder of AltoScale. We developed the WorkbenchØ Worked on Hadoop Management and Performance at YahooØ Primarily a systems and storage guy – have written TCP stacks and NFS Clients, Livebackup for KVM2
  3. 3. AltoScale My MotivationØ Build a cool real time big data app in order to acquire a deep understanding of Real Time Big Data Systems in the cloud3
  4. 4. AltoScale What will you get out of this?Ø See how easy it is to build a highly scalable real-time Big Data application using a variety of open source tools and technologies4
  5. 5. AltoScale Real Time Sentiment Analysis Ø Easily accessible real time signals v Twitter public status updates v Blog entries5
  6. 6. AltoScale Real Time Sentiment AnalysisØ Two types of solutions to Real Time Sentiment Analysis v Keywords known a-priori o  Filter tweets by keyword v Open ended sentiment analysis (no a-priori knowledge of keywords) o  Random sample of all public tweets •  1 % of public tweets easily available •  10% (twitter firehose) may be available for purchase6
  7. 7. AltoScale Real Time Sentiment Analysis: Application Architecture Hadoop/HBase Service Node TwitterSampler HBase REST Gateway Analyze Sentiment HBase every minute Write a few new rows to Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server7
  8. 8. AltoScale Real Time Sentiment Analysis: Twitter Streaming API Overview Twitter APIs REST APIs Streaming APIs (Request/Response) (Persistent HTTP Conn) Public Streams User Streams Site Streams (Sample of all (One User’s (Multiple Users’ public updates) updates) updates) filter sample We use this API to collect tweets8 firehose
  9. 9. AltoScale Real Time Sentiment Analysis: Time Series Database Ø Inspired by TSDB, but does not use TSDB Ø Read Benoît “tsuna” Sigoure’s slides from HBaseCon 20129
  10. 10. AltoScale Real Time Sentiment Analysis: in HBase Row NEUTRAL POSITIVE NEGATIVE Sample Tweetsobama:2012:06:04:13:34 1 4 0 sdac soasp fewromney:2012:06:04:13:34 2 3 1 Smsm djcn dje jdjdavebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj 10
  11. 11. AltoScale Real Time Sentiment Analysis: Front Page11
  12. 12. AltoScale Real Time Sentiment Analysis: Results Page12
  13. 13. AltoScale Real Time Sentiment Analysis: Standing on the Shoulders of GiantsØ Hadoop and HBase, of courseØ Twitter4j library for getting the twitter streamØ Sentiment Analysis v v Weka LibraryØ TomcatØ Jquery, dojo for javascript client13
  14. 14. AltoScale Real Time Sentiment Analysis: Twitter Stream API - TsStatusListenerpublic static class TsStatusListener implements StatusListener { public void onStatus(Status status) { Item item = wm.weightedClassify(status.getText()); int polarity = 0; try { polarity = Integer.parseInt(item.getPolarity().trim()); } catch (NumberFormatException nfe) { } updateKeywordTrackers(status, polarity); }}14
  15. 15. AltoScale Real Time Sentiment Analysis: Writing to HBaseprivate void writeToHBase() { Calendar cal = Calendar.getInstance(); String calStr = String.format("%04d", (cal.get(Calendar.YEAR))) + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1) + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH)) + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY)) + ":" + String.format("%02d", cal.get(Calendar.MINUTE)); String rowKey = keyword + ":" + calStr; Put put = new Put(rowKey.getBytes()); put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes()); put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes()); put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes()); try { table.put(put); } catch (Exception ex) { System.err.println(ex); }} 15
  16. 16. AltoScale Reading from HBase Various Options Technologies for Writing HBase Clients Service NodeOption 1: HBase Client Java Client linked to HBase Client classes Service Node Service Node Thrift ClientOption 2: Thrift RPC HBase Thrift Gateway Thrift protocol 16 Service Node HBase REST GatewayOption 3: REST API REST (HTTP or HTTPS)
  17. 17. AltoScale Reading from HBase and presenting to the user’s browser Hadoop/HBase in the cloud Service Node HBase REST Gateway REST scan Tomcat Proxy Static html Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop SlaveNN, HBase Master DataNode, Region Server17
  18. 18. AltoScale Tomcat as HTTP ProxyØ HBase Stardust REST Server runs on port 8081 and is connected to the HBaseØ The REST server has the capability to scan tablesØ A javascript webpage is the clientØ Problem: v JavaScript security restrictions do now allow the JavaScript to execute REST calls to any server other than the one it was loaded from v Tomcat is used as a proxy. It serves up: o  Static html pages with the javascript client, images etc. o  REST requests from the javascript client are proxied to the HBase Stardust server running on port 808118
  19. 19. AltoScale Future ImprovementsØ Elastic HBase in the cloudØ At night time, use on VM to receive tweets and write out into SequenceFiles in S3Ø Before business hours, start up HBase, run a MR job to process all these SequenceFiles and write into HBaseØ Cost effective real time HBase application in the cloud19
  20. 20. AltoScale Big Data Apps in the CloudØ The Cloud is suitable for Big Data apps which use Big Data from the Internet. For example: v Twitter Public Status Updates v Blog entries v Web Crawl dataØ Big Data apps in the cloud are not useful if all your data is generated inside your network v Router, Storage device, Authentication device logs v Logs from Web Servers located inside your network20
  21. 21. AltoScaleØ Questions, Comments, Flames? •  Thanks! •  Jagane Sundar •  jagane@altoscale.com21