A Real Time Sentiment Analysis Application using
Hadoop and HBase in the Cloud




Jagane Sundar
Founder, AltoScale Inc.



June 14, 2012                      Hadoop Summit 2012


     AltoScale
AltoScale                               About me


Ø Extensive Knowledge of Hadoop, Cloud Compute and
  Virtualization
Ø Co-founder of AltoScale. We developed the Workbench
Ø Worked on Hadoop Management and Performance at
  Yahoo
Ø Primarily a systems and storage guy – have written TCP
  stacks and NFS Clients, Livebackup for KVM




2
AltoScale                   My Motivation




Ø Build a cool real time big data app in order
 to acquire a deep understanding of Real
 Time Big Data Systems in the cloud




3
AltoScale   What will you get out of this?




Ø See how easy it is to build a highly
 scalable real-time Big Data application
 using a variety of open source tools and
 technologies




4
AltoScale         Real Time Sentiment Analysis




        Ø Easily accessible real time signals
           v Twitter public status updates
           v Blog entries




5
AltoScale           Real Time Sentiment Analysis


Ø Two types of solutions to Real Time Sentiment
  Analysis
    v Keywords known a-priori
      o  Filter tweets by keyword
    v Open ended sentiment analysis (no a-priori
      knowledge of keywords)
      o  Random sample of all public tweets
          •  1 % of public tweets easily available
          •  10% (twitter firehose) may be available for purchase




6
AltoScale
                    Real Time Sentiment Analysis:
                          Application Architecture
                             Hadoop/HBase

                                          Service Node
                     TwitterSampler                         HBase REST Gateway


                          Analyze Sentiment




                                HBase every minute
                                Write a few new rows to




                                                                       Scan HTable
                                                            Hadoop Slave
                                        DataNode, Region Server

                                                               Hadoop Slave
                                                          DataNode, Region Server
                     Master                                      Hadoop Slave
                NN, HBase Master                            DataNode, Region Server

7
AltoScale
                              Real Time Sentiment Analysis:
                             Twitter Streaming API Overview

                                Twitter APIs




        REST APIs                                   Streaming APIs
    (Request/Response)                          (Persistent HTTP Conn)




           Public Streams             User Streams             Site Streams
           (Sample of all             (One User’s             (Multiple Users’
           public updates)              updates)                 updates)
                    filter

                   sample                      We use this API to
                                               collect tweets
8                  firehose
AltoScale
                  Real Time Sentiment Analysis:
                          Time Series Database




    Ø Inspired by TSDB, but does not use TSDB
    Ø Read Benoît “tsuna” Sigoure’s slides from
      HBaseCon 2012




9
AltoScale
                               Real Time Sentiment Analysis:
                                                   in HBase



          Row              NEUTRAL   POSITIVE   NEGATIVE       Sample
                                                               Tweets
obama:2012:06:04:13:34    1          4          0          sdac soasp few


romney:2012:06:04:13:34   2          3          1          Smsm djcn dje
                                                           jdj
davebarry:2012:06:04:13:34 0         9          0          cs dsjw ausj




    10
AltoScale
                 Real Time Sentiment Analysis:
                                   Front Page




11
AltoScale
                 Real Time Sentiment Analysis:
                                 Results Page




12
AltoScale
                       Real Time Sentiment Analysis:
                  Standing on the Shoulders of Giants

Ø Hadoop and HBase, of course
Ø Twitter4j library for getting the twitter stream
Ø Sentiment Analysis
     v https://code.google.com/p/twitter-sentiment-analysis/
     v Weka Library

Ø Tomcat
Ø Jquery, dojo for javascript client




13
AltoScale
                    Real Time Sentiment Analysis:
             Twitter Stream API - TsStatusListener

public static class TsStatusListener implements StatusListener {
       public void onStatus(Status status) {
               Item item = wm.weightedClassify(status.getText());
               int polarity = 0;
               try {
                   polarity = Integer.parseInt(item.getPolarity().trim());
               } catch (NumberFormatException nfe) {
               }
               updateKeywordTrackers(status, polarity);
       }
}
14
AltoScale
                                             Real Time Sentiment Analysis:
                                                         Writing to HBase
private void writeToHBase() {
             Calendar cal = Calendar.getInstance();
             String calStr = String.format("%04d", (cal.get(Calendar.YEAR)))
                           + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1)
                           + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH))
                           + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY))
                           + ":" + String.format("%02d", cal.get(Calendar.MINUTE));
             String rowKey = keyword + ":" + calStr;
             Put put = new Put(rowKey.getBytes());
             put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes());
             put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes());
             put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes());

             try {
                           table.put(put);
             } catch (Exception ex) {
                           System.err.println(ex);
             }
}
    15
AltoScale
                                                           Reading from HBase
                                                              Various Options
                         Technologies for Writing HBase Clients

                                              Service Node

Option 1: HBase Client                   Java Client linked to
                                         HBase Client classes




                                               Service Node                                 Service Node


                                                                                        Thrift Client
Option 2: Thrift RPC                     HBase Thrift Gateway
                                                                 Thrift protocol




                                    16                   Service Node


                                         HBase REST Gateway
Option 3: REST API                                               REST (HTTP or HTTPS)
AltoScale
                                                  Reading from HBase
                                  and presenting to the user’s browser
     Hadoop/HBase in the cloud

                                 Service Node
       HBase REST Gateway
                                         REST scan            Tomcat
                                                               Proxy


                                                     Static
                                                     html
                   Scan HTable




                                   Hadoop Slave
                   DataNode, Region Server

                                      Hadoop Slave
                                 DataNode, Region Server
     Master                             Hadoop Slave
NN, HBase Master                   DataNode, Region Server

17
AltoScale                         Tomcat as HTTP Proxy


Ø HBase Stardust REST Server runs on port 8081 and is
  connected to the HBase
Ø The REST server has the capability to scan tables
Ø A javascript webpage is the client
Ø Problem:
     v JavaScript security restrictions do now allow the JavaScript to
        execute REST calls to any server other than the one it was
        loaded from
     v Tomcat is used as a proxy. It serves up:
        o  Static html pages with the javascript client, images etc.
        o  REST requests from the javascript client are proxied to the HBase
           Stardust server running on port 8081
18
AltoScale                  Future Improvements


Ø Elastic HBase in the cloud
Ø At night time, use on VM to receive tweets and write out
  into SequenceFiles in S3
Ø Before business hours, start up HBase, run a MR job to
  process all these SequenceFiles and write into HBase
Ø Cost effective real time HBase application in the cloud




19
AltoScale              Big Data Apps in the Cloud


Ø The Cloud is suitable for Big Data apps which use Big
  Data from the Internet. For example:
     v Twitter Public Status Updates
     v Blog entries
     v Web Crawl data

Ø Big Data apps in the cloud are not useful if all your data
  is generated inside your network
     v Router, Storage device, Authentication device logs
     v Logs from Web Servers located inside your network




20
AltoScale




Ø Questions, Comments, Flames?


       •  Thanks!
       •  Jagane Sundar
       •  jagane@altoscale.com




21

Realtime Sentiment Analysis Application Using Hadoop and HBase

  • 1.
    A Real TimeSentiment Analysis Application using Hadoop and HBase in the Cloud Jagane Sundar Founder, AltoScale Inc. June 14, 2012 Hadoop Summit 2012 AltoScale
  • 2.
    AltoScale About me Ø Extensive Knowledge of Hadoop, Cloud Compute and Virtualization Ø Co-founder of AltoScale. We developed the Workbench Ø Worked on Hadoop Management and Performance at Yahoo Ø Primarily a systems and storage guy – have written TCP stacks and NFS Clients, Livebackup for KVM 2
  • 3.
    AltoScale My Motivation Ø Build a cool real time big data app in order to acquire a deep understanding of Real Time Big Data Systems in the cloud 3
  • 4.
    AltoScale What will you get out of this? Ø See how easy it is to build a highly scalable real-time Big Data application using a variety of open source tools and technologies 4
  • 5.
    AltoScale Real Time Sentiment Analysis Ø Easily accessible real time signals v Twitter public status updates v Blog entries 5
  • 6.
    AltoScale Real Time Sentiment Analysis Ø Two types of solutions to Real Time Sentiment Analysis v Keywords known a-priori o  Filter tweets by keyword v Open ended sentiment analysis (no a-priori knowledge of keywords) o  Random sample of all public tweets •  1 % of public tweets easily available •  10% (twitter firehose) may be available for purchase 6
  • 7.
    AltoScale Real Time Sentiment Analysis: Application Architecture Hadoop/HBase Service Node TwitterSampler HBase REST Gateway Analyze Sentiment HBase every minute Write a few new rows to Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 7
  • 8.
    AltoScale Real Time Sentiment Analysis: Twitter Streaming API Overview Twitter APIs REST APIs Streaming APIs (Request/Response) (Persistent HTTP Conn) Public Streams User Streams Site Streams (Sample of all (One User’s (Multiple Users’ public updates) updates) updates) filter sample We use this API to collect tweets 8 firehose
  • 9.
    AltoScale Real Time Sentiment Analysis: Time Series Database Ø Inspired by TSDB, but does not use TSDB Ø Read Benoît “tsuna” Sigoure’s slides from HBaseCon 2012 9
  • 10.
    AltoScale Real Time Sentiment Analysis: in HBase Row NEUTRAL POSITIVE NEGATIVE Sample Tweets obama:2012:06:04:13:34 1 4 0 sdac soasp few romney:2012:06:04:13:34 2 3 1 Smsm djcn dje jdj davebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj 10
  • 11.
    AltoScale Real Time Sentiment Analysis: Front Page 11
  • 12.
    AltoScale Real Time Sentiment Analysis: Results Page 12
  • 13.
    AltoScale Real Time Sentiment Analysis: Standing on the Shoulders of Giants Ø Hadoop and HBase, of course Ø Twitter4j library for getting the twitter stream Ø Sentiment Analysis v https://code.google.com/p/twitter-sentiment-analysis/ v Weka Library Ø Tomcat Ø Jquery, dojo for javascript client 13
  • 14.
    AltoScale Real Time Sentiment Analysis: Twitter Stream API - TsStatusListener public static class TsStatusListener implements StatusListener { public void onStatus(Status status) { Item item = wm.weightedClassify(status.getText()); int polarity = 0; try { polarity = Integer.parseInt(item.getPolarity().trim()); } catch (NumberFormatException nfe) { } updateKeywordTrackers(status, polarity); } } 14
  • 15.
    AltoScale Real Time Sentiment Analysis: Writing to HBase private void writeToHBase() { Calendar cal = Calendar.getInstance(); String calStr = String.format("%04d", (cal.get(Calendar.YEAR))) + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1) + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH)) + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY)) + ":" + String.format("%02d", cal.get(Calendar.MINUTE)); String rowKey = keyword + ":" + calStr; Put put = new Put(rowKey.getBytes()); put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes()); put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes()); put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes()); try { table.put(put); } catch (Exception ex) { System.err.println(ex); } } 15
  • 16.
    AltoScale Reading from HBase Various Options Technologies for Writing HBase Clients Service Node Option 1: HBase Client Java Client linked to HBase Client classes Service Node Service Node Thrift Client Option 2: Thrift RPC HBase Thrift Gateway Thrift protocol 16 Service Node HBase REST Gateway Option 3: REST API REST (HTTP or HTTPS)
  • 17.
    AltoScale Reading from HBase and presenting to the user’s browser Hadoop/HBase in the cloud Service Node HBase REST Gateway REST scan Tomcat Proxy Static html Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 17
  • 18.
    AltoScale Tomcat as HTTP Proxy Ø HBase Stardust REST Server runs on port 8081 and is connected to the HBase Ø The REST server has the capability to scan tables Ø A javascript webpage is the client Ø Problem: v JavaScript security restrictions do now allow the JavaScript to execute REST calls to any server other than the one it was loaded from v Tomcat is used as a proxy. It serves up: o  Static html pages with the javascript client, images etc. o  REST requests from the javascript client are proxied to the HBase Stardust server running on port 8081 18
  • 19.
    AltoScale Future Improvements Ø Elastic HBase in the cloud Ø At night time, use on VM to receive tweets and write out into SequenceFiles in S3 Ø Before business hours, start up HBase, run a MR job to process all these SequenceFiles and write into HBase Ø Cost effective real time HBase application in the cloud 19
  • 20.
    AltoScale Big Data Apps in the Cloud Ø The Cloud is suitable for Big Data apps which use Big Data from the Internet. For example: v Twitter Public Status Updates v Blog entries v Web Crawl data Ø Big Data apps in the cloud are not useful if all your data is generated inside your network v Router, Storage device, Authentication device logs v Logs from Web Servers located inside your network 20
  • 21.
    AltoScale Ø Questions, Comments, Flames? •  Thanks! •  Jagane Sundar •  jagane@altoscale.com 21