• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Analyzing twitter data with hadoop

Analyzing twitter data with hadoop






Total Views
Views on SlideShare
Embed Views



1 Embed 1

https://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Analyzing twitter data with hadoop Analyzing twitter data with hadoop Presentation Transcript

    • 1Analyzing  Twi,er  Data  with  Hadoop  Data  Science  Maryland,  May  2013  Joey  Echeverria  |  Director  Federal  FTS  joey@cloudera.com  |  @fwiffo  ©2013 Cloudera, Inc.
    • About  Joey  •  Director  Federal  FTS  •  Technical  guy  playing  manager  •  2  years  @  Cloudera  •  5  years  of  Hadoop  •  Local  2
    • Analyzing  Twi,er  Data  with  Hadoop  BUILDING  A  BIG  DATA  SOLUTION  3   ©2013 Cloudera, Inc.
    • Big  Data  •  Big  •  Larger  volume  than  you’ve  handled  before  •  No  litmus  test  •  High  value,  under  uRlized  •  Data  •  Structured  •  Unstructured  •  Semi-­‐structured  •  Hadoop  •  Distributed  file  system  •  Distributed,  batch  computaRon  4 ©2013 Cloudera, Inc.
    • Data  Management  Systems  5 ©2013 Cloudera, Inc.Data  Source   Data  Storage  Data  IngesRon  Data  Processing  
    • RelaRonal  Data  Management  Systems  6 ©2013 Cloudera, Inc.Data  Source   RDBMS  ETL  ReporRng  
    • A  Canonical  Hadoop  Architecture  7 ©2013 Cloudera, Inc.Data  Source   HDFS  Flume  Hive  (Impala)  
    • Analyzing  Twi,er  Data  with  Hadoop  AN  EXAMPLE  USE  CASE  8   ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  •  Social  media  popular  with  markeRng  teams  •  Twi,er  is  an  effecRve  tool  for  promoRon  •  Who  is  influenRal?  •  Tweets  •  Followers  •  Retweets  •  Similar  to  e-­‐mail  forwarding  •  Which  twi,er  user  gets  the  most  retweets?  •  Who  is  influenRal  in  our  industry?  9 ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  HOW  DO  WE  ANSWER  THESE  QUESTIONS?  10   ©2013 Cloudera, Inc.
    • Techniques  •  SQL  •  Filtering  •  AggregaRon  •  SorRng  •  Complex  data  •  Deeply  nested  •  Variable  schema  11
    • Architecture  12 ©2013 Cloudera, Inc.Twi,er  HDFS  Flume   Hive  Custom  Flume  Source  Sink  to  HDFS  JSON  SerDe  Parses  Data  Oozie  Add  ParRRons  Hourly  Impala  Queries  Queries  and  ETL  
    • Analyzing  Twi,er  Data  with  Hadoop  TWITTER  SOURCE  13   ©2013 Cloudera, Inc.
    • Flume  •  Streaming  data  flow  •  Sources  •  Push  or  pull  •  Sinks  •  Event  based  14 ©2013 Cloudera, Inc.
    • Pulling  Data  From  Twi,er  •  Custom  source,  using  twi,er4j  •  Sources  process  data  as  discrete  events  
    • Loading  Data  Into  HDFS  •  HDFS  Sink  comes  stock  with  Flume  •  Easily  separate  files  by  creaRon  Rme  •  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
    • Flume  Source  17 ©2013 Cloudera, Inc.public class TwitterSource extends AbstractSource!implements EventDrivenSource, Configurable {!...!// The initialization method for the Source. The context contains all !// the Flume configuration info!@Override!public void configure(Context context) {!...!}!...!// Start processing events. Uses the Twitter Streaming API to sample!// Twitter, and process tweets.!@Override!public void start() {!...!}!...!// Stops Sources event processing and shuts down the Twitter stream.!@Override!public void stop() {!...!}!}!!
    • Twi,er  API  •  Callback  mechanism  for  catching  new  tweets  18 ©2013 Cloudera, Inc./** The actual Twitter stream. Its set up to collect raw JSON data */!private final TwitterStream twitterStream = new TwitterStreamFactory(!new ConfigurationBuilder().setJSONStoreEnabled(true).build())!.getInstance();!...!// The StatusListener is a twitter4j API that can be added to a stream,!// and will call a method every time a message is sent to the stream.!StatusListener listener = new StatusListener() {!// The onStatus method is executed every time a new tweet comes in.!public void onStatus(Status status) {!... !}!}!...!// Set up the streams listener (defined above), and set any necessary!// security information.!twitterStream.addListener(listener);!twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!AccessToken token = new AccessToken(accessToken, accessTokenSecret);!twitterStream.setOAuthAccessToken(token);!!
    • JSON  Data  •  JSON  data  is  processed  as  an  event  and  wri,en  to  HDFS  19 ©2013 Cloudera, Inc.public void onStatus(Status status) {!// The EventBuilder is used to build an event using the headers and!// the raw JSON of a tweet!!headers.put("timestamp", String.valueOf(!status.getCreatedAt().getTime()));!Event event = EventBuilder.withBody(!DataObjectFactory.getRawJSON(status).getBytes(), headers);!!channel.processEvent(event);!}!
    • Analyzing  Twi,er  Data  with  Hadoop  FLUME  DEMO  20   ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  HIVE  21   ©2013 Cloudera, Inc.
    • What  is  Hive?  •  Created  at  Facebook  •  HiveQL  •  SQL  like  interface  •  Hive  interpreter  converts  HiveQL  to  MapReduce  code  •  Returns  results  to  the  client  22   ©2013 Cloudera, Inc.
    • Hive  Details  •  Schema  on  read  •  Scalar  types  (int,  float,  double,  boolean,  string)  •  Complex  types  (struct,  map,  array)  •  Metastore  contains  table  definiRons  •  Stored  in  a  relaRonal  database  •  Similar  to  catalog  tables  in  other  DBs  23
    • Complex  Data  24 ©2013 Cloudera, Inc.SELECT! t.retweet_screen_name,! sum(retweets) AS total_retweets,! count(*) AS tweet_count!FROM (SELECT!  retweeted_status.user.screen_name AS retweet_screen_name,!    retweeted_status.text,!    max(retweeted_status.retweet_count) AS retweets!FROM tweets!  GROUP BY!retweeted_status.user.screen_name,!      retweeted_status.text) t!GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC!LIMIT 10;!
    • Analyzing  Twi,er  Data  with  Hadoop  JSON  INTERLUDE  25   ©2013 Cloudera, Inc.
    • What  is  JSON?  •  Complex,  semi-­‐structured  data  •  Based  on  JavaScript’s  data  syntax  •  Rich,  nested  data  types:  •  number  •  string  •  Array  •  object  •  true,  false  •  null  26 ©2013 Cloudera, Inc.
    • What  is  JSON?  27 ©2013 Cloudera, Inc.{!"retweeted_status": {!"contributors": null,!"text": "#Crowdsourcing – drivers already generate traffic data foryour smartphone to suggest alternative routes when a road is clogged.#bigdata",!"retweeted": false,!"entities": {!"hashtags": [!{!"text": "Crowdsourcing",!"indices": [0, 14]!},!{!"text": "bigdata",!"indices": [129,137]!}!],!"user_mentions": []!}!}!}!
    • Hive  Serializers  and  Deserializers    •  Instructs  Hive  on  how  to  interpret  data  •  JSONSerDe  28 ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  HIVE  DEMO  29   ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  IT’S  A  TRAP!  30   ©2013 Cloudera, Inc.Photo  from  h,p://www.flickr.com/photos/vanf/6798065626/  Some  rights  reserved  
    • Not  a  Database  31 ©2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specific extensionsUpdate CapabilitiesINSERT, UPDATE,DELETEINSERT OVERWRITEno UPDATE, DELETETransactions Yes NoLatency Sub-second MinutesIndexes Yes YesData size Terabytes Petabytes
    • ETL  •  Hive  works  great  for  SQL-­‐based  ETL  •  JSON  -­‐>  SequenceFiles  32 ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  IMPALA  33   ©2013 Cloudera, Inc.
    • Cloudera  Impala  34Real-­‐Time  Query  for  Data  Stored  in  Hadoop.  Supports  Hive  SQL  4-­‐30X  faster  than  Hive  over  MapReduce  Uses  exisRng  drivers,  integrates  with  exisRng  metastore,  works  with  leading  BI  tools  Flexible,  cost-­‐effecRve,  no  lock-­‐in  Deploy  &  operate  with  Cloudera  Enterprise  RTQ  Supports  mulRple  storage  engines  &    file  formats  ©2013 Cloudera, Inc.
    • Benefits  of  Cloudera  Impala  35Real-­‐Time  Query  for  Data  Stored  in  Hadoop  • Real-­‐Rme  queries  run  directly  on  source  data  • No  ETL  delays  • No  jumping  between  data  silos  •   No  double  storage  with  EDW/RDBMS  •   Unlock  analysis  on  more  data  •   No  need  to  create  and  maintain  complex  ETL  between  systems  •   No  need  to  preplan  schemas  •   All  data  available  for  interacRve  queries  • No  loss  of  fidelity  from  fixed  data  schemas  • Single  metadata  store  from  originaRon    through  analysis  • No  need  to  hunt  through  mulRple  data  silos  ©2013 Cloudera, Inc.
    • Cloudera  Impala  Details  36   ©2013 Cloudera, Inc.HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  ODBC  SQL  App  HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  Fully  MPP  Distributed  Local  Direct  Reads  State  Store  HDFS  NN  Hive  Metastore   YARN  Common  Hive  SQL  and  interface  Unified  metadata  and  scheduler  Low-­‐latency  scheduler  and  cache  (low-­‐impact  failures)  
    • Analyzing  Twi,er  Data  with  Hadoop  IMPALA  DEMO  37   ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  OOZIE  AUTOMATION  38   ©2013 Cloudera, Inc.
    • Oozie:  Everything  in  its  Right  Place  
    • Oozie  for  ParRRon  Management  •  Once  an  hour,  add  a  parRRon  •  Takes  advantage  of  advanced  Hive  funcRonality  
    • Analyzing  Twi,er  Data  with  Hadoop  OOZIE  DEMO  41   ©2013 Cloudera, Inc.
    • Analyzing  Twi,er  Data  with  Hadoop  PUTTING  IT  ALL  TOGETHER  42   ©2013 Cloudera, Inc.
    • Complete  Architecture  43 ©2013 Cloudera, Inc.Twi,er  HDFS  Flume   Hive  Custom  Flume  Source  Sink  to  HDFS  JSON  SerDe  Parses  Data  Oozie  Add  ParRRons  Hourly  Impala  Queries  Queries  and  ETL  
    • Analyzing  Twi,er  Data  with  Hadoop  MORE  DEMOS  44   ©2013 Cloudera, Inc.
    • What  next?  •  Cloudera  University  •  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  •  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM  •  Clone  the  source  repo  •  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example  
    • My  personal  preference  •  Cloudera  Manager  •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads  •  Free  up  to  50    unlimited  nodes!  
    • Shout  Out  •  Jon  Natkins  •  @na,yice  •  Blog  posts  •  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop/  •  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/  •  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐with-­‐hive/  
    • QuesRons?  •  Contact  me!  •  Joey  Echeverria  •  joey@cloudera.com  •  @fwiffo  •  We’re  hiring!  
    • 49 ©2013 Cloudera, Inc.