1Analyzing	  Twi,er	  Data	  with	  Hadoop	  Data	  Science	  Maryland,	  May	  2013	  Joey	  Echeverria	  |	  Director	  ...
About	  Joey	  •  Director	  Federal	  FTS	  •  Technical	  guy	  playing	  manager	  •  2	  years	  @	  Cloudera	  •  5	 ...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  BUILDING	  A	  BIG	  DATA	  SOLUTION	  3	   ©2013 Cloudera, Inc.
Big	  Data	  •  Big	  •  Larger	  volume	  than	  you’ve	  handled	  before	  •  No	  litmus	  test	  •  High	  value,	  u...
Data	  Management	  Systems	  5 ©2013 Cloudera, Inc.Data	  Source	   Data	  Storage	  Data	  IngesRon	  Data	  Processing	  
RelaRonal	  Data	  Management	  Systems	  6 ©2013 Cloudera, Inc.Data	  Source	   RDBMS	  ETL	  ReporRng	  
A	  Canonical	  Hadoop	  Architecture	  7 ©2013 Cloudera, Inc.Data	  Source	   HDFS	  Flume	  Hive	  (Impala)	  
Analyzing	  Twi,er	  Data	  with	  Hadoop	  AN	  EXAMPLE	  USE	  CASE	  8	   ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  •  Social	  media	  popular	  with	  markeRng	  teams	  •  Twi,er	  is	  an	  effecRve	  tool	  for	  ...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  HOW	  DO	  WE	  ANSWER	  THESE	  QUESTIONS?	  10	   ©2013 Cloudera, Inc.
Techniques	  •  SQL	  •  Filtering	  •  AggregaRon	  •  SorRng	  •  Complex	  data	  •  Deeply	  nested	  •  Variable	  sc...
Architecture	  12 ©2013 Cloudera, Inc.Twi,er	  HDFS	  Flume	   Hive	  Custom	  Flume	  Source	  Sink	  to	  HDFS	  JSON	  ...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  TWITTER	  SOURCE	  13	   ©2013 Cloudera, Inc.
Flume	  •  Streaming	  data	  flow	  •  Sources	  •  Push	  or	  pull	  •  Sinks	  •  Event	  based	  14 ©2013 Cloudera, Inc.
Pulling	  Data	  From	  Twi,er	  •  Custom	  source,	  using	  twi,er4j	  •  Sources	  process	  data	  as	  discrete	  ev...
Loading	  Data	  Into	  HDFS	  •  HDFS	  Sink	  comes	  stock	  with	  Flume	  •  Easily	  separate	  files	  by	  creaRon	...
Flume	  Source	  17 ©2013 Cloudera, Inc.public class TwitterSource extends AbstractSource!implements EventDrivenSource, Co...
Twi,er	  API	  •  Callback	  mechanism	  for	  catching	  new	  tweets	  18 ©2013 Cloudera, Inc./** The actual Twitter str...
JSON	  Data	  •  JSON	  data	  is	  processed	  as	  an	  event	  and	  wri,en	  to	  HDFS	  19 ©2013 Cloudera, Inc.public...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  FLUME	  DEMO	  20	   ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  Data	  with	  Hadoop	  HIVE	  21	   ©2013 Cloudera, Inc.
What	  is	  Hive?	  •  Created	  at	  Facebook	  •  HiveQL	  •  SQL	  like	  interface	  •  Hive	  interpreter	  converts	...
Hive	  Details	  •  Schema	  on	  read	  •  Scalar	  types	  (int,	  float,	  double,	  boolean,	  string)	  •  Complex	  t...
Complex	  Data	  24 ©2013 Cloudera, Inc.SELECT! t.retweet_screen_name,! sum(retweets) AS total_retweets,! count(*) AS twee...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  JSON	  INTERLUDE	  25	   ©2013 Cloudera, Inc.
What	  is	  JSON?	  •  Complex,	  semi-­‐structured	  data	  •  Based	  on	  JavaScript’s	  data	  syntax	  •  Rich,	  nes...
What	  is	  JSON?	  27 ©2013 Cloudera, Inc.{!"retweeted_status": {!"contributors": null,!"text": "#Crowdsourcing – drivers...
Hive	  Serializers	  and	  Deserializers	  	  •  Instructs	  Hive	  on	  how	  to	  interpret	  data	  •  JSONSerDe	  28 ©...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  HIVE	  DEMO	  29	   ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  Data	  with	  Hadoop	  IT’S	  A	  TRAP!	  30	   ©2013 Cloudera, Inc.Photo	  from	  h,p://www.flickr.co...
Not	  a	  Database	  31 ©2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specific extensi...
ETL	  •  Hive	  works	  great	  for	  SQL-­‐based	  ETL	  •  JSON	  -­‐>	  SequenceFiles	  32 ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  Data	  with	  Hadoop	  IMPALA	  33	   ©2013 Cloudera, Inc.
Cloudera	  Impala	  34Real-­‐Time	  Query	  for	  Data	  Stored	  in	  Hadoop.	  Supports	  Hive	  SQL	  4-­‐30X	  faster	...
Benefits	  of	  Cloudera	  Impala	  35Real-­‐Time	  Query	  for	  Data	  Stored	  in	  Hadoop	  • Real-­‐Rme	  queries	  ru...
Cloudera	  Impala	  Details	  36	   ©2013 Cloudera, Inc.HDFS	  DN	  Query	  Exec	  Engine	  Query	  Coordinator	  Query	  ...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  IMPALA	  DEMO	  37	   ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  Data	  with	  Hadoop	  OOZIE	  AUTOMATION	  38	   ©2013 Cloudera, Inc.
Oozie:	  Everything	  in	  its	  Right	  Place	  
Oozie	  for	  ParRRon	  Management	  •  Once	  an	  hour,	  add	  a	  parRRon	  •  Takes	  advantage	  of	  advanced	  Hiv...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  OOZIE	  DEMO	  41	   ©2013 Cloudera, Inc.
Analyzing	  Twi,er	  Data	  with	  Hadoop	  PUTTING	  IT	  ALL	  TOGETHER	  42	   ©2013 Cloudera, Inc.
Complete	  Architecture	  43 ©2013 Cloudera, Inc.Twi,er	  HDFS	  Flume	   Hive	  Custom	  Flume	  Source	  Sink	  to	  HDF...
Analyzing	  Twi,er	  Data	  with	  Hadoop	  MORE	  DEMOS	  44	   ©2013 Cloudera, Inc.
What	  next?	  •  Cloudera	  University	  •  Download	  Hadoop!	  •  CDH	  available	  at	  www.cloudera.com	  •  Cloudera...
My	  personal	  preference	  •  Cloudera	  Manager	  •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads	  •  Free	  up	 ...
Shout	  Out	  •  Jon	  Natkins	  •  @na,yice	  •  Blog	  posts	  •  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,e...
QuesRons?	  •  Contact	  me!	  •  Joey	  Echeverria	  •  joey@cloudera.com	  •  @fwiffo	  •  We’re	  hiring!	  
49 ©2013 Cloudera, Inc.
Upcoming SlideShare
Loading in...5
×

Analyzing twitter data with hadoop

948

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
948
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
54
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Analyzing twitter data with hadoop

  1. 1. 1Analyzing  Twi,er  Data  with  Hadoop  Data  Science  Maryland,  May  2013  Joey  Echeverria  |  Director  Federal  FTS  joey@cloudera.com  |  @fwiffo  ©2013 Cloudera, Inc.
  2. 2. About  Joey  •  Director  Federal  FTS  •  Technical  guy  playing  manager  •  2  years  @  Cloudera  •  5  years  of  Hadoop  •  Local  2
  3. 3. Analyzing  Twi,er  Data  with  Hadoop  BUILDING  A  BIG  DATA  SOLUTION  3   ©2013 Cloudera, Inc.
  4. 4. Big  Data  •  Big  •  Larger  volume  than  you’ve  handled  before  •  No  litmus  test  •  High  value,  under  uRlized  •  Data  •  Structured  •  Unstructured  •  Semi-­‐structured  •  Hadoop  •  Distributed  file  system  •  Distributed,  batch  computaRon  4 ©2013 Cloudera, Inc.
  5. 5. Data  Management  Systems  5 ©2013 Cloudera, Inc.Data  Source   Data  Storage  Data  IngesRon  Data  Processing  
  6. 6. RelaRonal  Data  Management  Systems  6 ©2013 Cloudera, Inc.Data  Source   RDBMS  ETL  ReporRng  
  7. 7. A  Canonical  Hadoop  Architecture  7 ©2013 Cloudera, Inc.Data  Source   HDFS  Flume  Hive  (Impala)  
  8. 8. Analyzing  Twi,er  Data  with  Hadoop  AN  EXAMPLE  USE  CASE  8   ©2013 Cloudera, Inc.
  9. 9. Analyzing  Twi,er  •  Social  media  popular  with  markeRng  teams  •  Twi,er  is  an  effecRve  tool  for  promoRon  •  Who  is  influenRal?  •  Tweets  •  Followers  •  Retweets  •  Similar  to  e-­‐mail  forwarding  •  Which  twi,er  user  gets  the  most  retweets?  •  Who  is  influenRal  in  our  industry?  9 ©2013 Cloudera, Inc.
  10. 10. Analyzing  Twi,er  Data  with  Hadoop  HOW  DO  WE  ANSWER  THESE  QUESTIONS?  10   ©2013 Cloudera, Inc.
  11. 11. Techniques  •  SQL  •  Filtering  •  AggregaRon  •  SorRng  •  Complex  data  •  Deeply  nested  •  Variable  schema  11
  12. 12. Architecture  12 ©2013 Cloudera, Inc.Twi,er  HDFS  Flume   Hive  Custom  Flume  Source  Sink  to  HDFS  JSON  SerDe  Parses  Data  Oozie  Add  ParRRons  Hourly  Impala  Queries  Queries  and  ETL  
  13. 13. Analyzing  Twi,er  Data  with  Hadoop  TWITTER  SOURCE  13   ©2013 Cloudera, Inc.
  14. 14. Flume  •  Streaming  data  flow  •  Sources  •  Push  or  pull  •  Sinks  •  Event  based  14 ©2013 Cloudera, Inc.
  15. 15. Pulling  Data  From  Twi,er  •  Custom  source,  using  twi,er4j  •  Sources  process  data  as  discrete  events  
  16. 16. Loading  Data  Into  HDFS  •  HDFS  Sink  comes  stock  with  Flume  •  Easily  separate  files  by  creaRon  Rme  •  hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  17. 17. Flume  Source  17 ©2013 Cloudera, Inc.public class TwitterSource extends AbstractSource!implements EventDrivenSource, Configurable {!...!// The initialization method for the Source. The context contains all !// the Flume configuration info!@Override!public void configure(Context context) {!...!}!...!// Start processing events. Uses the Twitter Streaming API to sample!// Twitter, and process tweets.!@Override!public void start() {!...!}!...!// Stops Sources event processing and shuts down the Twitter stream.!@Override!public void stop() {!...!}!}!!
  18. 18. Twi,er  API  •  Callback  mechanism  for  catching  new  tweets  18 ©2013 Cloudera, Inc./** The actual Twitter stream. Its set up to collect raw JSON data */!private final TwitterStream twitterStream = new TwitterStreamFactory(!new ConfigurationBuilder().setJSONStoreEnabled(true).build())!.getInstance();!...!// The StatusListener is a twitter4j API that can be added to a stream,!// and will call a method every time a message is sent to the stream.!StatusListener listener = new StatusListener() {!// The onStatus method is executed every time a new tweet comes in.!public void onStatus(Status status) {!... !}!}!...!// Set up the streams listener (defined above), and set any necessary!// security information.!twitterStream.addListener(listener);!twitterStream.setOAuthConsumer(consumerKey, consumerSecret);!AccessToken token = new AccessToken(accessToken, accessTokenSecret);!twitterStream.setOAuthAccessToken(token);!!
  19. 19. JSON  Data  •  JSON  data  is  processed  as  an  event  and  wri,en  to  HDFS  19 ©2013 Cloudera, Inc.public void onStatus(Status status) {!// The EventBuilder is used to build an event using the headers and!// the raw JSON of a tweet!!headers.put("timestamp", String.valueOf(!status.getCreatedAt().getTime()));!Event event = EventBuilder.withBody(!DataObjectFactory.getRawJSON(status).getBytes(), headers);!!channel.processEvent(event);!}!
  20. 20. Analyzing  Twi,er  Data  with  Hadoop  FLUME  DEMO  20   ©2013 Cloudera, Inc.
  21. 21. Analyzing  Twi,er  Data  with  Hadoop  HIVE  21   ©2013 Cloudera, Inc.
  22. 22. What  is  Hive?  •  Created  at  Facebook  •  HiveQL  •  SQL  like  interface  •  Hive  interpreter  converts  HiveQL  to  MapReduce  code  •  Returns  results  to  the  client  22   ©2013 Cloudera, Inc.
  23. 23. Hive  Details  •  Schema  on  read  •  Scalar  types  (int,  float,  double,  boolean,  string)  •  Complex  types  (struct,  map,  array)  •  Metastore  contains  table  definiRons  •  Stored  in  a  relaRonal  database  •  Similar  to  catalog  tables  in  other  DBs  23
  24. 24. Complex  Data  24 ©2013 Cloudera, Inc.SELECT! t.retweet_screen_name,! sum(retweets) AS total_retweets,! count(*) AS tweet_count!FROM (SELECT!  retweeted_status.user.screen_name AS retweet_screen_name,!    retweeted_status.text,!    max(retweeted_status.retweet_count) AS retweets!FROM tweets!  GROUP BY!retweeted_status.user.screen_name,!      retweeted_status.text) t!GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC!LIMIT 10;!
  25. 25. Analyzing  Twi,er  Data  with  Hadoop  JSON  INTERLUDE  25   ©2013 Cloudera, Inc.
  26. 26. What  is  JSON?  •  Complex,  semi-­‐structured  data  •  Based  on  JavaScript’s  data  syntax  •  Rich,  nested  data  types:  •  number  •  string  •  Array  •  object  •  true,  false  •  null  26 ©2013 Cloudera, Inc.
  27. 27. What  is  JSON?  27 ©2013 Cloudera, Inc.{!"retweeted_status": {!"contributors": null,!"text": "#Crowdsourcing – drivers already generate traffic data foryour smartphone to suggest alternative routes when a road is clogged.#bigdata",!"retweeted": false,!"entities": {!"hashtags": [!{!"text": "Crowdsourcing",!"indices": [0, 14]!},!{!"text": "bigdata",!"indices": [129,137]!}!],!"user_mentions": []!}!}!}!
  28. 28. Hive  Serializers  and  Deserializers    •  Instructs  Hive  on  how  to  interpret  data  •  JSONSerDe  28 ©2013 Cloudera, Inc.
  29. 29. Analyzing  Twi,er  Data  with  Hadoop  HIVE  DEMO  29   ©2013 Cloudera, Inc.
  30. 30. Analyzing  Twi,er  Data  with  Hadoop  IT’S  A  TRAP!  30   ©2013 Cloudera, Inc.Photo  from  h,p://www.flickr.com/photos/vanf/6798065626/  Some  rights  reserved  
  31. 31. Not  a  Database  31 ©2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specific extensionsUpdate CapabilitiesINSERT, UPDATE,DELETEINSERT OVERWRITEno UPDATE, DELETETransactions Yes NoLatency Sub-second MinutesIndexes Yes YesData size Terabytes Petabytes
  32. 32. ETL  •  Hive  works  great  for  SQL-­‐based  ETL  •  JSON  -­‐>  SequenceFiles  32 ©2013 Cloudera, Inc.
  33. 33. Analyzing  Twi,er  Data  with  Hadoop  IMPALA  33   ©2013 Cloudera, Inc.
  34. 34. Cloudera  Impala  34Real-­‐Time  Query  for  Data  Stored  in  Hadoop.  Supports  Hive  SQL  4-­‐30X  faster  than  Hive  over  MapReduce  Uses  exisRng  drivers,  integrates  with  exisRng  metastore,  works  with  leading  BI  tools  Flexible,  cost-­‐effecRve,  no  lock-­‐in  Deploy  &  operate  with  Cloudera  Enterprise  RTQ  Supports  mulRple  storage  engines  &    file  formats  ©2013 Cloudera, Inc.
  35. 35. Benefits  of  Cloudera  Impala  35Real-­‐Time  Query  for  Data  Stored  in  Hadoop  • Real-­‐Rme  queries  run  directly  on  source  data  • No  ETL  delays  • No  jumping  between  data  silos  •   No  double  storage  with  EDW/RDBMS  •   Unlock  analysis  on  more  data  •   No  need  to  create  and  maintain  complex  ETL  between  systems  •   No  need  to  preplan  schemas  •   All  data  available  for  interacRve  queries  • No  loss  of  fidelity  from  fixed  data  schemas  • Single  metadata  store  from  originaRon    through  analysis  • No  need  to  hunt  through  mulRple  data  silos  ©2013 Cloudera, Inc.
  36. 36. Cloudera  Impala  Details  36   ©2013 Cloudera, Inc.HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  ODBC  SQL  App  HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  HDFS  DN  Query  Exec  Engine  Query  Coordinator  Query  Planner  HBase  Fully  MPP  Distributed  Local  Direct  Reads  State  Store  HDFS  NN  Hive  Metastore   YARN  Common  Hive  SQL  and  interface  Unified  metadata  and  scheduler  Low-­‐latency  scheduler  and  cache  (low-­‐impact  failures)  
  37. 37. Analyzing  Twi,er  Data  with  Hadoop  IMPALA  DEMO  37   ©2013 Cloudera, Inc.
  38. 38. Analyzing  Twi,er  Data  with  Hadoop  OOZIE  AUTOMATION  38   ©2013 Cloudera, Inc.
  39. 39. Oozie:  Everything  in  its  Right  Place  
  40. 40. Oozie  for  ParRRon  Management  •  Once  an  hour,  add  a  parRRon  •  Takes  advantage  of  advanced  Hive  funcRonality  
  41. 41. Analyzing  Twi,er  Data  with  Hadoop  OOZIE  DEMO  41   ©2013 Cloudera, Inc.
  42. 42. Analyzing  Twi,er  Data  with  Hadoop  PUTTING  IT  ALL  TOGETHER  42   ©2013 Cloudera, Inc.
  43. 43. Complete  Architecture  43 ©2013 Cloudera, Inc.Twi,er  HDFS  Flume   Hive  Custom  Flume  Source  Sink  to  HDFS  JSON  SerDe  Parses  Data  Oozie  Add  ParRRons  Hourly  Impala  Queries  Queries  and  ETL  
  44. 44. Analyzing  Twi,er  Data  with  Hadoop  MORE  DEMOS  44   ©2013 Cloudera, Inc.
  45. 45. What  next?  •  Cloudera  University  •  Download  Hadoop!  •  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  •  h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM  •  Clone  the  source  repo  •  h,ps://github.com/cloudera/cdh-­‐twi,er-­‐example  
  46. 46. My  personal  preference  •  Cloudera  Manager  •  h,ps://ccp.cloudera.com/display/SUPPORT/Downloads  •  Free  up  to  50    unlimited  nodes!  
  47. 47. Shout  Out  •  Jon  Natkins  •  @na,yice  •  Blog  posts  •  h,p://blog.cloudera.com/blog/2013/09/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop/  •  h,p://blog.cloudera.com/blog/2013/10/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐2-­‐gathering-­‐data-­‐with-­‐flume/  •  h,p://blog.cloudera.com/blog/2013/11/analyzing-­‐twi,er-­‐data-­‐with-­‐hadoop-­‐part-­‐3-­‐querying-­‐semi-­‐structured-­‐data-­‐with-­‐hive/  
  48. 48. QuesRons?  •  Contact  me!  •  Joey  Echeverria  •  joey@cloudera.com  •  @fwiffo  •  We’re  hiring!  
  49. 49. 49 ©2013 Cloudera, Inc.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×