Your SlideShare is downloading. ×
Analyzing Twitter Data with Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Analyzing Twitter Data with Hadoop

1,152
views

Published on

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,152
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1Analyzing Twitter Data with HadoopOpen Analytics Summit, March 2013Joey Echeverria | Principal Solutions Architectjoey@cloudera.com | @fwiffo©2013 Cloudera, Inc.
  • 2. About Joey• Principal Solutions Architect• 2 years @ Cloudera• 5 years of Hadoop• Local2
  • 3. Analyzing Twitter Data with HadoopBUILDING A BIG DATA SOLUTION3 ©2013 Cloudera, Inc.
  • 4. Big Data• Big• Larger volume than you’ve handled before• No litmus test• High value, under utilized• Data• Structured• Unstructured• Semi-structured• Hadoop• Distributed file system• Distributed, batch computation4 ©2013 Cloudera, Inc.
  • 5. Data Management Systems5 ©2013 Cloudera, Inc.Data Source Data StorageDataIngestionDataProcessing
  • 6. Relational Data Management Systems6 ©2013 Cloudera, Inc.Data Source RDBMSETLReporting
  • 7. A Canonical Hadoop Architecture7 ©2013 Cloudera, Inc.Data Source HDFSFlumeHive(Impala)
  • 8. Analyzing Twitter Data with HadoopAN EXAMPLE USE CASE8 ©2013 Cloudera, Inc.
  • 9. Analyzing Twitter• Social media popular with marketing teams• Twitter is an effective tool for promotion• Who is influential?• Tweets• Followers• Retweets• Similar to e-mail forwarding• Which twitter user gets the most retweets?• Who is influential in our industry?9 ©2013 Cloudera, Inc.
  • 10. Analyzing Twitter Data with HadoopHOW DO WE ANSWER THESEQUESTIONS?10 ©2013 Cloudera, Inc.
  • 11. Techniques• SQL• Filtering• Aggregation• Sorting• Complex data• Deeply nested• Variable schema11
  • 12. Architecture12 ©2013 Cloudera, Inc.TwitterHDFSFlume HiveCustomFlumeSourceSink toHDFSJSON SerDeParses DataOozieAddPartitionsHourly
  • 13. Analyzing Twitter Data with HadoopTWITTER SOURCE13 ©2013 Cloudera, Inc.
  • 14. Flume• Streaming data flow• Sources• Push or pull• Sinks• Event based14 ©2013 Cloudera, Inc.
  • 15. Pulling Data From Twitter• Custom source, using twitter4j• Sources process data as discrete events
  • 16. Loading Data Into HDFS• HDFS Sink comes stock with Flume• Easily separate files by creation time• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source17 ©2013 Cloudera, Inc.public class TwitterSource extends AbstractSourceimplements EventDrivenSource, Configurable {...// The initialization method for the Source. The context contains all// the Flume configuration info@Overridepublic void configure(Context context) {...}...// Start processing events. Uses the Twitter Streaming API to sample// Twitter, and process tweets.@Overridepublic void start() {...}...// Stops Sources event processing and shuts down the Twitter stream.@Overridepublic void stop() {...}}
  • 18. Twitter API• Callback mechanism for catching new tweets18 ©2013 Cloudera, Inc./** The actual Twitter stream. Its set up to collect raw JSON data */private final TwitterStream twitterStream = new TwitterStreamFactory(new ConfigurationBuilder().setJSONStoreEnabled(true).build()).getInstance();...// The StatusListener is a twitter4j API that can be added to a stream,// and will call a method every time a message is sent to the stream.StatusListener listener = new StatusListener() {// The onStatus method is executed every time a new tweet comes in.public void onStatus(Status status) {...}}...// Set up the streams listener (defined above), and set any necessary// security information.twitterStream.addListener(listener);twitterStream.setOAuthConsumer(consumerKey, consumerSecret);AccessToken token = new AccessToken(accessToken, accessTokenSecret);twitterStream.setOAuthAccessToken(token);
  • 19. JSON Data• JSON data is processed as an event and written toHDFS19 ©2013 Cloudera, Inc.public void onStatus(Status status) {// The EventBuilder is used to build an event using the headers and// the raw JSON of a tweetheaders.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));Event event = EventBuilder.withBody(DataObjectFactory.getRawJSON(status).getBytes(), headers);channel.processEvent(event);}
  • 20. Analyzing Twitter Data with HadoopFLUME DEMO20 ©2013 Cloudera, Inc.
  • 21. Analyzing Twitter Data with HadoopHIVE21 ©2013 Cloudera, Inc.
  • 22. What is Hive?• Created at Facebook• HiveQL• SQL like interface• Hive interpreterconverts HiveQL toMapReduce code• Returns results to theclient22 ©2013 Cloudera, Inc.
  • 23. Hive Details• Schema on read• Scalar types (int, float, double, boolean, string)• Complex types (struct, map, array)• Metastore contains table definitions• Stored in a relational database• Similar to catalog tables in other DBs23
  • 24. Complex Data24 ©2013 Cloudera, Inc.SELECTt.retweet_screen_name,sum(retweets) AS total_retweets,count(*) AS tweet_countFROM (SELECTretweeted_status.user.screen_name AS retweet_screen_name,retweeted_status.text,max(retweeted_status.retweet_count) AS retweetsFROM tweetsGROUP BYretweeted_status.user.screen_name,retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;
  • 25. Analyzing Twitter Data with HadoopJSON INTERLUDE25 ©2013 Cloudera, Inc.
  • 26. What is JSON?• Complex, semi-structured data• Based on JavaScript’s data syntax• Rich, nested data types:• number• string• Array• object• true, false• null26 ©2013 Cloudera, Inc.
  • 27. What is JSON?27 ©2013 Cloudera, Inc.{"retweeted_status": {"contributors": null,"text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggestalternative routes when a road is clogged. #bigdata","retweeted": false,"entities": {"hashtags": [{"text": "Crowdsourcing","indices": [0, 14]},{"text": "bigdata","indices": [129,137]}],"user_mentions": []}}}
  • 28. Hive Serializers and Deserializers• Instructs Hive on how to interpret data• JSONSerDe28 ©2013 Cloudera, Inc.
  • 29. Analyzing Twitter Data with HadoopHIVE DEMO29 ©2013 Cloudera, Inc.
  • 30. Analyzing Twitter Data with HadoopIT’S A TRAP!30 ©2013 Cloudera, Inc.Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
  • 31. Not a Database31 ©2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specificextensionsUpdate CapabilitiesINSERT, UPDATE,DELETEINSERT OVERWRITEno UPDATE, DELETETransactions Yes NoLatency Sub-second MinutesIndexes Yes YesData size Terabytes Petabytes
  • 32. Analyzing Twitter Data with HadoopIMPALA ASIDE32 ©2013 Cloudera, Inc.
  • 33. Cloudera Impala33Real-Time Query for Data Stored in Hadoop.Supports Hive SQL4-30X faster than Hive over MapReduceUses existing drivers, integrates with existingmetastore, works with leading BI toolsFlexible, cost-effective, no lock-inDeploy & operate withCloudera Enterprise RTQSupports multiple storage engines &file formats©2013 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala34Real-Time Query for Data Stored in Hadoop• Real-time queries run directly on source data• No ETL delays• No jumping between data silos• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas• All data available for interactive queries• No loss of fidelity from fixed data schemas• Single metadata store from origination through analysis• No need to hunt through multiple data silos©2013 Cloudera, Inc.
  • 35. Cloudera Impala Details35 ©2013 Cloudera, Inc.HDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseODBCSQL AppHDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseHDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseFully MPPDistributedLocal Direct ReadsState StoreHDFS NNHiveMetastore YARNCommon Hive SQL and interfaceUnified metadata and schedulerLow-latency scheduler and cache(low-impact failures)
  • 36. Analyzing Twitter Data with HadoopOOZIE AUTOMATION36 ©2013 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management• Once an hour, add a partition• Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with HadoopOOZIE DEMO39 ©2013 Cloudera, Inc.
  • 40. Analyzing Twitter Data with HadoopPUTTING IT ALL TOGETHER40 ©2013 Cloudera, Inc.
  • 41. Complete Architecture41 ©2013 Cloudera, Inc.TwitterHDFSFlume HiveCustomFlumeSourceSink toHDFSJSON SerDeParses DataOozieAddPartitionsHourly
  • 42. Analyzing Twitter Data with HadoopMORE DEMOS42 ©2013 Cloudera, Inc.
  • 43. What next?• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM• Clone the source repo• https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference• Cloudera Manager• https://ccp.cloudera.com/display/SUPPORT/Downloads• Free up to 50 unlimited nodes!
  • 45. Shout Out• Jon Natkins• @nattyice• Blog posts• http://blog.cloudera.com/blog/2013/09/analyzing-twitter-data-with-hadoop/• http://blog.cloudera.com/blog/2013/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/• http://blog.cloudera.com/blog/2013/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
  • 46. Questions?• Contact me!• Joey Echeverria• joey@cloudera.com• @fwiffo• We’re hiring!
  • 47. 47 ©2013 Cloudera, Inc.