• Save
Analyzing Twitter Data with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Analyzing Twitter Data with Hadoop

on

  • 1,429 views

 

Statistics

Views

Total Views
1,429
Views on SlideShare
1,425
Embed Views
4

Actions

Likes
6
Downloads
0
Comments
0

1 Embed 4

https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Analyzing Twitter Data with Hadoop Analyzing Twitter Data with Hadoop Presentation Transcript

  • 1Analyzing Twitter Data with HadoopOpen Analytics Summit, March 2013Joey Echeverria | Principal Solutions Architectjoey@cloudera.com | @fwiffo©2013 Cloudera, Inc.
  • About Joey• Principal Solutions Architect• 2 years @ Cloudera• 5 years of Hadoop• Local2
  • Analyzing Twitter Data with HadoopBUILDING A BIG DATA SOLUTION3 ©2013 Cloudera, Inc.
  • Big Data• Big• Larger volume than you’ve handled before• No litmus test• High value, under utilized• Data• Structured• Unstructured• Semi-structured• Hadoop• Distributed file system• Distributed, batch computation4 ©2013 Cloudera, Inc.
  • Data Management Systems5 ©2013 Cloudera, Inc.Data Source Data StorageDataIngestionDataProcessing
  • Relational Data Management Systems6 ©2013 Cloudera, Inc.Data Source RDBMSETLReporting
  • A Canonical Hadoop Architecture7 ©2013 Cloudera, Inc.Data Source HDFSFlumeHive(Impala)
  • Analyzing Twitter Data with HadoopAN EXAMPLE USE CASE8 ©2013 Cloudera, Inc.
  • Analyzing Twitter• Social media popular with marketing teams• Twitter is an effective tool for promotion• Who is influential?• Tweets• Followers• Retweets• Similar to e-mail forwarding• Which twitter user gets the most retweets?• Who is influential in our industry?9 ©2013 Cloudera, Inc.
  • Analyzing Twitter Data with HadoopHOW DO WE ANSWER THESEQUESTIONS?10 ©2013 Cloudera, Inc.
  • Techniques• SQL• Filtering• Aggregation• Sorting• Complex data• Deeply nested• Variable schema11
  • Architecture12 ©2013 Cloudera, Inc.TwitterHDFSFlume HiveCustomFlumeSourceSink toHDFSJSON SerDeParses DataOozieAddPartitionsHourly
  • Analyzing Twitter Data with HadoopTWITTER SOURCE13 ©2013 Cloudera, Inc.
  • Flume• Streaming data flow• Sources• Push or pull• Sinks• Event based14 ©2013 Cloudera, Inc.
  • Pulling Data From Twitter• Custom source, using twitter4j• Sources process data as discrete events
  • Loading Data Into HDFS• HDFS Sink comes stock with Flume• Easily separate files by creation time• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • Flume Source17 ©2013 Cloudera, Inc.public class TwitterSource extends AbstractSourceimplements EventDrivenSource, Configurable {...// The initialization method for the Source. The context contains all// the Flume configuration info@Overridepublic void configure(Context context) {...}...// Start processing events. Uses the Twitter Streaming API to sample// Twitter, and process tweets.@Overridepublic void start() {...}...// Stops Sources event processing and shuts down the Twitter stream.@Overridepublic void stop() {...}}
  • Twitter API• Callback mechanism for catching new tweets18 ©2013 Cloudera, Inc./** The actual Twitter stream. Its set up to collect raw JSON data */private final TwitterStream twitterStream = new TwitterStreamFactory(new ConfigurationBuilder().setJSONStoreEnabled(true).build()).getInstance();...// The StatusListener is a twitter4j API that can be added to a stream,// and will call a method every time a message is sent to the stream.StatusListener listener = new StatusListener() {// The onStatus method is executed every time a new tweet comes in.public void onStatus(Status status) {...}}...// Set up the streams listener (defined above), and set any necessary// security information.twitterStream.addListener(listener);twitterStream.setOAuthConsumer(consumerKey, consumerSecret);AccessToken token = new AccessToken(accessToken, accessTokenSecret);twitterStream.setOAuthAccessToken(token);
  • JSON Data• JSON data is processed as an event and written toHDFS19 ©2013 Cloudera, Inc.public void onStatus(Status status) {// The EventBuilder is used to build an event using the headers and// the raw JSON of a tweetheaders.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));Event event = EventBuilder.withBody(DataObjectFactory.getRawJSON(status).getBytes(), headers);channel.processEvent(event);}
  • Analyzing Twitter Data with HadoopFLUME DEMO20 ©2013 Cloudera, Inc.
  • Analyzing Twitter Data with HadoopHIVE21 ©2013 Cloudera, Inc.
  • What is Hive?• Created at Facebook• HiveQL• SQL like interface• Hive interpreterconverts HiveQL toMapReduce code• Returns results to theclient22 ©2013 Cloudera, Inc.
  • Hive Details• Schema on read• Scalar types (int, float, double, boolean, string)• Complex types (struct, map, array)• Metastore contains table definitions• Stored in a relational database• Similar to catalog tables in other DBs23
  • Complex Data24 ©2013 Cloudera, Inc.SELECTt.retweet_screen_name,sum(retweets) AS total_retweets,count(*) AS tweet_countFROM (SELECTretweeted_status.user.screen_name AS retweet_screen_name,retweeted_status.text,max(retweeted_status.retweet_count) AS retweetsFROM tweetsGROUP BYretweeted_status.user.screen_name,retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;
  • Analyzing Twitter Data with HadoopJSON INTERLUDE25 ©2013 Cloudera, Inc.
  • What is JSON?• Complex, semi-structured data• Based on JavaScript’s data syntax• Rich, nested data types:• number• string• Array• object• true, false• null26 ©2013 Cloudera, Inc.
  • What is JSON?27 ©2013 Cloudera, Inc.{"retweeted_status": {"contributors": null,"text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggestalternative routes when a road is clogged. #bigdata","retweeted": false,"entities": {"hashtags": [{"text": "Crowdsourcing","indices": [0, 14]},{"text": "bigdata","indices": [129,137]}],"user_mentions": []}}}
  • Hive Serializers and Deserializers• Instructs Hive on how to interpret data• JSONSerDe28 ©2013 Cloudera, Inc.
  • Analyzing Twitter Data with HadoopHIVE DEMO29 ©2013 Cloudera, Inc.
  • Analyzing Twitter Data with HadoopIT’S A TRAP!30 ©2013 Cloudera, Inc.Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
  • Not a Database31 ©2013 Cloudera, Inc.RDBMS HiveLanguageGenerally >= SQL-92Subset of SQL-92 plusHive specificextensionsUpdate CapabilitiesINSERT, UPDATE,DELETEINSERT OVERWRITEno UPDATE, DELETETransactions Yes NoLatency Sub-second MinutesIndexes Yes YesData size Terabytes Petabytes
  • Analyzing Twitter Data with HadoopIMPALA ASIDE32 ©2013 Cloudera, Inc.
  • Cloudera Impala33Real-Time Query for Data Stored in Hadoop.Supports Hive SQL4-30X faster than Hive over MapReduceUses existing drivers, integrates with existingmetastore, works with leading BI toolsFlexible, cost-effective, no lock-inDeploy & operate withCloudera Enterprise RTQSupports multiple storage engines &file formats©2013 Cloudera, Inc.
  • Benefits of Cloudera Impala34Real-Time Query for Data Stored in Hadoop• Real-time queries run directly on source data• No ETL delays• No jumping between data silos• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas• All data available for interactive queries• No loss of fidelity from fixed data schemas• Single metadata store from origination through analysis• No need to hunt through multiple data silos©2013 Cloudera, Inc.
  • Cloudera Impala Details35 ©2013 Cloudera, Inc.HDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseODBCSQL AppHDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseHDFS DNQuery Exec EngineQuery CoordinatorQuery PlannerHBaseFully MPPDistributedLocal Direct ReadsState StoreHDFS NNHiveMetastore YARNCommon Hive SQL and interfaceUnified metadata and schedulerLow-latency scheduler and cache(low-impact failures)
  • Analyzing Twitter Data with HadoopOOZIE AUTOMATION36 ©2013 Cloudera, Inc.
  • Oozie: Everything in its Right Place
  • Oozie for Partition Management• Once an hour, add a partition• Takes advantage of advanced Hive functionality
  • Analyzing Twitter Data with HadoopOOZIE DEMO39 ©2013 Cloudera, Inc.
  • Analyzing Twitter Data with HadoopPUTTING IT ALL TOGETHER40 ©2013 Cloudera, Inc.
  • Complete Architecture41 ©2013 Cloudera, Inc.TwitterHDFSFlume HiveCustomFlumeSourceSink toHDFSJSON SerDeParses DataOozieAddPartitionsHourly
  • Analyzing Twitter Data with HadoopMORE DEMOS42 ©2013 Cloudera, Inc.
  • What next?• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM• Clone the source repo• https://github.com/cloudera/cdh-twitter-example
  • My personal preference• Cloudera Manager• https://ccp.cloudera.com/display/SUPPORT/Downloads• Free up to 50 unlimited nodes!
  • Shout Out• Jon Natkins• @nattyice• Blog posts• http://blog.cloudera.com/blog/2013/09/analyzing-twitter-data-with-hadoop/• http://blog.cloudera.com/blog/2013/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/• http://blog.cloudera.com/blog/2013/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/
  • Questions?• Contact me!• Joey Echeverria• joey@cloudera.com• @fwiffo• We’re hiring!
  • 47 ©2013 Cloudera, Inc.