• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Analyzing twitter data with hadoop
 

Analyzing twitter data with hadoop

on

  • 5,989 views

Cloudera's OA

Cloudera's OA

Statistics

Views

Total Views
5,989
Views on SlideShare
5,573
Embed Views
416

Actions

Likes
7
Downloads
171
Comments
2

3 Embeds 416

http://www.ikanow.com 342
http://funsung.blogspot.com 72
http://192.254.196.224 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Analyzing twitter data with hadoop Analyzing twitter data with hadoop Presentation Transcript

    • Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo1 ©2012 Cloudera, Inc.
    • About Joey • Principal Solutions Architect • 18 months • 4+ years • Local2
    • Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION3 ©2012 Cloudera, Inc.
    • Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation4 ©2012 Cloudera, Inc.
    • Data Management Systems Data Processing Data Source Data Data Storage Ingestion5 ©2012 Cloudera, Inc.
    • Relational Data Management Systems Reporting Data Source ETL RDBMS6 ©2012 Cloudera, Inc.
    • A Canonical Hadoop Architecture Hive (Impala) Data Source Flume HDFS7 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE8 ©2012 Cloudera, Inc.
    • Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry?9 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS?10 ©2012 Cloudera, Inc.
    • Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema11
    • Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive12 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop TWITTER SOURCE13 ©2012 Cloudera, Inc.
    • Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based14 ©2012 Cloudera, Inc.
    • Pulling Data From Twitter• Custom source, using twitter4j• Sources process data as discrete events
    • Loading Data Into HDFS• HDFS Sink comes stock with Flume• Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
    • Flume Source public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Sources event processing and shuts down the Twitter stream. @Override public void stop() { ... } }17 ©2012 Cloudera, Inc.
    • Twitter API • Callback mechanism for catching new tweets /** The actual Twitter stream. Its set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the streams listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token);18 ©2012 Cloudera, Inc.
    • JSON Data • JSON data is processed as an event and written to HDFS public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); }19 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop FLUME DEMO20 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop HIVE21 ©2012 Cloudera, Inc.
    • What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client22 ©2012 Cloudera, Inc.
    • Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs23
    • Complex Data SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10;24 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop JSON INTERLUDE25 ©2012 Cloudera, Inc.
    • What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null26 ©2012 Cloudera, Inc.
    • What is JSON? { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } }27 ©2012 Cloudera, Inc.
    • Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe28 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop HIVE DEMO29 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop IT’S A TRAP30 ©2012 Cloudera, Inc.
    • Not a Database RDBMS Hive Subset of SQL-92 plus Generally >= SQL-92 Language Hive specific extensions INSERT, UPDATE, INSERT OVERWRITE Update Capabilities DELETE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes31 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop IMPALA ASIDE32 ©2012 Cloudera, Inc.
    • Cloudera Impala Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Supports multiple storage engines & file formats Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ33 ©2012 Cloudera, Inc.
    • Benefits of Cloudera Impala Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos34 ©2012 Cloudera, Inc.
    • Cloudera Impala Details Unified metadata and scheduler Hive Metastore YARN HDFS NN SQL App ODBC State Store Low-latency scheduler and cache Common Hive SQL and interface (low-impact failures) Query Planner Query Planner Fully MPP Query Planner Distributed Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Local Direct Reads35 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop OOZIE AUTOMATION36 ©2012 Cloudera, Inc.
    • Oozie: Everything in its Right Place
    • Oozie for Partition Management• Once an hour, add a partition• Takes advantage of advanced Hive functionality
    • Analyzing Twitter Data with Hadoop OOZIE DEMO39 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER40 ©2012 Cloudera, Inc.
    • Complete Architecture Twitter Oozie Custom Add Flume Partitions Source Hourly Sink to JSON SerDe HDFS Parses Data Flume HDFS Hive41 ©2012 Cloudera, Inc.
    • Analyzing Twitter Data with Hadoop MORE DEMOS42 ©2012 Cloudera, Inc.
    • What next?• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM• Clone the source repo • https://github.com/cloudera/cdh-twitter-example
    • My personal preference• Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads• Free up to 50 nodes
    • Shout Out• Jon Natkins• @nattybnatkins• Blog posts • http://blog.cloudera.com/blog/2012/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2012/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2012/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
    • Questions?• Contact me!• Joey Echeverria• joey@cloudera.com• @fwiffo• We’re hiring!
    • 47 ©2012 Cloudera, Inc.