SlideShare a Scribd company logo
1 of 47
1
Analyzing Twitter Data with Hadoop
Open Analytics Summit, March 2013
Joey Echeverria | Principal Solutions Architect
joey@cloudera.com | @fwiffo
©2013 Cloudera, Inc.
About Joey
• Principal Solutions Architect
• 2 years @ Cloudera
• 5 years of Hadoop
• Local
2
Analyzing Twitter Data with Hadoop
BUILDING A BIG DATA SOLUTION
3 ©2013 Cloudera, Inc.
Big Data
• Big
• Larger volume than you’ve handled before
• No litmus test
• High value, under utilized
• Data
• Structured
• Unstructured
• Semi-structured
• Hadoop
• Distributed file system
• Distributed, batch computation
4 ©2013 Cloudera, Inc.
Data Management Systems
5 ©2013 Cloudera, Inc.
Data Source Data Storage
Data
Ingestion
Data
Processing
Relational Data Management Systems
6 ©2013 Cloudera, Inc.
Data Source RDBMSETL
Reporting
A Canonical Hadoop Architecture
7 ©2013 Cloudera, Inc.
Data Source HDFSFlume
Hive
(Impala)
Analyzing Twitter Data with Hadoop
AN EXAMPLE USE CASE
8 ©2013 Cloudera, Inc.
Analyzing Twitter
• Social media popular with marketing teams
• Twitter is an effective tool for promotion
• Who is influential?
• Tweets
• Followers
• Retweets
• Similar to e-mail forwarding
• Which twitter user gets the most retweets?
• Who is influential in our industry?
9 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE
QUESTIONS?
10 ©2013 Cloudera, Inc.
Techniques
• SQL
• Filtering
• Aggregation
• Sorting
• Complex data
• Deeply nested
• Variable schema
11
Architecture
12 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
TWITTER SOURCE
13 ©2013 Cloudera, Inc.
Flume
• Streaming data flow
• Sources
• Push or pull
• Sinks
• Event based
14 ©2013 Cloudera, Inc.
Pulling Data From Twitter
• Custom source, using twitter4j
• Sources process data as discrete events
Loading Data Into HDFS
• HDFS Sink comes stock with Flume
• Easily separate files by creation time
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource
implements EventDrivenSource, Configurable {
...
// The initialization method for the Source. The context contains all
// the Flume configuration info
@Override
public void configure(Context context) {
...
}
...
// Start processing events. Uses the Twitter Streaming API to sample
// Twitter, and process tweets.
@Override
public void start() {
...
}
...
// Stops Source's event processing and shuts down the Twitter stream.
@Override
public void stop() {
...
}
}
Twitter API
• Callback mechanism for catching new tweets
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */
private final TwitterStream twitterStream = new TwitterStreamFactory(
new ConfigurationBuilder().setJSONStoreEnabled(true).build())
.getInstance();
...
// The StatusListener is a twitter4j API that can be added to a stream,
// and will call a method every time a message is sent to the stream.
StatusListener listener = new StatusListener() {
// The onStatus method is executed every time a new tweet comes in.
public void onStatus(Status status) {
...
}
}
...
// Set up the stream's listener (defined above), and set any necessary
// security information.
twitterStream.addListener(listener);
twitterStream.setOAuthConsumer(consumerKey, consumerSecret);
AccessToken token = new AccessToken(accessToken, accessTokenSecret);
twitterStream.setOAuthAccessToken(token);
JSON Data
• JSON data is processed as an event and written to
HDFS
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) {
// The EventBuilder is used to build an event using the headers and
// the raw JSON of a tweet
headers.put("timestamp", String.valueOf(
status.getCreatedAt().getTime()));
Event event = EventBuilder.withBody(
DataObjectFactory.getRawJSON(status).getBytes(), headers);
channel.processEvent(event);
}
Analyzing Twitter Data with Hadoop
FLUME DEMO
20 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE
21 ©2013 Cloudera, Inc.
What is Hive?
• Created at Facebook
• HiveQL
• SQL like interface
• Hive interpreter
converts HiveQL to
MapReduce code
• Returns results to the
client
22 ©2013 Cloudera, Inc.
Hive Details
• Schema on read
• Scalar types (int, float, double, boolean, string)
• Complex types (struct, map, array)
• Metastore contains table definitions
• Stored in a relational database
• Similar to catalog tables in other DBs
23
Complex Data
24 ©2013 Cloudera, Inc.
SELECT
t.retweet_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name AS retweet_screen_name,
retweeted_status.text,
max(retweeted_status.retweet_count) AS retweets
FROM tweets
GROUP BY
retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweet_screen_name
ORDER BY total_retweets DESC
LIMIT 10;
Analyzing Twitter Data with Hadoop
JSON INTERLUDE
25 ©2013 Cloudera, Inc.
What is JSON?
• Complex, semi-structured data
• Based on JavaScript’s data syntax
• Rich, nested data types:
• number
• string
• Array
• object
• true, false
• null
26 ©2013 Cloudera, Inc.
What is JSON?
27 ©2013 Cloudera, Inc.
{
"retweeted_status": {
"contributors": null,
"text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest
alternative routes when a road is clogged. #bigdata",
"retweeted": false,
"entities": {
"hashtags": [
{
"text": "Crowdsourcing",
"indices": [0, 14]
},
{
"text": "bigdata",
"indices": [129,137]
}
],
"user_mentions": []
}
}
}
Hive Serializers and Deserializers
• Instructs Hive on how to interpret data
• JSONSerDe
28 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
HIVE DEMO
29 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
IT’S A TRAP!
30 ©2013 Cloudera, Inc.
Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
Not a Database
31 ©2013 Cloudera, Inc.
RDBMS Hive
Language
Generally >= SQL-92
Subset of SQL-92 plus
Hive specific
extensions
Update Capabilities
INSERT, UPDATE,
DELETE
INSERT OVERWRITE
no UPDATE, DELETE
Transactions Yes No
Latency Sub-second Minutes
Indexes Yes Yes
Data size Terabytes Petabytes
Analyzing Twitter Data with Hadoop
IMPALA ASIDE
32 ©2013 Cloudera, Inc.
Cloudera Impala
33
Real-Time Query for Data Stored in Hadoop.
Supports Hive SQL
4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing
metastore, works with leading BI tools
Flexible, cost-effective, no lock-in
Deploy & operate with
Cloudera Enterprise RTQ
Supports multiple storage engines &
file formats
©2013 Cloudera, Inc.
Benefits of Cloudera Impala
34
Real-Time Query for Data Stored in Hadoop
• Real-time queries run directly on source data
• No ETL delays
• No jumping between data silos
• No double storage with EDW/RDBMS
• Unlock analysis on more data
• No need to create and maintain complex ETL between systems
• No need to preplan schemas
• All data available for interactive queries
• No loss of fidelity from fixed data schemas
• Single metadata store from origination through analysis
• No need to hunt through multiple data silos
©2013 Cloudera, Inc.
Cloudera Impala Details
35 ©2013 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP
Distributed
Local Direct Reads
State Store
HDFS NN
Hive
Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-latency scheduler and cache
(low-impact failures)
Analyzing Twitter Data with Hadoop
OOZIE AUTOMATION
36 ©2013 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for Partition Management
• Once an hour, add a partition
• Takes advantage of advanced Hive functionality
Analyzing Twitter Data with Hadoop
OOZIE DEMO
39 ©2013 Cloudera, Inc.
Analyzing Twitter Data with Hadoop
PUTTING IT ALL TOGETHER
40 ©2013 Cloudera, Inc.
Complete Architecture
41 ©2013 Cloudera, Inc.
Twitter
HDFSFlume Hive
Custom
Flume
Source
Sink to
HDFS
JSON SerDe
Parses Data
Oozie
Add
Partitions
Hourly
Analyzing Twitter Data with Hadoop
MORE DEMOS
42 ©2013 Cloudera, Inc.
What next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Cloudera provides pre-loaded VMs
• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma
nager+Free+Edition+Demo+VM
• Clone the source repo
• https://github.com/cloudera/cdh-twitter-example
My personal preference
• Cloudera Manager
• https://ccp.cloudera.com/display/SUPPORT/Downloads
• Free up to 50 unlimited nodes!
Shout Out
• Jon Natkins
• @nattyice
• Blog posts
• http://blog.cloudera.com/blog/2013/09/analyzing-twitter-
data-with-hadoop/
• http://blog.cloudera.com/blog/2013/10/analyzing-twitter-
data-with-hadoop-part-2-gathering-data-with-flume/
• http://blog.cloudera.com/blog/2013/11/analyzing-twitter-
data-with-hadoop-part-3-querying-semi-structured-data-
with-hive/
Questions?
• Contact me!
• Joey Echeverria
• joey@cloudera.com
• @fwiffo
• We’re hiring!
47 ©2013 Cloudera, Inc.

More Related Content

More from Joey Echeverria

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

More from Joey Echeverria (6)

Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Analyzing Twitter Data with Hadoop

  • 1. 1 Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect joey@cloudera.com | @fwiffo ©2013 Cloudera, Inc.
  • 2. About Joey • Principal Solutions Architect • 2 years @ Cloudera • 5 years of Hadoop • Local 2
  • 3. Analyzing Twitter Data with Hadoop BUILDING A BIG DATA SOLUTION 3 ©2013 Cloudera, Inc.
  • 4. Big Data • Big • Larger volume than you’ve handled before • No litmus test • High value, under utilized • Data • Structured • Unstructured • Semi-structured • Hadoop • Distributed file system • Distributed, batch computation 4 ©2013 Cloudera, Inc.
  • 5. Data Management Systems 5 ©2013 Cloudera, Inc. Data Source Data Storage Data Ingestion Data Processing
  • 6. Relational Data Management Systems 6 ©2013 Cloudera, Inc. Data Source RDBMSETL Reporting
  • 7. A Canonical Hadoop Architecture 7 ©2013 Cloudera, Inc. Data Source HDFSFlume Hive (Impala)
  • 8. Analyzing Twitter Data with Hadoop AN EXAMPLE USE CASE 8 ©2013 Cloudera, Inc.
  • 9. Analyzing Twitter • Social media popular with marketing teams • Twitter is an effective tool for promotion • Who is influential? • Tweets • Followers • Retweets • Similar to e-mail forwarding • Which twitter user gets the most retweets? • Who is influential in our industry? 9 ©2013 Cloudera, Inc.
  • 10. Analyzing Twitter Data with Hadoop HOW DO WE ANSWER THESE QUESTIONS? 10 ©2013 Cloudera, Inc.
  • 11. Techniques • SQL • Filtering • Aggregation • Sorting • Complex data • Deeply nested • Variable schema 11
  • 12. Architecture 12 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 13. Analyzing Twitter Data with Hadoop TWITTER SOURCE 13 ©2013 Cloudera, Inc.
  • 14. Flume • Streaming data flow • Sources • Push or pull • Sinks • Event based 14 ©2013 Cloudera, Inc.
  • 15. Pulling Data From Twitter • Custom source, using twitter4j • Sources process data as discrete events
  • 16. Loading Data Into HDFS • HDFS Sink comes stock with Flume • Easily separate files by creation time • hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
  • 17. Flume Source 17 ©2013 Cloudera, Inc. public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... } }
  • 18. Twitter API • Callback mechanism for catching new tweets 18 ©2013 Cloudera, Inc. /** The actual Twitter stream. It's set up to collect raw JSON data */ private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance(); ... // The StatusListener is a twitter4j API that can be added to a stream, // and will call a method every time a message is sent to the stream. StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... } } ... // Set up the stream's listener (defined above), and set any necessary // security information. twitterStream.addListener(listener); twitterStream.setOAuthConsumer(consumerKey, consumerSecret); AccessToken token = new AccessToken(accessToken, accessTokenSecret); twitterStream.setOAuthAccessToken(token);
  • 19. JSON Data • JSON data is processed as an event and written to HDFS 19 ©2013 Cloudera, Inc. public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event); }
  • 20. Analyzing Twitter Data with Hadoop FLUME DEMO 20 ©2013 Cloudera, Inc.
  • 21. Analyzing Twitter Data with Hadoop HIVE 21 ©2013 Cloudera, Inc.
  • 22. What is Hive? • Created at Facebook • HiveQL • SQL like interface • Hive interpreter converts HiveQL to MapReduce code • Returns results to the client 22 ©2013 Cloudera, Inc.
  • 23. Hive Details • Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definitions • Stored in a relational database • Similar to catalog tables in other DBs 23
  • 24. Complex Data 24 ©2013 Cloudera, Inc. SELECT t.retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweeted_status.retweet_count) AS retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweet_screen_name ORDER BY total_retweets DESC LIMIT 10;
  • 25. Analyzing Twitter Data with Hadoop JSON INTERLUDE 25 ©2013 Cloudera, Inc.
  • 26. What is JSON? • Complex, semi-structured data • Based on JavaScript’s data syntax • Rich, nested data types: • number • string • Array • object • true, false • null 26 ©2013 Cloudera, Inc.
  • 27. What is JSON? 27 ©2013 Cloudera, Inc. { "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } } }
  • 28. Hive Serializers and Deserializers • Instructs Hive on how to interpret data • JSONSerDe 28 ©2013 Cloudera, Inc.
  • 29. Analyzing Twitter Data with Hadoop HIVE DEMO 29 ©2013 Cloudera, Inc.
  • 30. Analyzing Twitter Data with Hadoop IT’S A TRAP! 30 ©2013 Cloudera, Inc. Photo from http://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
  • 31. Not a Database 31 ©2013 Cloudera, Inc. RDBMS Hive Language Generally >= SQL-92 Subset of SQL-92 plus Hive specific extensions Update Capabilities INSERT, UPDATE, DELETE INSERT OVERWRITE no UPDATE, DELETE Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes
  • 32. Analyzing Twitter Data with Hadoop IMPALA ASIDE 32 ©2013 Cloudera, Inc.
  • 33. Cloudera Impala 33 Real-Time Query for Data Stored in Hadoop. Supports Hive SQL 4-30X faster than Hive over MapReduce Uses existing drivers, integrates with existing metastore, works with leading BI tools Flexible, cost-effective, no lock-in Deploy & operate with Cloudera Enterprise RTQ Supports multiple storage engines & file formats ©2013 Cloudera, Inc.
  • 34. Benefits of Cloudera Impala 34 Real-Time Query for Data Stored in Hadoop • Real-time queries run directly on source data • No ETL delays • No jumping between data silos • No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas • All data available for interactive queries • No loss of fidelity from fixed data schemas • Single metadata store from origination through analysis • No need to hunt through multiple data silos ©2013 Cloudera, Inc.
  • 35. Cloudera Impala Details 35 ©2013 Cloudera, Inc. HDFS DN Query Exec Engine Query Coordinator Query Planner HBase ODBC SQL App HDFS DN Query Exec Engine Query Coordinator Query Planner HBaseHDFS DN Query Exec Engine Query Coordinator Query Planner HBase Fully MPP Distributed Local Direct Reads State Store HDFS NN Hive Metastore YARN Common Hive SQL and interface Unified metadata and scheduler Low-latency scheduler and cache (low-impact failures)
  • 36. Analyzing Twitter Data with Hadoop OOZIE AUTOMATION 36 ©2013 Cloudera, Inc.
  • 37. Oozie: Everything in its Right Place
  • 38. Oozie for Partition Management • Once an hour, add a partition • Takes advantage of advanced Hive functionality
  • 39. Analyzing Twitter Data with Hadoop OOZIE DEMO 39 ©2013 Cloudera, Inc.
  • 40. Analyzing Twitter Data with Hadoop PUTTING IT ALL TOGETHER 40 ©2013 Cloudera, Inc.
  • 41. Complete Architecture 41 ©2013 Cloudera, Inc. Twitter HDFSFlume Hive Custom Flume Source Sink to HDFS JSON SerDe Parses Data Oozie Add Partitions Hourly
  • 42. Analyzing Twitter Data with Hadoop MORE DEMOS 42 ©2013 Cloudera, Inc.
  • 43. What next? • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-loaded VMs • https://ccp.cloudera.com/display/SUPPORT/Cloudera+Ma nager+Free+Edition+Demo+VM • Clone the source repo • https://github.com/cloudera/cdh-twitter-example
  • 44. My personal preference • Cloudera Manager • https://ccp.cloudera.com/display/SUPPORT/Downloads • Free up to 50 unlimited nodes!
  • 45. Shout Out • Jon Natkins • @nattyice • Blog posts • http://blog.cloudera.com/blog/2013/09/analyzing-twitter- data-with-hadoop/ • http://blog.cloudera.com/blog/2013/10/analyzing-twitter- data-with-hadoop-part-2-gathering-data-with-flume/ • http://blog.cloudera.com/blog/2013/11/analyzing-twitter- data-with-hadoop-part-3-querying-semi-structured-data- with-hive/
  • 46. Questions? • Contact me! • Joey Echeverria • joey@cloudera.com • @fwiffo • We’re hiring!