Sentiment Analysis using Hive
Secrets From the Pros
We will be starting at 11:03 PDT
Use the Chat Pane in GoToWebinar to A...
News Cycle for “Mortgage” 2008-
09
Mortgage- Crisis, Foreclosures, Fraud
-10
0
10
20
30
40
50
60
70
80
90
6/12/04 8/1/04 9...
AGENDA
This Webinar provides tips on doing basic sentiment analysis on
large data sets using Hive:
• Overview of Sentiment...
Sentiment Analysis
Applications
Direct-- Call center
logs, Emails, Chat logs
Indirect-- Social
Media, Forums, Review websi...
Sentiment Analysis
How to operationalize a Sentiment Analysis App
1.
Crawl, Scrape, API
calls, collect
2. Create
“Document...
Pre and Post Preprocessing
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Tokenization (“Hello Ther...
N-Gram
Language Models
Q: What is a language model?
A: A mathematical model that assigns probability to a sequence of m wo...
N-Gram Language Model
Hive Built-In Functions
Goal Input Data Output Data
Use this
Hive UDF
Find important topics
using a ...
Dataset used-- Meme Tracker
How MemeTracker.org creates the dataset
90 Million sources
900K news stories / day
Track 17M m...
Analyze Sentiment on “Mortgage”
By Tracking How Memes spread, using Hive
What is a Meme?
“Government Shutdown”, “Affordabl...
Demo
Hive’s Extensibility Framework
• There are many UDFs built into Hive
• For more advanced users Hive allows many
ways to ex...
How to access this Tutorial
• Create a free Qubole Account (www.qubole.com)
• Login  Click on “Analyze”  Look for “Tutor...
Summary
• Pre and post processing
– Use Hive
• Language Models
– Use pre-existing language models codified as Hive UDFs su...
THANK YOU
Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support
• 100% Managed Hadoop Cluster in t...
Upcoming SlideShare
Loading in...5
×

Basic Sentiment Analysis using Hive

1,373

Published on

Slide deck from a hands on workshop: Covers the following
1. Learn what Sentiment Analysis and how it can be used
2. Perform pre-processing and post-processing of textual data using Hive
3. Use n-gram language model built into Hive for perform sentiment analysis
4. Learn how to use Hive extensibility to plug-in other language models

Published in: Technology, Business
1 Comment
0 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
1,373
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
54
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  • Great to model clicks and impressions and try and understand what a buyers intent is. Intent to purchase or churn.. Quality-- Banks, call centers,
  • Information diffusionData is already gathered, documents created, memes extracted. Lot of work already done. Data ready for you.Can do this on your own on twitter feeds.
  • Solutions– many..Framework– pre-processing --- applying model --- post processingChallenges: Scaling.
  • Basic Sentiment Analysis using Hive

    1. 1. Sentiment Analysis using Hive Secrets From the Pros We will be starting at 11:03 PDT Use the Chat Pane in GoToWebinar to Ask Questions! Assess your level and learn new stuff This webinar is intended for intermediate audiences (familiar with Apache Hive and Hadoop, but not experts) ?
    2. 2. News Cycle for “Mortgage” 2008- 09 Mortgage- Crisis, Foreclosures, Fraud -10 0 10 20 30 40 50 60 70 80 90 6/12/04 8/1/04 9/20/04 11/9/04 12/29/04 2/17/05 4/8/05 5/28/05 Crisis Foreclosure Fraud Linear (Crisis) Linear (Foreclosure) Linear (Fraud) # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
    3. 3. AGENDA This Webinar provides tips on doing basic sentiment analysis on large data sets using Hive: • Overview of Sentiment Analysis (SA) • Hive UDFs useful for SA • Demo, Guided Tutorial • Developing advanced, custom SA Engines
    4. 4. Sentiment Analysis Applications Direct-- Call center logs, Emails, Chat logs Indirect-- Social Media, Forums, Review websites Gather Customer Feedback Over time, geography By customer, market segments Sentiment Analysis Product / service decisions Customer support Marketing- messaging, offers Customer retention, upsell Use for Decision making
    5. 5. Sentiment Analysis How to operationalize a Sentiment Analysis App 1. Crawl, Scrape, API calls, collect 2. Create “Documents” 3. Pre-process Data 4. Apply Language Model, Extract Sentiment 5. Integrate with Mktg Automn., CRM, C CA, etc OLTP 6. Improve Product, Better CS, Targeted Offers
    6. 6. Pre and Post Preprocessing Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Tokenization (“Hello There! How are you?”) ( (“Hello”, “There”) , (“How”, “are”, “you”) ) sentences Column (array) to rows [1, 2, 3] 1 2 3 explode Navigating documents, extracting fields {"store": {"fruit":[{"weight":8,"type":"apple"} ,{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red "} }, "email":"amy@xyz.net", "owner":"amy" } {"weight":8,"type":" apple"} get_json_object( src_json.json, '$.fruit[0]')
    7. 7. N-Gram Language Models Q: What is a language model? A: A mathematical model that assigns probability to a sequence of m words Q: What is “n-gram” model? A: Probabilistic language model for predicting next word in a sequence of words Q: What is an n-gram? A: A contiguous sequence of n items from a given sequence of text Eg: “Mary had a little lamb” Bi-grams: “Mary had”, “had a”, “a little”, “little lamb”
    8. 8. N-Gram Language Model Hive Built-In Functions Goal Input Data Output Data Use this Hive UDF Find important topics using a stop word list, trending topics Collection of sentences k most frequently occurring n-grams ngrams Extract intelligence around certain keywords, pre-compute search look aheads Collection of sentences k most frequently occuring n-grams around a “context” word. Eg: “Government shutdown” context_ngrams
    9. 9. Dataset used-- Meme Tracker How MemeTracker.org creates the dataset 90 Million sources 900K news stories / day Track 17M memes # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 6GB of Data / month
    10. 10. Analyze Sentiment on “Mortgage” By Tracking How Memes spread, using Hive What is a Meme? “Government Shutdown”, “Affordable Care Act”, “Green Eggs and Ham”, etc # of records: 90M/partition Partitions: Month Columns: URL Timestamp Array of Memes Links Table: MemeTracker 36GB of JSON Data
    11. 11. Demo
    12. 12. Hive’s Extensibility Framework • There are many UDFs built into Hive • For more advanced users Hive allows many ways to extend the language – SERDEs – UDFs, UDAFs, and UDTFs – Hive Streaming
    13. 13. How to access this Tutorial • Create a free Qubole Account (www.qubole.com) • Login  Click on “Analyze”  Look for “Tutorials” tab at top of page
    14. 14. Summary • Pre and post processing – Use Hive • Language Models – Use pre-existing language models codified as Hive UDFs such as ngrams and context_ngrams – UDFs-- Build your own language model in java using Hive UDF framework – Hive Streaming-- Plug-in your existing language models or 3rd party libraries • Visualization – Use a spreadsheet / BI reporting tool
    15. 15. THANK YOU Managed Cluster Built-In Connectors Friendly User-Interface Dedicated Support • 100% Managed Hadoop Cluster in the Cloud • Auto-Scaling Cluster. Full Life-cycle Management • +12 Connectors to Applications and Data Sources • 14-Day Free Trial (free account available) • 24/7 Customer Support What’s Included?  www.qubole.com/try 
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×