Submit Search
Upload
Ruby and R
•
58 likes
•
16,791 views
Sau Sheong Chang
Follow
Ruby and R integration with text classification example
Read less
Read more
Technology
Slideshow view
Report
Share
Slideshow view
Report
Share
1 of 47
Recommended
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training
Aengus Rooney
Tools and Measurements at the RIPE NCC
Tools and Measurements at the RIPE NCC
RIPE NCC
Money, Sex and Evolution - Simulation and data analysis with Ruby and R
Money, Sex and Evolution - Simulation and data analysis with Ruby and R
Sau Sheong Chang
Building a web app on top of R (Slides from PAPIs 2014)
Building a web app on top of R (Slides from PAPIs 2014)
zhvihti
1st day 1 - hp and hp s oftware overview
1st day 1 - hp and hp s oftware overview
Lilian Schaffer
Hewlett-Packard Enterprise Case Study
Hewlett-Packard Enterprise Case Study
Stavros Koloniaris
Creating API's with R and plumber
Creating API's with R and plumber
sellorm
Hewlett-Packard: financial analysis Q1-2012
Hewlett-Packard: financial analysis Q1-2012
joris_d91
Recommended
Hortonworks Big Data Career Paths and Training
Hortonworks Big Data Career Paths and Training
Aengus Rooney
Tools and Measurements at the RIPE NCC
Tools and Measurements at the RIPE NCC
RIPE NCC
Money, Sex and Evolution - Simulation and data analysis with Ruby and R
Money, Sex and Evolution - Simulation and data analysis with Ruby and R
Sau Sheong Chang
Building a web app on top of R (Slides from PAPIs 2014)
Building a web app on top of R (Slides from PAPIs 2014)
zhvihti
1st day 1 - hp and hp s oftware overview
1st day 1 - hp and hp s oftware overview
Lilian Schaffer
Hewlett-Packard Enterprise Case Study
Hewlett-Packard Enterprise Case Study
Stavros Koloniaris
Creating API's with R and plumber
Creating API's with R and plumber
sellorm
Hewlett-Packard: financial analysis Q1-2012
Hewlett-Packard: financial analysis Q1-2012
joris_d91
Evented programming
Evented programming
Rodrigo Urubatan
Python course in hyderabad
Python course in hyderabad
RevathiUppala
Introduction to pig
Introduction to pig
Ravi Mutyala
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
BeMyApp
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
Revolution Analytics
Revolution Analytics Podcast
Revolution Analytics Podcast
inside-BigData.com
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
Reason To learn & use r
Reason To learn & use r
Septian Pratama Rusmana
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
PatrickCrompton
iKariera 2015
iKariera 2015
Tomáš Muchka
Pilot Project Highlights: Ruby on Rails - November 2006
Pilot Project Highlights: Ruby on Rails - November 2006
juliannacole
Helion meetup-2014
Helion meetup-2014
Bruno Cornec
Yahoo! Hack Europe
Yahoo! Hack Europe
Hortonworks
Big Data & SQL: The On-Ramp to Hadoop
Big Data & SQL: The On-Ramp to Hadoop
Inside Analysis
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
Ferdin Joe John Joseph PhD
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks
Pig programming is fun
Pig programming is fun
DataWorks Summit
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
HP and linux
HP and linux
Bruno Cornec
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Puma Security, LLC
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
soniya singh
More Related Content
Similar to Ruby and R
Evented programming
Evented programming
Rodrigo Urubatan
Python course in hyderabad
Python course in hyderabad
RevathiUppala
Introduction to pig
Introduction to pig
Ravi Mutyala
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
BeMyApp
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
Revolution Analytics
Revolution Analytics Podcast
Revolution Analytics Podcast
inside-BigData.com
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
Reason To learn & use r
Reason To learn & use r
Septian Pratama Rusmana
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
PatrickCrompton
iKariera 2015
iKariera 2015
Tomáš Muchka
Pilot Project Highlights: Ruby on Rails - November 2006
Pilot Project Highlights: Ruby on Rails - November 2006
juliannacole
Helion meetup-2014
Helion meetup-2014
Bruno Cornec
Yahoo! Hack Europe
Yahoo! Hack Europe
Hortonworks
Big Data & SQL: The On-Ramp to Hadoop
Big Data & SQL: The On-Ramp to Hadoop
Inside Analysis
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
Krishna-Kumar
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
Ferdin Joe John Joseph PhD
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks
Pig programming is fun
Pig programming is fun
DataWorks Summit
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
Mac Moore
HP and linux
HP and linux
Bruno Cornec
Similar to Ruby and R
(20)
Evented programming
Evented programming
Python course in hyderabad
Python course in hyderabad
Introduction to pig
Introduction to pig
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
HP Helion Webinar #1 - Introduction to HP Helion OpenStack w/Christian Frank
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
Revolution Analytics Podcast
Revolution Analytics Podcast
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
Reason To learn & use r
Reason To learn & use r
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
iKariera 2015
iKariera 2015
Pilot Project Highlights: Ruby on Rails - November 2006
Pilot Project Highlights: Ruby on Rails - November 2006
Helion meetup-2014
Helion meetup-2014
Yahoo! Hack Europe
Yahoo! Hack Europe
Big Data & SQL: The On-Ramp to Hadoop
Big Data & SQL: The On-Ramp to Hadoop
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Pig programming is fun
Pig programming is fun
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
HP and linux
HP and linux
Recently uploaded
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Puma Security, LLC
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
soniya singh
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Delhi Call girls
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
BookNet Canada
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Safe Software
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Paola De la Torre
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Delhi Call girls
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Maria Levchenko
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
2toLead Limited
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
ThousandEyes
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Rafal Los
Slack Application Development 101 Slides
Slack Application Development 101 Slides
praypatel2
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
OnBoard
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Delhi Call girls
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Pixlogix Infotech
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Sujit Pal
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Recently uploaded
(20)
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Slack Application Development 101 Slides
Slack Application Development 101 Slides
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ruby and R
1.
RUBY AND R Chang
Sau Sheong Director, Applied Research, HP Labs Singapore 1 © Copyright 2010 Hewlett-Packard Development Company, L.P.
2.
About HP Labs 2
© Copyright 2010 Hewlett-Packard Development Company, L.P.
3.
HP LABS – Exploratory
and advanced research group for Hewlett-Packard – Global organization that tackles complex challenges facing our customers and society over the next decade – Pushes the frontiers of fundamental science – HQ Palo Alto 3 © Copyright 2010 Hewlett-Packard Development Company, L.P.
4.
HP LABS AROUND
THE WORLD Bristol St. Petersburg Beijing Palo Alto Bangalore Haifa Singapore 4 © Copyright 2010 Hewlett-Packard Development Company, L.P.
5.
HP LABS SINGAPORE –
Set up in February 2010 – Focus on Cloud Computing Research Applied Research • Exploratory research • Applied Research • Researchers • Innovators • Change the state of the art • Take the research to the next stage • Working closely with the academic community • Work closely with customers and business units 5 © Copyright 2010 Hewlett-Packard Development Company, L.P.
6.
Ruby and R 6
© Copyright 2010 Hewlett-Packard Development Company, L.P.
7.
Programming language and
platform for statistical computing, licensed under GPL 7 © Copyright 2010 Hewlett-Packard Development Company, L.P.
8.
Strengths in
statistical processing and data visualization 8 © Copyright 2010 Hewlett-Packard Development Company, L.P.
9.
Extensive library of
statistical computing packages (CRAN) written by statisticians 9 © Copyright 2010 Hewlett-Packard Development Company, L.P.
10.
Statistics is not
just for statisticians 10 © Copyright 2010 Hewlett-Packard Development Company, L.P.
11.
Recommendation
Speech engine recognition Fingerprint Spam detection identification Card fraud Financial Face detection forecasting recognition Data OCR Credit scoring mining 11 © Copyright 2010 Hewlett-Packard Development Company, L.P.
12.
CRAN – Almost 2000
packages, mostly created by statisticians • BiodiversityR – GUI for biodiversity and community ecology analysis • Emu – analyze speech patterns • GenABEL – study human genome • Quantmod– quantitative financial modeling framework • Ftrading – technical trading analysis • Cyclones – cyclone identification • DOSim – disease analysis toolkit for gene set • Agricolae – statistical procedures for agricultural research 12 © Copyright 2010 Hewlett-Packard Development Company, L.P.
13.
EXAMPLE R CODE –
EPL data from football-data.co.uk – Show home/away goals distribution for 201 season 1 13 © Copyright 2010 Hewlett-Packard Development Company, L.P.
14.
Why Ruby and
R? 14 © Copyright 2010 Hewlett-Packard Development Company, L.P.
15.
Stand on shoulders
of giants 15 © Copyright 2010 Hewlett-Packard Development Company, L.P.
16.
–Ruby
• Human focused programming! • Better general purpose programming capabilities • Great frameworks! • Great libraries (20,000+ gems in RubyGems) –R • Focus on statistical computing/crunching • Lots of packages written by domain experts/ statisticians • Great graphing libraries 16 © Copyright 2010 Hewlett-Packard Development Company, L.P.
17.
Ruby and R
integration 17 © Copyright 2010 Hewlett-Packard Development Company, L.P.
18.
RINRUBY – 100% Ruby –
Uses pipes to send commands and evals – Uses TCP/IP Sockets to send and retrieve data – Pros: • Doesn't requires anything but R • Works flawlessly on Windows • Work with Ruby 1.8, 1.9 and JRuby 1.5 • All API tested – Cons: • VERY SLOW in assigning • Very limited datatypes: only Vector and Matrix • Not released since 2009 • Poor documentation 18 © Copyright 2010 Hewlett-Packard Development Company, L.P.
19.
RSRUBY – C Extension
for Ruby, linked to R's shared library – Pros: • Blazing speed! 5-10 times faster than Rserve and 100-1000 than RinRuby. • Seamless integration with Ruby. Every method and object is treated like a Ruby object – Cons: • Transformation between R and Ruby types aren't trivial • Dependent on operating system, Ruby implementation and R version • Not available for alternative implementations of Ruby (eg JRuby) • Not released since 2009 • Poor documentation 19 © Copyright 2010 Hewlett-Packard Development Company, L.P.
20.
RSERVE – 100% Ruby –
Uses TCP/IP sockets to interchange data and commands – Requires Rserve installed on the server machine – Access with Ruby uses Ruby-Rserve-Client library – Pros: • Work with Ruby 1.8, 1.9 and JRuby 1.5. • Session allows to process data asynchronously • Fast: 5-10 times faster than RinRuby • Most recently updated (Jan 2011) – Cons: • Requires Rserve • Limited features on Windows • Poor documentation 20 © Copyright 2010 Hewlett-Packard Development Company, L.P.
21.
RAPACHE/RRACK – Web service
based – Run R scripts as web services, consumed by Ruby front-end apps – Pros: • Modular and separate (no direct integration) • Can be scalable, ‘cloud’-ready – Cons: • Requires Rapache/rRack • rRack is very new (not accepted by CRAN yet, as of today!), requires R 2.13 (just released a few weeks ago) • Rapache specific to Apache web server only • Communications overhead for smaller integrations 21 © Copyright 2010 Hewlett-Packard Development Company, L.P.
22.
Let’s look at
some code! (I’m going to use Rserve) 22 © Copyright 2010 Hewlett-Packard Development Company, L.P.
23.
Text classification 23
© Copyright 2010 Hewlett-Packard Development Company, L.P.
24.
TEXT CLASSIFICATION –Automatically sorting
a set of documents into different categories from a predefined set –Classic uses: Training Test data • Spam filtering data • Email prioritization Classifier category 24 © Copyright 2010 Hewlett-Packard Development Company, L.P.
25.
25
© Copyright 2010 Hewlett-Packard Development Company, L.P.
26.
TEXT CLASSIFIER CODE
Prepare 26 © Copyright 2010 Hewlett-Packard Development Company, L.P.
27.
Train classifier by
counting frequency of each word in the document 27 © Copyright 2010 Hewlett-Packard Development Company, L.P.
28.
Get word count 28
© Copyright 2010 Hewlett-Packard Development Company, L.P.
29.
What you get
{"check"=>1, "result"=>3, "marissa"=>1, "experi"=>1, "click"=>1, "engin"=>1, "simpli"=>1, "mistakenli"=>1, "pick"=>1, "prevent"=>1, "40"=>1, "regularli"=>1, "place"=>1, "user"=>5, "prefer"=>1, "malevol"=>1, "access"=>1, "robust"=>1, "servic"=>1, "fault"=>1, "malici"=>1, "list"=>2, "hand"=>1, "internet"=>1, "attribut"=>1, "instal"=>1, "file"=>1, "unabl"=>1, "vice"=>1, "stopbadwareorg"=>2, "merit"=>1, "decid"=>1, "flag"=>2, "saturdai"=>2, "hit"=>2, "offici"=>1, "error"=>3, "work"=>1, "site"=>5, "happen"=>2, "incid"=>1, "technic"=>1, "advis"=>1, "put"=>1, "human"=>3, "harm"=>2, "softwar"=>1, "ms"=>1, "affect"=>1, "carefulli"=>1, "product"=>1, "presid"=>1, "complaint"=>1, "potenti"=>2, "googl"=>6, "comput"=>2, "peopl"=>1, "investig"=>2, "consum"=>1, "danger"=>2, "period"=>1, "wrote"=>2, "search"=>7, "ascertain"=>1, "blog"=>1, "warn"=>2, "problem"=>1, "updat"=>2, "minut"=>1, "mayer"=>2} 29 © Copyright 2010 Hewlett-Packard Development Company, L.P.
30.
Generate training data
for prediction 30 © Copyright 2010 Hewlett-Packard Development Company, L.P.
31.
Training data 31
© Copyright 2010 Hewlett-Packard Development Company, L.P.
32.
category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,sof twar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,syst em,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wal l,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,0,0 not_interesting,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3, 1,0,0,0,0,0,3,0,0,0,0,0,0,2 not_interesting,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0, 0,0,0,0,0,0,0,3,1,3,1,0,2,0 not_interesting,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
The top 25 most 0,0,0,0,0,0,0,0,0,0,0,0,0,1 not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0, 0,2,0,0,0,2,0,0,0,0,2,0,1,0 frequent words in not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0, 0,0,3,3,0,0,0,0,0,0,0,2,0,0 the training dataset not_interesting,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0, 2,1,0,0,2,1,0,0,2,0,0,1,0,0 interesting,6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,3 interesting,0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,1,0,0,3,0 interesting,0,0,0,0,3,5,5,0,0,0,0,0,0,0,0,0,1,4,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,6,0,1,1,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,2,0,0 32 © Copyright 2010 Hewlett-Packard Development Company, L.P.
33.
category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,sof twar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,syst em,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,wal l,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,0,0 not_interesting,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3, 1,0,0,0,0,0,3,0,0,0,0,0,0,2 not_interesting,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0, 0,0,0,0,0,0,0,3,1,3,1,0,2,0 not_interesting,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
Each line 0,0,0,0,0,0,0,0,0,0,0,0,0,1 not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0, 0,2,0,0,0,2,0,0,0,0,2,0,1,0 represents 1 not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0, 0,0,3,3,0,0,0,0,0,0,0,2,0,0 document trained not_interesting,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0, 2,1,0,0,2,1,0,0,2,0,0,1,0,0 interesting,6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,3 interesting,0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,1,0,0,3,0 interesting,0,0,0,0,3,5,5,0,0,0,0,0,0,0,0,0,1,4,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,6,0,1,1,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,2,0,0 33 © Copyright 2010 Hewlett-Packard Development Company, L.P.
34.
category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site ,softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result, system,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous ,wall,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0 ,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0 not_interesting,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3, 1,0,0,0,0,0,3,0,0,0,0,0,0,2 not_interesting,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0, 0,0,0,0,0,0,0,3,1,3,1,0,2,0
Categories set not_interesting,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,1 not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0, 0,2,0,0,0,2,0,0,0,0,2,0,1,0 when the classifier not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0, is created 0,0,3,3,0,0,0,0,0,0,0,2,0,0 not_interesting,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0, 2,1,0,0,2,1,0,0,2,0,0,1,0,0 interesting,6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,3 interesting,0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,1,0,0,3,0 interesting,0,0,0,0,3,5,5,0,0,0,0,0,0,0,0,0,1,4,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,6,0,1,1,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,2,0,0 34 © Copyright 2010 Hewlett-Packard Development Company, L.P.
35.
category,googl,report,search,user,review,court,mckinnon,year,internet,microsoft,site,s oftwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sharpli,error,group,result,sy stem,rebel,econom,presid,crisi,find,year,accus,global,obama,china,civilian,shrink,hous,w all,street,quarter,white,heavi,lehman,economi,session,ey,time,davo,human not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,0,0 not_interesting,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,5,0,2,0,0,0,3,0,0,0,3, 1,0,0,0,0,0,3,0,0,0,0,0,0,2 not_interesting,0,1,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,3,0,3,1,2,0,2,0,0,0,
Number indicates the 0,0,0,0,0,0,0,3,1,3,1,0,2,0 not_interesting,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,1 number of times the not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,0,0,1,2,1,4,0, 0,2,0,0,0,2,0,0,0,0,2,0,1,0 word appears in that not_interesting,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0, 0,0,3,3,0,0,0,0,0,0,0,2,0,0 not_interesting,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,2,0,0,2,0,0, document 2,1,0,0,2,1,0,0,2,0,0,1,0,0 interesting,6,0,7,5,0,0,0,0,1,0,5,1,2,0,0,0,0,0,0,0,0,3,0,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,3 interesting,0,7,0,0,2,0,0,0,0,0,0,0,1,0,0,1,0,0,3,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,1,0,0,0,0,0,3,3,1,0,1,1,1,0,3,3,0,1,0,3,0,1,0,2,0,1,0,0,0,3,0,0,0,0,0,0,0, 0,0,0,0,0,0,1,1,0,0,3,0 interesting,0,0,0,0,3,5,5,0,0,0,0,0,0,0,0,0,1,4,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,6,0,1,1,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0 interesting,0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,2,0,0 35 © Copyright 2010 Hewlett-Packard Development Company, L.P.
36.
Test data 36
© Copyright 2010 Hewlett-Packard Development Company, L.P.
37.
category,googl,report,search,user,review,court,mckinnon,year,internet,micr osoft,site,softwar,warn,browser,oper,expert,rise,lawyer,digit,extradit,sha rpli,error,group,result,system,rebel,econom,presid,crisi,find,year,accus,g
lobal,obama,china,civilian,shrink,hous,wall,street,quarter,white,heavi,leh man,economi,session,ey,time,davo,human category,0,0,0,2,0,0,0,2,1,4,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0 37 © Copyright 2010 Hewlett-Packard Development Company, L.P.
38.
Using different
classification models 38 © Copyright 2010 Hewlett-Packard Development Company, L.P.
39.
NAÏVE BAYES 39
© Copyright 2010 Hewlett-Packard Development Company, L.P.
40.
SVM 40
© Copyright 2010 Hewlett-Packard Development Company, L.P.
41.
RANDOM FOREST 41
© Copyright 2010 Hewlett-Packard Development Company, L.P.
42.
NEURAL NETWORKS 42
© Copyright 2010 Hewlett-Packard Development Company, L.P.
43.
Using the classifier 43
© Copyright 2010 Hewlett-Packard Development Company, L.P.
44.
44
© Copyright 2010 Hewlett-Packard Development Company, L.P.
45.
45
© Copyright 2010 Hewlett-Packard Development Company, L.P.
46.
RESOURCES – HP Labs
Worldwide – Rserve-Ruby-Client http://www.hpl.hp.com/ https://github.com/clbustos/Rserve- – R Project Ruby-client http://www.r-project.org/ – rApache – RsRuby http://rapache.net/index.html https://github.com/alexgutteridge/rsrub – rRack y https://github.com/jeffreyhorner/rRack/ – RinRuby http://rinruby.ddahl.org/ – Rserve http://www.rforge.net/Rserve/ 46 © Copyright 2010 Hewlett-Packard Development Company, L.P.
47.
Thank you sausheong@hp.com
http://twitter.com/sausheong http://blog.saush.com 47 © Copyright 2010 Hewlett-Packard Development Company, L.P.