SlideShare a Scribd company logo
1 of 19
Download to read offline
RubyScraping


  Ruby関西
  yhara(原 悠)
自己紹介
 yhara (原 悠)
 滋賀県在住
 所属:京大マイコンクラブ

 過去の発表:
  Plagger meets Ruby
  30分でわかるcallccの使い方
  Ruby用Webスクレイピングライブラリ、Hpricot
“RubyScraping” Wiki
 http://mono.kmc.gr.jp/~yhara/rubyscraping/
「スクレイピング」とは?
 scrape=引っかく、削り取る
 Webページから情報を抜き出す技術
 便利なライブラリ
  WWW::Mechanize
  Hpricot
WWW::Mechanize


  Web自動巡回ライブラリ
WWW::Mechanize
   Web自動巡回ライブラリ
   例:自動google検索
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get(quot;http://www.google.co.jp/quot;)
search_form = page.forms.first
search_form[quot;qquot;] = quot;Rubyquot;
search_results = search_form.submit

print search_results.body
実行結果
 googleで”Ruby”を検索した結果のHTMLが
 出力される
解説
手作業で検索するときの動作:
(1)トップページを開く
(2)検索用フォームを探す
(3)検索語句に ”Ruby” と入力
(4)「検索」ボタンを押す

Mechanizeならこれを「そのまま」Rubyで書ける!
解説
 (1)トップページを開く→(2)検索用フォームを探す→
 (3)検索語句に ”Ruby” と入力→(4)「検索」ボタンを押す
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get(“http://www.google.co.jp/”) #1
search_form = page.forms.first              #2
search_form[“q”] = “Ruby“                   #3
search_results = search_form.submit        #4

print search_results.body
Hpricot


  HTML解析ライブラリ
Hpricot
  HTML解析ライブラリ
  例:さっきのgoogle検索結果から、
  ヒットしたページのタイトルを全て出力

  require quot;hpricotquot;
  doc = Hpricot(File.read(quot;google.htmlquot;))

  (doc/quot;a.lquot;).each do |a|
   puts a[quot;hrefquot;]
   puts a.inner_text
  end
実行結果
解説
 (doc/quot;a.lquot;) ⇔ <a class=quot;lquot;>…</a>
 (doc/quot;div#resquot;) ⇔ <div id=quot;resquot;>…</div>
 CSSセレクタが使える

  require quot;hpricotquot;
  doc = Hpricot(File.read(quot;google.htmlquot;))

  (doc/quot;a.lquot;).each do |a|
   puts a[quot;hrefquot;]        #=> URL
   puts a.inner_text    #=> ページタイトル
  end
デモ


 (時間が余れば)
Googleの広告を消す(^^;
require ’rubygems’
require ’hpricot’
require ‘kconv’
html = File.read(“google.html”) #htmlファイルを開く
doc = Hpricot(html)             #Hpricotオブジェクトを作る

#テキストに”DODA”が含まれているtdタグの中身を消す
(doc/”td[text()*=DODA]”).empty

File.open(“no_ad.html”, “w”){|f| #ファイルに書き戻す
  f.wirte doc.to_html
}
実行結果
その他の機能
 ファイルのアップロード (Mechanize)
 プロキシ使用(Mechanize)
 CSSセレクタ (Hpricot)
 XPath (Hpricot)
 タグ構造を調べるには、Firefox+FireBugが
 便利
続きはWebで!




   ご清聴ありがとうございました
おまけ
   cross-search.rb

> ./cross_search.rb Ruby
search_by_google: オブジェクト指向スクリプト言語 Ruby
search_by_excite: オブジェクト指向スクリプト言語 Ruby
search_by_yahoo: オブジェクト指向スクリプト言語 Ruby
search_by_msn: Rubyリファレンスマニュアル - Enumerable

  なぜかMSNサーチだけリファレンスのEnumerableがトップに来る(^^;

More Related Content

What's hot

Prezentacja1
Prezentacja1Prezentacja1
Prezentacja1tpgolab
 
Odontogenic infections 1
Odontogenic infections 1Odontogenic infections 1
Odontogenic infections 1Hadi hoseini
 
Chinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhChinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhJesse Cai
 
4-seasons Landscapes
4-seasons Landscapes4-seasons Landscapes
4-seasons LandscapesTanya
 
Visualizing user reading via chord diagram
Visualizing user reading via chord diagramVisualizing user reading via chord diagram
Visualizing user reading via chord diagramHaokang Den
 
Smashing Rails
Smashing RailsSmashing Rails
Smashing Railssava
 
CSS in JSの話 #friday13json
CSS in JSの話 #friday13jsonCSS in JSの話 #friday13json
CSS in JSの話 #friday13jsonYukiya Nakagawa
 
Web Design Fundamentals
Web Design FundamentalsWeb Design Fundamentals
Web Design FundamentalsAhmed Faris
 
IASP World Conference, 2004 Bergamo, Italy
IASP World Conference, 2004 Bergamo, ItalyIASP World Conference, 2004 Bergamo, Italy
IASP World Conference, 2004 Bergamo, ItalyIlkka Kakko
 
Browser Mechanics & CSS
Browser Mechanics & CSSBrowser Mechanics & CSS
Browser Mechanics & CSSLara Schenck
 
Online Information 2008 - Final Slides
Online Information 2008 - Final SlidesOnline Information 2008 - Final Slides
Online Information 2008 - Final SlidesJason Griffey
 

What's hot (14)

Prezentacja1
Prezentacja1Prezentacja1
Prezentacja1
 
Odontogenic infections 1
Odontogenic infections 1Odontogenic infections 1
Odontogenic infections 1
 
Chinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 ZhChinaonrails Rubyonrails21 Zh
Chinaonrails Rubyonrails21 Zh
 
4-seasons Landscapes
4-seasons Landscapes4-seasons Landscapes
4-seasons Landscapes
 
Visualizing user reading via chord diagram
Visualizing user reading via chord diagramVisualizing user reading via chord diagram
Visualizing user reading via chord diagram
 
Coisas para o blog
Coisas para o blogCoisas para o blog
Coisas para o blog
 
2
22
2
 
Smashing Rails
Smashing RailsSmashing Rails
Smashing Rails
 
CSS in JSの話 #friday13json
CSS in JSの話 #friday13jsonCSS in JSの話 #friday13json
CSS in JSの話 #friday13json
 
Web Design Fundamentals
Web Design FundamentalsWeb Design Fundamentals
Web Design Fundamentals
 
IASP World Conference, 2004 Bergamo, Italy
IASP World Conference, 2004 Bergamo, ItalyIASP World Conference, 2004 Bergamo, Italy
IASP World Conference, 2004 Bergamo, Italy
 
Browser Mechanics & CSS
Browser Mechanics & CSSBrowser Mechanics & CSS
Browser Mechanics & CSS
 
Invest In Kazakhstan Almaty P 145 147
Invest In Kazakhstan   Almaty P 145 147Invest In Kazakhstan   Almaty P 145 147
Invest In Kazakhstan Almaty P 145 147
 
Online Information 2008 - Final Slides
Online Information 2008 - Final SlidesOnline Information 2008 - Final Slides
Online Information 2008 - Final Slides
 

Similar to RubyScraping

20090418 イケテルRails勉強会 第1部Rails編
20090418 イケテルRails勉強会 第1部Rails編20090418 イケテルRails勉強会 第1部Rails編
20090418 イケテルRails勉強会 第1部Rails編mochiko AsTech
 
Rails Deployment with NginX
Rails Deployment with NginXRails Deployment with NginX
Rails Deployment with NginXStoyan Zhekov
 
Ruby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionRuby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionLibin Pan
 
Ruby off Rails (japanese)
Ruby off Rails (japanese)Ruby off Rails (japanese)
Ruby off Rails (japanese)Stoyan Zhekov
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For RailsKoichi ITO
 
オブジェクト指向スクリプト言語 Ruby
オブジェクト指向スクリプト言語 Rubyオブジェクト指向スクリプト言語 Ruby
オブジェクト指向スクリプト言語 RubyKitajiro Kitayama
 
PostgreSQLで学ぶBoyer-Moore-Horspoolアルゴリズム
PostgreSQLで学ぶBoyer-Moore-HorspoolアルゴリズムPostgreSQLで学ぶBoyer-Moore-Horspoolアルゴリズム
PostgreSQLで学ぶBoyer-Moore-HorspoolアルゴリズムAkio Ishida
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909Yusuke Wada
 
Web技術勉強会 第19回
Web技術勉強会 第19回Web技術勉強会 第19回
Web技術勉強会 第19回龍一 田中
 
20090323 Phpstudy
20090323 Phpstudy20090323 Phpstudy
20090323 PhpstudyYusuke Ando
 
Ruby on Rails Tutorial Part I
Ruby on Rails Tutorial Part IRuby on Rails Tutorial Part I
Ruby on Rails Tutorial Part IWei Jen Lu
 
20090418 イケテルRails勉強会 第2部Air編
20090418 イケテルRails勉強会 第2部Air編20090418 イケテルRails勉強会 第2部Air編
20090418 イケテルRails勉強会 第2部Air編mochiko AsTech
 

Similar to RubyScraping (20)

Grails紹介
Grails紹介Grails紹介
Grails紹介
 
Revisited
RevisitedRevisited
Revisited
 
20090418 イケテルRails勉強会 第1部Rails編
20090418 イケテルRails勉強会 第1部Rails編20090418 イケテルRails勉強会 第1部Rails編
20090418 イケテルRails勉強会 第1部Rails編
 
XS Japan 2008 Xen Mgmt Japanese
XS Japan 2008 Xen Mgmt JapaneseXS Japan 2008 Xen Mgmt Japanese
XS Japan 2008 Xen Mgmt Japanese
 
Spring Framework勉強会
Spring  Framework勉強会Spring  Framework勉強会
Spring Framework勉強会
 
Rails Deployment with NginX
Rails Deployment with NginXRails Deployment with NginX
Rails Deployment with NginX
 
Ruby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese VersionRuby on Rails 2.1 What's New Chinese Version
Ruby on Rails 2.1 What's New Chinese Version
 
Ruby off Rails (japanese)
Ruby off Rails (japanese)Ruby off Rails (japanese)
Ruby off Rails (japanese)
 
Five Minutes Introduction For Rails
Five Minutes Introduction For RailsFive Minutes Introduction For Rails
Five Minutes Introduction For Rails
 
オブジェクト指向スクリプト言語 Ruby
オブジェクト指向スクリプト言語 Rubyオブジェクト指向スクリプト言語 Ruby
オブジェクト指向スクリプト言語 Ruby
 
What Can Compilers Do for Us?
What Can Compilers Do for Us?What Can Compilers Do for Us?
What Can Compilers Do for Us?
 
PostgreSQLで学ぶBoyer-Moore-Horspoolアルゴリズム
PostgreSQLで学ぶBoyer-Moore-HorspoolアルゴリズムPostgreSQLで学ぶBoyer-Moore-Horspoolアルゴリズム
PostgreSQLで学ぶBoyer-Moore-Horspoolアルゴリズム
 
yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909yusukebe in Yokohama.pm 090909
yusukebe in Yokohama.pm 090909
 
T1
T1T1
T1
 
Web技術勉強会 第19回
Web技術勉強会 第19回Web技術勉強会 第19回
Web技術勉強会 第19回
 
20090323 Phpstudy
20090323 Phpstudy20090323 Phpstudy
20090323 Phpstudy
 
dRuby
dRubydRuby
dRuby
 
Ruby on Rails Tutorial Part I
Ruby on Rails Tutorial Part IRuby on Rails Tutorial Part I
Ruby on Rails Tutorial Part I
 
20090418 イケテルRails勉強会 第2部Air編
20090418 イケテルRails勉強会 第2部Air編20090418 イケテルRails勉強会 第2部Air編
20090418 イケテルRails勉強会 第2部Air編
 
Ruby Postgres
Ruby PostgresRuby Postgres
Ruby Postgres
 

Recently uploaded

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

RubyScraping