Intro to Beautiful Soup ANDREAS CHANDRA
What is Beautiful Soup crummy.com define Beautiful Soup is a Python library for pulling data out of HTML and XML files. It...
Install Simply open your terminal or command prompt ◦ $ easy_install beautifulsoup4 Or ◦ $ pip install beautifulsoup4
Getting Basic - Making a soup Beautifulsoup apply html as a string Example: ””” <html><head><title>Andreas Chandra</title>...
Getting Basic - Making a soup Then convert the string to Beautiful Soup format soup = BeautifulSoup(html_doc, "html.parser...
Getting Basic - Extract If you want to get the title of website simply code: soup.title.text Result: ‘Andreas Chandra’
Case Study – Detik.com You want to get the title of popular article on the website. What do you do first?
Case Study – Detik.com 1. Import library bs4 and urllib3 (python3)
Case Study – Detik.com 2. Download HTML from the page
Case Study – Detik.com 3. Select tag and id for most popular, you can get the id name and tag by inspect element the page
Case Study – Detik.com 4. Find all ‘li’ for the list of most popular article
Case Study – Detik.com 5. Then iterate the selected ‘li’ and get the title of articles
Done Cool, you can get the title of most popular article on detik.com, now you should not select, copy and paste to your e...
Upcoming SlideShare
Loading in …5
×

Intro to beautiful soup

10 views

Published on

Introducing beautiful soup for web scraping using python 3

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
no profile picture user

  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
10
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Intro to beautiful soup

  1. 1. Intro to Beautiful Soup ANDREAS CHANDRA
  2. 2. What is Beautiful Soup crummy.com define Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
  3. 3. Install Simply open your terminal or command prompt ◦ $ easy_install beautifulsoup4 Or ◦ $ pip install beautifulsoup4
  4. 4. Getting Basic - Making a soup Beautifulsoup apply html as a string Example: ””” <html><head><title>Andreas Chandra</title></head> <body> <h1>Hello World!</h1> </body> </html> """
  5. 5. Getting Basic - Making a soup Then convert the string to Beautiful Soup format soup = BeautifulSoup(html_doc, "html.parser")
  6. 6. Getting Basic - Extract If you want to get the title of website simply code: soup.title.text Result: ‘Andreas Chandra’
  7. 7. Case Study – Detik.com You want to get the title of popular article on the website. What do you do first?
  8. 8. Case Study – Detik.com 1. Import library bs4 and urllib3 (python3)
  9. 9. Case Study – Detik.com 2. Download HTML from the page
  10. 10. Case Study – Detik.com 3. Select tag and id for most popular, you can get the id name and tag by inspect element the page
  11. 11. Case Study – Detik.com 4. Find all ‘li’ for the list of most popular article
  12. 12. Case Study – Detik.com 5. Then iterate the selected ‘li’ and get the title of articles
  13. 13. Done Cool, you can get the title of most popular article on detik.com, now you should not select, copy and paste to your excel or your word to collect the article, further action you can save it to csv, or txt for doing text mining.

×