Intro to beautiful soup

AndreasChandra3 1,513 views 13 slides Jun 28, 2017
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

Introducing beautiful soup for web scraping using python 3


Slide Content

Intro to Beautiful Soup
ANDREAS CHANDRA

What is Beautiful Soup
crummy.com define Beautiful Soup is a Python library for pulling data out of HTML and XML
files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and
modifying the parse tree. It commonly saves programmers hours or days of work.

Install
Simply open your terminal or command prompt
◦$ easy_installbeautifulsoup4
Or
◦$ pip install beautifulsoup4

Getting Basic -Making a soup
Beautifulsoupapply html as a string
Example:
”””
<html><head><title>Andreas Chandra</title></head>
<body>
<h1>Hello World!</h1>
</body>
</html>
"""

Getting Basic -Making a soup
Then convert the string to Beautiful Soup format
soup = BeautifulSoup(html_doc, "html.parser")

Getting Basic -Extract
If you want to get the title of website simply code:
soup.title.text
Result:
‘Andreas Chandra’

Case Study –Detik.com
You want to get the title of popular article on the website.
What do you do first?

Case Study –Detik.com
1.Import library bs4 and urllib3 (python3)

Case Study –Detik.com
2.Download HTML from the page

Case Study –Detik.com
3.Select tag and id for most popular, you can get the id name and tag by inspect element the
page

Case Study –Detik.com
4.Find all ‘li’ for the list of most popular article

Case Study –Detik.com
5.Then iterate the selected ‘li’ and get the title of articles

Done
Cool, you can get the title of most popular article on detik.com, now you should not select, copy
and paste to your excel or your word to collect the article, further action you can save it to csv,
or txt for doing text mining.