Basic usage - BeautifulSoup¶
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
This is the html document (html_doc) to use as an example (BeautifulSoup Official Document).
soup.prettify()¶
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
<head>
<title>
The Dormouse's story </title> </head> <body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
If you print soup.prettify()
, it shows the hierarchy of the html document.
soup.title.string¶
print(soup.title.string)
The Dormouse's story
Returns the string of the title tag.
soup.title.parent.name¶
print(soup.title.parent.name)
head
Returns the name of the title tag’s parent tag.
soup.p[‘class’]¶
print(soup.p['class'])
['title']
Returns the first p tag with the ‘class’ attribute.
soup.a¶
print(soup.a)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Returns the ‘a’ tag
soup.find_all()¶
print(soup.find_all('a'))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Returns all a tags in the form of a list.
soup.find()¶
soup.find(id="link3")
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Returns a tag with id ‘link3’
get()¶
for link in soup.find_all('a'):
print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
Returns the href attribute.
get_text()¶
print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
Returns the text inside the html document.
Prev/Next
Next : Basic usage - Requests