Basic usage - BeautifulSoup¶

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

This is the html document (html_doc) to use as an example (BeautifulSoup Official Document).

soup.prettify()¶

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<head>
  <title>
   The Dormouse's story  </title> </head> <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

If you print soup.prettify() , it shows the hierarchy of the html document.

soup.title¶

print(soup.title)

<title>The Dormouse's story</title>

Returns the title tag.

soup.title.name¶

print(soup.title.name)

title

Returns the name (‘title’) of the title tag.

soup.title.string¶

print(soup.title.string)

The Dormouse's story

Returns the string of the title tag.

soup.title.parent.name¶

print(soup.title.parent.name)

head

Returns the name of the title tag’s parent tag.

soup.p¶

print(soup.p)

<p class="title"><b>The Dormouse's story</b></p>

Returns the first p tag.

soup.p[‘class’]¶

print(soup.p['class'])

['title']

Returns the first p tag with the ‘class’ attribute.

soup.a¶

print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

Returns the ‘a’ tag

soup.find_all()¶

print(soup.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Returns all a tags in the form of a list.

soup.find()¶

soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Returns a tag with id ‘link3’

get()¶

for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

Returns the href attribute.

get_text()¶

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Returns the text inside the html document.

Prev/Next

Prev : BeautifulSoup - A Python package for parsing HTML and XML documents

Next : Basic usage - Requests