Basic usage - BeautifulSoup


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

This is the html document (html_doc) to use as an example (BeautifulSoup Official Document).


soup.prettify()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
<head>
  <title>
   The Dormouse's story  </title> </head> <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

If you print soup.prettify() , it shows the hierarchy of the html document.



soup.title

print(soup.title)
<title>The Dormouse's story</title>

Returns the title tag.



soup.title.name

print(soup.title.name)
title

Returns the name (‘title’) of the title tag.



soup.title.string

print(soup.title.string)
The Dormouse's story

Returns the string of the title tag.



soup.title.parent.name

print(soup.title.parent.name)
head

Returns the name of the title tag’s parent tag.



soup.p

print(soup.p)
<p class="title"><b>The Dormouse's story</b></p>

Returns the first p tag.



soup.p[‘class’]

print(soup.p['class'])
['title']

Returns the first p tag with the ‘class’ attribute.



soup.a

print(soup.a)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

Returns the ‘a’ tag



soup.find_all()

print(soup.find_all('a'))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Returns all a tags in the form of a list.



soup.find()

soup.find(id="link3")
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Returns a tag with id ‘link3’



get()

for link in soup.find_all('a'):
    print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

Returns the href attribute.



get_text()

print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Returns the text inside the html document.

Prev/Next