Learning BeautifulSoup

Code, Comments, Output, Hyperlinks

0. Install BeautifulSoup

$ apt-get install python-bs4


$ pip install beautifulsoup4

1. Import BeautifulSoup and Load HTML

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(open(‘test.html’))

>>> print soup.prettify()      # show the HTML tree

2. Retrieve tags/elements

We can visit HTML elements by dot (“.”)


Find by tag:

>>> tag = soup.html.head.title

>>> <b class=”boldest”>

# type(tag): <class ‘bs4.element.Tag’> 

# Attribute of “Tag”: name (name of the tag. e.g. tag[name]: ‘b’)

# Atrribute of “Tag”: attrs (e.g. tag[‘class’]: ‘boldest’)

>>> tags =soup.find_all(‘a’)     # a list of tags

>>> [<a id=”top”></a>, <a href=”#mw-navigation”>navigation</a>]


Find by content:

>>> print soup.findAll(text=re.compile(“para”))      # show all contents (in a list) that have pattern “para”.   re.compile is necessary

>>> print soup.findAll(text=re.compile(“para”))[0].parent

>>> print soup.findAll(text=re.compile(“para”))[0].parent.contents



REF: http://www.pythonclub.org/modules/beautifulsoup/start