HTML Parsing

Parse HTML within a response

HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.

Library
Time (1e5 trials)

BeautifulSoup4

52.6

PyQuery

7.5

selectolax

1.9

The HTML parser can be accessed through the html attribute of the response object:

>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>

Parsing page

Grab a list of all links on the page, as-is (anchors excluded):

>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...

Search for text on the page:

>>> resp.html.search('Python is a {} language')[0]
programming

Selecting elements

Select an element using a CSS Selector:

Parameters

Introspecting elements

Grab an Element's text contents:

Getting an Element's attributes:

Get an Element's raw HTML:

Select Elements within Elements:

Searching by HTML attributes:

Search for links within an element:


Last updated