HTML Parsing
Parse HTML within a response
HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.
Library
Time (1e5 trials)
BeautifulSoup4
52.6
PyQuery
7.5
selectolax
1.9
The HTML parser can be accessed through the html attribute of the response object:
>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>Parsing page
Grab a list of all links on the page, as-is (anchors excluded):
>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...Grab a list of all links on the page, in absolute form (anchors excluded):
>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...Search for text on the page:
>>> resp.html.search('Python is a {} language')[0]
programmingSelecting elements
Select an element using a CSS Selector:
Introspecting elements
Grab an Element's text contents:
Getting an Element's attributes:
Get an Element's raw HTML:
Select Elements within Elements:
Searching by HTML attributes:
Search for links within an element:
Last updated