HTML Parsing
Parse HTML within a response
HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.
Library
Time (1e5 trials)
BeautifulSoup4
52.6
PyQuery
7.5
selectolax
1.9
The HTML parser can be accessed through the html
attribute of the response object:
>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>
Parsing page
Grab a list of all links on the page, as-is (anchors excluded):
>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...
Grab a list of all links on the page, in absolute form (anchors excluded):
>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...
Search for text on the page:
>>> resp.html.search('Python is a {} language')[0]
programming
Selecting elements
Select an element using a CSS Selector:
>>> about = resp.html.find('#about')
Introspecting elements
Grab an Element's text contents:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Getting an Element's attributes:
>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
>>> about.id
'about'
Get an Element's raw HTML:
>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
Select Elements within Elements:
>>> about.find_all('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
>>> about.find('a')
<Element 'a' href='/about/' title='' class=''>
Searching by HTML attributes:
>>> about.find('il', role='treeitem')
<Element 'li' role='treeitem' class=('tier-2', 'element-1')>
Search for links within an element:
>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
Last updated