As @nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):
from lxml import etree
from io import StringIO
import requests
# Set explicit HTMLParser
parser = etree.HTMLParser()
page = requests.get('https://URL.COM')
# Decode the page content from bytes to string
html = page.content.decode("utf-8")
# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)
# Call this function and pass in your tree
def get_links(tree):
# This will get the anchor tags <a href...>
refs = tree.xpath("//a")
# Get the url from the ref
links = [link.get('href', '') for link in refs]
# Return a list that only ends with .com.br
return [l for l in links if l.endswith('.com.br')]
# Example call
links = get_links(tree)
Answer from C.Nivs on Stack Overflowpython - Is there a way to parse out HTML in a response from requests.get()? - Stack Overflow
,
, and \', among a bunch of other elements. The return value for response.encoding is utf-8 if that helps. I'd like to parse out all the HTML values and just have a simple ... More on stackoverflow.compython - Parsing HTML with requests and BeautifulSoup - Stack Overflow
Get html using Python requests? - Stack Overflow
Steps for requests-html to parse more than one tag/class in python
Videos
Change the parser to html5lib
pip install html5lib
And then,
soup = BeautifulSoup(con.content,'html5lib')
The a tags are probably not on the top level.
soup.find_all('a')
is probably what you wanted.
In general, I found lxml to be more reliable, consistent in the API and faster. Yes, even more reliable - I have repeatedly had documents where BeautifulSoup failed to parse them, but lxml in its robust mode lxml.html.soupparser still worked well. And there is the lxml.etree API which is really easy to use.
» pip install requests-html
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding information survives, so there requests decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}