The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding information survives, so there requests decodes the content for you, as expected.
» pip install requests-html
Get html using Python requests? - Stack Overflow
Newest 'python-requests-html' Questions - Stack Overflow
Been using requests-html out of fear for using Selenium, advice?
Requests vs Requests-html vs beautifulsoup4
Videos
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding information survives, so there requests decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
So basically all webscraping guides describe selenium as super slow, last option that you should avoid at all costs, so when I had to scrape a number of websites with dynamic content, that's what I did. I found an alternative in requests-html, primarily for its r.html.render() function and for the simplicity (no need to set headers etc.)
The more blogs and guides on webscraping, the more I become aware that nobody mentions requests-html, but always recommends selenium despite being the slowest of them all.
Is this a bad way to deal with dynamic content? Does Selenium have a similar function?