The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding information survives, so there requests decodes the content for you, as expected.
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding information survives, so there requests decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
Python getting HTML content via 'requests' returns partial response - Stack Overflow
Strange HTML code after parsing via requests. What is it and how to deal?
Scraping Using API, website still returns html output instead of JSON data
Couldn't get the whole html using requests.get(url)
Videos
Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.
@jason answered it correctly so I am extending his answer for the reason
Why It happens
- Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
- Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool a web site (come handy when the site is using some short of authentication cookies) A small tutorial
Use selenium to actually implement a browser
Hello everybody.
I have such code, which extracts html from polish site
import requests url = "https://www.olx.pl/oferty/uzytkownik/nzuCv/" response = requests.get(url) print(response.text)
While this page has normal html (https://imgur.com/a/6W44Jgm), the response has encoding utf-8, and in python/pycharm it is not a Doctype, at all. What is it and how to make it normal html code?
Example of few lines from the very beginning of the response:
\"parentId\":453,\"name\":\"Poradniki i albumy\",\"normalizedName\":\"poradniki-i-albumy\",\"position\":7,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":7,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fporadniki-i-albumy\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1161\":{\"id\":1161,\"label\":\"komiksy\",\"parentId\":453,\"name\":\"Komiksy\",\"normalizedName\":\"komiksy\",\"position\":4,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":4,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fkomiksy\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1163\":{\"id\":1163,\"label\":\"dla-dzieci\",\"parentId\":453,\"name\":\"Dla dzieci\",\"normalizedName\":\"dla-dzieci\",\"position\":3,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":3,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fdla-dzieci\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1165\":{\"id\":1165,\"label\":\"czasopisma\",\"parentId\":453,\"name\":\"Czasopisma\",\"normalizedName\":\"czasopisma\",\"position\":2,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":2,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fczasopisma\",UPDATE:
It appears that HTML is dynamic so it is worth to use Network tab and find endpoints which front end is using. Base on that endpoint I was able to request JSON.
full code here: https://pastebin.com/QbtRBJgb
due to dynamically generated HTML it is not possible to do with requests.
It is possible to do with Selenium, but a window will pop up which annoys a bit - code here https://pastebin.com/U4t8VcVf
» pip install requests-html