python requests response html

stackoverflow.com › questions › 27803503 › get-html-using-python-requests

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

Answer from Martijn Pieters on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 27803503 › get-html-using-python-requests

Get html using Python requests? - Stack Overflow

Top answer

1 of 4

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

2 of 4

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

Requests

requests.readthedocs.io › projects › requests-html › en › latest

requests-HTML v0.3.4 documentation

Change response enconding and replace it by a HTMLResponse. ... Pass in all the coroutines you want to run, it will wrap each one in a task, run it and wait for the result. Return a list with all results, this is returned in the same order coros are passed in. ... Send a given PreparedRequest. ... Requests-HTML intends to make parsing HTML (e.g.

Discussions

Python getting HTML content via 'requests' returns partial response - Stack Overflow

Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here) ... You can use the mechanize module of python to mimic a browser to fool a web site (come handy when the site is using some short of authentication ... More on stackoverflow.com

stackoverflow.com

Strange HTML code after parsing via requests. What is it and how to deal?

Erm... I get this. I think you're looking at the window.__INIT_CONFIG__ variable at the bottom. More on reddit.com

r/learnpython

May 5, 2023

Scraping Using API, website still returns html output instead of JSON data

[SOLVED] Thanks everyone here for helping. 🙏 u/Brian and I have provided the solution as below. Also, big thanks to… More on reddit.com

r/learnpython

October 22, 2022

Couldn't get the whole html using requests.get(url)

You can look at the Network Tab in your developer tools to see the HTTP requests being made when you search. https://i.imgur.com/8ma0Z9Y.jpg The URL it fetches the data from is massive - I cut out some of the unneeded params https://redsky.target.com/v2/plp/search/?count=96&default_purchasability_filter=true&keyword=horizon+organic+whole+milk&offset=0&pricing_store_id=1771&scheduled_delivery_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&include_sponsored_search_v2=true&excludes=available_to_promise_qualitative%2Cavailable_to_promise_location_qualitative&key=ff457966e64d5e877fdbad070f276d18ecec4a01 You can open this URL directly and see the JSON response: https://i.imgur.com/zIDpJVz.png keyword is your search term. count=96 is the amount of results to get (96 is the max per request) - you can use the offset= to get the next "batch" / "page" The key seems to be hardcoded - and it is contained in the original html. {"apiKey":{"name":"x-api-key","value":"ff457966e64d5e877fdbad070f276d18ecec4a01"} Not sure if the other params are important - the stores ones seem to be and seem to be hardcoded too. They may change depending on if you mess around with the search settings. More on reddit.com

r/learnpython

August 20, 2020

Videos

31:04

YouTube

Web Scraping in Python - Requests HTML - YouTube

Easy Web Scraping With Python Requests-HTML: Extract and Parse ...

November 3, 2023

1.46K

youtube.com

requests HTML - Python requests on sterioids - YouTube

December 25, 2022

56:27

YouTube

Python Tutorial: Web Scraping with Requests-HTML - YouTube

March 11, 2019

25:01

YouTube

Python Requests Tutorial: Request Web Pages, Download Images, POST ...

Requests-HTML - Checking out a new HTML parsing library for Python ...

March 30, 2018

View all

W3Schools

w3schools.com › python › ref_requests_response.asp

Python requests.Response Object

The requests.Response() Object contains the server's response to the HTTP request. ... If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail: sales@w3schools.com · If you want to report an ...

JC Chouinard

jcchouinard.com › accueil › web scraping with python and requests-html (with example)

Web Scraping With Python and Requests-HTML (with Example) - JC Chouinard

June 21, 2023 - RuntimeError: Cannot use HTMLSession within an existing event loop. Here, I will make an example with Hamlet Batista’s amazing intro to Python post. Just to make sure that there is no error, I will add a try and except statement to return an error in any case the code doesn’t work. We will store the response in a variable called response. import requests from requests_html import HTMLSession url = "https://www.searchenginejournal.com/introduction-to-python-seo-spreadsheets/342779/" try: session = HTMLSession() response = session.get(url) except requests.exceptions.RequestException as e: print(e)

Fernandomc

fernandomc.com › posts › using-requests-to-get-and-post

Python Requests and Beautiful Soup - Playing with HTTP Requests, HTML Parsing and APIs – Fernando Medina Corey

May 26, 2018 - A guide to getting started with the Python libraries requests and Beautiful Soup.

ZetCode

zetcode.com › python › requests

Python Requests - accessing web resources via HTTP

July 20, 2019 - For more complex HTML documents, consider using a library like Beautiful Soup instead of regular expressions for more robust parsing. The Response object contains a server's response to an HTTP request. Its status_code attribute returns HTTP status code of the response, such as 200 or 404.

Medium

medium.com › @tubelwj › requests-html-an-html-parsing-library-in-python-8d182d13ecd2

Requests-HTML: An HTML parsing library in Python | by Gen. Devin DL. | Medium

September 17, 2024 - Requests-HTML: An HTML parsing library in Python When performing web scraping and web-page parsing, Python’s `requests` and `BeautifulSoup` libraries are commonly used tools. The `requests_html` …

Find elsewhere

Google Bing Mojeek

Python-requests

html.python-requests.org › _modules › requests_html.html

requests_html — requests-HTML v0.3.4 documentation

Try increasing timeout") html = HTML(url=self.url, html=content.encode(DEFAULT_ENCODING), default_encoding=DEFAULT_ENCODING) self.__dict__.update(html.__dict__) self.page = page return result class HTMLResponse(requests.Response): """An HTML-enabled :class:`requests.Response <requests.Response>` object.

GitHub

github.com › psf › requests-html

GitHub - psf/requests-html: Pythonic HTML Parsing for Humans™

>>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession() >>> async def get_pythonorg(): ... r = await asession.get('https://python.org/') ... return r ... >>> async def get_reddit(): ... r = await asession.get('https://reddit.com/') ... return r ... >>> async def get_google(): ... r = await asession.get('https://google.com/') ... return r ... >>> results = asession.run(get_pythonorg, get_reddit, get_google) >>> results # check the requests all returned a 200 (success) code [<Response [200]>, <Response [200]>, <Response [200]>] >>> # Each item in the results list is a response object and can be interacted with as such >>> for result in results: ... print(result.html.url) ...

Starred by 13.8K users

Forked by 1K users

Languages Python 99.7% | Makefile 0.3%

Kennethreitz

requests-html.kennethreitz.org

Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML v0.3.4 documentation

Returns a generator of Responses or Requests. ... Send a given PreparedRequest. ... Requests-HTML intends to make parsing HTML (e.g.

Python-requests

docs.python-requests.org › projects › requests-html › en › stable

requests-HTML v0.3.4 documentation

Receives a Response. Returns a generator of Responses or Requests. ... Send a given PreparedRequest. ... Requests-HTML intends to make parsing HTML (e.g.

Stack Overflow

stackoverflow.com › questions › 33755849 › python-getting-html-content-via-requests-returns-partial-response

Python getting HTML content via 'requests' returns partial response - Stack Overflow

Top answer

1 of 2

Try setting a User-Agent:

import requests

url = "http://localbusiness.com/"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html',
}

response = requests.get(url, headers=headers)
html = response.text

The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.

2 of 2

@jason answered it correctly so I am extending his answer for the reason

Why It happens

Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)

Other alternatives

You can use the mechanize module of python to mimic a browser to fool a web site (come handy when the site is using some short of authentication cookies) A small tutorial
Use selenium to actually implement a browser

Delft Stack

delftstack.com › home › howto › python › response 200 python

How to Get HTML With HTTP Requests in Python | Delft Stack

March 11, 2025 - The HTML content can be accessed using response.text, which contains the raw HTML as a string. Finally, we print the HTML content to the console. This method is efficient and easy to use, making it a go-to choice for many developers. Another built-in option for making HTTP requests in Python is the urllib library.

Mimo

mimo.org › glossary › python › requests-library

Python requests Library: How to Make HTTP Requests with Python

Make a request: Use requests.get() for GET requests or requests.post() for POST requests. Check the response: The function returns a Response object. Check response.status_code to see if it was successful (200 means OK). Access the content: Use response.text for HTML/text or response.json() ...

Medium

medium.com › @datajournal › web-scraping-with-python-and-requests-html-015e202970a0

Web Scraping With Python & Requests-HTML in 2025 | Medium

February 23, 2025 - To solve this, requests-HTML offers a method called render(), which allows you to execute JavaScript in the background and fetch the rendered content. If you’re using Jupyter notebooks, you can use arender() for asynchronous rendering. Here’s an example of how to render JavaScript content: # Render JavaScript content response.html.render() # Now you can extract the data content = response.html.find('h1', first=True) print(content.text)

GeeksforGeeks

geeksforgeeks.org › python › response-text-python-requests

response.text - Python requests - GeeksforGeeks

April 15, 2025 - In Python’s requests library, the response.text attribute allows developers to access the content of the response returned by an HTTP request. This content is always returned as a Unicode string, making it easy to read and manipulate.

reddit.com › r/learnpython › strange html code after parsing via requests. what is it and how to deal?

r/learnpython on Reddit: Strange HTML code after parsing via requests. What is it and how to deal?

May 5, 2023 -

Hello everybody.

I have such code, which extracts html from polish site

import requests

url = "https://www.olx.pl/oferty/uzytkownik/nzuCv/"
response = requests.get(url)
print(response.text)

While this page has normal html (https://imgur.com/a/6W44Jgm), the response has encoding utf-8, and in python/pycharm it is not a Doctype, at all. What is it and how to make it normal html code?

Example of few lines from the very beginning of the response:

\"parentId\":453,\"name\":\"Poradniki i albumy\",\"normalizedName\":\"poradniki-i-albumy\",\"position\":7,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":7,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fporadniki-i-albumy\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1161\":{\"id\":1161,\"label\":\"komiksy\",\"parentId\":453,\"name\":\"Komiksy\",\"normalizedName\":\"komiksy\",\"position\":4,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":4,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fkomiksy\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1163\":{\"id\":1163,\"label\":\"dla-dzieci\",\"parentId\":453,\"name\":\"Dla dzieci\",\"normalizedName\":\"dla-dzieci\",\"position\":3,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":3,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fdla-dzieci\",\"type\":\"goods\",\"isAdding\":true,\"isSearch\":false,\"isOfferSeek\":false,\"privateBusiness\":true,\"photosMax\":8},\"1165\":{\"id\":1165,\"label\":\"czasopisma\",\"parentId\":453,\"name\":\"Czasopisma\",\"normalizedName\":\"czasopisma\",\"position\":2,\"viewType\":\"list\",\"iconName\":\"\",\"level\":3,\"displayOrder\":2,\"children\":[],\"path\":\"muzyka-edukacja\\u002Fksiazki\\u002Fczasopisma\",

UPDATE:

It appears that HTML is dynamic so it is worth to use Network tab and find endpoints which front end is using. Base on that endpoint I was able to request JSON.

full code here: https://pastebin.com/QbtRBJgb

due to dynamically generated HTML it is not possible to do with requests.

It is possible to do with Selenium, but a window will pop up which annoys a bit - code here https://pastebin.com/U4t8VcVf