The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

Answer from Martijn Pieters on Stack Overflow
🌐
PyPI
pypi.org › project › requests-html
requests-html · PyPI
>>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession() >>> async def get_pythonorg(): ... r = await asession.get('https://python.org/') >>> async def get_reddit(): ... r = await asession.get('https://reddit.com/') >>> ...
      » pip install requests-html
    
Published   Feb 17, 2019
Version   0.10.0
🌐
GitHub
github.com › psf › requests-html
GitHub - psf/requests-html: Pythonic HTML Parsing for Humans™
>>> from requests_html import HTML >>> doc = """<a href='https://httpbin.org'>""" >>> html = HTML(html=doc) >>> html.links {'https://httpbin.org'} ... Only Python 3.6 and above is supported.
Starred by 13.8K users
Forked by 1K users
Languages   Python 99.7% | Makefile 0.3%
Discussions

Get html using Python requests? - Stack Overflow
I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this: More on stackoverflow.com
🌐 stackoverflow.com
Newest 'python-requests-html' Questions - Stack Overflow
Stack Overflow | The World’s Largest Online Community for Developers More on stackoverflow.com
🌐 stackoverflow.com
Been using requests-html out of fear for using Selenium, advice?
requests-html requests-html (at least the .render() function) is doing a similar thing to selenium, with a lot of the same performance impact: it's invoking a whole browser to render the page UI. It does have the advantage that you can get away without .render() in cases where you don't need it, so is preferable in that respect, but the main thing those websites are advocating is to simply use the plain requests.get() approach, without having to get a web browser involved in the process at all. I think selenium often becomes a bit of a newbie trap, because it works in a way that's pretty familiar to everyone (we've all used a browser), and so viewing a site as being about clicking on links and manipulating UI elements then looking at the result is an easy way to understand scraping. And as sites grow more dynamic and complex, sometimes it's the simplest (but usually not best) way to get some information, especially if you're unfamiliar with what goes on behind the scenes. But if we drop a level of abstraction and view websites not as UI elements, but as simply fetching and sending data to and from the server, scraping becomes orders of magnitude more efficient, and sometimes even easier. Even on dynamic webpages that build their content dynamically with javascript, that code needs to get that data from somewhere, and often a quick look at the network tab when loading the page will show you where, often in a format much easier and more consistent to parse than messing with complex html. But a lot of time people reach for selenium as a first option, even if they could accomplish the same goal with a simple get request, simply because it's their default. In reality, I think selenium should be the option of last resort - used only when regular get requests are too difficult or tedious to figure out (generally when a site is basically actively using anti-spidering countermeasures), and requests-html.render() is only really one step behind that (but is much better in that it presents this as an optional tool you can selectively apply, rather than the interface through which you do everything). More on reddit.com
🌐 r/learnpython
8
12
February 8, 2021
Requests vs Requests-html vs beautifulsoup4
Well, requests for starters. Its simple and easy to get hold of. But the prob is that it doesn't fetch the Javascript content which is mostly on E-commerce websites. For that theoretically requests-html is for. However, i had some trouble with it (unable to do SSL Handshake and couldn't resolve it) Scrapy is a bit advanced but is very powerful. Selenium isn't that efficient as it loads the webpage through a browser, but does the job. TL:DR: if you have small and simple scraping job, use requests. If the task is simple and data is huge, still would suggest requests. If going through small data on E-commerce or JS websites, selenium does the job. If large data, scrapy is pretty efficient. More on reddit.com
🌐 r/learnpython
5
4
May 15, 2020
🌐
Requests
requests.readthedocs.io › projects › requests-html › en › latest
requests-HTML v0.3.4 documentation
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. ... Full JavaScript support! CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). XPath Selectors, for the faint of heart. Mocked user-agent (like a real web browser). Automatic following of redirects. ... The Requests experience you know and love, with magical parsing abilities. ... Only Python 3.6 is supported.
🌐
Medium
medium.com › @tubelwj › requests-html-an-html-parsing-library-in-python-8d182d13ecd2
Requests-HTML: An HTML parsing library in Python | by Gen. Devin DL. | Medium
September 17, 2024 - Requests-HTML is a Python library that extends the functionality of the Requests library by adding parsing and manipulation capabilities for HTML content.
🌐
JC Chouinard
jcchouinard.com › accueil › web scraping with python and requests-html (with example)
Web Scraping With Python and Requests-HTML (with Example) - JC Chouinard
June 21, 2023 - The Python Requests-HTML library is a web scraping module that offers HTTP requests as well as JavaScript rendering.
Top answer
1 of 4
31

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

2 of 4
14

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
🌐
Webscraping
webscraping.fyi › lib › python › requests-html
Python requests-html Library in Web Scraping - Web Scraping FYI
February 16, 2023 - requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and ...
Published   Feb 16, 2023
Author   webscraping.fyi
Find elsewhere
🌐
Plain English
python.plainenglish.io › scrape-the-web-like-a-pro-with-python-and-requests-html-29d02ddddf39
Scrape the Web Like a PRO with Python and requests-HTML | Python in Plain English
December 8, 2022 - In our example I’m looking for the python version which is shown on the python blog page: ... So we just need to search for that string and replace the part we are looking for with a variable name between brackets. By doing this, requests-HTML will populate our variable with the value from the string.
🌐
Kennethreitz
requests-html.kennethreitz.org
Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML v0.3.4 documentation
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. ... Full JavaScript support! CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). XPath Selectors, for the faint at heart. Mocked user-agent (like a real web browser). Automatic following of redirects. ... The Requests experience you know and love, with magical parsing abilities. ... Only Python 3.6 is supported.
🌐
Medium
medium.com › @datajournal › web-scraping-with-python-and-requests-html-015e202970a0
Web Scraping With Python & Requests-HTML in 2025 | Medium
February 23, 2025 - Master web scraping with Python's requests-HTML: send HTTP requests, render JavaScript, parse HTML, and store data effortlessly.
🌐
Stack Overflow
stackoverflow.com › questions › tagged › python-requests-html
Newest 'python-requests-html' Questions - Stack Overflow
While attempting to perform an asynchronous task using requests-html, I encountered an error message stating ... ... # Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.nts.live/shows' # Package the request, send the request and catch the response: r r = requests.... ... So I have the following script: #!/usr/bin/env python3 import requests from bs4 import BeautifulSoup def parse_marketwatch_calendar(url): #page=requests.get(url).text #soup=BeautifulSoup(page,...
🌐
Fernandomc
fernandomc.com › posts › using-requests-to-get-and-post
Python Requests and Beautiful Soup - Playing with HTTP Requests, HTML Parsing and APIs – Fernando Medina Corey
May 26, 2018 - Naturally, I gravitated towards teaching the basics of one of the most popular Python packages - Requests. I’ve also found it’s useful to throw in using Beatiful Soup to show folks how they can efficiently interact with HTML data after getting an HTML page.
🌐
Medium
medium.com › analytics-vidhya › the-modern-way-of-web-scraping-requests-html-2567ba2554f4
Requests-HTML: The modern way of web scraping. | by David Kowalk | Analytics Vidhya | Medium
December 2, 2020 - As a freelancer, people often come to me for the same reasons: Python’s difficult, the code is about as understandable as a bowl of spaghetti and generally inaccessible for beginners. “Why do I need 3 different libraries to download some data off a website?” This is very unfortunate since web scraping is something especially data scientists can use almost every day. In this tutorial, I will show you the basics of web scraping with requests-html, the modern way of scraping data off of websites.
🌐
Finxter
blog.finxter.com › home › learn python blog › how to get an html page from a url in python?
How to Get an HTML Page from a URL in Python? - Be on the Right Side of Change
October 31, 2022 - python -r "import requests; print(requests.get(url = 'https://google.com').text)" The output, again, is the desired Google HTML page:
🌐
AskPython
askpython.com › home › how to read html from a url in python 3?
How to read HTML from a URL in Python 3? - AskPython
April 28, 2023 - Requests is an HTTP library for the Python programming language. The objective of the package intends to simplify and improve the overall accessibility of HTTP requests. To read HTML for the provided URL, we first prepare a request using the ...
🌐
Medium
boadziedaniel.medium.com › scraping-bestsellers-with-the-requests-html-package-c1a671332e6
Scraping Bestsellers with the `requests-html` package | by Daniel Boadzie | Medium
February 19, 2023 - To get started in using `requests-html` let’s learn a little bit about the package. `requests-html` is a Python package for making the parsing of HTML easy and intuitive. It was created by Kenneth Reitz, the same guy who created the `requests` ...
🌐
Reddit
reddit.com › r/learnpython › been using requests-html out of fear for using selenium, advice?
r/learnpython on Reddit: Been using requests-html out of fear for using Selenium, advice?
February 8, 2021 -

So basically all webscraping guides describe selenium as super slow, last option that you should avoid at all costs, so when I had to scrape a number of websites with dynamic content, that's what I did. I found an alternative in requests-html, primarily for its r.html.render() function and for the simplicity (no need to set headers etc.)

The more blogs and guides on webscraping, the more I become aware that nobody mentions requests-html, but always recommends selenium despite being the slowest of them all.

Is this a bad way to deal with dynamic content? Does Selenium have a similar function?

Top answer
1 of 3
5
requests-html requests-html (at least the .render() function) is doing a similar thing to selenium, with a lot of the same performance impact: it's invoking a whole browser to render the page UI. It does have the advantage that you can get away without .render() in cases where you don't need it, so is preferable in that respect, but the main thing those websites are advocating is to simply use the plain requests.get() approach, without having to get a web browser involved in the process at all. I think selenium often becomes a bit of a newbie trap, because it works in a way that's pretty familiar to everyone (we've all used a browser), and so viewing a site as being about clicking on links and manipulating UI elements then looking at the result is an easy way to understand scraping. And as sites grow more dynamic and complex, sometimes it's the simplest (but usually not best) way to get some information, especially if you're unfamiliar with what goes on behind the scenes. But if we drop a level of abstraction and view websites not as UI elements, but as simply fetching and sending data to and from the server, scraping becomes orders of magnitude more efficient, and sometimes even easier. Even on dynamic webpages that build their content dynamically with javascript, that code needs to get that data from somewhere, and often a quick look at the network tab when loading the page will show you where, often in a format much easier and more consistent to parse than messing with complex html. But a lot of time people reach for selenium as a first option, even if they could accomplish the same goal with a simple get request, simply because it's their default. In reality, I think selenium should be the option of last resort - used only when regular get requests are too difficult or tedious to figure out (generally when a site is basically actively using anti-spidering countermeasures), and requests-html.render() is only really one step behind that (but is much better in that it presents this as an optional tool you can selectively apply, rather than the interface through which you do everything).
2 of 3
2
requests-html already uses Chromium under the hood, just like Selenium does.
🌐
Opensource.com
opensource.com › article › 22 › 6 › analyze-web-pages-python-requests-beautiful-soup
Analyze web pages with Python requests and Beautiful Soup | Opensource.com
Because Beautiful Soup recognizes HTML entities, you can use some of its built-in features to make the output a little easier for the human eye to parse. For instance, instead of printing raw text at the end of your program, you can run the text through the .prettify function of Beautiful Soup: from bs4 import BeautifulSoup import requests PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux") SOUP = BeautifulSoup(PAGE.text, 'html.parser') # Run the script if __name__ == '__main__': # do a thing here print(SOUP.prettify())
🌐
W3Schools
w3schools.com › python › module_requests.asp
Python Requests Module
C:\Users\Your Name\AppData\Local\Programs\Python\Python36-32\Scripts>pip install requests ... If you want to use W3Schools services as an educational institution, team or enterprise, send us an e-mail: sales@w3schools.com · If you want to report an error, or if you want to make a suggestion, send us an e-mail: help@w3schools.com · HTML ...