Brave Search

stackoverflow.com › questions › 27803503 › get-html-using-python-requests

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

Answer from Martijn Pieters on Stack Overflow

PyPI

pypi.org › project › requests-html

requests-html · PyPI

>>> from requests_html import AsyncHTMLSession >>> asession = AsyncHTMLSession() >>> async def get_pythonorg(): ... r = await asession.get('https://python.org/') >>> async def get_reddit(): ... r = await asession.get('https://reddit.com/') >>> ...

      » pip install requests-html

Published Feb 17, 2019

Version 0.10.0

Homepage https://github.com/kennethreitz/requests-html

GitHub

github.com › psf › requests-html

GitHub - psf/requests-html: Pythonic HTML Parsing for Humans™

>>> from requests_html import HTML >>> doc = """<a href='https://httpbin.org'>""" >>> html = HTML(html=doc) >>> html.links {'https://httpbin.org'} ... Only Python 3.6 and above is supported.

Starred by 13.8K users

Forked by 1K users

Languages Python 99.7% | Makefile 0.3%

Discussions

Get html using Python requests? - Stack Overflow

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this: More on stackoverflow.com

stackoverflow.com

Newest 'python-requests-html' Questions - Stack Overflow

Stack Overflow | The World’s Largest Online Community for Developers More on stackoverflow.com

stackoverflow.com

Been using requests-html out of fear for using Selenium, advice?

requests-html requests-html (at least the .render() function) is doing a similar thing to selenium, with a lot of the same performance impact: it's invoking a whole browser to render the page UI. It does have the advantage that you can get away without .render() in cases where you don't need it, so is preferable in that respect, but the main thing those websites are advocating is to simply use the plain requests.get() approach, without having to get a web browser involved in the process at all. I think selenium often becomes a bit of a newbie trap, because it works in a way that's pretty familiar to everyone (we've all used a browser), and so viewing a site as being about clicking on links and manipulating UI elements then looking at the result is an easy way to understand scraping. And as sites grow more dynamic and complex, sometimes it's the simplest (but usually not best) way to get some information, especially if you're unfamiliar with what goes on behind the scenes. But if we drop a level of abstraction and view websites not as UI elements, but as simply fetching and sending data to and from the server, scraping becomes orders of magnitude more efficient, and sometimes even easier. Even on dynamic webpages that build their content dynamically with javascript, that code needs to get that data from somewhere, and often a quick look at the network tab when loading the page will show you where, often in a format much easier and more consistent to parse than messing with complex html. But a lot of time people reach for selenium as a first option, even if they could accomplish the same goal with a simple get request, simply because it's their default. In reality, I think selenium should be the option of last resort - used only when regular get requests are too difficult or tedious to figure out (generally when a site is basically actively using anti-spidering countermeasures), and requests-html.render() is only really one step behind that (but is much better in that it presents this as an optional tool you can selectively apply, rather than the interface through which you do everything). More on reddit.com

r/learnpython

8

12

February 8, 2021

Requests vs Requests-html vs beautifulsoup4

Well, requests for starters. Its simple and easy to get hold of. But the prob is that it doesn't fetch the Javascript content which is mostly on E-commerce websites. For that theoretically requests-html is for. However, i had some trouble with it (unable to do SSL Handshake and couldn't resolve it) Scrapy is a bit advanced but is very powerful. Selenium isn't that efficient as it loads the webpage through a browser, but does the job. TL:DR: if you have small and simple scraping job, use requests. If the task is simple and data is huge, still would suggest requests. If going through small data on E-commerce or JS websites, selenium does the job. If large data, scrapy is pretty efficient. More on reddit.com

r/learnpython

5

4

May 15, 2020

Videos

31:04

YouTube

Web Scraping in Python - Requests HTML - YouTube

Easy Web Scraping With Python Requests-HTML: Extract and Parse ...

November 3, 2023

1.46K

youtube.com

requests HTML - Python requests on sterioids - YouTube

December 25, 2022

56:27

YouTube

Python Tutorial: Web Scraping with Requests-HTML - YouTube

March 11, 2019

12:58

YouTube

Python and Requests-HTML - Web Scraping Dynamic Content ...

March 30, 2023

View all

Requests

requests.readthedocs.io › projects › requests-html › en › latest

requests-HTML v0.3.4 documentation

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. ... Full JavaScript support! CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). XPath Selectors, for the faint of heart. Mocked user-agent (like a real web browser). Automatic following of redirects. ... The Requests experience you know and love, with magical parsing abilities. ... Only Python 3.6 is supported.

Medium

medium.com › @tubelwj › requests-html-an-html-parsing-library-in-python-8d182d13ecd2

Requests-HTML: An HTML parsing library in Python | by Gen. Devin DL. | Medium

September 17, 2024 - Requests-HTML is a Python library that extends the functionality of the Requests library by adding parsing and manipulation capabilities for HTML content.

JC Chouinard

jcchouinard.com › accueil › web scraping with python and requests-html (with example)

Web Scraping With Python and Requests-HTML (with Example) - JC Chouinard

June 21, 2023 - The Python Requests-HTML library is a web scraping module that offers HTTP requests as well as JavaScript rendering.

Stack Overflow

stackoverflow.com › questions › 27803503 › get-html-using-python-requests

Get html using Python requests? - Stack Overflow

Top answer

1 of 4

31

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

2 of 4

14

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

Webscraping

webscraping.fyi › lib › python › requests-html

Python requests-html Library in Web Scraping - Web Scraping FYI

February 16, 2023 - requests-html is a Python package that allows you to easily make HTTP requests and parse the HTML content of web pages. It is built on top of the popular requests package and uses the html parser from the lxml library, which makes it fast and ...

Published Feb 16, 2023

Author webscraping.fyi

Find elsewhere

Google Bing Mojeek

Plain English

python.plainenglish.io › scrape-the-web-like-a-pro-with-python-and-requests-html-29d02ddddf39

Scrape the Web Like a PRO with Python and requests-HTML | Python in Plain English

December 8, 2022 - In our example I’m looking for the python version which is shown on the python blog page: ... So we just need to search for that string and replace the part we are looking for with a variable name between brackets. By doing this, requests-HTML will populate our variable with the value from the string.

Kennethreitz

requests-html.kennethreitz.org

Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML v0.3.4 documentation

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. ... Full JavaScript support! CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). XPath Selectors, for the faint at heart. Mocked user-agent (like a real web browser). Automatic following of redirects. ... The Requests experience you know and love, with magical parsing abilities. ... Only Python 3.6 is supported.

Medium

medium.com › @datajournal › web-scraping-with-python-and-requests-html-015e202970a0

Web Scraping With Python & Requests-HTML in 2025 | Medium

February 23, 2025 - Master web scraping with Python's requests-HTML: send HTTP requests, render JavaScript, parse HTML, and store data effortlessly.

Stack Overflow

stackoverflow.com › questions › tagged › python-requests-html

Newest 'python-requests-html' Questions - Stack Overflow

While attempting to perform an asynchronous task using requests-html, I encountered an error message stating ... ... # Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.nts.live/shows' # Package the request, send the request and catch the response: r r = requests.... ... So I have the following script: #!/usr/bin/env python3 import requests from bs4 import BeautifulSoup def parse_marketwatch_calendar(url): #page=requests.get(url).text #soup=BeautifulSoup(page,...

Fernandomc

fernandomc.com › posts › using-requests-to-get-and-post

Python Requests and Beautiful Soup - Playing with HTTP Requests, HTML Parsing and APIs – Fernando Medina Corey

May 26, 2018 - Naturally, I gravitated towards teaching the basics of one of the most popular Python packages - Requests. I’ve also found it’s useful to throw in using Beatiful Soup to show folks how they can efficiently interact with HTML data after getting an HTML page.

Medium

medium.com › analytics-vidhya › the-modern-way-of-web-scraping-requests-html-2567ba2554f4

Requests-HTML: The modern way of web scraping. | by David Kowalk | Analytics Vidhya | Medium

December 2, 2020 - As a freelancer, people often come to me for the same reasons: Python’s difficult, the code is about as understandable as a bowl of spaghetti and generally inaccessible for beginners. “Why do I need 3 different libraries to download some data off a website?” This is very unfortunate since web scraping is something especially data scientists can use almost every day. In this tutorial, I will show you the basics of web scraping with requests-html, the modern way of scraping data off of websites.

Finxter

blog.finxter.com › home › learn python blog › how to get an html page from a url in python?

How to Get an HTML Page from a URL in Python? - Be on the Right Side of Change

October 31, 2022 - python -r "import requests; print(requests.get(url = 'https://google.com').text)" The output, again, is the desired Google HTML page:

AskPython

askpython.com › home › how to read html from a url in python 3?

How to read HTML from a URL in Python 3? - AskPython

April 28, 2023 - Requests is an HTTP library for the Python programming language. The objective of the package intends to simplify and improve the overall accessibility of HTTP requests. To read HTML for the provided URL, we first prepare a request using the ...

Medium

boadziedaniel.medium.com › scraping-bestsellers-with-the-requests-html-package-c1a671332e6

Scraping Bestsellers with the `requests-html` package | by Daniel Boadzie | Medium

February 19, 2023 - To get started in using `requests-html` let’s learn a little bit about the package. `requests-html` is a Python package for making the parsing of HTML easy and intuitive. It was created by Kenneth Reitz, the same guy who created the `requests` ...

reddit.com › r/learnpython › been using requests-html out of fear for using selenium, advice?

r/learnpython on Reddit: Been using requests-html out of fear for using Selenium, advice?

February 8, 2021 -

So basically all webscraping guides describe selenium as super slow, last option that you should avoid at all costs, so when I had to scrape a number of websites with dynamic content, that's what I did. I found an alternative in requests-html, primarily for its r.html.render() function and for the simplicity (no need to set headers etc.)

The more blogs and guides on webscraping, the more I become aware that nobody mentions requests-html, but always recommends selenium despite being the slowest of them all.

Is this a bad way to deal with dynamic content? Does Selenium have a similar function?