python-requests parse html response

stackoverflow.com › questions › 51788359 › parsing-html-with-python-request

As @nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):

from lxml import etree
from io import StringIO
import requests

# Set explicit HTMLParser
parser = etree.HTMLParser()

page = requests.get('https://URL.COM')

# Decode the page content from bytes to string
html = page.content.decode("utf-8")

# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)

# Call this function and pass in your tree
def get_links(tree):
    # This will get the anchor tags <a href...>
    refs = tree.xpath("//a")
    # Get the url from the ref
    links = [link.get('href', '') for link in refs]
    # Return a list that only ends with .com.br
    return [l for l in links if l.endswith('.com.br')]


# Example call
links = get_links(tree)

Answer from C.Nivs on Stack Overflow

Kennethreitz

requests-html.kennethreitz.org

Requests-HTML: HTML Parsing for Humans (writing Python 3)! — requests-HTML v0.3.4 documentation

Returns a generator of Responses or Requests. ... Send a given PreparedRequest. ... Requests-HTML intends to make parsing HTML (e.g.

Stack Overflow

stackoverflow.com › questions › 51788359 › parsing-html-with-python-request

Parsing HTML with python request - Stack Overflow

Top answer

1 of 1

from lxml import etree
from io import StringIO
import requests

# Set explicit HTMLParser
parser = etree.HTMLParser()

page = requests.get('https://URL.COM')

# Decode the page content from bytes to string
html = page.content.decode("utf-8")

# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)

# Call this function and pass in your tree
def get_links(tree):
    # This will get the anchor tags <a href...>
    refs = tree.xpath("//a")
    # Get the url from the ref
    links = [link.get('href', '') for link in refs]
    # Return a list that only ends with .com.br
    return [l for l in links if l.endswith('.com.br')]


# Example call
links = get_links(tree)

Discussions

python - Is there a way to parse out HTML in a response from requests.get()? - Stack Overflow

I'm using the requests package to get data from an API and see some HTML elements in the response data such as

, and \', among a bunch of other elements. The return value for response.encoding is utf-8 if that helps. I'd like to parse out all the HTML values and just have a simple ... More on stackoverflow.com

stackoverflow.com

python - Parsing HTML with requests and BeautifulSoup - Stack Overflow

I'm not sure if I'm approaching this correctly. I'm using requests to make a GET: con = s.get(url) when I call con.content, the whole page is there. But when I pass con into BS: soup = BeautifulS... More on stackoverflow.com

stackoverflow.com

Get html using Python requests? - Stack Overflow

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this: More on stackoverflow.com

stackoverflow.com

Steps for requests-html to parse more than one tag/class in python

You definitely want r/python for this - r/css is for questions about styling 👍 More on reddit.com

r/css

October 7, 2020

Videos

07:47

YouTube

Easy Web Scraping With Python Requests-HTML: Extract and Parse ...

November 3, 2023

1.46K