parse html file in python

stackoverflow.com › questions › 11709079 › parsing-html-using-python

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

Answer from Aadaam on Stack Overflow

Python

docs.python.org › 3 › library › html.parser.html

html.parser — Simple HTML and XHTML parser

Source code: Lib/html/parser.py This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Example HTML Parser...

Stack Overflow

stackoverflow.com › questions › 11709079 › parsing-html-using-python

Parsing HTML using Python - Stack Overflow

Top answer

1 of 8

291

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

2 of 8

114

I guess what you're looking for is pyquery:

pyquery: a jquery-like library for python.

An example of what you want may be like:

from pyquery import PyQuery    
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or     tag = pq('div.class')
print tag.text()

And it uses the same selectors as Firefox's or Chrome's inspect element. For example:

The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:

pq('div#mw-head.noprint')

Discussions

Python HTML parsing options?

Beautifulsoup is a popular html parsing library More on reddit.com

r/learnpython

April 4, 2021

Trying to Parse HTML with Beautiful Soup Module

Are you sure that css selector is in the text you are searching? Save the text to a file open that up in your browser and see if it looks the same as what you expect. More on reddit.com

r/learnpython

September 15, 2022

How do i parse HTML on a page that is currently open in my browser and not a simulated browser

I am pretty sure that webbrowser and requests work independent of each other. Webbrowser just opens up in your default browser. Based on your authentication methods, existing user data, cookies etc. it could be possiblle that you are logged in. Requests however just sends a GET request to the url and retrieves the data. It does not use any data like cookies from the (default) webbrowser. You can have a look at selenium. It allows you to open a browser window and directly interact with it. Additionally you can set it up to store user data. This should allow you to only log in once. If this works properly highly depends on the authentication methods and the policy of the website. After setting up selenium you could have a look at the link below on how to store user data for the next session: https://stackoverflow.com/questions/45651879/using-selenium-how-to-keep-logged-in-after-closing-driver-in-python More on reddit.com

r/learnpython

May 24, 2024

TIL You can parse html in Python using jQuery syntax (this was posted 2 years ago, but it has helped me so much I thought it deserved a repost)

This is a great package, thanks for the repost, although there's nothing jQuery about the syntax, it's just CSS selectors... More on reddit.com

r/Python

140

December 7, 2010

Videos