You should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Answer from Sven Marnach on Stack OverflowYou should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.
Any way to parse HTML tables?
How to scrap a html table with bs4?
Guide to Scrape HTML Table Using Python
How to read a table in beautiful soup, and parse the elements
What are the best tools to scrape HTML tables?
Some of the best tools for scraping HTML tables include Python libraries like BeautifulSoup and Pandas. These tools allow you to easily extract table data and manipulate it. For more complex scenarios, Scraping APIs like ScraperAPI can handle dynamic content and bypass anti-scraping measures.
What kind of data can HTML tables contain?
HTML tables can contain a wide variety of data types. They are commonly used to display structured information such as text, numbers, dates, images, links, and even nested tables. This makes them suitable for representing data like financial reports, product listings, and statistical data – which we can then scrape for analysis.
How Can I Scrape Dynamic Tables?
To scrape dynamic tables that are generated or updated using JavaScript, you need tools that can render JavaScript content. Tools like ScraperAPI can fetch the fully loaded page, allowing you to extract the dynamic table data. Alternatively, you can use webdrivers like Selenium or PlayWright, which can simulate a real user interacting with the browser, loading dynamic content before scraping.
Videos
» pip install html-table-parser-python3
» pip install html-table-extractor