You should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Answer from Sven Marnach on Stack OverflowYou should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.
beautifulsoup - How to parse html table in python - Stack Overflow
Guide to Scrape HTML Table Using Python
Any way to parse HTML tables?
How to scrap a html table with bs4?
What are the best tools to scrape HTML tables?
Some of the best tools for scraping HTML tables include Python libraries like BeautifulSoup and Pandas. These tools allow you to easily extract table data and manipulate it. For more complex scenarios, Scraping APIs like ScraperAPI can handle dynamic content and bypass anti-scraping measures.
What kind of data can HTML tables contain?
HTML tables can contain a wide variety of data types. They are commonly used to display structured information such as text, numbers, dates, images, links, and even nested tables. This makes them suitable for representing data like financial reports, product listings, and statistical data – which we can then scrape for analysis.
How Can I Scrape Dynamic Tables?
To scrape dynamic tables that are generated or updated using JavaScript, you need tools that can render JavaScript content. Tools like ScraperAPI can fetch the fully loaded page, allowing you to extract the dynamic table data. Alternatively, you can use webdrivers like Selenium or PlayWright, which can simulate a real user interacting with the browser, loading dynamic content before scraping.
Videos
» pip install html-table-parser-python3
» pip install html-table-extractor
You can use CSS selector select() and select_one() to get "3text" and "6text" like below:
import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
print(i.select_one('td:nth-child(2)').text)
You can also use find_all method:
trs = soup.find('table').find_all('tr')
for i in trs:
tds = i.find_all('td')
print(tds[1].text)
Result:
3text
6text
best way is to use beautifulsoup
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
<tbody><tr>
<td>1text 2text</td>
<td>3text </td>
</tr>
<tr>
<td>4text 5text</td>
<td>6text </td>
</tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
for k in i.find_all("td"):
# prints all td tags with a text format
print(k.text)
in this case it prints
1text 2text
3text
4text 5text
6text
but you can grab the texts you want with indexing. In this case you could just go with
# finds all tr tags
for i in soup.find_all("tr"):
# finds all td tags in tr tags
print(i.find_all("td")[1].text)