You should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Answer from Sven Marnach on Stack OverflowYou should use some HTML parsing library like lxml:
from lxml import etree
s = """<table>
<tr><th>Event</th><th>Start Date</th><th>End Date</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>d</td><td>e</td><td>f</td></tr>
<tr><td>g</td><td>h</td><td>i</td></tr>
</table>
"""
table = etree.HTML(s).find("body/table")
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
prints
{'End Date': 'c', 'Start Date': 'b', 'Event': 'a'}
{'End Date': 'f', 'Start Date': 'e', 'Event': 'd'}
{'End Date': 'i', 'Start Date': 'h', 'Event': 'g'}
Hands down the easiest way to parse a HTML table is to use pandas.read_html() - it accepts both URLs and HTML.
import pandas as pd
url = r'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url) # Returns list of all tables on page
sp500_table = tables[0] # Select table of interest
As of pandas version 1.5.0, read_html() can preserve hyperlinks with the extract_links argument. Table elements will be tuples.
My grades are presented as a vanilla HTML table and converting the site to text makes it very hard to get values from the table. Is there any shortcut that parses an HTML table into a dictionary?
python - Fastest, easiest, and best way to parse an HTML table? - Stack Overflow
Parse HTML Table with some attribute tags from Text in Flat File to Table
[Help] HTML Table Parsing via HTML::TableExtract
code golf - HTML Table Parser - Code Golf Stack Exchange
Videos
» pip install html-table-parser-python3
For your general problem: try lxml.html from the lxml package (think of it as the stdlibs xml.etree on steroids: the same xml api, but with html support, xpath, xslt etc...)
A quick example for your specific case:
from lxml import html
tree = html.parse('http://www.datamystic.com/timezone/time_zones.html')
table = tree.findall('//table')[1]
data = [
[td.text_content().strip() for td in row.findall('td')]
for row in table.findall('tr')
]
This will give you a nested list: each sub-list corresponds to a row in the table and contains the data from the cells. The sneakily inserted advertisement rows are not filtered out yet, but it should get you on your way. (and by the way: lxml is fast!)
BUT: More specifically for your particular use case: there are better way to get at timezone database information than scraping that particular webpage (aside: note that the web page actually mentions that you are not allowed to copy its contents). There are even existing libraries that already use this information, see for example python-dateutil.
Avoid regular expressions for parsing HTML, they're simply not appropriate for it, you want a DOM parser like BeautifulSoup for sure...
A few other alternatives
- SimpleHTMLDom PHP
- Hpricot & Nokogiri Ruby
- Web::Scraper Perl/CPAN
All of these are reasonably tolerant of poorly formed HTML.
Not too long ago, I posted this thread, and received feedback indicating that a wiser approach to my problem involved the use of one or more Perl modules. Towards that end, I have the following:
#!/usr/bin perl -w
# For reference: https://metacpan.org/pod/HTML::TableExtract
use strict;
use warnings;
use diagnostics;
use HTML::TableExtract;
my $headers = ['Guest ID', 'Password'];
my $table_extract = HTML::TableExtract->new(headers => $headers);
$table_extract->parse_file('sample.html');
my ($table) = $table_extract->tables;
for my $row ($table->rows) {
print join(" ", @$row), "\n";
}Which works as expected, i.e. reads an HTML file, and parses two (2) regions of interest. What I'd like to be able to do, though, is pull the HTML directly from the web, rather than have to store it as a file, and then parse it with this:
$table_extract->parse($HTML);
instead of:
$table_extract->parse_file('sample.html');.
(I use LWP::UserAgent; to pass credentials, and retrieve the page, FYI). Here's the error I get:
Can't call method "rows" on an undefined value at parse_table.pl line 51 (#1)
It isn't clear to me what's breaking? This:
print($table_extract->parse_file('sample.html'));
returns this:
HTML::TableExtract=HASH(0x55912db53110)HTTP::Response=HASH(0x55912e2a28c8)HTML::TableExtract=HASH(0x55912db53110)
But this:
print(my ($table) = $table_extract->tables);
returns this:
Use of uninitialized value in print at parse_table.pl line 49 (#1) (W uninitialized) An undefined value was used as if it were already defined. It was interpreted as a "" or a 0, but maybe it was a mistake. To suppress this warning assign a defined value to your variables.
So I guess that's where the problem starts.
Any suggestions on how to further debug/remedy this?
JavaScript (Node.js), 175 bytes
x=>x.replace(/<t.(?: c.*?(\d+)")?(?: .*?(\d+)")?>(\w*)/g,(t,c=t<'<te'||!--y,r=1,v)=>{for(i=0;c;++i)if(!(X[~y]?.[i]+1))for(j=1,--c;+r+--j;u[i]=v,v='')u=X[~y-j]||=[]},X=y=[])&&X
Attempt This Online!
Charcoal, 144 bytes
SθSθ≔⁰ηW›Lθ⁸«≔⁰ζF∧›Lθ⁹⪪✂θ⁷±χ¹</td><td«≔E⊞O⪪κwspan=ω∨Σλ¹ε≔✂κ⊕⌕κ>Lκ¹κ≔⁺η§ε¹δF⁻δLυ⊞υ⟦⟧F✂υηδ¹«W∧‹ζLλ¬⁼§λζ⁰≦⊕ζF⁻⁺ζ§ε⁰Lλ⊞λ⁰F§ε⁰«§≔λ⁺ζμκ≔ωκ»»»≦⊕ηSθ»⭆¹υ
Try it online! Link is to verbose version of code. Explanation:
Sθ
Skip over the initial <table>.
Sθ
Read the first line of the table body.
≔⁰η
Start at (0-indexed) row 0.
W›θ⁸«
Repeat until </table> is reached.
≔⁰ζ
Start at column 0.
F∧›Lθ⁹⪪✂θ⁷±χ¹</td><td«
Loop over the cells of the table, excluding the leading <td and trailing </td>.
≔E⊞O⪪κwspan=ω∨Σλ¹ε
Extract the rowspan and colspan. (This depends on the text not containing digits.)
≔✂κ⊕⌕κ>Lκ¹κ
Extract the text.
≔⁺η§ε¹δ
Get the height necessary to include this rowspan.
F⁻δLυ⊞υ⟦⟧
Extend the table to that height if necessary.
F✂υηδ¹«
Loop over each row in the rowspan.
W∧‹ζLλ¬⁼§λζ⁰≦⊕ζ
Increase the column until it's not a used cell. (This only makes a difference on the first row, in which case the column advances past the previous cell and any cells from rowspans in previous rows.)
F⁻⁺ζ§ε⁰Lλ⊞λ⁰
Extend the row to the width necessary to include the colspan.
F§ε⁰«
Loop for every colspan.
§≔λ⁺ζμκ
Set the cell to the current text.
≔ωκ
Clear the current text.
»»»≦⊕η
Advance to the text row.
Sθ
Read the next line of the table.
»⭆¹υ
Pretty-print the final table, as the default output would confuse empty cells with the double-spacing between rows.