You only get to read a file once, unless you go back to the beginning. But this isn't how you want to do it regardless.

Iterate the file line-by-line, checking for your conditions.

error = False
report = False

with open(html_path) as html_file:
  for line in html_file:
    print(line)
    if 'Error' in line:
      error = True
    if 'Report' in line:
      report = True
    print(line)
  else:
    if error:
      print('error')
    elif report:
      print('result')
    else:
      print('nothing')
Answer from Ignacio Vazquez-Abrams on Stack Overflow
🌐
Python
docs.python.org › 3 › library › html.parser.html
html.parser — Simple HTML and XHTML parser
Source code: Lib/html/parser.py This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Example HTML Parser...
🌐
TutorialsPoint
tutorialspoint.com › python_data_science › python_reading_html_pages.htm
Python - Reading HTML Pages
We can extract tag value from all the instances of a tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm') html_doc = response.read() soup = BeautifulSoup(html_doc, 'html.parser') for x in soup.find_all('b'): print(x.string)
Top answer
1 of 1
5

Use a html parser and decode the bytes as suggested, BeautifulSoup will make the job very easy and it a lot more reliable than a regex when parsing html:

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET", my_url)
html = a.data.decode("utf-8")

from bs4 import BeautifulSoup

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])

If you only want the links starting with http you can use a css select:

soup = BeautifulSoup(html)

print([a["href"] for a in soup.select("a[href^=http]")])

Which will give you:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/']

To get the text and href:

soup = BeautifulSoup(html)

a_tags = soup.select("a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)

pp(paired)

Output:

 {u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u'Capital IQ': 'http://www.capitaliq.com',
 u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com',
 u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/',
 u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN',
 u'Markets': 'https://in.finance.yahoo.com/investing/',
 u'Morningstar, Inc.': 'http://www.morningstar.com/',
 u'My Yahoo': 'http://in.my.yahoo.com',
 u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html',
 u'Yahoo': 'https://in.yahoo.com/',
 u'Yahoo India Finance': 'https://in.finance.yahoo.com',
 u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html',
 u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'}

The a[href^=http] means give me all the a tags that have href's and those href's values start with http.

Using lxml and using the table id to get just the story links which you are probably most interested in:

from lxml.etree  import fromstring, HTMLParser

xml = fromstring(_html, HTMLParser())

a_tags = xml.xpath("//table[@id='yfncsumtab']//a")

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Gives you:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 "Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 "EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

We can do the same with out select:

soup = BeautifulSoup(_html)

a_tags = soup.select("#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)

Which will match our lxml output:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

You could just use //*[@id='yfncsumtab']//a as id's should be unique.

To get the first six links from the table using an xpath, we can use the ul's and extract the first 6 using ul[position() < 7]:

a_tags  = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a")

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Which will give you:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'}

For small tables, you could also simply slice.

🌐
Medium
mateuszwiza.medium.com › text-extraction-from-html-by-keyword-using-python-6b4b5bc1364e
Text Extraction from HTML by Keyword using Python | by Mateusz Wiza | Medium
June 19, 2021 - To make it more clear, let’s assume that we have HTML code <p> Very <b> important </b> piece of code </p>. The resulting list of text strips would then have 3 elements and would look like this: ['Very', 'important', 'piece of code']. Such a format of results is useful in some cases (remember our second task?) but if we’re interested in the entire text from the HTML file, we can just concatenate all these strips of text into a super-long, single string. The final step is to divide this super-long string into pieces with at most 32,767 characters but this is also quite straightforward. We can use the fact that in Python each character in a string has its own index.
🌐
Apify
docs.apify.com › web scraping basics with python
Locating HTML elements with Python | Academy | Apify Documentation
We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price. In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two.
🌐
Stack Overflow
stackoverflow.com › questions › 55705747 › searching-a-html-file-line-by-line-for-a-string
python 3.x - searching a html file line by line for a string - Stack Overflow
April 16, 2019 - The break statement should be inside if line is '<html>':, so that the for loop is only broken when there's a match. Lines in the file include line breaks (and may include white-spaces). Use line.strip() to remove trailing characters. The is operator does not not test whether two variables have the same value, but whether they point to the same object. Use == to compare values. ... def search(): with open('cate.html') as ht: for cnt, line in enumerate(ht): print(line.strip()) if cnt < 4: if line.strip() == '<html>': print("found") break
Find elsewhere
🌐
The Hitchhiker's Guide to Python
docs.python-guide.org › scenarios › scrape
HTML Scraping — The Hitchhiker's Guide to Python
(We need to use page.content rather than page.text because html.fromstring implicitly expects bytes as input.) tree now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect.
🌐
Stack Overflow
stackoverflow.com › questions › 44005897 › search-for-something-in-a-html-string-via-python
Search for something in a HTML-String via Python - Stack Overflow
I use split() to search strings like that. Isolate that third line then ... to get "#1244587 - 16 ..." then since you know that the left numbers will always have 7 digits, you can take that string and do string[1:7] to get the left number.
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-find-a-html-tag-that-contains-certain-text-using-beautifulsoup
How to find a HTML tag that contains certain text using BeautifulSoup ? - GeeksforGeeks
January 12, 2024 - In this example, BeautifulSoup is used to search gfg.html for specific text patterns in different HTML tags, and the found tags are printed to the console. ... # Importing library from bs4 import BeautifulSoup import re # Opening and reading the html file file = open("gfg.html", "r") contents = file.read() soup = BeautifulSoup(contents, 'html.parser') # Finding a pattern(certain text) pattern = 'Geeks For Geeks' # Anchor tag text1 = soup.find_all('a', text=pattern) print(text1) # Span tag text2 = soup.find_all('span', text=pattern) print(text2) # Finding a pattern(certain text) pattern2 = 'Python Program' # Heading tag text3 = soup.find_all('h1', text=pattern2) print(text3) # List tag text4 = soup.find_all('li', text=pattern2) print(text4) # Finding a pattern(certain text) pattern3 = 'GFG Website' # Table(row) tag text5 = soup.find_all('tr', text=pattern3) print(text5)
🌐
Opensource.com
opensource.com › article › 18 › 1 › parsing-html
Parsing HTML with Python | Opensource.com
If I could scan through all the HTML files for image references, then compare that list to the actual image files, chances are I would see a mismatch. ... I'm interested in the part between the first set of quotation marks, after src=. After some searching for a solution, I found a Python module called BeautifulSoup.
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-parse-local-html-file-in-python
How to parse local HTML file in Python? - GeeksforGeeks
March 16, 2021 - In Python, we can parse the html files using the panda's library and the library which is beautiful soup. The Beautiful Soup library is mainly used for web scraping.
🌐
GeeksforGeeks
geeksforgeeks.org › how-to-scrape-data-from-local-html-files-using-python
How to Scrape Data From Local HTML Files using Python? | GeeksforGeeks
April 21, 2021 - BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the ...
🌐
Stack Overflow
stackoverflow.com › questions › 25234217 › in-python-how-do-i-search-an-html-webpage-for-a-set-of-strings-in-text-file
beautifulsoup - In Python, how do I search an html webpage for a set of strings in text file? - Stack Overflow
August 11, 2014 - The problem is that you are trying to search with a string that contains some special character, like ' ', and '\n'. Note that str.strip() removes ' ' and other whitespace characters as well (e.g. tabs and newlines), so, update the following line: ... def textsearch(): #manga_file = open("managa.txt").readlines() ##assuming your text file has 3 titles manga_file = ["Karakuri Circus","Sun-ken Rock","Shaman King Flowers"] manga_html = "https://www.mangaupdates.com/releases.html" manga_page = urllib2.urlopen(manga_html) found = 0 soup = BeautifulSoup(manga_page) ##use BeautifulSoup to parse the s
🌐
Real Python
realpython.com › python-web-scraping-practical-introduction
A Practical Introduction to Web Scraping in Python – Real Python
December 21, 2024 - One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.
🌐
Stack Abuse
stackabuse.com › guide-to-parsing-html-with-beautifulsoup-in-python
Guide to Parsing HTML with BeautifulSoup in Python
September 21, 2023 - This article will give you a crash course on web scraping in Python with Beautiful Soup - a popular Python library for parsing HTML and XML.