search in html file python

stackoverflow.com › questions › 51354098 › how-to-search-a-string-in-a-html-file

You only get to read a file once, unless you go back to the beginning. But this isn't how you want to do it regardless.

Iterate the file line-by-line, checking for your conditions.

error = False
report = False

with open(html_path) as html_file:
  for line in html_file:
    print(line)
    if 'Error' in line:
      error = True
    if 'Report' in line:
      report = True
    print(line)
  else:
    if error:
      print('error')
    elif report:
      print('result')
    else:
      print('nothing')

Answer from Ignacio Vazquez-Abrams on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 51354098 › how-to-search-a-string-in-a-html-file

python - How to search a string in a html file? - Stack Overflow

Top answer

1 of 1

You only get to read a file once, unless you go back to the beginning. But this isn't how you want to do it regardless.

Iterate the file line-by-line, checking for your conditions.

error = False
report = False

with open(html_path) as html_file:
  for line in html_file:
    print(line)
    if 'Error' in line:
      error = True
    if 'Report' in line:
      report = True
    print(line)
  else:
    if error:
      print('error')
    elif report:
      print('result')
    else:
      print('nothing')

Stack Overflow

stackoverflow.com › questions › 17506355 › search-for-a-string-inside-html-source-with-python-3-3-1

search for a string inside html source with python (3.3.1) - Stack Overflow

Top answer

1 of 3

I'd recommend using a library such as Beautiful Soup if it's HTML you want to parse. No need for regex.

EDIT

Using the URL you just added, this is the sample code to get the HTML object out:

import BeautifulSoup
import re
import urllib

data = urllib.urlopen('http://listadecasamento.fastshop.com.br/ListaCasamento/ListaCasamentoBusca.aspx?Data=2013-06-07').read()
soup = BeautifulSoup.BeautifulSoup(data)
element = soup.find('span', attrs={'class': re.compile(r".*\btxt_resultad_busca_casamento\b.*")})
print element.text

This will find the HTML span element on the page that has the class txt_resultad_busca_casamento, which I believe is the data you're trying to extract. From there you can just parse the .text attribute to get the exact data you're interested in.

EDIT 2

Oops, just realised that uses regular expressions... it seems class matching in BeautifulSoup isn't perfect! This line should work instead, at least until the site changes their HTML:

element = soup.find('div', attrs={'id': 'ctl00_body_uppBusca'}).find('span')

2 of 3

Given that you can't parse html with regular expression, if you treat your file as a bag of text you have to use regex or something like:

a = 'Resultado de Busca: Foram encontrados 264 casais' #your page text
num = int(a[a.index("encontrados")+len("encontrados"):a.index("casais")])

Python

docs.python.org › 3 › library › html.parser.html

html.parser — Simple HTML and XHTML parser

Source code: Lib/html/parser.py This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Example HTML Parser...

TutorialsPoint

tutorialspoint.com › python_data_science › python_reading_html_pages.htm

Python - Reading HTML Pages

We can extract tag value from all the instances of a tag using the following code. import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen('http://tutorialspoint.com/python/python_overview.htm') html_doc = response.read() soup = BeautifulSoup(html_doc, 'html.parser') for x in soup.find_all('b'): print(x.string)

Stack Overflow

stackoverflow.com › questions › 36294816 › python-searching-through-html-file-grabbing-a-tags-with-the-href-and-text-con

Python: searching through html file grabbing <a> tags with the href and text content - Stack Overflow

Top answer

1 of 1

Use a html parser and decode the bytes as suggested, BeautifulSoup will make the job very easy and it a lot more reliable than a regex when parsing html:

http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET", my_url)
html = a.data.decode("utf-8")

from bs4 import BeautifulSoup

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])

If you only want the links starting with http you can use a css select:

soup = BeautifulSoup(html)

print([a["href"] for a in soup.select("a[href^=http]")])

Which will give you:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/']

To get the text and href:

soup = BeautifulSoup(html)

a_tags = soup.select("a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)

pp(paired)

Output:

 {u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u'Capital IQ': 'http://www.capitaliq.com',
 u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com',
 u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/',
 u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN',
 u'Markets': 'https://in.finance.yahoo.com/investing/',
 u'Morningstar, Inc.': 'http://www.morningstar.com/',
 u'My Yahoo': 'http://in.my.yahoo.com',
 u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html',
 u'Yahoo': 'https://in.yahoo.com/',
 u'Yahoo India Finance': 'https://in.finance.yahoo.com',
 u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html',
 u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'}

The a[href^=http] means give me all the a tags that have href's and those href's values start with http.

Using lxml and using the table id to get just the story links which you are probably most interested in:

from lxml.etree  import fromstring, HTMLParser

xml = fromstring(_html, HTMLParser())

a_tags = xml.xpath("//table[@id='yfncsumtab']//a")

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Gives you:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 "Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 "EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

We can do the same with out select:

soup = BeautifulSoup(_html)

a_tags = soup.select("#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)

Which will match our lxml output:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
 u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
 u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
 u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
 u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
 u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
 u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
 u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
 u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
 u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
 u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
 u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
 u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
 u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}

You could just use //*[@id='yfncsumtab']//a as id's should be unique.

To get the first six links from the table using an xpath, we can use the ul's and extract the first 6 using ul[position() < 7]:

a_tags  = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a")

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)

Which will give you:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
 "Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
 'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
 'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
 "Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
 'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'}

For small tables, you could also simply slice.

Stack Overflow

stackoverflow.com › questions › 36099261 › finding-string-in-html-file

python - Finding string in HTML file? - Stack Overflow

Top answer

1 of 2

As snakecharmerb said, by using

for line in html :

you iterate over the characters of html when it's a string, not the lines. But you can use

for line in html.split("\n") :

to iterate over the lines.

2 of 2

In the requests package response.content is a string, so you should search like this:

if find in html:
    # do something

By iterating over response.content with

for line in html

you are iterating over the individual characters in the string, not lines.

Medium

mateuszwiza.medium.com › text-extraction-from-html-by-keyword-using-python-6b4b5bc1364e

Text Extraction from HTML by Keyword using Python | by Mateusz Wiza | Medium

June 19, 2021 - To make it more clear, let’s assume that we have HTML code <p> Very <b> important </b> piece of code </p>. The resulting list of text strips would then have 3 elements and would look like this: ['Very', 'important', 'piece of code']. Such a format of results is useful in some cases (remember our second task?) but if we’re interested in the entire text from the HTML file, we can just concatenate all these strips of text into a super-long, single string. The final step is to divide this super-long string into pieces with at most 32,767 characters but this is also quite straightforward. We can use the fact that in Python each character in a string has its own index.

Apify

docs.apify.com › web scraping basics with python

Locating HTML elements with Python | Academy | Apify Documentation

We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price. In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two.

Stack Overflow

stackoverflow.com › questions › 55705747 › searching-a-html-file-line-by-line-for-a-string

python 3.x - searching a html file line by line for a string - Stack Overflow

April 16, 2019 - The break statement should be inside if line is '<html>':, so that the for loop is only broken when there's a match. Lines in the file include line breaks (and may include white-spaces). Use line.strip() to remove trailing characters. The is operator does not not test whether two variables have the same value, but whether they point to the same object. Use == to compare values. ... def search(): with open('cate.html') as ht: for cnt, line in enumerate(ht): print(line.strip()) if cnt < 4: if line.strip() == '<html>': print("found") break

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 57253144 › search-for-a-specific-word-in-html-file-in-python

Search for a specific word in html file in Python - Stack Overflow

Top answer

1 of 2

Try the below code:

num_occ = source.count("your_specific_word")

2 of 2

So if are looking for a web scraping tool then go with beautiful soup or scrapy

Else you could just calculate the count by the number of occurrences in the text by simply using count

with urlopen(url) as response:
   source = response.read()
noOfOccurances = source.count(searchWord)

Python string count

The Hitchhiker's Guide to Python

docs.python-guide.org › scenarios › scrape

HTML Scraping — The Hitchhiker's Guide to Python

(We need to use page.content rather than page.text because html.fromstring implicitly expects bytes as input.) tree now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect.

Stack Overflow

stackoverflow.com › questions › 44005897 › search-for-something-in-a-html-string-via-python

Search for something in a HTML-String via Python - Stack Overflow

I use split() to search strings like that. Isolate that third line then ... to get "#1244587 - 16 ..." then since you know that the left numbers will always have 7 digits, you can take that string and do string[1:7] to get the left number.

GeeksforGeeks

geeksforgeeks.org › how-to-find-a-html-tag-that-contains-certain-text-using-beautifulsoup

How to find a HTML tag that contains certain text using BeautifulSoup ? - GeeksforGeeks

January 12, 2024 - In this example, BeautifulSoup is used to search gfg.html for specific text patterns in different HTML tags, and the found tags are printed to the console. ... # Importing library from bs4 import BeautifulSoup import re # Opening and reading the html file file = open("gfg.html", "r") contents = file.read() soup = BeautifulSoup(contents, 'html.parser') # Finding a pattern(certain text) pattern = 'Geeks For Geeks' # Anchor tag text1 = soup.find_all('a', text=pattern) print(text1) # Span tag text2 = soup.find_all('span', text=pattern) print(text2) # Finding a pattern(certain text) pattern2 = 'Python Program' # Heading tag text3 = soup.find_all('h1', text=pattern2) print(text3) # List tag text4 = soup.find_all('li', text=pattern2) print(text4) # Finding a pattern(certain text) pattern3 = 'GFG Website' # Table(row) tag text5 = soup.find_all('tr', text=pattern3) print(text5)

Opensource.com

opensource.com › article › 18 › 1 › parsing-html

Parsing HTML with Python | Opensource.com

If I could scan through all the HTML files for image references, then compare that list to the actual image files, chances are I would see a mismatch. ... I'm interested in the part between the first set of quotation marks, after src=. After some searching for a solution, I found a Python module called BeautifulSoup.

GeeksforGeeks

geeksforgeeks.org › how-to-parse-local-html-file-in-python

How to parse local HTML file in Python? - GeeksforGeeks

March 16, 2021 - In Python, we can parse the html files using the panda's library and the library which is beautiful soup. The Beautiful Soup library is mainly used for web scraping.

Stack Overflow

stackoverflow.com › questions › 74757344 › python-search-a-file-for-text-based-on-other-text-found-look-ahead

html - python search a file for text based on other text found (look ahead)? - Stack Overflow

Top answer

1 of 1

Maybe you can use bs4 API. Find <a> tag that contains <img> and for a date find next <a> tag that contains <div>:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser")  # html_doc contains the snippet from your question

for img in soup.select("a > img"):
    src = img["src"]
    date = img.find_next(lambda tag: tag.name == "a" and tag.div).text.strip()
    print(f"{src=} {date=}")

Prints:

src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'

EDIT: With updated input:

import re

rx_date = re.compile(
    r"(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s\d{4}\s\d{1,2}:\d{2}:\d{2}(?:AM|PM|am|pm)"
)


for img in soup.select("a > img"):
    src = img["src"]
    date = img.find_next(
        lambda tag: tag.name == "div"
        and rx_date.search(tag.find(text=True, recursive=False) or "")
    ).text.strip()
    print(f"{src=} {date=}")

Prints:

src='folder/image.jpg' date='Jun 25, 2011 12:10:54pm'
src='folder/image2.jpg' date='Feb 10, 2012 1:10:54am'
src='folder/image.jpg' date='Feb 10, 2012 1:10:54am'

GeeksforGeeks

geeksforgeeks.org › how-to-scrape-data-from-local-html-files-using-python

How to Scrape Data From Local HTML Files using Python? | GeeksforGeeks

April 21, 2021 - BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the ...

Stack Overflow

stackoverflow.com › questions › 25234217 › in-python-how-do-i-search-an-html-webpage-for-a-set-of-strings-in-text-file

beautifulsoup - In Python, how do I search an html webpage for a set of strings in text file? - Stack Overflow

August 11, 2014 - The problem is that you are trying to search with a string that contains some special character, like ' ', and '\n'. Note that str.strip() removes ' ' and other whitespace characters as well (e.g. tabs and newlines), so, update the following line: ... def textsearch(): #manga_file = open("managa.txt").readlines() ##assuming your text file has 3 titles manga_file = ["Karakuri Circus","Sun-ken Rock","Shaman King Flowers"] manga_html = "https://www.mangaupdates.com/releases.html" manga_page = urllib2.urlopen(manga_html) found = 0 soup = BeautifulSoup(manga_page) ##use BeautifulSoup to parse the s

Real Python

realpython.com › python-web-scraping-practical-introduction

A Practical Introduction to Web Scraping in Python – Real Python

December 21, 2024 - One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.

Stack Abuse

stackabuse.com › guide-to-parsing-html-with-beautifulsoup-in-python

Guide to Parsing HTML with BeautifulSoup in Python

September 21, 2023 - This article will give you a crash course on web scraping in Python with Beautiful Soup - a popular Python library for parsing HTML and XML.