It's may be not an easy decision

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://edwvb.blogspot.ru/2018/03/3-tipa-povedeniya-kotorye-opredelyayut-uspeshnyh-prodavcov.html")
bsObj = BeautifulSoup(html, "html.parser")

nameList = bsObj.findAll("div", {"dir":"ltr", "style":"text-align: left;", "trbidi":"on"})
nameList = [i.text for i in nameList]

After that we need first convert nameList[1] to pd.Series and then to DataFrame

S = pd.Series(nameList[1])
S.to_frame()
Answer from Edward on Stack Overflow
Top answer
1 of 1
1

Try:

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}

user_input = "Solarpanels"
site = f"https://news.google.com/rss/search?q={user_input}+when:14d&hl=en-GB&gl=DE&ceid=GB:en"


soup = BeautifulSoup(requests.get(site, headers=headers).content, "xml")

all_data = []
for item in soup.select("item"):
    all_data.append(
        {
            "title": item.title.text,
            "link": item.link.text,
            "pubDate": item.pubDate.text,
            "description": BeautifulSoup(
                item.description.text, "html.parser"
            ).get_text(strip=True), # or .get_text(strip=True, separator=" ")
            "source": item.source.text,
            "source_url": item.source["url"],
        }
    )

df = pd.DataFrame(all_data)
print(df.head().to_markdown(index=False))

Prints:

title link pubDate description source source_url
Australian research finds cost-effective way to recycle solar panels - The Guardian https://www.theguardian.com/environment/2022/oct/16/australian-research-finds-cost-effective-way-to-recycle-solar-panels Sat, 15 Oct 2022 23:51:00 GMT Australian research finds cost-effective way to recycle solar panelsThe GuardianAustralian Researchers Find Cost-Effective Way To Recycle Solar PanelsTechJuiceHow could recycling solar panels be scaled up for sustainable effectESI AfricaSolar Panel Recycling Market to Rise at 37% CAGR during Forecast Period: TMR StudyDigital JournalView Full coverage on Google News The Guardian https://www.theguardian.com
Business Matters: Solar Panels on Commercial Property: Why You Should Make the Switch - Insider Media https://www.insidermedia.com/blogs/north-west/business-matters-solar-panels-on-commercial-property-why-you-should-make-the-switch Mon, 17 Oct 2022 09:13:35 GMT Business Matters: Solar Panels on Commercial Property: Why You Should Make the SwitchInsider Media Insider Media https://www.insidermedia.com
Cost of living: The people using solar panels and turbines to reduce bills - bbc.co.uk https://www.bbc.co.uk/news/uk-england-essex-62967716 Wed, 05 Oct 2022 07:00:00 GMT Cost of living: The people using solar panels and turbines to reduce billsbbc.co.uk bbc.co.uk https://www.bbc.co.uk
School applies for 120 solar panels - Stamford Mercury https://www.stamfordmercury.co.uk/news/school-applies-for-120-solar-panels-9278921/ Mon, 17 Oct 2022 11:00:00 GMT School applies for 120 solar panelsStamford Mercury Stamford Mercury https://www.stamfordmercury.co.uk
Solar panels enable Lanarkshire village hall to cut running costs by 80 per cent - Daily Record https://www.dailyrecord.co.uk/in-your-area/lanarkshire/solar-panels-enable-lanarkshire-village-28211459 Sun, 16 Oct 2022 18:50:00 GMT Solar panels enable Lanarkshire village hall to cut running costs by 80 per centDaily Record Daily Record https://www.dailyrecord.co.uk
Discussions

python - Extract elements from bs4.element.ResultSet - Stack Overflow
I'm looking to extract the two numeric value from this bs4. forecast = [ 1.2 - More on stackoverflow.com
๐ŸŒ stackoverflow.com
Scraping HTML table using Beautiful Soup - need help - driving me crazy!
They're probably blocking the User-Agent - you could just use requests to grab the HTML and pass it on to pandas e.g. tables = pandas.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text) Not sure if they added the ability to change the headers with read_html() - a quick search suggests it's not possible as of yet. More on reddit.com
๐ŸŒ r/learnpython
5
4
January 25, 2017
python - Can I change datatype list to bs4.element.ResultSet? - Stack Overflow
I wanted to get data from 2022-01 to 2022-04, so I used this code. from urllib.request import urlopen import pandas as pd from bs4 import BeautifulSoup df = [] for i in month: gu_code = 11680 ... More on stackoverflow.com
๐ŸŒ stackoverflow.com
April 19, 2022
python - Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements - Code Review Stack Exchange
I am trying to convert a BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements and handling them accordingly. I have an implementation of this that works at a surface level More on codereview.stackexchange.com
๐ŸŒ codereview.stackexchange.com
February 7, 2017
๐ŸŒ
Hackore
shaynemei.github.io โ€บ hackore โ€บ bs4_basic.html
Hackore โ€“ Beautiful Soup Basics
June 14, 2018 - # To get content from webpages via `get()` import requests from bs4 import BeautifulSoup import pandas as pdp
๐ŸŒ
Reddit
reddit.com โ€บ r/learnpython โ€บ scraping html table using beautiful soup - need help - driving me crazy!
r/learnpython on Reddit: Scraping HTML table using Beautiful Soup - need help - driving me crazy!
January 25, 2017 -

Hi all - was looking for some help if possible. I wrote a small script a while ago to scrape an HTML table using Pandas to convert it into a DataFrame, however recently i am getting the dreaded "Forbidden" message while trying to scrape. As I understand it, that is due to some ability of the website to block the certain request module used in the Pandas "read_html()" function.

So I am trying to tackle it another way, using Beautiful Soup and the Requests module, however i am only getting so far.

An example of a page I am trying to scrape is "http://www.etf.com/channels/bond-etfs" and I am trying to get the table named "Classification" into a Pandas DataFrame.

This is what i have so far:

# Import required modules
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.etf.com/channels/bond-etfs'

# Scrape the HTML at the url
r = requests.get(url)

# Turn the HTML into a Beautiful Soup object
soup = BeautifulSoup(r.text, 'lxml')

table_left = soup.findAll("table")[4]
table_right = soup.findAll("table")[5]

The table is a bit weird as it seems to be contained across two separate tables, one holding the first 3 columns I want, and the other holding the rest of the info - I guess that is a result of the way the table tab selection works.

So now I have two "bs4.element.Tag" datatypes, and am at a complete loss as to how i would transform these into Pandas DataFrames...."read_html()" made it so easy!!

Would anyone have any idea how i would go about scraping this bloody table as it's driving me mad!

Thanks to all in advance :D

๐ŸŒ
Stack Overflow
stackoverflow.com โ€บ questions โ€บ 71924081 โ€บ can-i-change-datatype-list-to-bs4-element-resultset
python - Can I change datatype list to bs4.element.ResultSet? - Stack Overflow
April 19, 2022 - 1 BeautifulSoup4 - python: how to merge two bs4.element.ResultSet and get one single list? 0 How to convert bs4.element.Tag to pandas ยท 1 Creating a pandas DataFrame from list of lists containing data from bs4 object ยท 2 Beautifulsoup convert string to ResultSet object of bs4.element module ยท
Find elsewhere
Top answer
1 of 1
4

Here is the list of things I would think about to improve:

  • you are doubling on calls to handle_bs4_element() here:

    data.append([handle_bs4_element(rc) for rc in row_cells if handle_bs4_element(rc)])
    

    Instead, you can either allow "falsy" values for the row cells and filter them afterwards, or expand the loop:

    result = []
    for rc in row_cells:
        cell_text = handle_bs4_element(rc)
        if cell_text:
            result.append(cell_text)
    data.append(result)
    
  • the DRY principle. There are several repeated blocks of code, like:

    if len(_res) == 1:
        return _res[0]
    else:
        return _res
    
  • using list comprehensions is not only more Pythonic, but actually faster. E.g. you can replace:

    _res = []
    for td_content in element.contents:
        _res.append(handle_bs4_element(td_content))
    

    with:

    _res = [handle_bs4_element(td_content) for td_content in element.contents]
    
  • you can use the short if/else one-liner, replacing:

    if len(_res) == 1:
        return _res[0]
    else:
        return _res
    

    with:

    return _res[0] if len(_res) == 1 else _res
    
  • variable naming. _res should not be started with an underscore. You are confusing private class or instance attributes with regular variables. _res should probably be called result, or may be cell_data?

  • if you will have more of this kind of tag-specific processing logic, continuing to put it as an another elif would hurt readability and does not scale well. Consider using the "Extract Method" refactoring method and defining a separate functions for each of the cases.

  • instead of using the .contents list directly, look into using .get_text(), which completes an element's text including the children texts recursively. Not sure if applicable for your problem.

  • or, instead of .contents list, you can use the .children generator


As a side note, there is also a simpler way to parse HTML tables - pandas.read_html() which would load an HTML table into a DataFrame, you can then easily dump the dataframe into a list or into CSV, or into an Excel file directly. For example, the following code:

from pprint import pprint

import pandas as pd


df = pd.read_html('table_sample.html')[0]  # get the first parsed dataframe
pprint(df.values.tolist())

Would automagically produce:

[[nan, 'Description', 'Col 1', 'Col 2', 'Col 3'],
 [1.0, 'Some paragraph text', 'x', '5', '2'],
 [2.0, 'HEADER 1', nan, nan, nan],
 [3.0, 'Some text: (1) Check out this Figure 1.0.', 'x', '2', '1'],
 [4.0, '(2) Some more text', 'x', '2', '1'],
 [5.0, '(3) Additional text', 'x', '2', '1'],
 [6.0, '(4) A bit more text', 'x', '2', '1'],
 [7.0, '(5) A span Figure 1.0 for  edited text. At this point the span starts again', 'x', '2', '1'],
 [8.0, 'HEADER 2', nan, nan, nan],
 [9.0, 'Weird formatting, because Confluence', 'x', '4', '2'],
 [10.0, 'HEADER 3', nan, nan, nan],
 [11.0, 'A paragraph about header 3.  This is just silly. Strong indeed.', 'x', '3', '3'],
 [12.0, 'Something about things or what not. Why is this in a span?', 'x', '2', '2'],
 [13.0, 'HEADER 4', nan, nan, nan],
 [14.0, 'Section 4 baby! Or header.  Confluence formatting fun.', 'x', '2', '3'],
 [15.0, 'Pretty boring span of text', 'x', '2', '2'],
 [16.0, 'HEADER 5', nan, nan, nan],
 [17.0, 'A big paragraph describing more stuff. Super exciting.', 'x', '4', '2']]
๐ŸŒ
GeeksforGeeks
geeksforgeeks.org โ€บ python โ€บ convert-xml-structure-to-dataframe-using-beautifulsoup-python
Convert XML structure to DataFrame using BeautifulSoup - Python - GeeksforGeeks
March 21, 2024 - # Python program to convert xml # structure into dataframes using beautifulsoup # Import libraries from bs4 import BeautifulSoup import pandas as pd # Open XML file file = open("gfg.xml", 'r') # Read the contents of that file contents = file.read() soup = BeautifulSoup(contents, 'xml') # Extracting the data authors = soup.find_all('author') titles = soup.find_all('title') prices = soup.find_all('price') pubdate = soup.find_all('publish_date') genres = soup.find_all('genre') des = soup.find_all('description') data = [] # Loop to store the data in a list named 'data' for i in range(0, len(author
๐ŸŒ
Readthedocs
sdss-marvin.readthedocs.io โ€บ en โ€บ 2.2.5 โ€บ tools โ€บ results โ€บ results_set.html
The ResultSet Object โ€” Marvin 2.2.5 documentation
Using numpy, you can handle the ResultSet and extract a subset of elements that satisfy some condition. Slicing a ResultSet with Numpy array of indices will return a standard Numpy array. For fancier manipulation, consider converting the results into an Astropy Table or Pandas dataframe:
๐ŸŒ
Python Forum
python-forum.io โ€บ thread-35482.html
Parsing bs4 Resultset
November 8, 2021 - I'm having trouble understanding the intricacies of BeautifulSoup. I did a find for a specific 'select' tag using 'find(id=...)'. The returned results was the correct 'select' along with its options. Now I'm stuck on how to extract data from that res...
๐ŸŒ
Python.org
discuss.python.org โ€บ python help
Python using BeautifulSoup - Python Help - Discussions on Python.org
March 22, 2021 - Hi, Iโ€™m very new to Python and have written a program to scrape for baseball player bio data. I made a soup object and extracted all the data to a list. I expect it to have 7 items in each record. Some of them donโ€™t โ€ฆ