It's may be not an easy decision
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://edwvb.blogspot.ru/2018/03/3-tipa-povedeniya-kotorye-opredelyayut-uspeshnyh-prodavcov.html")
bsObj = BeautifulSoup(html, "html.parser")
nameList = bsObj.findAll("div", {"dir":"ltr", "style":"text-align: left;", "trbidi":"on"})
nameList = [i.text for i in nameList]
After that we need first convert nameList[1] to pd.Series and then to DataFrame
S = pd.Series(nameList[1])
S.to_frame()
Answer from Edward on Stack Overflowpython - Extract elements from bs4.element.ResultSet - Stack Overflow
Scraping HTML table using Beautiful Soup - need help - driving me crazy!
python - Can I change datatype list to bs4.element.ResultSet? - Stack Overflow
python - Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements - Code Review Stack Exchange
I believe that the below is what you are looking for. The idea is to look at each entry in xml_list as xml document, parse it and get the attrributes.
import xml.etree.ElementTree as ET
xml_list = ['<TAG attr_1="A" attr_2="1" attr_3="01"/>',
'<TAG attr_1="B" attr_3="02"/>',
'<TAG attr_1="C" attr_2="3" attr_3="03"/>']
result = []
for xml in xml_list:
root = ET.fromstring(xml)
entry = {}
for i in range(1,4):
entry[f'attr_{i}'] = root.attrib.get(f'attr_{i}',None)
result.append(entry)
print(result)
output
[{'attr_1': 'A', 'attr_2': '1', 'attr_3': '01'}, {'attr_1': 'B', 'attr_2': None, 'attr_3': '02'}, {'attr_1': 'C', 'attr_2': '3', 'attr_3': '03'}]
Thanks, I managed to solve it by using the much better xml-function from pandas, that is available since 1.3.0.
It is called pd.read_xml! Pretty cool thing, that makes it much easier.
Hi all - was looking for some help if possible. I wrote a small script a while ago to scrape an HTML table using Pandas to convert it into a DataFrame, however recently i am getting the dreaded "Forbidden" message while trying to scrape. As I understand it, that is due to some ability of the website to block the certain request module used in the Pandas "read_html()" function.
So I am trying to tackle it another way, using Beautiful Soup and the Requests module, however i am only getting so far.
An example of a page I am trying to scrape is "http://www.etf.com/channels/bond-etfs" and I am trying to get the table named "Classification" into a Pandas DataFrame.
This is what i have so far:
# Import required modules
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.etf.com/channels/bond-etfs'
# Scrape the HTML at the url
r = requests.get(url)
# Turn the HTML into a Beautiful Soup object
soup = BeautifulSoup(r.text, 'lxml')
table_left = soup.findAll("table")[4]
table_right = soup.findAll("table")[5]The table is a bit weird as it seems to be contained across two separate tables, one holding the first 3 columns I want, and the other holding the rest of the info - I guess that is a result of the way the table tab selection works.
So now I have two "bs4.element.Tag" datatypes, and am at a complete loss as to how i would transform these into Pandas DataFrames...."read_html()" made it so easy!!
Would anyone have any idea how i would go about scraping this bloody table as it's driving me mad!
Thanks to all in advance :D
The table variable contains a list. You would need to call find_all on its members (even though you know it's a list with only one member), not on the entire thing.
>>> type(table)
<class 'bs4.element.ResultSet'>
>>> type(table[0])
<class 'bs4.element.Tag'>
>>> len(table[0].find_all('tr'))
6
>>>
table = soup.find_all(class_='dataframe')
This gives you a result set โ i.e. all the elements that match the class. You can either iterate over them or, if you know you only have one dataFrame, you can use find instead. From your code it seems the latter is what you need, to deal with the immediate problem:
table = soup.find(class_='dataframe')
However, that is not all:
for row in table.find_all('tr'):
col = table.find_all('td')
You probably want to iterate over the tds in the row here, rather than the whole table. (Otherwise you'll just see the first row over and over.)
for row in table.find_all('tr'):
for col in row.find_all('td'):