I'm in an internship where (for whatever reason) the codebase for this internal analytics website does not use a standard framework like flask or django. Instead, someone created their own HTMLReporterclass with its own methods like render_html. I don't really care about this, but one of the things I've been trying to do (for way too long) is make a bar chart and display it on the webpage. The data from the bar chart is returned by a python file where some data wrangling happens.
At first, I just made a plot in python, saved the image, and then embedded it in the html but this didn't work for more complex images, and some of the axis labels were cut off. So I've been trying to pass the data for the barplot to a javascript file that renders a barplot and I'm completely lost at this point.
With a framework like Flask - I could've done something like render_template(..., ..., data=data) and use jinja2 inside a <script> to get it done. But this custom class doesn't have that functionality.
-
I tried using the json-script template tag as described here but that just gives me this error:
... {{ barplot_data|json_script:"plot-data" }}
jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got ':'-
I also tried using a python package like plotly or dash, but they seem to require me to host the plot on a public website before embedding it into html and I can't do that since the data used here is sensitive.
-
I'm right now trying to parse the json file with the javascript itself. I'm running into a bunch of incorrect-path issues, and trying to resolve them.
I'm just thinking that there's gotta be an easier way to do this... right?
Extracting data from javascript var inside <script> with python - Stack Overflow
javascript - How get data stored in vars from a js file in python - Stack Overflow
Can I get data from JavaScript to Python? - Stack Overflow
How can I scrape a page with dynamic content (created by JavaScript) in Python? - Stack Overflow
Videos
>>> import json
>>> weird_json = '{"x": 1, "x": 2, "x": 3}'
>>> x = json.loads(weird_json)
>>> x
{u'x': 3}
>>> y = json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
>>> y
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
You can take the HTML data, and convert it into a dictionary, enabling you to do:
print x['x']
This is the starting point, create a socket in Python which listens to a port. Then have it recieve data.
In Javascript, open a socket which can connect to a port (the one Python listens to). Use, say: http://socket.io/
This is a pure socket-to-socket related issue?
A working relationship between Python and Javascript (on port 80):
from socket import *
import json
s = socket()
s.bind(('', 80))
s.listen(4)
ns, na = s.accept()
while 1:
try:
data = ns.recv(8192)
except:
ns.close()
s.close()
break
data = json.loads(data)
print data
There you got a socket listening to 80, connect to that and send whatever you want.
function callPython()
{
var xmlhttp;
if (window.XMLHttpRequest)
{// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
}
else
{// code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","Form-data",true);
xmlhttp.send();
}
For instance, where you can send the form data as a string, replacing "Form-data" and the response from Python can be put into "myDiv" :)
Something like Ghost.py should be able to do what you want.
This allows evaluation of JavaScript.
result, resources = ghost.evaluate(
"document.getElementById('my-input').getAttribute('value');")
Which should help.
I've used PhantomJS the JS headless webkit browser and this is a port and/or reworking of that using Python.
In my use case I just called PhantomJS from subprocess.call as I couldn't be bothered to install the Ghost dependencies.
I just emitted JSon to stdout and json.loads on it.
The county_info is the first object in there. So, just search for the delimiters"
import json
data = open('x.js').read()
i = data.find('{' )
j = data.find('}', i)
data = json.loads(data[i:j+1])
print(data)
I tried creating an example file, but I'm not too sure if this is what you had in mind.
https://replit.com/@hunter-macias/get-js-data#main.py
you can use .readlines() instead of read() to get all of your data as a list and then you just have to clean up your data based on how you formatted it
EDIT Sept 2021: phantomjs isn't maintained any more, either
EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end.
dryscape isn't maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium's python library with Phantom JS as a web driver fast enough and easy to get the work done.
Once you have installed Phantom JS, make sure the phantomjs binary is available in the current path:
phantomjs --version
# result:
2.1.1
Example
To give an example, I created a sample page with following HTML code. (link):
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Javascript scraping test</title>
</head>
<body>
<p id='intro-text'>No javascript support</p>
<script>
document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
</script>
</body>
</html>
without javascript it says: No javascript support and with javascript: Yay! Supports javascript
Scraping without JS support:
import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>
Scraping with JS support:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'
You can also use Python library dryscrape to scrape javascript driven websites.
Scraping with JS support:
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>
We are not getting the correct results because any javascript generated content needs to be rendered on the DOM. When we fetch an HTML page, we fetch the initial, unmodified by javascript, DOM.
Therefore we need to render the javascript content before we crawl the page.
As selenium is already mentioned many times in this thread (and how slow it gets sometimes was mentioned also), I will list two other possible solutions.
Solution 1: This is a very nice tutorial on how to use Scrapy to crawl javascript generated content and we are going to follow just that.
What we will need:
Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform.
Install Splash following the instruction listed for our corresponding OS.
Quoting from splash documentation:Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
Essentially we are going to use Splash to render Javascript generated content.
Run the splash server:
sudo docker run -p 8050:8050 scrapinghub/splash.Install the scrapy-splash plugin:
pip install scrapy-splashAssuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the
settings.py:Then go to your scrapy project’s
settings.pyand set these middlewares:DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
SPLASH_URL = 'http://localhost:8050'And finally you need to set these values too:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'Finally, we can use a
SplashRequest:In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quoteSplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.
Solution 2: Let's call this experimental at the moment (May 2018)...
This solution is for Python's version 3.6 only (at the moment).
Do you know the requests module (well who doesn't)?
Now it has a web crawling little sibling: requests-HTML:
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
Install requests-html:
pipenv install requests-htmlMake a request to the page's url:
from requests_html import HTMLSession session = HTMLSession() r = session.get(a_page_url)Render the response to get the Javascript generated bits:
r.html.render()
Finally, the module seems to offer scraping capabilities.
Alternatively, we can try the well-documented way of using BeautifulSoup with the r.html object we just rendered.
Hi all, I need to extract a specific value embedded inside a large JS file served from a CDN. The file is not JSON; it contains a JS object literal like this (sanitized):
var Ii = {
'strict': [
{ 'name': 'randoje', 'domain': 'example.com', 'value': 'abc%3dXYZ...' },
...
],
...
};Right now I could only think of using a regex to grab the value 'abc%3dXYZ...'.
But i am not that familliar with regex and I cant wonder but think that there is an easier way of doing this.
any advice is appreciated a lot!