Nowadays, there is at least one better tool, called slimit:
SlimIt is a JavaScript minifier written in Python. It compiles JavaScript into more compact code so that it downloads and runs faster.
SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor.
Demo:
Imagine we have the following javascript code:
Copy$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
And now we need to get email, phone and name values from the data object.
The idea here would be to instantiate a slimit parser, visit all nodes, filter all assignments and put them into the dictionary:
Copyfrom slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
"""
parser = Parser()
tree = parser.parse(data)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print fields
It prints:
Copy{'name': "'XYZ'",
'url': "'http://www.example.com'",
'type': '"POST"',
'phone': "'9999999999'",
'data': '',
'email': "'abc@g.com'"}
Answer from alecxe on Stack OverflowNowadays, there is at least one better tool, called slimit:
SlimIt is a JavaScript minifier written in Python. It compiles JavaScript into more compact code so that it downloads and runs faster.
SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor.
Demo:
Imagine we have the following javascript code:
Copy$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
And now we need to get email, phone and name values from the data object.
The idea here would be to instantiate a slimit parser, visit all nodes, filter all assignments and put them into the dictionary:
Copyfrom slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
"""
parser = Parser()
tree = parser.parse(data)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print fields
It prints:
Copy{'name': "'XYZ'",
'url': "'http://www.example.com'",
'type': '"POST"',
'phone': "'9999999999'",
'data': '',
'email': "'abc@g.com'"}
ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.
The ANTLR site provides many grammars, including one for JavaScript.
As it happens, there is a Python API available - so you can call the lexer (recognizer) generated from the grammar directly from Python (good luck).
JavaScript parser for Python
Parsing Javascript In Python - Stack Overflow
javascript - Parse JS file via Python - Stack Overflow
How can I parse Javascript variables using python? - Stack Overflow
Videos
Dear webscrapers,
I'm scraping this website by intercepting the API with insomnia and requests. The problem is, I don't get a JSON but a javascript script containing javascript and multiple JSONS with the data I want. Does anybody know how I can parse the script to get whatever I want?
Any help appreciated, have been stuck on this for some time ^^'
» pip install esprima
Hello, I'm looking for a framework that will help me read a JS file locally and edit it. I want to be able to find a specific function inside the file and edit it, adding new lines of code at the beginning or at the end.
If your format really is just one or more var foo = [JSON array or object literal];, you can just write a dotall regex to extract them, then parse each one as JSON. For example:
Copy>>> j = '''var line1=
[["Wed, 12 Jun 2013 01:00:00 +0000",22.4916114807,"2 sold"],
["Fri, 14 Jun 2013 01:00:00 +0000",27.4950008392,"2 sold"],
["Sun, 16 Jun 2013 01:00:00 +0000",19.5499992371,"1 sold"],
["Tue, 18 Jun 2013 01:00:00 +0000",17.25,"1 sold"],
["Sun, 23 Jun 2013 01:00:00 +0000",15.5420341492,"2 sold"],
["Thu, 27 Jun 2013 01:00:00 +0000",8.79045295715,"3 sold"],
["Fri, 28 Jun 2013 01:00:00 +0000",10,"1 sold"]];\s*$'''
>>> values = re.findall(r'var.*?=\s*(.*?);', j, re.DOTALL | re.MULTILINE)
>>> for value in values:
... print(json.loads(value))
[[['Wed, 12 Jun 2013 01:00:00 +0000', 22.4916114807, '2 sold'],
['Fri, 14 Jun 2013 01:00:00 +0000', 27.4950008392, '2 sold'],
['Sun, 16 Jun 2013 01:00:00 +0000', 19.5499992371, '1 sold'],
['Tue, 18 Jun 2013 01:00:00 +0000', 17.25, '1 sold'],
['Sun, 23 Jun 2013 01:00:00 +0000', 15.5420341492, '2 sold'],
['Thu, 27 Jun 2013 01:00:00 +0000', 8.79045295715, '3 sold'],
['Fri, 28 Jun 2013 01:00:00 +0000', 10, '1 sold']]]
Of course this makes a few assumptions:
- A semicolon at the end of the line must be an actual statement separator, not the middle of a string. This should be safe because JS doesn't have Python-style multiline strings.
- The code actually does have semicolons at the end of each statement, even though they're optional in JS. Most JS code has those semicolons, but it obviously isn't guaranteed.
- The array and object literals really are JSON-compatible. This definitely isn't guaranteed; for example, JS can use single-quoted strings, but JSON can't. But it does work for your example.
- Your format really is this well-defined. For example, if there might be a statement like
var line2 = [[1]] + line1;in the middle of your code, it's going to cause problems.
Note that if the data might contain JavaScript literals that aren't all valid JSON, but are all valid Python literals (which isn't likely, but isn't impossible, either), you can use ast.literal_eval on them instead of json.loads. But I wouldn't do that unless you know this is the case.
Okay, so there are a few ways to do it, but I ended up simply using a regular expression to find everything between line1= and ;
Copy#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#evaluate list as python code => turn into list in python
newParsed = eval(parsed[0])
Regex is nice when you have good coding, but is this method better (EDIT: or worse!) than any of the other answers here?
EDIT: I ultimately used the following:
Copy#Read page data as a string
pageData = sock.read()
#set p as regular expression
p = re.compile('(?<=line1=)(.*)(?=;)')
#find all instances of regular expression in pageData
parsed = p.findall(pageData)
#load as JSON instead of using evaluate to prevent risky execution of unknown code
newParsed = json.loads(parsed[0])
» pip install calmjs.parse
» npm install dt-python-parser
I suggest you take a look at the BeautifulSoup - it can help you extract JavaScript code from an HTML file (but not parse/run it):
source = """<html>...</html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(source)
js_code = soup.find_all("script")[0].text
Then you can use some JavaScript interpreter to run the code and get the variables - there are some out there like this one or this one. Just Google it.
I think you need to add the fuction so the computer can read if it is javascript and python, use this:
script type="text/javascript"> <!-------or python----></script>