My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.
page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:
"<text from pdf document>"
It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.
Here's a little code to illustrate this:
import json
s1 = "This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))
Result:
This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
Answer from CryptoFool on Stack OverflowMy guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.
page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:
"<text from pdf document>"
It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.
Here's a little code to illustrate this:
import json
s1 = "This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))
Result:
This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.
If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.
The JSON data will contain an array of elements with information such as the following:
{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}
parse pdf to json using python
How to convert the extracted text from PDF to JSON or XML format in Python? - Stack Overflow
How do I convert JSON to PDF? I have no programming skills.
extracting data from 100+ pdf files
What is JSON Format?
Is it safe to convert PDF to JSON in the Cloud?
Starting with Aspose.Total REST APIs Using Python SDK: A Beginner's Guide
Videos
» pip install pydf2json
Im searching for a while now for a library that can parse a pdf to json or xml format while keeping the document structure.
the popular libs like pypdf do often not preserve the document structure. Thought about using teseract for OCR and then transforming it into a json format but could not get it working. Is there a library that can parse pdf to json format while preserving the document structure and not just spitt out a block of text ?