My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
Answer from CryptoFool on Stack Overflow
Top answer
1 of 2
5

My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
2 of 2
3

Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.

If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.

The JSON data will contain an array of elements with information such as the following:

{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}
🌐
PyPI
pypi.org › project › json2pdf-Converter
json2pdf-Converter
September 1, 2023 - JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
Discussions

parse pdf to json using python
No, because the pdf format does not save the document structure. The way pdf works is by saving the absolute position of things, not the relative position. More on reddit.com
🌐 r/learnpython
5
2
March 23, 2023
How to convert the extracted text from PDF to JSON or XML format in Python? - Stack Overflow
I am using PyPDF2 to extract the data from PDF file and then converting into Text format? PDF format for the file is like this: Name : John Address: 123street , USA Phone No: 123456 Gender: ... More on stackoverflow.com
🌐 stackoverflow.com
September 16, 2023
How do I convert JSON to PDF? I have no programming skills.
JSON is just text, e. g.: {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }} so you can open your file in any text editor like Notepad and use the Windows "print to PDF" function. If you want some JSON-specific formatting & coloring, there are lots of programming editors like Notepad++ to do that. Again, use Windows' internal PDF printer. More on reddit.com
🌐 r/software
15
1
March 4, 2024
extracting data from 100+ pdf files
I actually just built a program to extract lab test data from PDFs for work and dump it into an ordered csv file. I won’t go into the nitty gritty details of everything I did, but it all started with the use of the pdfplumber package. It’s not too difficult to use. The GitHub will explain all of the necessary details. Using .extract_text() will give you all of the text as a single string with \n representing line breaks. From there it’s up to you to write the appropriate text parsing using regex and such. Note that pdfplumber won’t work in scanned PDFs. They have to have been computer generated. One nice thing about pdfplumber is using .extract_words() will generate a list of dictionaries for every word in the pdf. The dictionaries have location info which you can use to help crop the pdf based on the relative location of what you’re looking for to other nearby words. More on reddit.com
🌐 r/learnpython
92
264
September 25, 2020
People also ask

What is JSON Format?
JSON (JavaScript Object Notation) is an open standard file format for sharing data that uses human-readable text to store and transmit data. JSON files are stored with the .json extension. JSON requires less formatting and is a good alternative for XML. JSON is derived from JavaScript but is a language-independent data format. The generation and parsing of JSON is supported by many modern programming languages. application/json is the media type used for JSON.
🌐
products.aspose.cloud
products.aspose.cloud › aspose.total › python › conversion › pdf to json conversion
Free online PDF to JSON conversion App via python
Is it safe to convert PDF to JSON in the Cloud?
Of course! Aspose Cloud uses Amazon EC2 cloud servers that guarantee the security and resilience of the service. Please read more about Aspose's Security Practices.
🌐
products.aspose.cloud
products.aspose.cloud › aspose.total › python › conversion › pdf to json conversion
Free online PDF to JSON conversion App via python
Starting with Aspose.Total REST APIs Using Python SDK: A Beginner's Guide
Quickstart not only guides through the initialization of Aspose.Total Cloud API, it also helps in installing the required libraries.
🌐
products.aspose.cloud
products.aspose.cloud › aspose.total › python › conversion › pdf to json conversion
Free online PDF to JSON conversion App via python
🌐
GitHub
github.com › antoinecarme › pdf_to_json
GitHub - antoinecarme/pdf_to_json: Python module to Convert a PDF file to a JSON format
May 19, 2024 - import pdf_to_json as p2j import json # web document : UDHR url = "https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf" # Convert the document into a python dictionary lConverter = p2j.pdf_to_json.pdf_to_json_converter() lDict ...
Starred by 6 users
Forked by 5 users
Languages   Python 100.0% | Python 100.0%
🌐
Nanonets
nanonets.com › blog › pdf-to-json
Convert PDF to JSON - PDF to JSON Python & Javascript
September 17, 2025 - Learn how to convert PDF to JSON using Python & Javascript. Automate data extraction from PDF to JSON with our API.
🌐
PyPI
pypi.org › project › pydf2json
pydf2json · PyPI
PDF analysis. Convert contents of PDF to a JSON-style python dictionary.
      » pip install pydf2json
    
Published   Sep 15, 2022
Version   2.4.0
🌐
Reddit
reddit.com › r/learnpython › parse pdf to json using python
r/learnpython on Reddit: parse pdf to json using python
March 23, 2023 -

Im searching for a while now for a library that can parse a pdf to json or xml format while keeping the document structure.
the popular libs like pypdf do often not preserve the document structure. Thought about using teseract for OCR and then transforming it into a json format but could not get it working. Is there a library that can parse pdf to json format while preserving the document structure and not just spitt out a block of text ?

Find elsewhere
🌐
GitHub
github.com › hitachi-nlp › appjsonify
GitHub - hitachi-nlp/appjsonify: A handy PDF-to-JSON conversion tool for academic papers implemented in Python. · GitHub
appjsonify1 is a handy PDF-to-JSON conversion tool for academic papers implemented in Python.
Starred by 73 users
Forked by 4 users
Languages   Python
🌐
Aspose
products.aspose.cloud › aspose.total › python › conversion › pdf to json conversion
Free online PDF to JSON conversion App via python
February 5, 2023 - Use free online app or python SDK to convert between PDF & JSON as well as several popular formats from Microsoft® Word.
🌐
PDF.co
pdf.co › blog › convert-pdf-to-json-from-a-file-in-python
How to Convert PDF to JSON from a File in Python using PDF.co Web API | PDF.co
In this tutorial, you will learn how to merge PDF files in Python using PDF.co Web API. You will also learn how to extract PDF to JSON after.
🌐
Medium
medium.com › @rishab_dugar › convert-pdf-tables-to-json-using-python-a-comprehensive-guide-be119fd16544
Convert PDF Tables to JSON Using Python: A Comprehensive Guide | by Rishab Dugar | Towards Dev
January 2, 2025 - In this guide, we have covered how to extract tables from PDF files and convert them to CSV format, using pdfplumber and pandas. We then converted the CSV files to JSON using Python's built-in. csv and json Libraries.
🌐
Thomas Suedbroecker
suedbroecker.net › 2024 › 05 › 29 › python-pdf-to-json-conversion-for-efficient-data-pre-processing
Python PDF to JSON Conversion for Efficient Data Pre-processing – Thomas Suedbroecker's Blog
May 29, 2024 - { "file": "/path/to/pdf_file/file.pdf", "pdf_pages" : [{"page":"1","content":"xxx"}, {"page":"2","content":"yyy"}] } This is the small function which does the extraction of the page text. def extract_pages_from_pdf(pdf_path): text = "" list = [] with open(pdf_path, "rb") as file: pdf_reader = PyPDF2.PdfReader(file) num_pages = len(pdf_reader.pages) for page_num in range(num_pages): print(f"***** {page_num} / {num_pages} ****") page = pdf_reader.pages[page_num] text = page.extract_text() print(f"*****\n {text} \n****\n") value = { "page": page_num, "content": text} list.append(value) pdf_information = { "file": pdf_path, "pages": list}
🌐
PyPI
pypi.org › project › json2pdf
json2pdf
JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser
🌐
Quora
quora.com › How-do-I-convert-a-PDF-file-to-a-JSON-file-using-Python
How to convert a PDF file to a JSON file using Python - Quora
December 16, 2025 - Answer: What does this even mean? PDF is a structured binary file format with a specific purpose in mind. JSON is a general serialization of any type of text-based data. What would you expect when going from one to the other (esp. in a lossless manner)?
🌐
GitHub
github.com › topics › pdf-to-json
pdf-to-json · GitHub Topics · GitHub
May 13, 2022 - pdf json pdftotext pdftojson pdf-to-json pdf-to-database insert-pdf-database ... A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction.
🌐
PDF.co
pdf.co › tutorials › convert-pdf-to-json-in-python-xml-xls-csv-txt
How to Convert PDF to JSON, XML, XLS, CSV, and TXT from Uploaded File in Python | PDF.co
BY using REST Web API you can convert the Python code into JSON and then simply open your Python project and just copy & paste the code and then run your app! Writing Python applications often involves various steps of software development so ...
🌐
LinkedIn
linkedin.com › pulse › convert-pdf-data-machin-readable-format-jason-deo-swaroop
Convert PDF Data to Machine Readable Format (JASON)
September 20, 2025 - Use the "jason" library in Python to convert the data structure into a JSON formatted string. ... import PyPDF import json def extract_text_from_pdf(pdf_path): text = '' with open(pdf_path, 'rb') as pdf_file: pdf_reader = PyPDF2.PdfFileRead...
🌐
Konfuzio
konfuzio.com › start › blog › python › pdf to json conversion for intelligent text processing
PDF to JSON conversion for intelligent text processing
December 31, 2019 - For processing, there are several libraries, products and vendors that offer very good text recognition and AI support. A very popular programming language for handling the capabilities, training the AI algorithms and converting the input files to JSON is Python.
🌐
ABBYY
support.abbyy.com › hc › en-us › community › posts › 360009372239-PDF-to-JSON
PDF to JSON – Help Center
September 20, 2025 - import json def get_data(page_content): _dict = {} page_content_list = page_content.splitlines() for line in page_content_list: if ':' not in line: continue key, value = line.split(':') _dict[key.strip()] = value.strip() return _dict page_data ...
🌐
Medium
medium.com › nanonets › convert-data-from-pdfs-to-json-outputs-4bf32d50cfd2
Convert data from PDFs to JSON outputs | by Prithiv Sassisegarane | NanoNets | Medium
July 28, 2022 - In the next section, let’s look at how to parse data from PDF to generate JSON files. Parsing through PDFs isn’t a complicated task if you have developer experience. Firstly, we’ll have to check if our PDF files contain text data or consist of scanned images. We’d have to check if we can extract text data and pipe the files through an OCR library if no text was returned. This could be achieved using a Python library or by relying on some Linux command-line utilities.