Brave Search

Convert pdf data to JSON format using Python? [closed]

stackoverflow.com › questions › 65546921 › convert-pdf-data-to-json-format-using-python

My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"

Answer from CryptoFool on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 65546921 › convert-pdf-data-to-json-format-using-python

Convert pdf data to JSON format using Python? - Stack Overflow

Top answer

1 of 2

5

My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.

page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:

"<text from pdf document>"

It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.

Here's a little code to illustrate this:

import json
s1 = "This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))

Result:

This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string.  A Python string encoded as JSON is the text of that string surrounded by double quotes"

2 of 2

3

Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.

If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.

The JSON data will contain an array of elements with information such as the following:

{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}

PyPI

pypi.org › project › json2pdf-Converter

json2pdf-Converter

September 1, 2023 - JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser

Discussions

parse pdf to json using python

No, because the pdf format does not save the document structure. The way pdf works is by saving the absolute position of things, not the relative position. More on reddit.com

r/learnpython

5

2

March 23, 2023

How to convert the extracted text from PDF to JSON or XML format in Python? - Stack Overflow

I am using PyPDF2 to extract the data from PDF file and then converting into Text format? PDF format for the file is like this: Name : John Address: 123street , USA Phone No: 123456 Gender: ... More on stackoverflow.com

stackoverflow.com

September 16, 2023

How do I convert JSON to PDF? I have no programming skills.

JSON is just text, e. g.: {"menu": { "id": "file", "value": "File", "popup": { "menuitem": [ {"value": "New", "onclick": "CreateNewDoc()"}, {"value": "Open", "onclick": "OpenDoc()"}, {"value": "Close", "onclick": "CloseDoc()"} ] } }} so you can open your file in any text editor like Notepad and use the Windows "print to PDF" function. If you want some JSON-specific formatting & coloring, there are lots of programming editors like Notepad++ to do that. Again, use Windows' internal PDF printer. More on reddit.com

r/software

15

1

March 4, 2024

extracting data from 100+ pdf files

I actually just built a program to extract lab test data from PDFs for work and dump it into an ordered csv file. I won’t go into the nitty gritty details of everything I did, but it all started with the use of the pdfplumber package. It’s not too difficult to use. The GitHub will explain all of the necessary details. Using .extract_text() will give you all of the text as a single string with \n representing line breaks. From there it’s up to you to write the appropriate text parsing using regex and such. Note that pdfplumber won’t work in scanned PDFs. They have to have been computer generated. One nice thing about pdfplumber is using .extract_words() will generate a list of dictionaries for every word in the pdf. The dictionaries have location info which you can use to help crop the pdf based on the relative location of what you’re looking for to other nearby words. More on reddit.com

r/learnpython

92

264

September 25, 2020

Videos

20:48

YouTube

PDF to JSON: LLM-Powered Data Extraction In Python - YouTube

Python - How to Read OR Convert PDF Files into JSON files - YouTube

April 30, 2024

13:44

YouTube

Easiest Way to Convert a PDF to JSON using LangChain Output Parsers ...

python code to convert pdf to json - YouTube

January 20, 2024

youtube.com

How to Convert PDF to JSON from a File in Python ... - YouTube

youtube.com

How to convert PDFs to JSON

View all

GitHub

github.com › antoinecarme › pdf_to_json

GitHub - antoinecarme/pdf_to_json: Python module to Convert a PDF file to a JSON format

May 19, 2024 - import pdf_to_json as p2j import json # web document : UDHR url = "https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf" # Convert the document into a python dictionary lConverter = p2j.pdf_to_json.pdf_to_json_converter() lDict ...

Starred by 6 users

Forked by 5 users

Languages Python 100.0% | Python 100.0%

Nanonets

nanonets.com › blog › pdf-to-json

Convert PDF to JSON - PDF to JSON Python & Javascript

September 17, 2025 - Learn how to convert PDF to JSON using Python & Javascript. Automate data extraction from PDF to JSON with our API.

PyPI

pypi.org › project › pydf2json

pydf2json · PyPI

PDF analysis. Convert contents of PDF to a JSON-style python dictionary.

      » pip install pydf2json

Published Sep 15, 2022

Version 2.4.0

Homepage https://github.com/kingaling/pydf2json

reddit.com › r/learnpython › parse pdf to json using python

r/learnpython on Reddit: parse pdf to json using python

March 23, 2023 -

Im searching for a while now for a library that can parse a pdf to json or xml format while keeping the document structure.
the popular libs like pypdf do often not preserve the document structure. Thought about using teseract for OCR and then transforming it into a json format but could not get it working. Is there a library that can parse pdf to json format while preserving the document structure and not just spitt out a block of text ?

Top answer

1 of 1

3

Not so pretty, but this would get the job done, I think. You would get a dictionary which then gets printed by the json parser in a nice, pretty format.

import json    

def get_data(page_content):
    _dict = {}
    page_content_list = page_content.splitlines()
    for line in page_content_list:
        if ':' not in line:
            continue
        key, value = line.split(':')
        _dict[key.strip()] = value.strip()
    return _dict

page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)

or, instead of those last 3 lines, just do this:

print(json.dumps(get_data(page_content), indent=4))

PyPI

pypi.org › project › json2pdf

json2pdf

JavaScript is disabled in your browser. Please enable JavaScript to proceed · A required part of this site couldn’t load. This may be due to a browser extension, network issues, or browser settings. Please check your connection, disable any ad blockers, or try using a different browser

Quora

quora.com › How-do-I-convert-a-PDF-file-to-a-JSON-file-using-Python

How to convert a PDF file to a JSON file using Python - Quora

December 16, 2025 - Answer: What does this even mean? PDF is a structured binary file format with a specific purpose in mind. JSON is a general serialization of any type of text-based data. What would you expect when going from one to the other (esp. in a lossless manner)?

GitHub

github.com › topics › pdf-to-json

pdf-to-json · GitHub Topics · GitHub

May 13, 2022 - pdf json pdftotext pdftojson pdf-to-json pdf-to-database insert-pdf-database ... A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction.

PDF.co

pdf.co › tutorials › convert-pdf-to-json-in-python-xml-xls-csv-txt

How to Convert PDF to JSON, XML, XLS, CSV, and TXT from Uploaded File in Python | PDF.co

BY using REST Web API you can convert the Python code into JSON and then simply open your Python project and just copy & paste the code and then run your app! Writing Python applications often involves various steps of software development so ...

linkedin.com › pulse › convert-pdf-data-machin-readable-format-jason-deo-swaroop

Convert PDF Data to Machine Readable Format (JASON)

September 20, 2025 - Use the "jason" library in Python to convert the data structure into a JSON formatted string. ... import PyPDF import json def extract_text_from_pdf(pdf_path): text = '' with open(pdf_path, 'rb') as pdf_file: pdf_reader = PyPDF2.PdfFileRead...

Konfuzio

konfuzio.com › start › blog › python › pdf to json conversion for intelligent text processing

PDF to JSON conversion for intelligent text processing

December 31, 2019 - For processing, there are several libraries, products and vendors that offer very good text recognition and AI support. A very popular programming language for handling the capabilities, training the AI algorithms and converting the input files to JSON is Python.

ABBYY

support.abbyy.com › hc › en-us › community › posts › 360009372239-PDF-to-JSON

PDF to JSON – Help Center

September 20, 2025 - import json def get_data(page_content): _dict = {} page_content_list = page_content.splitlines() for line in page_content_list: if ':' not in line: continue key, value = line.split(':') _dict[key.strip()] = value.strip() return _dict page_data ...

Medium

medium.com › nanonets › convert-data-from-pdfs-to-json-outputs-4bf32d50cfd2

Convert data from PDFs to JSON outputs | by Prithiv Sassisegarane | NanoNets | Medium

July 28, 2022 - In the next section, let’s look at how to parse data from PDF to generate JSON files. Parsing through PDFs isn’t a complicated task if you have developer experience. Firstly, we’ll have to check if our PDF files contain text data or consist of scanned images. We’d have to check if we can extract text data and pipe the files through an OCR library if no text was returned. This could be achieved using a Python library or by relying on some Linux command-line utilities.