the module PDFWriter is in xtopdf
PDFWriter - a core class of the xtopdf toolkit - can now be used with a Python context manager, a.k.a. the Python
withstatement.
( http://code.activestate.com/recipes/578790-use-pdfwriter-with-context-manager-support/ )
how to install xtopdf is in https://bitbucket.org/vasudevram/xtopdf :
Installation and usage:
To install the files, first make sure that you have downloaded and installed all the prerequisities mentioned above, including setup steps such as adding needed directories to your PYTHONPATH. Then, copy all the files in xtopdf.zip into a directory which is on your PYTHONPATH.
To use any of the Python programs, run the .py file as:
python filename.py
This will give a usage message about the correct usage and arguments expected.
To run the shell script(s), do the same as above.
Developers can look at the source code for further information.
an alternative is to use pdfdocument to create the pdf, it can be installed using pip ( https://pypi.python.org/pypi/pdfdocument )
parse the data from the json data ( How can I parse GeoJSON with Python, Parse JSON in Python ) and print it as pdf using pdfdocument ( https://pypi.python.org/pypi/pdfdocument )
import json
data = json.loads(datastring)
from io import BytesIO
from pdfdocument.document import PDFDocument
def say_hello():
f = BytesIO()
pdf = PDFDocument(f)
pdf.init_report()
pdf.h1('Hello World')
pdf.p('Creating PDFs made easy.')
pdf.generate()
return f.getvalue()
Answer from ralf htp on Stack Overflowparse pdf to json using python
python - Transform a json file into a pdf - Stack Overflow
Convert pdf data to JSON format using Python? - Stack Overflow
Create JSON to PDF in Lambda using Python - Serverless Framework - Serverless Forums
Videos
the module PDFWriter is in xtopdf
PDFWriter - a core class of the xtopdf toolkit - can now be used with a Python context manager, a.k.a. the Python
withstatement.
( http://code.activestate.com/recipes/578790-use-pdfwriter-with-context-manager-support/ )
how to install xtopdf is in https://bitbucket.org/vasudevram/xtopdf :
Installation and usage:
To install the files, first make sure that you have downloaded and installed all the prerequisities mentioned above, including setup steps such as adding needed directories to your PYTHONPATH. Then, copy all the files in xtopdf.zip into a directory which is on your PYTHONPATH.
To use any of the Python programs, run the .py file as:
python filename.py
This will give a usage message about the correct usage and arguments expected.
To run the shell script(s), do the same as above.
Developers can look at the source code for further information.
an alternative is to use pdfdocument to create the pdf, it can be installed using pip ( https://pypi.python.org/pypi/pdfdocument )
parse the data from the json data ( How can I parse GeoJSON with Python, Parse JSON in Python ) and print it as pdf using pdfdocument ( https://pypi.python.org/pypi/pdfdocument )
import json
data = json.loads(datastring)
from io import BytesIO
from pdfdocument.document import PDFDocument
def say_hello():
f = BytesIO()
pdf = PDFDocument(f)
pdf.init_report()
pdf.h1('Hello World')
pdf.p('Creating PDFs made easy.')
pdf.generate()
return f.getvalue()
from json2html import *
import json
import tempfile
class PdfConverter(object):
def __init__(self):
pass
def to_html(self, json_doc):
return json2html.convert(json=json_doc)
def to_pdf(self, html_str):
return pdfkit.from_string(html_str, None)
def main():
stowflw = {
"data": [
{
"state": "Manchester",
"quantity": 20
},
{
"state": "Surrey",
"quantity": 46
},
{
"state": "Scotland",
"quantity": 36
},
{
"state": "Kent",
"quantity": 23
},
{
"state": "Devon",
"quantity": 43
},
{
"state": "Glamorgan",
"quantity": 43
}
]
}
pdfc = PdfConverter()
with open("sample.pdf", "wb") as pdf_fl:
pdf_fl.write(pdfc.to_pdf(pdfc.to_html(json.dumps(stowflw))))
- install json2html
- install pdfkit (requires wkhtmltox)
Im searching for a while now for a library that can parse a pdf to json or xml format while keeping the document structure.
the popular libs like pypdf do often not preserve the document structure. Thought about using teseract for OCR and then transforming it into a json format but could not get it working. Is there a library that can parse pdf to json format while preserving the document structure and not just spitt out a block of text ?
There seems to be a lot of steps in your code. You could simply loop over the columns of your transposed df and export each of them to html. Append all html tables to a root html element and export with pdfkit:
import json
import pandas as pd
import lxml.etree as et
import pdfkit
your_json = """{"url": "https://www.abc123.com", "extensionVersion": "4.51.0", "axeVersion": "4.6.3", "standard": "WCAG 2.1 AA", "testingStartDate": "2023-04-03T09:35:06.177Z", "testingEndDate": "2023-04-03T09:35:06.177Z", "bestPracticesEnabled": false, "issueSummary": {"critical": 2, "moderate": 0, "minor": 0, "serious": 0, "bestPractices": 0, "needsReview": 0}, "remainingTestingSummary": {"run": false}, "igtSummary": [], "failedRules": [{"name": "button-name", "count": 1, "mode": "automated"}, {"name": "select-name", "count": 1, "mode": "automated"}], "needsReview": [], "allIssues": [{"ruleId": "button-name", "description": "Ensures buttons have discernible text", "help": "Buttons must have discernible text", "helpUrl": "https://www.abc123.com", "impact": "critical", "needsReview": false, "isManual": false, "selector": [".livechat-button"], "summary": "Fix any of the following:\\n Element does not have inner text that is visible to screen readers\\n aria-label attribute does not exist or is empty\\n aria-labelledby attribute does not exist, references elements that do not exist or references elements that are empty\\n Element has no title attribute\\n Element's default semantics were not overridden with role=\\"none\\" or role=\\"presentation\\"", "source": "<button class=\\"livechat-button items-center bg-black shadow-liveChat rounded-full text-white p-2 h-12 transition-all opacity-0 pointer-events-none w-sp-48 opacity-0 pointer-events-none\\">", "tags": ["cat.name-role-value", "wcag2a", "wcag412", "section508", "section508.22.a", "ACT"], "igt": "", "shareURL": "", "createdAt": "2023-04-03T09:35:06.177Z", "testUrl": "", "testPageTitle": "ABC123", "foundBy": "[email protected]", "axeVersion": "4.6.3"}, {"ruleId": "select-name", "description": "Ensures select element has an accessible name", "help": "Select element must have an accessible name", "helpUrl": "https://www.abc123.com", "impact": "critical", "needsReview": false, "isManual": false, "selector": ["#plp__sortSelected"], "summary": "Fix any of the following:\\n Form element does not have an implicit (wrapped) <label>\\n Form element does not have an explicit <label>\\n aria-label attribute does not exist or is empty\\n aria-labelledby attribute does not exist, references elements that do not exist or references elements that are empty\\n Element has no title attribute\\n Element's default semantics were not overridden with role=\\"none\\" or role=\\"presentation\\"", "source": "<select class=\\"w-full absolute opacity-0 appearance-none text-value-small font-bold text-black uppercase cursor-pointer bg-transparent outline-0\\" id=\\"plp__sortSelected\\">", "tags": ["cat.forms", "wcag2a", "wcag412", "section508", "section508.22.n", "ACT"], "igt": "", "shareURL": "", "createdAt": "2023-04-03T09:35:06.177Z", "testUrl": "https://www.abc123.com", "testPageTitle": "ABC123", "foundBy": "[email protected]", "axeVersion": "4.6.3"}]}"""
data = json.loads(your_json)
## replace the above lines with the following in your case
# with open('your_file.json', 'r') as f:
# data = json.load(f)
html = et.Element("html")
# general info
html.append(et.fromstring(f"""<h3>Site link: <a href="{data['url']}">{data['url']}</a></h3>"""))
html.append(et.fromstring(f"""<h4>Date: {data['testingEndDate']}</h4>"""))
html.append(et.fromstring(f"""<h4>Summary:</h4>"""))
# summary table
summary = pd.Series(data['issueSummary'])
summary_table = et.fromstring(summary.to_frame().to_html(header=False))
summary_table.set('class', 'summary')
html.append(summary_table)
# issue tables
cols_of_interest = ['ruleId', 'description', 'help', 'impact', 'selector', 'summary', 'source']
df = pd.DataFrame(data['allIssues'])[cols_of_interest].T
for col in df.columns:
table = et.fromstring(df[[col]].to_html(header=False))
table.set('class', 'issue')
html.append(table)
html.append(et.fromstring('<br/>'))
pdfkit.from_string(et.tostring(html, encoding="unicode"), "./output.pdf", css='style.css')
With the following css file:
/* style.css */
* {
font-family: 'Liberation Sans';
}
table {
margin: 20px;
margin-left: auto;
margin-right: auto;
}
table.summary {
width: 50%;
}
table.issue{
border: 0;
width: 100%;
border-collapse: collapse;
}
table.issue td,
table.issue th {
border: 0;
text-align: left;
padding: 5px;
}
table.issue tr {
border-bottom: 1px solid #dddddd;
}
You'll get:

Edit: updated json with the data you provided + exporting additional data + improved css
Note: you will need to install wkhtmltopdf and make sure that it is in your path.
Edit2: limiting output to desired fields
disclaimer: I am the author of borb, the library used in this answer.
Assuming your data looks like this:
data = [
{
"ruleId":"name",
"description":"Description123",
"help":"Description234",
"impact":"critical",
"selector":[
"abc1234"
],
"summary":"long text",
"source":"long text2",
},
]
You can run the following code:
from borb.pdf import Document, Page, PageLayout, SingleColumnLayout, Paragraph, HexColor, Table, TableUtil
from decimal import Decimal
# create empty document
doc: Document = Document()
# create empty page
page: Page = Page()
doc.add_page(page)
# use a PageLayout to be able to add things easily
layout: PageLayout = SingleColumnLayout(page)
# generate a Table for each issue
for i, issue in enumerate(data):
# add a header (Paragraph)
layout.add(Paragraph("Issue %d" % i, font_size=Decimal(20), font_color=HexColor("#B5F8FE")))
# add a Table (using the convenient TableUtil class)
table: Table = TableUtil.from_2d_array([["Rule ID", issue.get("ruleId", "N.A.")],
["Description", issue.get("description", "N.A.")],
["Help", issue.get("help", "N.A.")],
["Impact", issue.get("impact", "N.A.")],
["Selector", str(issue.get("selector", []))],
["Summary", issue.get("summary", "N.A.")],
["Source", issue.get("source", "N.A.")],
], header_row=False, header_col=True, flexible_column_width=False)
layout.add(table)
# store the PDF
with open("output.pdf", "wb") as fh:
PDF.dumps(fh, doc)
This generates the following PDF:

» pip install pydf2json
My guess is that you're expecting to see more structure in the JSON you are getting, like seeing a pair of curly braces or square brackets?. But curlies represent a dictionary (key/value pairs), and square brackets represent an array or list. What you are encoding as JSON is neither of those things.
page.extractText returns text from the PDF being read as a single Python string value. The JSON encoding of a Python string value is the text of that string within a pair of double quotes. So the JSON you're getting will be of the form:
"<text from pdf document>"
It doesn't matter what's in the PDF. Whatever text you get back from page.extractText will always be a single Python string. What you get when you encode that string as JSON will always be that same text, with double quotes before and after it.
Here's a little code to illustrate this:
import json
s1 = "This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
print(s1)
print(json.dumps(s1))
Result:
This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes
"This is a Python string. A Python string encoded as JSON is the text of that string surrounded by double quotes"
Simply converting a string with json.dumps() will not yield your desired result, since the string first needs to be split into key-value pairs.
If you need to extract a lot of data from an unstructured PDF, you may want to consider using Adobe's extract PDF Python SDK. The API converts all the structural and text information from a PDF directly into JSON, so you don't have to do it manually.
The JSON data will contain an array of elements with information such as the following:
{
"Page": 1,
"Path": "//Document/P",
"Text": "The quick brown fox jumps over the lazy dog "
}