How to extract text from images using API?
How to Programmatically Extract Text from Images Using GPT-4
OCR using API for text extraction
API playground - image generation
Videos
I just would need good quality blog style images, but all models I've tested seem to have issues adding letters, numbers, symbols incorrectly very often.
Is there any image model which handles these without issues? I'm currently using Flux, and even it's quite good, it can't be automated due to these quality issues.
I'm working on a Python project where I need to perform Optical Character Recognition (OCR) on screenshots and then compare the extracted text with a source code file to identify any differences. I'm leveraging OpenAI's GPT-4 model via the ChatCompletion API to handle both tasks.
Here's what I'm trying to achieve:
Capture a Screenshot: Use
pyautoguito capture a specific region of the screen and save it as an image file.Encode the Image: Convert the captured image to a base64-encoded string to send it via the API.
Prepare and Send the Prompt: Send both the encoded image data and the source code text to GPT-4 in a single prompt, instructing it to perform OCR on the image and then compare the extracted text with the source code to identify any differences.
Here's a snippet of my current implementation:
pythonCopy codeimport openai
import base64
from PIL import Image
import io
# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'
def capture_and_compare(image_path, source_text):
# Read and encode the image
with open(image_path, 'rb') as image_file:
image_bytes = image_file.read()
encoded_image = base64.b64encode(image_bytes).decode('utf-8')
# Prepare the prompt with image data and source code
messages = [
{
"role": "system",
"content": "You are an AI assistant that performs OCR on images and compares the extracted text with provided source code."
},
{
"role": "user",
"content": (
f"Task: Perform OCR on the following image and compare the extracted text with the provided source code to identify differences.\n\n"
f"Source Code:\n```python\n{source_text}\n```\n\n"
f"Image Data (base64): {encoded_image}\n\n"
f"Please extract the text from the image and provide a line-by-line diff with the source code."
)
}
]
# Make the API call to GPT-4
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages,
temperature=0
)
# Extract and return the response content using attribute access
return response.choices[0].message.content.strip()
# Example source code for comparison
source_code = """
def add(a, b):
return a + b
def subtract(a, b):
return a - b
"""
# Perform OCR and comparison on a sample image
diff_result = capture_and_compare('screenshot.png', source_code)
print(diff_result)Issue I'm Facing:
When I run this script, instead of receiving actual OCR results and a meaningful diff between the image and the source code, GPT-4 responds by simulating the OCR process without performing it. Here's an example of the response I receive:
vbnetCopy code---- Top Half Differences ---- To perform the task, I will first extract the text from the provided image data using Optical Character Recognition (OCR). Then, I will compare the extracted text with the provided source code line by line to identify any differences. Let's proceed with the OCR and comparison: ### OCR Text Extraction (Note: The OCR process is simulated here as I cannot directly process images. The extracted text is assumed to be similar to the source code with potential OCR errors.)
Question:
What is the correct method to send both image data and text in the same prompt to GPT-4 via the OpenAI API to perform OCR and then a code diff?
I'm aiming to have GPT-4 extract text from an image and compare it with a given source code file to identify any discrepancies. Is there a specific format, additional parameters, or a different approach I should use to achieve this effectively? My current method doesn't seem to enable GPT-4 to perform actual OCR on the image data provided.
Accuracy is very important