Train a model to extract specific text from a document (pdf or txt) - where to begin
Extracting text from PDFs - Building and Evaluating Advanced RAG Applications - DeepLearning.AI
keras - Extract phrases from PDFs with Deep Learning - Stack Overflow
What’s the Best Python Library for Extracting Text from PDFs?
Videos
We have a process at work where a pdf memo is downloaded and turned into a text document and then someone has to go in and extract applicable data and type it into a SQL Server database manually. I believe that we should be able to automate this better using machine learning to train a model to recognize where we are pulling the data from in the document (they are somewhat structured, but there are differences depending on the type of memo we are getting). We have years of extracted data in the database and pdf/txt files that could be used to train a model but I don't know where to begin.
I have a masters in Data Science but I've never used the ML/AI stuff I learned (I'm a data engineer) - so I have just enough knowledge to know this should be do-able and not enough to know how to do it
Hello everyone, I hope you're all doing well! I’m currently on the lookout for a library that can extract text in paragraph chunks from PDFs. For instance, I need it to pull out the Introduction with all its paragraphs separately, the Conclusion with all its paragraphs separately, and so on, essentially chunking the text by paragraphs. Do you have any suggestions? Thanks!