Videos
I am interested in extracting certain entities from scientific publications. Extracting certain types of entities requires some contextual understanding of the method, which is something that LLMs would excel at. However, even using larger models like Llama3.1-70B on Groq still leads to slow inference overall. For example, I have used the Llama3.1-70B and the Llama3.2-11B models on Groq for NER. To account for errors in logic, I have had the models read the papers one page at a time, and used chain of thought and self-consistency prompting to improve performance. They do well, but total inference time can take several minutes. This can make the use of GPTs prohibitive since I hope to extract entities from several hundreds of publications. Does anyone have any advice for methods that would be faster, and also less error-prone, so that methods like self-consistency are not necessary?
Other issues that I have realized with the Groq models:
The Groq models have context sizes of only 8K tokens, which can make summarization of publications difficult. For this reason, I am looking at other options. My hardware is not the best, so using the 70B parameter model is difficult.
Also, while tools like SpaCy are great for some entity types of NER as mentioned in this list here, I'm aware that my entity types are not within this list.
If anyone has any recommendations for LLM models on Huggingface or otherwise for NER, or any other recommendations for tools that can extract specific types of entities, I would greatly appreciate it!
UPDATE:
I have reformatted my prompting approach using the GPT+Groq and the execution time is much faster. I am still comparing against other models, but precision, recall, F1, and execution time is much better for the GPT+Groq. The GLiNE models also do well, but take about 8x longer to execute. Also, even for the domain specific GLiNE models, they tend to consistently miss certain entities, which unfortunately tells me those entities may not have been in the training data. Models with larger corpus of training data and the free plan on Groq so far seems to be the best method overall.
As I said, I am still testing this across multiple models and publications. But this is my experience so far. Data to follow.
Using generative large language models like Llama 3.1 is very inefficient for a task like NER. As this answer suggests, you can use traditional techniques like SpaCy and achieve very good results. However, the problem with non-LLM methods is that they are not flexible; you're limited to a set of predefined tags. If you want a flexible, powerful and yet efficient solution, use GLiNER. It uses BERT at its core for named entity recognition. With GLiNER You can easily perform NER on your local hardware with whatever entities you want. Using less than 500M parameters.
Installation
pip install gliner
Example of GLiNER Usage
from gliner import GLiNER
# Initialize GLiNER with the base model
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
text = """On 25 July 1948, on the 39th anniversary of Bleriot's crossing of the English Channel,
the Type 618 Nene-Viking flew Heathrow to Paris (Villacoublay) in the morning carrying
letters to Bleriot's widow and son (secretary of the FAI), who met it at the airport."""
labels = ["person", "company", "location", "airplane"]
entities = model.predict_entities(text, labels, threshold=0.5)
for entity in entities:
print(entity["text"], "=>", entity["label"])
Result
Bleriot => person
English Channel => location
Type 618 Nene-Viking => airplane
Heathrow => location
Paris => location
Villacoublay => location
Bleriot => person
There are various models with diffrent number of parameters (166M-459M), choose the one that best suits your needs.
Apart from your limitations, I wouldn't recommend using LLMs like Llamma 3.1 for such a task. NER is one of the classic tasks of NLP and there are smaller language models and tools you can incorporate to achieve your goal. You can use NLTK or SpaCy for this matter. My personal choice is SpaCy, however a gender as you defined is not a known named entity. you can see a list of named entities in this doc.
I guess what you mean by gender is the possible gender associated with the names of a PERSON mentioned in your articles. There are a few python packages that you can use to lookup genders, however, you should note that this can be very ambiguous and there should be a substantial tolerance for error. You can use gender-guesser package.
A possible solution would be like this:
import spacy
import gender_guesser.detector as gender
nlp = spacy.load("en_core_web_sm")
def extract_info(text):
doc = nlp(text)
gender_detector = gender.Detector()
for ent in doc.ents:
if ent.label_ == "PERSON":
name = ent.text
name_gender = gender_detector.get_gender(name)
return doc.ents, name_gender
Note that en_core_web_sm is the small model available via spaCy, you can use the large model by specifying en_core_web_lg, just make sure that the model is downloaded before running your code. here's how you can download the model:
python -m spacy download en_core_web_sm
I have a use case where I need to extract all the different entities in a typical email signature.
So given an email signature, extract the following:
Input:
Bob Smith, VP of Consulting, Akamai Technologies, [email protected], 222-XXX-YYYY
Output:
-
first name: Bob
-
last name: Smith
-
job title: VP of Consulting
-
company: Akamai Technologies
-
address: [email protected]
-
phone: 222-XXX-YYYY
Data format: we have tabular data (CSV) with Signature, first name, last name, company, etc.. columns.
Outside of just using few shot prompting, I'm thinking of finetuning Mistral 7b and Llama-2-7b chat to accomplish this. Any other pointers?
Are there ways outside of using LLMs that are better suited?
Note: the current solution to address this problem uses regex.