Videos
I am currently working on a research project where we are using gpt3.5 heavily to process text.
When processing our text with azure and gpt3.5 turbo, we are getting insane processing speeds of around 7 texts per second.
I just played around with Llama2 70B on 2xA100 80GB in 8bit with bf16 and got only 0.6 texts per seconds. I am using huggingface and wrote a standard script in which I am tokenizing in batches and passing those batches to „model.generate(input_ids)“. (Different batch sizes make no difference in processing speeds 🤷♂️)
How do these guys make their models, which are larger than Llama2 70B, generate so fast? It is more than 10x faster including the api call overhead and I am using some of the best GPUs with NVLink.
Any tips and tricks how I can make my „local“ Llama faster without quantising it to uselessness?
I’ve played around with the text chat in playground, but is this it? Is there a way to fine tune the ai for a special style or type of content without just getting repeats. I’ve seen people use it for copywriting etc, but I’m curious how you go about improving the model yourself