Videos
I'm excited to share my recent side-by-side comparison of Anthropic's Claude 3.5 Sonnet and OpenAI's GPT-4o models. Using my AI-powered trading platform NexusTrade as a testing ground, I put these models through their paces on complex financial tasks.
Some key findings:
✅ Claude excels at reasoning and human-like responses, creating a more natural chat experience
✅ GPT-4o is significantly faster, especially when chaining multiple prompts
✅ Claude performed better on complex portfolio configuration tasks
✅ GPT-4o handled certain database queries more effectively
✅ Claude is nearly 2x cheaper for input tokens and has a 50% larger context window
While there's no clear winner across all scenarios, I found Claude 3.5 Sonnet to be slightly better overall for my specific use case. Its ability to handle complex reasoning tasks and generate more natural responses gives it an edge, despite being slower.
Does this align with your experience? Have you tried out the new Claude 3.5 Sonnet model? What did you think?
Also, if you want to read a full comparison, check out the detailed analysis here
This isn’t specific to just OpenAI and Anthropic, but they are the 2 clear leaders for building a LLM heavy application on top of, so I am just focusing on them. In terms of model quality, I do think OpenAI and Anthropic are actually pretty head to head right now, with Anthropic seemingly controlling the tradeoff war a bit more at this point. In terms of API, OpenAI just completely crushes Anthropic in every aspect, and as a developer, I’ve found this very frustrating. Below are a list of differences I’ve noticed OpenAI supports which Anthropic doesn’t, which have hurt my development experience quite a bit:
Injecting system messages anywhere. OpenAI lets you use multiple system messages and throw them wherever. Anthropic forces the system message to just be at the top of the conversation and there can only be 1. This seriously restricts the ability to simulate LLM consciousness throughout a conversation, as it has no concept of background events in relation to each message
Multiple turns in a row by the same sender. Why can’t the user or AI send 2 messages in a row? This is frequent behavior in real conversations. In conversations where the AI is using tools, this can give a more natural feel both on the development side, and the end user side
Setting none for tool use, or just removing tools as a whole for conversations with tool messages. I personally find tools are a great way to simulate LLM triggered events between responses. I like using methods like simulating tool use with my own, predefined context at times. In these scenarios, there is no point of letting the LLM use the tool here, as doing so would introduce more latency due to another API call, twice the input price for another API call, and uncertainty in the quality of the tool request. With tool use “auto” being the closest thing to banning the LLM from using a tool, this gives the model the opportunity to go off on its own tangent even more and start attempting to use tools on its own. A workaround would be to remove the tool definitions, but anthropic also bans this for conversations with tool messages. Why?
Conversations with tool use having such extensive system messages, and hindering latency. Anthropic states that using tools forces the model to use a system message of something like 250-300 tokens. While I noticed this is already a bit excessive, since OpenAI seemingly uses around 100-150, it also seems like this is a lie from my manual testing. The bare minimum tokens I have found the model will use for a conversation with a tool use is just under 600 tokens. With near nothing in the conversation or context, this would mean the system message is closer to 500 tokens. In addition to this, I have noticed a consistent slowdown in time to first token when using tools vs not with anthropic. OpenAI there is no slowdown whatsoever
batching tool use responses. This probably ties to point 2 with the whole multiple turns in a row thing, but if you want to use more than 1 tool at a time, you have to structure the conversation as: ai -> tool result -> ai -> tool result. In OpenAI, I can simply put ai -> tool results. This is relatively minor, but does add additional token usage, development overhead, and the experience is less natural, as tool usage is seemingly forced to be a more sequential process
As a side note, I still think OpenAI’s API is overly restrictive and has many gotchas that it does not need to. May make a separate post on this
These LLMs cost millions per day to run, and even with super popular paid API services, the revenue is nowhere near enough to cover mind-boggling costs. What happens next?
The big industry AI leaders raise prices to cover OpEx, and companies realize a low-paid human is just less of a headache compared to an LLM?
Do they just keep absorbing billions in losses like Uber did until cab companies were destroyed, then enjoy no competition?
Are they holding out until a model is capable of displacing enough people that it actually is a good value for business customers?