Has anyone been able to put the deep research via api to any good use. I am finding it extremely hard to steer this model, plus it keeps defaulting to it’s knowledge cutoff timeline to make all research plans, even if I have provided with all tools and information.
Another issue is that it keeps defaulting to web search when the mcp tools I have provided would provide much better data for certain tasks.
No amount of prompting helps. Anyone figured out how to make it follow a plan?
Hi, did anybody implemented the openai deep research api that was released earlier ?
https://cookbook.openai.com/examples/deep_research_api/introduction_to_deep_research_api
Videos
So as we all know OpenAI's Deep Research is being talked about everywhere. The same goes with the company that I work for, everyone wanted to try it. They finally tried it the other day, and got a report of around 26 pages about a very specific subject and the quality was ok. But when I saw the structure it hit me: it was clearly a bunch of separate search queries stiched together and rewritten by their model. So why couldn't we just make it ourselves? So I built the agentic workflow that you see in the image in AI Workflow Automation plugin in WordPress which has integration with Perplexity api.
Basically this is how it works: a research query comes in, it gets sent to several different research nodes, each running Sonar Pro, and each one research the topic from a different angle, then each research node passes on the results to an AI node, for which I used Grok 2 because of the large output context window and good writing skills, and then all of them come together to create a unified research report.
I could generate a well cited and factual report of around 5000 to 6000 words in around 7 minutes, which in terms of quality was really on par with the OpenAI one. And the best thing about it: it cost less than 30 cents to run the whole thing!! You can see it in the second image.
And no, I didn't benchmark it against some standard benchmark, it was my own qualitative review of the result.
So yes, you can make your own agents, and I love Perplexity's Sonar Pro and Sonar Reasoning for this use case, especially that now you can limit the time window of the search and limit the search context. It's amazing.
You should be able to get similar results with a workflow built in a tool like N8N if you don't use AI Workflow Automation.
If you do use the plugin and you want the workflow, send me a dm and I'd be happy to share it with you!
This system can reason what it knows and it does not know when performing big searches using o3 or deepseek.
This might seem like a small thing within research, but if you really think about it, this is the start of something much bigger. If the agents can understand what they don't know—just like a human—they can reason about what they need to learn. This has the potential to make the process of agents acquiring information much, much faster and in turn being much smarter.
Let me know your thoughts, any feedback is much appreciated and if enough people like it I can work it as an API agents can use.
Thanks, code below:
I've built a deep research implementation using the OpenAI Agents SDK which was released 2 weeks ago - it can be called from the CLI or a Python script to produce long reports on any given topic. It's compatible with any models using the OpenAI API spec (DeepSeek, OpenRouter etc.), and also uses OpenAI's tracing feature (handy for debugging / seeing exactly what's happening under the hood).
Sharing how it works here in case it's helpful for others.
https://github.com/qx-labs/agents-deep-research
Or:
pip install deep-researcher
It does the following:
Carries out initial research/planning on the query to understand the question / topic
Splits the research topic into sub-topics and sub-sections
Iteratively runs research on each sub-topic - this is done in async/parallel to maximise speed
Consolidates all findings into a single report with references
If using OpenAI models, includes a full trace of the workflow and agent calls in OpenAI's trace system
It has 2 modes:
Simple: runs the iterative researcher in a single loop without the initial planning step (for faster output on a narrower topic or question)
Deep: runs the planning step with multiple concurrent iterative researchers deployed on each sub-topic (for deeper / more expansive reports)
I'll comment separately with a diagram of the architecture for clarity.
Some interesting findings:
gpt-4o-mini tends to be sufficient for the vast majority of the workflow. It actually benchmarks higher than o3-mini for tool selection tasks (see this leaderboard) and is faster than both 4o and o3-mini. Since the research relies on retrieved findings rather than general world knowledge, the wider training set of 4o doesn't really benefit much over 4o-mini.
LLMs are terrible at following word count instructions. They are therefore better off being guided on a heuristic that they have seen in their training data (e.g. "length of a tweet", "a few paragraphs", "2 pages").
Despite having massive output token limits, most LLMs max out at ~1,500-2,000 output words as they simply haven't been trained to produce longer outputs. Trying to get it to produce the "length of a book", for example, doesn't work. Instead you either have to run your own training, or follow methods like this one that sequentially stream chunks of output across multiple LLM calls. You could also just concatenate the output from each section of a report, but I've found that this leads to a lot of repetition because each section inevitably has some overlapping scope. I haven't yet implemented a long writer for the last step but am working on this so that it can produce 20-50 page detailed reports (instead of 5-15 pages).
Feel free to try it out, share thoughts and contribute. At the moment it can only use Serper.dev or OpenAI's WebSearch tool for running SERP queries, but happy to expand this if there's interest. Similarly it can be easily expanded to use other tools (at the moment it has access to a site crawler and web search retriever, but could be expanded to access local files, access specific APIs etc).
This is designed not to ask follow-up questions so that it can be fully automated as part of a wider app or pipeline without human input.
I built an open source deep research implementation using the OpenAI Agents SDK that was released 2 weeks ago. It works with any models that are compatible with the OpenAI API spec and can handle structured outputs, which includes Gemini, Ollama, DeepSeek and others.
The intention is for it to be a lightweight and extendable starting point, such that it's easy to add custom tools to the research loop such as local file search/retrieval or specific APIs.
It does the following:
Carries out initial research/planning on the query to understand the question / topic
Splits the research topic into sub-topics and sub-sections
Iteratively runs research on each sub-topic - this is done in async/parallel to maximise speed
Consolidates all findings into a single report with references
If using OpenAI models, includes a full trace of the workflow and agent calls in OpenAI's trace system
It has 2 modes:
Simple: runs the iterative researcher in a single loop without the initial planning step (for faster output on a narrower topic or question)
Deep: runs the planning step with multiple concurrent iterative researchers deployed on each sub-topic (for deeper / more expansive reports)
I'll post a pic of the architecture in the comments for clarity.
Some interesting findings:
gpt-4o-mini and other smaller models with large context windows work surprisingly well for the vast majority of the workflow. 4o-mini actually benchmarks similarly to o3-mini for tool selection tasks (check out the Berkeley Function Calling Leaderboard) and is way faster than both 4o and o3-mini. Since the research relies on retrieved findings rather than general world knowledge, the wider training set of larger models don't yield much benefit.
LLMs are terrible at following word count instructions. They are therefore better off being guided on a heuristic that they have seen in their training data (e.g. "length of a tweet", "a few paragraphs", "2 pages").
Despite having massive output token limits, most LLMs max out at ~1,500-2,000 output words as they haven't been trained to produce longer outputs. Trying to get it to produce the "length of a book", for example, doesn't work. Instead you either have to run your own training, or sequentially stream chunks of output across multiple LLM calls. You could also just concatenate the output from each section of a report, but you get a lot of repetition across sections. I'm currently working on a long writer so that it can produce 20-50 page detailed reports (instead of 5-15 pages with loss of detail in the final step).
Feel free to try it out, share thoughts and contribute. At the moment it can only use Serper or OpenAI's WebSearch tool for running SERP queries, but can easily expand this if there's interest.
now that its in the API, that means you can benchmark it. I wonder what the difference between regular o3 vs o3 deep research might be on something like livebench?
I’m looking for a strong alternative to OpenAI’s Deep Research Agent — something that actually delivers and isn’t just fluff. Ideally, I want something that can either be run locally or accessed via a solid API. Performance should be on par with Deep Research if not better, Any recommendations?
Curious how much I can start automating research tasks so that I can batch out queries, not have to wait around, and then come back hours later to a multi layered and structured finished product.
I built a deep research implementation that allows you to produce 20+ page detailed research reports, compatible with online and locally deployed models. Built using the OpenAI Agents SDK that was released a couple weeks ago. Have had a lot of learnings from building this so thought I'd share for those interested.
You can run it from CLI or a Python script and it will output a report
https://github.com/qx-labs/agents-deep-research
Or pip install deep-researcher
Some examples of the output below:
Text Book on Quantum Computing - 5,253 words (run in 'deep' mode)
Deep-Dive on Tesla - 4,732 words (run in 'deep' mode)
Market Sizing - 1,001 words (run in 'simple' mode)
It does the following (I'll share a diagram in the comments for ref):
Carries out initial research/planning on the query to understand the question / topic
Splits the research topic into sub-topics and sub-sections
Iteratively runs research on each sub-topic - this is done in async/parallel to maximise speed
Consolidates all findings into a single report with references (I use a streaming methodology explained here to achieve outputs that are much longer than these models can typically produce)
It has 2 modes:
Simple: runs the iterative researcher in a single loop without the initial planning step (for faster output on a narrower topic or question)
Deep: runs the planning step with multiple concurrent iterative researchers deployed on each sub-topic (for deeper / more expansive reports)
Some interesting findings - perhaps relevant to others working on this sort of stuff:
I get much better results chaining together cheap models rather than having an expensive model with lots of tools think for itself. As a result I find I can get equally good results in my implementation running the entire workflow with e.g. 4o-mini (or an equivalent open model) which keeps costs/computational overhead low.
I've found that all models are terrible at following word count instructions (likely because they don't have any concept of counting in their training data). Better to give them a heuristic they're familiar with (e.g. length of a tweet, a couple of paragraphs, etc.)
Most models can't produce output more than 1-2,000 words despite having much higher limits, and if you try to force longer outputs these often degrade in quality (not surprising given that LLMs are probabilistic), so you're better off chaining together long responses through multiple calls
At the moment the implementation only works with models that support both structured outputs and tool calling, but I'm making adjustments to make it more flexible. Also working on integrating RAG for local files.
Hope it proves helpful!
Interestingly, o3 deep research has the old o3 pricing.
Hey everyone,
I just started using the new Deep Research feature in ChatGPT, and I’m curious about the best ways to prompt it for optimal results. According to OpenAI’s FAQ, it’s designed for multi-step, in-depth research using web data, but it also seems to generate structured reports rather than just quick summaries.
For those who have used it, what are some best practices for getting the most out of Deep Research?
Are there specific phrasing techniques that lead to better results?
Should I provide as much context as possible up front, or does it refine the scope as it researches?
What information is absolutly essential to include in you prompt? And what can be ommitted?
How does it handle broad vs. narrow topics? If I want a comparison of niche products or a technical literature review, what’s the best way to ask?
Can it do multiple things in one prompt? For example, costs of X, benefits of X, effectiveness of X?
Has anyone noticed certain types of questions that it excels at—or struggles with?
I’m also curious how you guys iterate and experiment with using Deep Research. It seems quite hard to get the hang of it with so few monthly reports.
I’ve been on a mission to streamline how I conduct in-depth research with AI—especially when tackling academic papers, business analyses, or larger investigative projects. After experimenting with a variety of approaches, I ended up gravitating toward something called “Deep Research” (a higher-tier ChatGPT Pro feature) and building out a set of multi-step workflows. Below is everything I’ve learned, plus tips and best practices that have helped me unlock deeper, more reliable insights from AI.
1. Why “Deep Research” Is Worth Considering
Game-Changing Depth.
At its core, Deep Research can sift through a broader set of sources (arXiv, academic journals, websites, etc.) and produce lengthy, detailed reports—sometimes upwards of 25 or even 50 pages of analysis. If you regularly deal with complex subjects—like a dissertation, conference paper, or big market research—having a single AI-driven “agent” that compiles all that data can save a ton of time.
Cost vs. Value.
Yes, the monthly subscription can be steep (around $200/month). But if you do significant research for work or academia, it can quickly pay for itself by saving you hours upon hours of manual searching. Some people sign up only when they have a major project due, then cancel afterward. Others (like me) see it as a long-term asset.
2. Key Observations & Takeaways
Prompt Engineering Still Matters
Even though Deep Research is powerful, it’s not a magical “ask-one-question-get-all-the-answers” tool. I’ve found that structured, well-thought-out prompts can be the difference between a shallow summary and a deeply reasoned analysis. When I give it specific instructions—like what type of sources to prioritize, or what sections to include—it consistently delivers better, more trustworthy outputs.
Balancing AI with Human Expertise
While AI can handle a lot of the grunt work—pulling references, summarizing existing literature—it can still hallucinate or miss nuances. I always verify important data, especially if it’s going into an academic paper or business proposal. The sweet spot is letting AI handle the heavy lifting while I keep a watchful eye on citations and overall coherence.
Workflow Pipelines
For larger projects, it’s often not just about one big prompt. I might start with a “lightweight” model or cheaper GPT mode to create a plan or outline. Once that skeleton is done, I feed it into Deep Research with instructions to gather more sources, cross-check references, and generate a comprehensive final report. This staged approach ensures each step builds on the last.
3. Tools & Alternatives I’ve Experimented With
Deep Research (ChatGPT Pro) – The most robust option I’ve tested. Handles extensive queries and large context windows. Often requires 10–30 minutes to compile a truly deep analysis, but the thoroughness is remarkable.
GPT Researcher – An open-source approach where you use your own OpenAI API key. Pay-as-you-go: costs pennies per query, which can be cheaper if you don’t need massive multi-page reports every day.
Perplexity Pro, DeepSeek, Gemini – Each has its own strengths, but in my experience, none quite match the depth of the ChatGPT Pro “Deep Research” tier. Still, if you only need quick overviews, these might be enough.
4. My Advanced Workflow & Strategies
A. Multi-Step Prompting & Orchestration
Plan Prompt (Cheaper/Smaller Model). Start by outlining objectives, methods, or scope in a less expensive model (like “o3-mini”). This is your research blueprint.
Refine the Plan (More Capable Model). Feed that outline to a higher-tier model (like “o1-pro”) to create a clear, detailed research plan—covering objectives, data sources, and evaluation criteria.
Deep Dive (Deep Research). Finally, give the refined plan to Deep Research, instructing it to gather references, analyze them, and synthesize a comprehensive report.
B. System Prompt for a Clear Research Plan
Here’s a system prompt template I often rely on before diving into a deeper analysis:
You are given various potential options or approaches for a project. Convert these into a well-structured research plan that: 1. Identifies Key Objectives - Clarify what questions each option aims to answer - Detail the data/info needed for evaluation 2. Describes Research Methods - Outline how you’ll gather and analyze data - Mention tools or methodologies for each approach 3. Provides Evaluation Criteria - Metrics, benchmarks, or qualitative factors to compare options - Criteria for success or viability 4. Specifies Expected Outcomes - Possible findings or results - Next steps or actions following the research Produce a methodical plan focusing on clear, practical steps.
This prompt ensures the AI thinks like a project planner instead of just throwing random info at me.
C. “Tournament” or “Playoff” Strategy
When I need to compare multiple software tools or solutions, I use a “bracket” approach. I tell the AI to pit each option against another—like a round-robin tournament—and systematically eliminate the weaker option based on preset criteria (cost, performance, user-friendliness, etc.).
D. Follow-Up Summaries for Different Audiences
After Deep Research pumps out a massive 30-page analysis, I often ask a simpler GPT model to summarize it for different audiences—like a 1-page executive brief for my boss or bullet points for a stakeholder who just wants quick highlights.
E. Custom Instructions for Nuanced Output
You can include special instructions like:
“Ask for my consent after each section before proceeding.”
“Maintain a PhD-level depth, but use concise bullet points.”
“Wrap up every response with a short menu of next possible tasks.”
F. Verification & Caution
AI can still be confidently wrong—especially with older or niche material. I always fact-check any reference that seems too good to be true. Paywalled journals can be out of the AI’s reach, so combining AI findings with manual checks is crucial.
5. Best Practices I Swear By
Don’t Fully Outsource Your Brain. AI is fantastic for heavy lifting, but it can’t replace your own expertise. Use it to speed up the process, not skip the thinking.
Iterate & Refine. The best results often come after multiple rounds of polishing. Start general, zoom in as you go.
Leverage Custom Prompts. Whether it’s a multi-chapter dissertation outline or a single “tournament bracket,” well-structured prompts unlock far richer output.
Guard Against Hallucinations. Check references, especially if it’s important academically or professionally.
Mind Your ROI. If you handle major research tasks regularly, paying $200/month might be justified. If not, look into alternatives like GPT Researcher.
Use Summaries & Excerpts. Sometimes the model will drop a 50-page doc. Immediately get a 2- or 3-page summary—your future self will thank you.
Final Thoughts
For me, “Deep Research” has been a game-changer—especially when combined with careful prompt engineering and a multi-step workflow. The tool’s depth is unparalleled for large-scale academic or professional research, but it does come with a hefty price tag and occasional pitfalls. In the end, the real key is how you orchestrate the entire research process.
If you’ve been curious about taking your AI-driven research to the next level, I’d recommend at least trying out these approaches. A little bit of upfront prompt planning pays massive dividends in clarity, depth, and time saved.
TL;DR:
Deep Research generates massive, source-backed analyses, ideal for big projects.
Structured prompts and iterative workflows improve quality.
Verify references, use custom instructions, and deploy summary prompts for efficiency.
If $200/month is steep, consider open-source or pay-per-call alternatives.
Hope this helps anyone diving into advanced AI research workflows!
Hi everyone,
I'm interested in having your opinion about Deep Research models that can be called either through API or be easily deployed with a simple endpoint (for instance on Vertex AI or Hugging Face).
Obviously GPT o3 deep search is strong but not available through API. Same for Perplexity or Grok3 as far as I know.
So it narrows down the option to DeepSeek R1 that is capable of deep search and that has endpoints easily available / API documentaiton. Gemini 2.0 obviously has API access too, but remains unclear whether the deepsearch function is actually available through API.
What are your thoughts?
Thanks for the help
Deep Research has completely changed how I approach research. I canceled my Perplexity Pro plan because this does everything I need. It’s fast, reliable, and actually helps cut through the noise.
For example, if you’re someone like me who constantly has a million thoughts running in the back of your mind—Is this a good research paper? How reliable is this? Is this the best model to use? Is there a better prompting technique? Has anyone else explored this idea?—this tool solves that.
It took a 24-minute reasoning process, gathered 38 sources (mostly from arXiv), and delivered a 25-page research analysis. It’s insane.
Curious to hear from others…What are your thoughts?
Note: All of examples are all way to long to even post lol