Skip to main content
Physical AI Agents: What IBM's Distinguished Engineer Learned Building Enterprise Middleware
blog ship-ai mihai-criveti-ibm-mar-2026

Physical AI Agents: What IBM's Distinguished Engineer Learned Building Enterprise Middleware

MG
Manav Gupta
5 min read
Table of Contents
Note: This article was generated from the transcript of the original podcast episode. It has been edited for clarity and structure.

“Large language models are insufficient and give you very limited productivity for enterprise use cases.” That’s not a hot take from a skeptic—it’s Mihai Criveti, IBM Distinguished Engineer and creator of Context Forge, IBM’s open source gateway for agentic AI. After four years of deploying AI with enterprise customers, he’s learned that the real progress isn’t happening in the models. It’s happening in the tooling, the agents, and the middleware that makes them actually work.

The Skills AI Can’t Replace

For students entering tech or business today, Mihai’s advice cuts against the grain of most AI optimism: don’t let your traditional skills become obsolete.

“Today AI works for people that already know how to do things without AI. So that’s a problem. And the second problem you’re going to face is that you need experience. And the only way to get experience is to lack experience—experience failure, experience fixing those failures by hand, and then leveraging that experience with AI to solve problems.”

The models stumble. They build code that’s incorrect, incomplete, or fails in spectacular ways. Without the experience to debug those failures, you’re stuck.

“If you’re heading towards a brick wall, you’re just gonna head faster towards the brick wall and it’s gonna hurt.”

The same skills that mattered before—communication, collaboration, troubleshooting, debugging, reasoning—still matter. AI just accelerates whatever you were doing. Sometimes that’s not a good thing, because it skips the learning.

“Before you had one problem and you had to fix it. Now you have a hundred concurrent problems in open terminals that you need to address concurrently.”

Diagram

Why MCP Wasn’t Enough for Enterprise

Context Forge started in November 2024 when Mihai was “being forced to take vacation” by his wife. At his in-laws, looking for a way to appear busy, he found Anthropic’s MCP paper and quickly wrote the first version of what would become IBM’s agent gateway.

“I quickly understood that MCP as a protocol was insufficient, incomplete, and the implementation itself lacked a lot of enterprise features—security and authentication, authorization, observability, monitoring, reliability, high availability, routing, self-healing, remediation.”

The specification couldn’t even handle conversations beyond a realistic limit of about 10 tools (128 being the hard ceiling). And even as MCP evolved to address these gaps, a new problem emerged: 20,000 MCP servers implementing different versions of the spec, incomplete versions, or only supporting transports like STDIO.

“The more you try to fix the problem, the worse it can become.”

This is innovator’s dilemma within innovator’s dilemma. Even Anthropic—with arguably the strongest model on the planet—faces the same challenge every enterprise faces: once you ship something, you have to update it with the community.

What Context Forge Actually Does

Think of Context Forge as agentic middleware. It sits between your AI applications—whether that’s an app talking to an LLM, an agentic framework talking to tools, or an agent talking to another agent.

“It can change the authentication, it can change the authorization, it can change the protocol specification, it can convert A2A to MCP, MCP to A2A, REST to MCP, it can act as a proxy, it can act as a gateway, it can do routing.”

Everything flows through a plugins framework. Before any log line is written, before anything hits the backend API, it calls prehooks. After execution, it calls post-hooks. You can chain these together.

“For example, we can use a guardrail like a PIA filter so that if you submit your social security number or personal identifiable information—before it hits the log file, before it hits the LLM, before it hits your agent, before it hits your tool—that gets stripped out or even blocked.”

The output side matters too. If your MCP tool retrieves 1 million tokens because you requested a file, you can compress or truncate that output before it breaks your agent or empties your wallet.

“Think of Context Forge as a point of enforcement. It’s not about defining the policy—it’s about enforcing the policy.”

Diagram

The Agent Evaluation Problem

How do you know if an agent did its job correctly? The current approach is basically “LLM as judge”—asking a model to score another model’s output. Usually with a cheaper model, for cost reasons.

“You can build an amazing evaluator by using three models. I’m going to use the latest Opus, the latest GPT, the latest Gemini. I’m going to ask them independently to evaluate, blend the results together, do this 20 times, connect it to a RAG with more context, blend it with human feedback—and that’s going to cost you a hundred times more than the prompt itself.”

The real issue is that LLM-based evaluation can’t verify what matters. Mihai points to Anthropic’s C compiler example: it passed the test suite but couldn’t compile Hello World.

“If your evaluation tool doesn’t have the ability to run the code, then it’s just making it up. It’s like me giving you a piece of paper with my code and saying, does this code look right? And you say, what am I, a compiler? Go compile it.”

Non-LLM evaluation metrics matter: Does the code compile? Does it pass the linter? Does the container image build and deploy? Does load testing pass for 10 hours?

“The more I used AI, the more I realized that you need to have extremely tough criteria before you trust it—because you’re not verifying it line by line.”

Why Visual Agent Builders Don’t Scale

Mihai compares visual no-code agent platforms to the old UML promise. Attractive for prototyping, but they break down at scale.

“You need to have one speed for rapid prototyping where visual tools help. But once you need to build at scale, it all goes back to code.”

Here’s the twist: visual representations are actually harder for LLMs to work with. Those pretty flowcharts are probably “some horrifying JSON or XML behind the scenes.” Agents deal better with language and code than with visual abstractions.

“If you’re using agents to write your agents, the programmatic approach tends to scale more.”

AI Doesn’t Make Everything Faster

Even AI software development can be just as slow as traditional development. This sounds crazy until you factor in reality.

“Do you need to compile your code? Add half an hour compilation time. Do you need to run Sonar, linters, static analysis? Add that. Do you need to build a container? Add that. Do you need to deploy the container? Add that.”

Without these steps, you’re just saying “please write me this feature, make no mistakes” to a text file. The bottlenecks that existed before still exist. The Linux kernel still needs Linus Torvalds deciding what goes in and what doesn’t.

“AI doesn’t necessarily make things faster. There’s still the same bottlenecks. It makes it more efficient. If previously you needed 10 people, now you need four. You still need the same people to provide the governance.”

The problem: you need the experts. The seniors with debugging and security experience. The juniors doing basic manual execution of well-specified tasks? That work can be a prompt now.

“Folks need to become more end-to-end in their skillset instead of being overly specialized. And you need to get expertise fast.”


For more on enterprise AI architecture and agent systems, check out our episodes on MCP protocol evolution and production-grade agent deployment patterns.

Share this article

Related Episodes

Dive deeper into these topics in the podcast.

The Man Behind IBM's AI Agent Gateway
Guest Conversation

The Man Behind IBM's AI Agent Gateway

Mar 4, 2026 54 min

Mihai Criveti, Distinguished Engineer at IBM and creator of Context Forge, on why AI agents need agentic middleware, MCP's enterprise gaps, and what production-grade agent architecture actually loo...

Enjoying this article?

Ship AI is a video podcast covering the trends, tools, and strategies driving enterprise AI. New episodes every two weeks.