Interviews
Nikunj Bajaj, Co-Founder and CEO of TrueFoundry – Interview Series

You’ve worked across machine learning research, production AI at Facebook, and large-scale recommendation systems before founding TrueFoundry — what experiences most directly pushed you toward building an enterprise AI infrastructure company, and what pain did you feel wasn’t being addressed at the time?
At Meta, we viewed machine learning as a special case of software, and GenAI as a special case of machine learning, which resulted in a vertical stack with software at the bottom, machine learning in the middle, and GenAI at the top. In this setup, if I am a machine learning developer, the models I build follow the same deployment pattern as the rest of the software, which makes scaling systems very straightforward.
Most enterprises, however, were deploying parallel stacks, meaning they had separate stacks for software, machine learning, and GenAI. The moment you have these parallel stacks, scaling becomes more complex because of the handoffs required between machine learning and the software world.
Our team has always worked at the intersection of building machine learning models and machine learning infrastructure, so we had a unique point of view that we could bring similar vertical stacks to enterprises and adapt them for their specific requirements. We also had a hypothesis toward the end of 2021 that machine learning was approaching an inflection point, and when it did, more companies would need a vertically integrated stack to deploy and scale these systems effectively. This is what ultimately led us to founding TrueFoundry, and our hypothesis was right. AI adoption accelerated following the launch of ChatGPT in late 2022.
As AI systems move from experimentation into everyday operations, what has changed about the way organizations should think about reliability and failure?
The stakes with Gen AI are significantly higher compared to traditional machine learning systems. As these systems move into production, organizations are dealing with a much higher level of ambiguity and non-determinism because LLMs are stochastic by nature. Agentic systems built on top of them add further ambiguity.
Additionally, failures are no longer binary. Instead of systems simply failing or not failing, many issues appear as partial failures or silent degradations. Systems might respond with higher latency, degraded quality, or incorrect behavior over time. In many cases, these degradations can be harder to detect and sometimes even more damaging than a hard outage.
Organizations need to think about reliability not just in terms of uptime but also performance degradation over time.
TrueFailover was launched amid a wave of high-profile cloud and AI service disruptions. What recent events made it clear that AI reliability had shifted from a “nice to have” to a core architectural requirement?
One of our healthcare customers that processes real-time, time-sensitive patient requests related to prescriptions was impacted by an outage caused by a model failure. Their workflows generate thousands of dollars of revenue per second, and the outage disrupted some of these critical workflows. As an early TrueFailover customer, we were able to help with quick recovery, and the impact was contained.
Incidents like this raise an important question. As the stakes of Gen AI systems continue to rise, why are recovery processes still largely manual? It reinforced the idea that systems should be built with the assumption that failures will happen, and they should be designed to automatically correct themselves. Reliability also has to be built into the AI stack itself through the use of AI Gateways, which can provide centralized routing, observability, guardrails, and intelligent model switching across providers.
Many AI outages are still framed as technical hiccups. Where do you see the real economic and human costs starting to emerge when AI systems go down?
Enterprise AI has evolved to the point where these hiccups no longer just affect internal workflows. Today, outages and degradations impact public perception and profits directly and immediately, because the production use-cases are now customer-facing. This shift from internal testing to high-stakes, public-facing applications is why we’re seeing increased demand for executive attention and oversight.
As AI systems become embedded deeper into operational workflows, outages are no longer just technical issues. They increasingly have direct business, customer, and reputational consequences.
In mission-critical environments like pharmacies, healthcare operations, or customer support, how quickly can AI downtime escalate into operational or reputational risk?
In mission-critical environments, escalation happens almost immediately because these systems support real-time, time-sensitive workflows. Even a short disruption can halt critical processes, delay service delivery, or interrupt downstream systems that depend on those outputs, creating cascading operational effects across the organization.
In sectors like healthcare, the impact extends beyond operational disruption to customer experience and service outcomes. If a patient is unable to fulfill their prescription on time, there can be real consequences. Not only is this an issue for the patient, but it can also damage a pharmacy’s or healthcare provider’s reputation. In mission-critical environments where trust is a factor, it’s paramount that systems stay online. This is why organizations are increasingly recognizing that AI systems must be designed with the assumption that failures will occur and that recovery mechanisms need to activate automatically to minimize risk.
You’ve said many teams architect for capability rather than continuity. Why do you think resilience has historically been deprioritized in AI system design?
This largely comes down to incentives within organizations. New capabilities are visible and exciting. They unlock demos, features, and product possibilities that leadership can immediately see.
Continuity, by definition, is invisible when things are working well. Because of this, reward systems tend to be skewed toward shipping new features rather than ensuring nothing breaks. As a result, organizations often invest disproportionately in capability development rather than in resiliency engineering.
As enterprises increasingly rely on external models and APIs, what new fragilities are being introduced into the AI stack that leaders may not fully appreciate yet?
LLMs are fundamentally shared resources, and enterprises do not own them as they do traditional infrastructure. In addition, important business-critical systems with enterprises are running on external systems that are not fully time-tested. LLMs themselves are evolving rapidly, which means a model provider can’t be held accountable for things like latency or model performance slightly going down, because they are iterating on their research very quickly.
Because LLMs are shared resources, latency can spike because another consumer of these LLMs takes a specific action. There are a lot of these failure points that get introduced because of the fundamental nature of LLMs, and enterprises in this new world simply do not have full control. Without full control, the best thing an enterprise can do is create enough system redundancies to design a resilient system.
Without focusing on specific products, how should organizations rethink AI architecture to assume failure rather than treat outages as rare edge cases?
Organizations should return to the first principles of distributed systems design. Software systems were built on the assumption that network components and machines would fail, and that an entire region could go down.
AI systems should be no different. We should assume that model providers will experience latency issues, degradations, or outages, and incorporate redundancy so applications remain resilient across different failure scenarios.
Do you expect AI resilience to become a deciding factor in platform and vendor selection, similar to how uptime and redundancy shaped cloud infrastructure decisions?
As more AI systems move into production, resilience will become table stakes. If a vendor can’t showcase their graphs and metrics on uptime and overall resiliency, they won’t even be considered. Once resilience becomes a baseline expectation across vendors, the deciding factors will shift toward user experience, performance optimization, observability, and higher-level product capabilities. Over time, components such as an AI Gateway and automated failover capabilities will become core foundational elements of enterprise AI infrastructure.
Looking ahead, what does “production-ready” AI really mean in a world where AI is expected to be continuously available, not just occasionally helpful?
Production-ready AI systems should be observable, controllable, and recoverable. All three of these boxes need to be checked.
For production AI to be observable, teams need deep visibility into model behavior, latency, error rates, token usage, drift, and failure patterns. Without strong observability, it becomes very difficult to detect degradations before users begin to notice them.
For systems to be controllable, that includes traffic shaping, rate limiting, guardrails, policy enforcement, and intelligent routing across models and providers. This is where an AI Gateway becomes foundational, acting as a centralized control plane that enforces guardrails, provides consistent governance, and enables dynamic model switching when performance or reliability drops.
And lastly, when it comes to being recoverable, systems should be built with the assumption that components can be partially or completely broken, whether due to provider outages, degraded model quality, rate limits, or unexpected inputs from malicious actors. Automated failover and self-healing mechanisms should be native to the architecture, not manual playbooks triggered after something goes wrong.
This is the direction we are working toward at TrueFoundry. Vendors that define production readiness in this way, combining observability, centralized control, and automated recovery, will earn long-term customer trust and will be able to continue to solve new issues as they emerge.
Thank you for the great interview, readers who wish to learn more should visit TrueFoundry.












