Interviews

Shahar Azulay, CEO and Co-Founder of groundcover

Published January 6, 2026

Antoine Tardif, CEO & Founder of Unite.AI

Shahar Azulay, CEO and cofounder of groundcover is a serial R&D leader. Shahar brings experience in the world of cybersecurity and machine learning having worked as a leader in companies such as Apple, DayTwo, and Cymotive Technologies. Shahar spent many years in the Cyber division at the Israeli Prime Minister’s Office and holds three degrees in Physics, Electrical Engineering and Computer Science from the Technion Israel Institute of Technology as well as Tel Aviv University. Shahar strives to use technological learnings from this rich background and bring it to today’s cloud native battlefield in the sharpest, most innovative form to make the world of dev a better place.

groundcover is a cloud-native observability platform designed to give engineering teams full, real-time visibility into their systems without the complexity or cost of traditional monitoring tools. Built on eBPF technology, it collects and correlates logs, metrics, traces, and events across cloud-native and Kubernetes environments with no code changes, enabling faster root-cause analysis and clearer system insight. The platform emphasizes predictable pricing, flexible deployment that keeps data in the customer’s cloud, and end-to-end observability spanning infrastructure, applications, and modern AI-driven workloads.

Looking back at your journey—from leading cyber R&D teams in the Israeli Prime Minister’s Office to managing ML initiatives at Apple—what experiences ultimately pushed you toward founding groundcover, and when did you first recognize the gap in observability for modern AI systems?

The push to found groundcover came from my time at Apple and DayTwo. Even with huge budgets, we were stuck choosing between paying a fortune to log everything or sampling and flying blind. Back then, we were looking for a technology that would solve that. Once we ran into Extended Berkeley Packet Filter (eBPF), it was clear that it would change everything.eBPF lets us see everything happening in the kernel without relying on application changes. I could not understand why observability tools were not taking advantage of that.The AI gap became clear later. Once our Kubernetes platform matured, we saw customers rushing into GenAI deployments while treating LLMs like black boxes. They knew the model responded, but not why it behaved unpredictably or why costs were spiking. We realized agentic workflows are simply complex, non-deterministic microservices that need the same zero-touch visibility we had already built.

How did your background in cybersecurity, embedded systems, and machine-learning R&D influence the vision behind groundcover, and what early challenges did you face building a company centered on observability for LLM-driven and agentic applications?

My cyber background shaped the company’s DNA. In the intelligence world, you assume you do not control the application. That approach is why groundcover does not require instrumentation. I know from experience that asking developers to modify code is the fastest way to block adoption.The hardest early challenge with LLM monitoring was privacy. Observability for AI captures prompts that may contain sensitive PII or IP. My background made it obvious that enterprises would not want that data leaving their environment. That is why we built our in-cloud architecture, allowing us to provide deep visibility into agent behavior while keeping all data inside the customer’s own environment.

How do you define LLM observability, and what makes it different from traditional monitoring or ML monitoring?

LLM observability is the practice of instrumenting and monitoring production systems that use large language models so you can capture the full context of every inference: the prompt, context, completion, token usage, latency, errors, model metadata, and ideally downstream feedback or quality signals.Instead of only asking “Is the service up and fast?” or “Did this request error out?”, LLM observability helps you answer questions like “Why did this particular request succeed or fail?”, “What actually happened inside this multi-step workflow?”, and “How are changes to prompts, context, or model versions affecting cost, latency, and output quality?”That is very different from traditional monitoring or even classical ML monitoring. Legacy approaches are tuned for deterministic systems, infrastructure metrics, and static thresholds. LLM applications are non-deterministic, open-ended, and highly context-dependent. Success is often semantic and subjective, not just a 200 vs 500 status code. That means you have to trace inputs and outputs, understand tool calls and retrieval steps, evaluate responses for things like hallucinations or policy violations, and connect token-level costs and delays back to the surrounding application and infrastructure.

What challenges do LLM-powered applications introduce that make traditional observability tools insufficient?

LLM-powered systems introduce several challenges that expose the limits of traditional tools:

Complex, multi-step workflows – We moved from simple “call a model, get a response” flows to multi-turn agents, multi-step pipelines, retrieval augmented generation, and tool use. A silent failure in any of those steps, such as retrieval, enrichment, embedding, tool call, or model call, can break the whole experience. Traditional monitoring usually does not give you a complete, trace-level view of those chains with prompts and responses included.
Rapidly evolving AI stacks – Teams are adding new models, tools, and vendors at a pace they have never seen before. In many companies, nobody can confidently list which models are in production at any given moment. Classic observability usually assumes you have time to instrument SDKs, redeploy, and carefully curate what you measure. That simply does not keep up with how fast AI is being adopted.
Token-based economics and quotas – Pricing and rate limits are tied to tokens and context length, which are often controlled by developers, prompts, or user behavior, not by central ops. Traditional tools are not built to show you “who burned how many tokens on which model, for which workflow, at what latency”.
Semantic correctness instead of binary success – An LLM can return a 200 and still hallucinate, drift away from your prompt, or violate policy. Traditional tools see that as a success. You need observability that can surface prompts and responses and give you enough context to inspect behavior and, over time, plug in automated quality checks.
Sensitive input data flowing into third parties – LLMs invite users to share very sensitive information through chat-style interfaces. Now you are responsible for that data, where it is stored, and which vendors see it. Conventional SaaS based observability that ships all telemetry to a third party is often unacceptable for these workloads.

All of this means LLM systems require observability that is AI aware, context-rich, and far less dependent on manual instrumentation than the tools most teams use today.

Which signals or metrics are most important for understanding the performance and quality of LLM systems, including latency, token usage, and prompt/response behavior?

There are a few categories of signals that matter a lot in practice:

Latency and throughput

End to end latency per request, including model time and surrounding application time.
Tail latencies (P90, P95, P99) per model and per workflow.
Throughput by model, route, and service, so you know where load is really going.

Token usage and cost drivers

Input and output tokens per request, broken down by model.
Aggregated token usage over time per model, team, user, and workflow.
Context sizes for retrieval heavy pipelines so you can see when prompts are exploding.
This is what lets you answer “Who is actually spending our AI budget and on what?”

Prompt and response behavior

The actual prompt and response payloads on representative traces, including tool calls and reasoning paths.
Which tools the LLM chose to call and in what sequence.
Variance in responses for similar prompts so you can tell how stable the behavior is.

Reliability and errors

Model specific error rates and types (provider errors, timeouts, auth issues, quota errors).
Failures in the surrounding workflow, such as tool timeouts or retrieval errors, correlated with the LLM call.

Classic infra context

Container CPU, memory, and network metrics for the services orchestrating your LLM calls.
Correlated logs that describe what the application was trying to do.

When you can see all of that in one place, LLM observability moves from “I know something is slow or expensive” to “I know exactly which model, prompt pattern, and service are responsible and why”.

How can observability help teams detect silent failures such as prompt drift, hallucinations, or gradual degradation in output quality?

Silent failures in LLM systems usually happen when everything looks “green” at the infrastructure level, but the actual behavior is drifting. Observability helps in a few ways:

Tracing the full workflow, not just the model call – By capturing the whole path of a request client to service to retrieval to model to tools, you can see where behavior changed. For example, maybe retrieval started returning fewer documents, or a tool call is intermittently failing, and the model is improvising.
Keeping prompts, context, and responses in view – When you can inspect prompts and responses alongside traces, it becomes much easier to spot cases where a new prompt version, a new system instruction, or a new context source changed the behavior, even though latency and error rates stayed the same.
Filtering and slicing on semantic conditions – Once you have rich LLM telemetry, you can filter down to things like “bedrock calls over one second”, “requests using this model family”, or “traces involving this particular route”, then read the prompts and responses to see if the model is drifting or hallucinating in a specific scenario.
Alerting on business-level SLOs – You can define SLOs like “any LLM call over one second breaches our user-facing SLA” and trigger alerts when those conditions are met. Over time, similar SLOs can be tied to quality scores or policy checks so you get alerted when quality degrades, not just when infrastructure fails.

Because the observability layer has access to both the AI-specific signals and the classic logs, metrics, and traces, it becomes a natural place to catch issues that would otherwise quietly degrade user experience.

How does groundcover’s approach support diagnosing unpredictable latency or unexpected behavior within multi-step agent workflows and tool calls?

groundcover takes an approach designed for modern AI systems. We use an eBPF-based sensor at the kernel level to observe traffic across microservices with no code changes or redeploys. As soon as you introduce an LLM workflow, we can auto-discover those calls. If you start using a new model like Anthropic, OpenAI, or Bedrock tomorrow, groundcover captures that traffic automatically.That gives you:

End-to-end traces of multi-hop workflows – You see the full path of a request across services, including where an LLM or tool is used.
Deep context on each LLM call – Every call includes the model used, latency, token usage, prompts, responses, and correlated logs and infra metrics.
Powerful filtering on latency and conditions – For example, you can filter for all Claude 3.5 calls over one second and immediately inspect the traces that breached your SLA.
Alerts and dashboards tied to LLM behavior – Once the data is available, you can create alerts for SLA breaches or build dashboards that track latency, throughput, token usage, and errors.

Because everything is collected at the edge by eBPF and stored in your own cloud, you get this high-granularity view without adding instrumentation inside every agent or tool call.

What data-security and compliance risks do you see emerging in LLM deployments, and how can observability help reduce those risks?

LLM deployments bring some unique data risks:

Unbounded user input – Users can type extremely sensitive information into chatbots and AI-powered interfaces. That may include personal data, customer data, or regulated information that you never meant to collect.
Third-party model providers – Once you send that data to an external LLM provider, you are responsible for where it went, how it is stored, and which subprocessors are involved. That has major implications for GDPR, data residency, and customer trust.
Telemetry as a second copy of sensitive data – If your observability stack ships full payloads to a SaaS vendor, you now have another copy of that sensitive information sitting outside your environment.

groundcover’s architecture is designed to address exactly those concerns:

We use a bring your own cloud model where the full observability backend runs inside your cloud account, in a sub-account, as a fully managed data plane. The control plane that scales and manages it is run by us, but we do not access, store, or process your telemetry data.
Because we can safely capture payloads in your own environment, you can observe prompts, responses, and workflows without that data ever leaving your cloud. There is no third party storage of your LLM traces and no extra data egress to worry about.
With that visibility, you can see who is uploading what and where it flows, detect unexpected usage of sensitive data, and enforce policies around which models and regions are allowed.

In other words, observability becomes not only a reliability and cost tool, but also a key control point for privacy, data residency, and compliance.

As organizations scale from one LLM integration to many AI-powered services, what operational challenges tend to appear around visibility, reliability, and cost?

The first integration is usually a single model in a single workflow. At that stage, things feel manageable. As soon as teams see value, usage explodes and several challenges appear:

Model and vendor sprawl – Teams test new models constantly. It quickly becomes unclear which ones are in production and how they are used.
Cost surprises from token usage – Token consumption grows with context length and workflow complexity. Without visibility into token usage per model and workflow, managing costs is very difficult.
Reliability dependencies on external providers – User-facing APIs become sensitive to model latency or errors, which can disrupt SLAs even when core infrastructure is healthy.
Growing Instrumentation debt – Traditional observability assumes you can add instrumentation when needed. In fast-moving AI stacks, developers rarely have time for that.

groundcover addresses these by auto-discovering AI traffic and then giving you:

Central visibility into which models and vendors are used.
Dashboards showing latency, throughput, and token usage over time.
Correlation between LLM behavior and the services that depend on it
Alerts for AI-driven SLO breaches.

That makes it much easier to scale from “one cool AI feature” to “AI is woven into dozens of critical services” without losing control.

Looking ahead, how do you expect LLM observability to evolve over the next five years as agentic AI, multi-model orchestration, and regulatory pressures accelerate?

We are still in the early days. Over the next five years, I expect a few big shifts:

From request level to agent level understanding – Observability will expand to capture tool sequences, reasoning paths, and retry logic, not just model calls.
Richer semantic and policy signals – Automated quality checks for hallucinations, safety issues, and brand alignment will become standard metrics.
Tighter coupling with governance and privacy – As regulation grows, observability will also serve as an enforcement and audit layer for data residency, retention, and approved model usage.
Cross model, multi vendor optimization – Teams will route traffic across models dynamically based on performance and cost, guided by real-time observability data.
Less manual instrumentation – Techniques like eBPF-based collection and auto discovery will become the default, so teams can innovate without slowing down.

In short, LLM observability will evolve from “nice to have dashboards for AI” into the central nervous system that connects reliability, cost control, data governance, and product quality across everything an organization does with AI.

Thank you for the great interview, readers who wish to learn more should visit groundcover.

Unite.AI

Shahar Azulay, CEO and Co-Founder of groundcover

You may like