Thought Leaders
Beyond Up/Down: There’s A Better Way To Define ‘Normal’ in Complex Infrastructure

We’ve come a long way from up/down monitoring. From factory floors to modern enterprise infrastructure, IT admins now require significantly more information than a simple check to determine whether a website or application is able to service users. Sure, it’s helpful to see a basic “up” or “down” status, but this doesn’t tell the whole story of how technology is delivering the business value expected. Further, as IT and OT environments converge and ecosystems become more dynamic and ephemeral, these alerts don’t accurately establish or reflect baselines.
Understanding what’s normal, learning performance patterns, and preventing costly downtime are vital functions in today’s complex infrastructure. This is particularly true as threat actors use increasingly sophisticated tools to do more with less and modern interconnected infrastructure creates new vulnerabilities.
It’s in this landscape that AI-driven monitoring transforms infrastructure management by offering insight into what is and what isn’t normal behavior, thereby eliminating poor baselines and alert fatigue. Let’s explore how this shift from reactive firefighting to proactive prevention marks a much-needed monitoring evolution.
Discovering the new normal
What’s normal, anyway? This is a question that infrastructure teams overseeing servers, network devices, applications, and databases have been asking for decades. Why? Because defining ‘normal’ is complex and error-prone across dynamic and increasingly distributed environments with diverse systems to monitor. Finding the answer will depend on your specific business patterns and technologies. Additionally, it will depend on your monitoring technology and configuration, because setting static thresholds doesn’t catch many problems. Instead, it will give you a good idea when something is happening which you expect, but doesn’t help catch problems you don’t expect, leading to false positives, alert fatigue, and gaps in visibility.
Consider a manufacturing facility where traffic suddenly spikes at 2 PM on a Tuesday. Traditional monitoring might trigger an alert because it exceeds a preset threshold, but is this actually a problem? There’s no way to know without deeper data and diagnostics. The spike could indicate legitimate business activity like a new shift schedule or increased production to meet a deadline. Alternatively, it could signal a serious security threat, such as data exfiltration or a compromised system beaconing to command-and-control servers.
This is where AI-driven anomaly detection enhances the intelligence of infrastructure monitoring. This emerging method continuously analyzes historical data to create intelligent baselines that automatically adjust to changing conditions. This approach allows for more proactive alerting which provides extra time for IT admins and DevOps teams to step in and mitigate the problem before heavy impacts.
Monitoring of network traffic is a good example of this in action. Infrastructure monitoring systems collect various signals including logs and metrics. A log is an event generated by a system, while a metric is a measure. Over time, these measures are collected and represented as a time series, similar to the temperature being measured throughout the day. The data collected to monitor network conditions includes metrics such as incoming and outgoing broadcast packet rates, the number of discards and errors, and total traffic throughput. If something is off compared to regular performance, intelligent monitoring can ensure the right alarms are raised and false positives are avoided.
As a result, infrastructure teams can focus on delivering business value instead of constantly fine-tuning alert settings and firefighting problems that may not exist.
Avoiding alert duplication
Doubling up on monitoring may introduce additional challenges by creating more alerts. Monitoring can become cluttered over time as teams add tracking for new projects or create additional monitoring when troubleshooting or testing. Before long, what seemed like a clean and simple monitoring setup can turn into an overloaded maze of spurious or redundant alerts that obscure rather than illuminate issues.
For example, IT teams sometimes receive alerts for high CPU usage, slow application response times, and network congestion from the same overloaded server. Without understanding the correlation, teams might investigate three separate problems instead of the single root cause.
Modern AI technologies, when coupled with monitoring, again transform this issue through the automatic detection of similar monitoring configurations. Employing techniques like fuzzy math and heuristics, this approach analyzes behavioral patterns and uncovers correlations between similar monitoring to reveal hidden interconnections.
This matters for two main reasons. First, it reduces alert noise. Instead of receiving three separate alerts from one problem, teams get a single alert with a clear understanding of what needs attention and why. Second, it eliminates redundant monitoring. This helps create a more manageable setup that streamlines dashboards and reduces the cognitive load.
The future of intelligent monitoring
Other networking and cybersecurity developments also support the case for increased monitoring as complexity continues to grow exponentially. What were once separate, air-gapped industrial networks are now interconnected with enterprise systems, creating hybrid environments where one network issue can impact both production lines and business applications. And we’re seeing this convergence across the modern stack.
Industrial IoT sensors, edge gateways, and OT devices now communicate alongside standard IT protocols. When these diverse systems experience issues, admins require monitoring that can understand relationships across the ecosystem rather than treating each as a separate silo. Vigilance is non-negotiable as a successful breach can halt production lines, damage expensive equipment, and pose safety hazards. In fact, unplanned downtime now costs Fortune Global 500 companies 11% of their annual revenue, underscoring that the cost of intelligent monitoring is significantly less than the expense of manual troubleshooting and lost productivity.
Meanwhile, there’s no escaping that hackers on the other side of the cybersecurity ledger are using this technology as a productivity breakthrough to attack at scale. Free or inexpensive generative AI large language models (LLMs) enable hackers to generate and modify attacks at a minimal cost. And, with time, it’s clear that bad actors increasingly see AI as a game-changer. Today, 7 out of 10 believe the technology and its various tools enhance hacking, up from just two out of 10 in 2023.
Today’s anomaly detection algorithms are based on mathematics and statistics which have been well established for decades. This technology works but the advent and application of AI and LLMs to metric monitoring is game-changing. We’re seeing some of the first time-series-based LLMs come to the market and can expect this to transform anomaly detection over the next two years. Several of these new models are showing excellent accuracy and advancements.
The choice now lies with IT and operations teams on how to best oversee their ecosystems and counter threats. The good news is that automated anomaly detection and baseline monitoring can help better protect assets while learning, adapting, and optimizing, which in turn enables more effective capacity planning and resource optimization. Basic up/down checks are still valuable but – when a single issue can cascade across IT, OT, and IoT systems – we need intelligent context on top of that foundation. Infrastructure defenders can meet the moment by scaling up their visibility accordingly.












