Your AI infrastructure is only as strong as your DNS strategy

Susana SchwartzOctober 22, 2025011 views

Adobe stock

Table of Contents

Assess four key areas when making DNS a mission-critical component of resiliency

In sum – what to know:

DNS is strategic – DNS is not a minor network utility, but rather a strategic component of AI.

Wake-up call — This week’s AWS outage may accelerate improvements to DNS features for AWS, GCP, Microsoft Azure, and other cloud providers and their customers.

Risk mitigation — Areas of vulnerability like data pipelines, multicloud and high-velocity AI environments can be identified and strengthened with proactive mitigation measures.

“It’s always a DNS error” is a common refrain, but one that will be getting more attention now that this week’s Amazon Web Services outage knocked out AI-driven applications and other web services, including Alexa, Kindle, Ring and Prime Video, as well as Google, Hulu, Lyft, Netflix, Reddit, Snapchat, Signal, Starbucks, McDonalds, Lloyds, Halifax; Roblox, Fortnite, Spotify, Starbucks, T-Mobile, Verizon, Venmo, Zoom, and hundreds more — as well as a collective user base of millions of people.

AI’s reliance on real-time data flows means every API, data pipeline, and edge deployment depends on DNS. When errors disrupt network communications for data, computation, and integration with other services, cascading failures occur. “Many AI companies offer their models in just a single cloud region, or a very select number of cloud regions, which often creates implicit concentration risk in a particular region. AWS US-East-1 is certainly a particular point of concentration,” explained Gartner Distinguished Vice President Lydia Leong.

Even customers not hosted in “Data Center Alley’s” US-East-1 location were indirectly affected, struggling to create support cases or to change IAM configurations.

For these reasons, DNS has to graduate from a “minor network utility” to a mission-critical component of AI. The AWS outage “demonstrates both how far we’ve come and where we still need to focus,” said Dolores Saiz, CEO of cloud technology consultancy, The Server Labs, who added, “Compared to major incidents 10-15 years ago, where you could be offline for weeks on end, today’s cloud platforms enable dramatically faster recovery times, but only if businesses have architected for resilience from the start.”

With AI applications becoming more intricate and expansive in their reach, new layers of dependencies make DNS a foundational element for much-needed resilience. Below are some critical areas to proactively assess when it comes to DNS:

Data pipeline failures: AI models’ accuracy can be affected when the data pipelines on which they rely go out. Make sure Extract, Transform, Load (ETL) tools, servers, or cloud services resolve hostnames into correct IP addresses. As much as possible, address misconfigurations, server outages, caching issues, or network latency, all of which can disrupt data flow and cause failures in pipeline jobs.

Multicloud, high-velocity AI environments: DNS complexity and fragmentation of services across cloud providers means a loss of centralized visibility and control for security and operations teams. Work to standardize configurations and implement advanced monitoring and network observability to identify issues in interconnected cloud networks, Kubernetes clusters, and AI workloads.

Security vulnerabilities: Malicious actors often target AI operations and intellectual property while DNS outages occur. AI has created a new threat vector because it can be used to automate the search for vulnerable DNS records, especially “dangling CNAMEs,” which point to decommissioned resources. That enables bad actors to impersonate legitimate services and carry out sophisticated phishing, malware distribution, and domain hijacking attacks.

DNS tunneling: Cybercriminals can encode and exfiltrate sensitive data from AI systems by hiding it within DNS queries and responses. Encoding the data of other programs or protocols within DNS queries and responses helps bypass firewalls and security measures so that communications are established between the compromised system and the criminal’s server.

It’s also wise to check what your cloud provider is doing in the realm of DNS. Before the outage, AWS and other giant cloud platforms like Microsoft Azure and Google Cloud Platform were rolling out new features related to DNS. Below are a few examples:

AWS

Route 53 Resolver DNS Firewall: Users can filter and regulate DNS queries, automatically deploying mitigations against new threats and managing custom blocklists. Last updated in July 2025 with features for real-time pattern and anomaly detection.

Multi-account management: Route 53 Resolver for easier cross-account DNS forwarding helps simplify DNS management in multi-account environments, helping to reduce the types of errors that were commonplace with manual configurations.

Google Cloud Platform (GCP)

DNSSEC support: Like AWS, GCP offers full support for DNSSEC, which is a major tool for securing DNS and preventing cache poisoning and other attacks.

Limited native policies: GCP has historically had more limited native routing policies compared to AWS, and the need more robust DNS security will likely prompt an evolution in this realm.

Microsoft Azure

DNSSEC adoption: Azure’s DNSSEC for public zones moves toward more robust DNS security.

If major cloud players harden their systems against DNS outages, and more the businesses and institutions do more to improve their own resiliency, then perhaps the chasm between where we are now and where we need to be can be closed. “The key lesson is building resilience and comprehensive business continuity into your cloud architecture. Every organization should be asking ‘which business functions continue operating, which degrade gracefully, and which stop entirely?’ That gap between current state and required continuity is where your resilience strategy needs to focus,” added Saiz.

An analysis yesterday by Adrian Cockroft of consulting firm Orionx.net laid out some possible proactive actions that organizations could take, such as use of distinct domains for email, internal company services and externally facing products; support of more than one DNS provider – migrating configurations between providers, but keeping them synchronized (i.e., interchange .com and .net, so that if code fails to reach one, it automatically tries the other). In addition, organizations could start regularly auditing DNS records, removing any stale or misconfigured ones, and perhaps avoid DNS wherever possible. All of these measures will be an important part of battling service outages, performance bottlenecks, and expanding security threats.

Assess four key areas when making DNS a mission-critical component of resiliency

GPU, NPU, ASIC, and FPGA: What are the differences?

Your AI infrastructure is as good as your DNS strategy, and the resiliency of both cloud and network infrastructure

Related posts

OpenAI and partners announce Stargate Wisconsin

OpenAI unveils Japan economic blueprint

Critical paths – hybrid subsea and terrestrial fiber for enterprise AI