The AI paradox: How AI fixes the crisis it creates

The rise of AI has created significant challenges for modern data center infrastructure in terms of power management. Traditional enterprise racks that once consumed an average of 7-10 kW, require close to 30-100 kW today. This significant increase in computational requirements has revealed a fundamental bottleneck: The traditional infrastructure isn’t enough to sustain ‌AI growth.

However, AI can also prove to be a savior: By embedding it into hardware design and automated construction workflows, data centers can evolve into intelligent, adaptive systems instead of the passive hubs that they are.

Revolutionizing hardware architecture through AI

Due to the rapid development of AI models, a significant innovation is required to reshape hardware design into a streamlined AI-driven innovation cycle. This is required at the microarchitecture design and macro-level system management.

AI-driven chip design processes

Modern-day AI accelerator chips integrate chiplets, high bandwidth memory (HBM) stacks and dense interconnect structures; hence, manual design isn’t scalable. AI-driven EDA tools are essentially the future.

AI-driven EDA tools can be pivotal in multiple avenues. In one such instance, Google has shown that something as complex as chip floorplanning can be done in hours that rival or surpass human efforts in terms of quality. These optimizations can reduce parasitic energy losses and prevent thermal hotspots in the physical design (PD).

AI models can also be used to evaluate thousands of multi-die configurations to predict hotspots, “through silicon via” (TSV) density issues and power-delivery constraints. This enables far more thermally balanced 2.5D/3D layouts than traditional heuristics.

Apart from PD floorplanning and prevention of hotspots, verification is another avenue where AI-driven EDA can be useful. Verification is significantly important because it consumes up to 70% of chip development time. AI tools such as Synopsys ML verification and Cadence Cerberus will prove to be useful tools to reduce this development time. Once the development time of a chip is reduced, meeting the growing performance needs of AI models will be feasible.

Another avenue where AI can be useful is reducing the power consumption in frontend design. Researchers have successfully demonstrated that ML-driven dynamic voltage frequency scaling (DVFS) strategies reduce the power without significant performance loss. AI can also be used to predict power consumption of the RTL design and post-layout snapshot in seconds, allowing designers to iterate rapidly.

Thermal and power management

Since modern AI chips generate vast amounts of heat, which can lead to hardware failure, modern AI chips require algorithms that can analyze data from multiple thermal sensors. AI algorithms can play pivotal roles here. Modern datacenters have used AI to reduce energy consumption in facilities, achieving significant amounts of energy savings. These AI-driven systems improve hardware longevity and reliability while significantly reducing operational costs.

AI can also be used to analyze the operational data and identify energy-intensive processes. It also then goes on to allocate computational tasks to efficient resources. This leads to less idle time and hence avoids power waste.

This creates a “self-sustainable cycle”: Power-optimized hardware enables the training of even more powerful AI models, which in turn are used to design the next generation of hardware.

AI in data center design and construction

To meet the demand for “speed-to-market,” AI can be integrated into procurement and design phases of data centers. These segments were historically slowed by manual reviews and complex specifications, something that AI can help with.

Streamlining procurement and design

AI tools can be particularly useful in automating tasks that otherwise require a substantial amount of manual work. For example, LLM-based assistants trained on design standards and Request for Information (RFI) history can now respond to vendor queries in minutes – a task that would have taken a control engineer 2 to 4 hours. Similarly, machine learning systems can be used to extract control point requirements (temperature setpoints and pressure limits, for example)‌ from 100% design drawings. This can help reduce human errors while transitioning from blueprint to physical installation.

In addition to identifying control requirements, generative AI tools can organize information scattered across multiple documents and convert it into structured outputs. For example, AI can automatically generate equipment schedules that list all major components, their capacities, control parameters and operating limits. Activities that once took design teams several weeks—such as cross-checking documents, extracting control data and preparing schedules—can potentially be completed in hours.

Automated commissioning and configuration

The commissioning process of the datacenter spans across five levels: Originating with factory tests and ending with integrated system testing. This is a final hurdle before the data center goes live in operation. Consequently, it’s a key step in the process, but can become tedious as it requires validating complex interconnected electrical and mechanical systems to ensure zero-downtime reliability, often under tight timelines. AI scripts can be helpful in reducing the burden here by automatically checking software configurations and interconnected systems to reduce rework during final testing. Generative AI can also be used to simulate system behavior under various operating conditions before physical commissioning can start. This allows the system to achieve optimal performance upon handover.

Predictive operations and AIOps

AI can also be used to make the management of data centers predictive and proactive instead of reactive. For example, AI can be used to predict the maintenance schedule. This is possible after the initial model is trained on vibration and voltage sensors. After that, the AI model can forecast failures. This will lead to an increase in reliability and a reduction in unplanned downtime.

Similarly, AI can also be used to place high-intensity workloads in cooler areas of a datacenter, preventing “thermal hotspots”. This will reduce the energy required for ‌cooling.

Since security is of paramount importance in data centers, AI can also be used to enhance physical and digital defense by tracing network anomalies, such as suspicious traffic patterns or unauthorized access attempts. This will lead to the neutralization of threats in real-time, instead of reacting to the threats.

Sustainability and the circular hardware economy: Beyond the linear lifecycle

Traditionally, enterprise servers had a lifecycle of 3-5 years. Now, with AI models being developed rapidly, AI hardware is being refreshed in 12-18 months. This is leading to large amounts of “embodied carbon waste”, which isn’t environmentally sustainable.

Consequently, hardware and infrastructure engineers need to pivot to a circular hardware economy framework, where hardware is an “evolving asset”.

At the hardware level, modularity is paramount, so that operators can upgrade to high-performance accelerators while retaining the chassis, power delivery units and cooling manifolds. This will significantly reduce the embodied carbon waste during raw material extraction and fabrication of the non-compute components.

To further solve this issue, AI can be used to decommission the hardware. Intelligent AI systems can analyze the telemetry data points from the server rack’s operational history to predict the remaining useful life of the chip or components around it (such as power delivery units or cooling manifolds). Healthy units can then be redeployed to “edge” data centers for less intense inference tasks, whereas failing units can be routed to specialized facilities for recovery of important materials.

This way, we can address the AI paradox: Use ‌AI to mitigate the environmental footprint of the machines. This ensures that the next generation of infrastructure is not produced for better speed and performance, but is also sustainable.

Modern approaches, not conventional engineering

With the exponential growth of performance-critical AI models, AI has become a foundational requirement for datacenter infrastructure. Even though modern AI models lead to an increase in total power/energy consumption, they can also act as a critical driving force to mitigate the same. To keep up with the growing rise of AI models and the subsequent rise of power/energy consumption, we need to switch to modern approaches rather than relying on conventional engineering. The next generation of datacenter infrastructure will be defined by how well we manage the evolution of hardware design and automated construction to build AI-capable data centers.

By integrating AI at the silicon level and the structural level, we are not just building faster computers; we are building a more intelligent foundation for the future of technology.

Disclaimer: The views expressed in this article are solely those of the authors in a personal capacity and do not represent the views of their employers.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?