Converged architecture built for AI

AI and ML have different requirements, and as such different requirements. Machine Learning (ML) is the processes whereby granting the machine access to a large pool of data, and the programming allows the machine to gain in its interpretive answers to these significant questions. Meanwhile, Artificial Intelligence is a broader concept of the machine being able to perform tasks in a way we would consider smart, as based on historical response to the datasets provided. In both cases, the more data from which these inferences are provided, the better the quality of the responses.

The emergence of software capable of performing mathematically intensive analysis on potentially huge datasets, the emergence of accelerators (e.g., Graphics Processing Units) more tuned toward these kinds of deep mathematical functions, and more robust performance-optimized storage infrastructure (e.g., storage Class Memory, NVMe, and interconnects like NVMe-OF) have launched a new era in the abilities of systems to train models and derive inference from these datasets. Implications to Facial Recognition, Financial, Autonomous Driving, Healthcare and many more of these categorical sets of data are astounding. Toward that end, vendors have created Integrated Infrastructure systems and platforms, which in turn enable organizations to easily implement and benefit from AI initiatives.

Challenges with DIY Approaches

During previous iterations of hardware platforms geared toward the unique requirements of AI, customers had been forced to come up with their own designs, attempting, without known recipes, to build platforms on their own, leveraging best-in-breed processors, storage, and most notably accelerators like GPUs to create physical infrastructure. Such implementations were more in-line with a science project than corporate IT initiatives. This presented any number of problems, from firmware mismatches, to software incompatibilities, to specious support. The learning curve, and ever-changing product landscape of all disparate vendors caused inconsistencies. Even when such problems were overcome, quite often the challenges presented by these ad-hoc builds were significant and unforeseen.

As the AI market matured, it became clear that in order for AI initiatives to scale consistently, there was a need for integrated infrastructure systems, platforms in which the vendors preintegrated the various hardware and software platforms. Many vendors have been driving the adoption of AI builds into enterprises, the likes of AMD, NVIDIA, Pure, NetApp and others including a number of the large public cloud providers. They have taken steps toward making the commodification of such approaches far more appealing toward the enterprise, with the benefits of centralized support models, firmware consistencies, software to manage, and in general an ease of adoption enabling these organizations the opportunity to build “AI Infrastructure in a Box” approaches, and removing quite a bit of the old obstacles to general adoption.

The Solution – Integrated Infrastructure

The development of individual technologies: faster disc, faster interconnect, graphic processing engines, and faster processors as well as exponential increases in processor core-counts have driven the AI technological developments forward. . Up until recently, though, the need to build these architectures have relied solely on the engineering talents of internal staff. These technologies have been critical in the growth of AI architectures, but have been a science project, for a while. As these have matured, Integrated Infrastructure Systems, wherein all components have been tested both together and individually such that a reliable recipe for a build dependent only on sizing of the datasets to be accessed and can perform in a reliable way, with support, reference builds, customer references, and a dependable management layer are all part and parcel.

Integrated Infrastructure Systems are pre-configured , where most choices are already defined. Variables such as amount of storage, numbers of servers, etc., are determined at the time of purchase based on the workloads anticipated in the requirements. These still are preconfigured, and ready to land on the data center floor. Often, these solutions are built from multiple technology suppliers or from those of a single supplier. These are not built as separate solutions to each core technology, but as one solution.

Integrated Infrastructure changed the game on how IT consumed infrastructure. The maturity in the marketplace have made such integrated infrastructure platforms viable in the AI landscape. The true benefits of a “Validated Design” approach as opposed to reference custom builds should be obvious. The below list is by no means exhaustive but does give a representative sample of reference builds in the marketplace.

Storage Elements

Of critical importance in AI architectures is the way in which data is stored, accessed, and available to the infrastructure requiring access. The datasets are growing and promise to continue to do so.

The vast amounts of data relied upon to gain solid analytical insight needed for AI has highlighted the problem of storage as it’s been historically used. The effort to improve the reliability of the scale-out architectures, the global namespace requirements as well as sheer speed of access, both read and write capabilities in unstructured data have been and will continue to be addressed by storage components for AI.

The advent of the accelerators like GPUs (Graphics Processing Unit) with discrete functionality in these cases, cooperatively leveraged against the more traditional CPU (Central Processing Unit), and the software needed to run these functions have aided in the access and processing of the data. These accelerators have granted massive power toward the data.

No longer can these enormous datasets reside on just one monolithic array. The ability to scale out the storage platform, while having the file-system present as one available namespace gives these systems the kind of functionality that AI workloads require. Sometimes, the datasets may reside in multiple arrays, across multiple datacenters, or even in cloud-based storage. The AI infrastructure must have access to all sets of data in order to produce these completed calls to that data. NVMe architectures, and NVMe over Fabric interconnects, along with faster ethernet, and file-systems that allowed for the global namespace to be adopted have gone a long way toward speeding up, and scaling out these architectures. Imagine that every autonomous vehicle on the road creates and transmits gigabytes of data per day, multiplied by the sheer number of these vehicles on the road, you begin to get an idea of the huge volumes of data we’re talking about.

Problems addressed by Integrated Infrastructure Solutions for AI/ML

Technical Problems

There are a variety of issues that AI architectures can and often have faced, historically. As solutions have become available to address these technically, they’ve been incorporated into these IIS architectures.

  • Data growth has been massive. Scalability of the file-systems across multiple data centers, multiple arrays, and potentially multi-cloud environments have increased the difficulty in accessing the full scope of the data involved.
  • Storage infrastructures haven’t had the required disc speed, or tolerance for latencies to handle the quick access to all these distinct sets of data.
  • Network elements haven’t existed across the spectrum of locations, allowing for rapid ingest of this data, and haven’t allowed for reasonable read IOPS.

Technical Solutions

A number of technical solutions have arisen, and have become standardized in the industry, in an effort to resolve the problems outlined above.

  • Disc technology has advanced including solid state technologies, with ever increasing capacities in both IOPs, and storage
  • Optane memory has been a solid advancement in the ability to place more data into RAM, effectively decreasing the demand for the program to make calls to disc, and placing that data as close to the processors improving performance
  • Optane SSD has increased persistent cache, directly from the PCIe bus, also bringing that data closer to the CPU
  • Interconnect latencies have been further reduced, as have space and power by the addition of NVMe Fast SSD’s, which promise growth and further reduction of latency
  • Networking and data fabric has improved with the standardizations of faster ethernet, faster fibre-channel, and faster Infiniband, as well as the improved latency numbers with NVMe over Fabrics, all improving the latency statistics.

Business Problems

As above, with the technical problems, a number of the difficulties arising have the issue of directly effecting the business. Some of those are listed below.

  • Businesses are leveraging cloud-based architectures to scale their storage with can be exceptionally helpful in a scalability and growth mode. The difficulties have come from file-systems which haven’t historically been architected to handle the latencies and disparate locale. Accessing this data in rapid nature, particularly in AI systems, has made the nature of the analytics historically difficult, as well as costly. The relevance to Data Egress charges from cloud providers can greatly increase the costs related.
  • The latency and speed, so very important to the desire to generate timely reports against the data in a reasonable and timely manner, particularly in on-demand reporting has been quite difficult.

Business related solutions

Newer technical approaches have come a long way toward addressing the business- related issues as well. This section will discuss some of those solutions

  • Having a clear understanding of the scalability requirements, and how those are utilized, as well as a universal name-space based file system (either File-based or Object-based) can allow for these datasets to be accessed in a number of manners, giving the systems a more accurate, and efficient manner to scale-out the access to all of this data.
  • Understanding egress charges, and how the read versus copy of that data can effect these charges can go a long way to alleviating these costs.
  • Network utilization and upgrade paths have matured so that speed and latencies are handled better than previous iterations. The understanding of these networking technologies, and how they integrate with existing infrastructure, as well as the learning curve needed to implement them can assist quite significantly in resolving those issues

Architectural Approaches

Converged Architectures

A “Converged solution” of equipment can be defined as a platform wherein each of the core technologies must be chosen and put together, These are not sold with all four core technologies (servers, storage systems, networking, and management software) where choices from within each key category are chosen. For example, there may be choices available for: storage options, GPU choices, network interconnects, and CPU choices. Each of these key categorical decisions can have ramifications in relation to the whole and should not be undertaken lightly.

Integrated Infrastructure Systems

An Integrated Infrastructure System can be defined as a prebuilt, fully qualified and fully fleshed out solution wherein the variables such as Storage, CPU, Networking, Interconnect are determined at the point of order generation. These are sold, implemented and delivered as complete solutions, not as piecemeal approaches toward a goal.

Many vendors have stepped forward with complete solutions, designed toward building a standing up a defined approach toward handling the requirements of complex workloads, the sheer amount of data needing to be accessed and the complexities of these analytic tasks. Below is a list of a selected set of these solutions, with some of the variables of these approaches outlined and explained.

These IIS systems have changed the game on how IT has consumed infrastructure. The benefits of a Validated Design approach, as opposed to a Certified Reference Architecture are obvious. The below listings is not exhaustive, but does give a representative sample of various integrated builds within the marketplace.

Examples of Solutions

NVidia has created a platform which leverages the power of the bluefield DPU, and the NVidia GPU technology. Often the NVidia networking fabric, and the NVidia base command manager software are part of the build, and the networking may be NVidia’s Network fabric. The wholistic solution are the pieces that make this the DGX pod.  Differentiators to the solution quite often come from the storage elements, thus creating logic behind why the varying solutions are often build and presented by storage vendors.

Built on the DGX series of solutions from NVIdia, on either the DGX Pod, or the DGX Superpod are solutions from the following vendors:

Pure’s AIRI

The goal is a fully integrated stack of architecture with fast NFS storage, the current state of the art from NVidia graphics processing unit and a suite of software built around the concepts of both deep-learning, Kubernetes cloud orchestration, and workloads. The build centers around the DGX GPU, and the capacities of Pure’s established platform for file-based storage. The FlashBlade is designed for expandability, with an intelligent approach of when the need to grow the storage requires another storage blade to be added, the task is achieved easily by “Hot” plugging in another blade, which also adds processor functionality on each and every blade. Thereby, incorporating more power, as well as storage, quickly into the storage subsystem. The requirements of the advanced AI frameworks benefit greatly from the storage architecture. This architecture delivers 1.5million IOPS of NFS storage, and with 4 DGX GPU’s, roughly 20 Petaflops of AI performance.

The key differentiator among all the DGX (Pod or SuperPod) builds is the FlashBlade array. A truly scalable storage infrastructure with power added as each blade is added, so appropriate processing becomes integrated as storage is added. The File-based File System is highly scalable, and able to be managed both on-premises, as well as running Pure’s Purity OS in the cloud allowing for full extension of the file-system wherever required. The interconnect architecture has embraced NVMe over Fabrics as well, for the fastest architectures in terms of access and least amount of latencies.

IBM

Solution(s) :IBM Spectrum Scale (OnPrem and Hybrid Cloud) – Built on NVIDIA DGX Pod

  • Leveraging NVIDIA Bluefield DPU’s, and NVIDIA GPU’s for processing or TensorCore GPU
  • 1-10Petabytes of server-based storage – Spectrum Scale storage management POSIX, SFS, SMB or Object store, across local and cloud
  • IBM Elastic Storage System based on Spectrum Scale NVME all-flash
  • NVIDIA Mellanox QM8700 Infiniband, NVIDIA NGC
  • NVIDIA’s CUDA-X and DGX Software
  • Management via NVIDIA  Base Command Manager

IBM has been in the AI market for 3 decades, has a strong ecosystem of existing software, training, and research advisory to help build and enable trusted AI. Designed to run Kubernetes Containers, and for object storage, leveraging Advanced File Management (AFM) if desired. Integration with IBM Cloud Pak for Data, ensuring easy integration to cloud.

Ideal customer will have appreciation for IBM’s history in previous iterations of AI solutions, on the Minicomputer, and UNIX system approaches.

NetApp

NetApp differentiates their approach by incorporating the DGX SuperPod with their EF600 Enterprise all-flash array. Key differentiators:

  • Leveraging NVIDIA Bluefield DPU’s, and NVIDIA GPU’s for processing
  • NetApp EF600 All Flash Array with OnTap AI
  • 200Gbps NVIDIA Networking fabric
  • NetApp Data Science toolkit
  • Management from the NetApp AI Control Plane

As discussed previously, having ample processor and GPU power is only part of the goal. Having a purpose-built storage infrastructure that can deliver the scalability and performance necessary and required for gaining entrée to the data being leveraged, and all the storage elements required (deduplication, replication, compression, and speedy interconnect will make the implementation and day-to-day operation of the architecture more friendly for the organization. NetApp’s suite of management software ensures that the environment which has proven itself to the enterprise can be controlled consistently as other NetApp architectures in the organization. The minimization of learning curve is not to be trivialized.

NetApp is one of the most established and well rounded storage platforms. One of the key customer bases for the NetApp AI solution would be enterprises who already lean on OnTap, are familiar with the nuances, thus reducing any learning curve that may be associated with storage.

Additionally, there are approaches that lean on the GPU, and CPU solutions based on AMD’s or Intel’s toolkits. I’ve decided to refer to these as Enabling vendors. This due to the fact that there’s less a reference build plus storage, and more reference architectural builds.

Enabling Technology Vendors

The below vendors are being listed and described as they are creating some of the technologies that others are adopting into IIS solutions. While the solutions are being marketed by others, they would not be possible if not for the enabling technology built by the following enabling tech vendors.

Advanced Micro Devices (AMD)

Solution(s) : AMD Instinct GPU Powered Machine Learning Solutions

  • Leveraging AMD EPYC Processors, and AMD’s Instinct GPU’s
  • Presized storage based on Channel driven server models from various vendors, including Dell, Gigabyte, HPE, and SuperMicro
  • Various Networking solutions, depending upon who’s channel is relied
  • AMD’s ROCm Open Software Platform
  • AMD’s Infinity Architecture

AMD has partnered with many server manufacturers, AMD based GPU architectures leveraging the AMD EPYC and AMD Instinct MI100 Accelerators in an effort to build single-sku architectures. These Integrated Infrastructures leverage partner channel resellers, and have emphasized the ways in which AMD, historically an alternative to Intel in the processor market, has expanded the GPU alternative to NVidia. Competition is solid and ideal in the marketplace. Competition drives innovation. AMD solutions will expand options on the GPU side, and are certainly worth considering as enterprises decide how to build out their AI/ML/DL architectures. Each platform has nuanced server-based storage elements, which help to differentiate their platform.

Intel

Solution(s) : Intel Third Generation Xeon Scalable Platform

  • Combination of Intel Xeon Scalable processors, and Intel Core Processors, along with Intel Movidius, Vision Processing Units
  • Server based storage provided by Intel’s 3D Nand SSD’s, Optane Disc, and intel’s platform of Solid State SSD architectures
  • Fully designed for Tensorflow, PyTorch, and MXNet, with container support, and full approach to application/hardware integration

Intel has fully integrated all components of the architecture from processor to GPU, to networking, and even their own solid state disc into its architecture. The value is significant. Integration, in this case is built-in, as all components, with the inclusion of storage come from the same manufacturer. These are the definition of integrated Information Systems, as end-to-end is provided by Intel, a feat that no other vendor can achieve today.

Intel has long been a trusted name in compute, networking, storage, and GPU’s. The customer looking for that full-spectrum solution, with a centralized management, and support system from “Soup to Nuts” is going to truly want to look deeply at these solutions.

For IT Buyers

Ease of Purchase, Deployment, Management and Scaling

The success of an AI project is reliant on a hardware platform far more consistent, with far fewer questions as to the fault related to potential failure.

As mentioned previously, the functionality of an AI project has, up until today’s developments on Integrated Infrastructures have been unique, complex, and difficult. These builds, as they’ve matured, have eased the transition from experimental design to reliable, easily deployable, supportable, manageable, and consistent architecture. It should have solid performance, meeting the requirements of the business, and be scalable to grow in a proven methodology as the needs of the business grow.

The advent of single-sku builds, rather than a more build-by-item approach has made the purchase, and deployment of these builds both far more supportable, and far more easily purchasable. To add to that categorical model, scalabilities for these models in an orderly and consistent manner is done by recipes for each of the architectures.

For IT Suppliers

Just as the market for maturation of the hardware platforms has taken place, and the availability of architectures geared toward the functionality of the growing AI market, also, the customer base is maturing. Likely, those that have interest in a purpose-built integrated infrastructure have tried pilot AI projects, have a frame of reference as to what has worked, and what has not. These customers, for the most part, are educated as to the variables, metrics and performance categories they will require. The experiences they’ve had have brought them to educated perspectives, with clear understanding of what their needs will be. It is incumbent on the vendor or reseller to provide solutions that meet or exceed those requirements. A clear path to maintenance, and support, as well as hardware and software patching are clear benefits to a fleshed-out design. The educated and experienced customer today is looking for answers to these questions. The proverbial “Single Throat to Choke” becomes a relevant conversation. The value of being able to resolve problems in a very quick and reliable manner with as little effort on the part of the customer is not to be minimized.

One thought on “Converged architecture built for AI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.