PopYard:Today's Tech.-Custom Hardware Sharpens Edge for Deep Learning Future

Tue Feb 18 17:16:48 2025

Custom Hardware Sharpens Edge for Deep Learning Future
Source: Nicole Hemsoth

In an era of commodity hardware and open source software—or generalization in one arena and extraordinary opportunities for customization in the other—one could argue that the playing field for specialization of new platforms has already been set. However, in some emerging areas, including wider deployments of deep learning beyond Google, Baidu, Facebook, and others, that movement could be upended by an increasing focus on configurability, customization, and fine-tuning of both hardware and software for very specific end goals.

There are instances of computing systems being customized from the top down—and to rather great success, even if that success is rooted and only well known in certain circles. For example, consider the Anton supercomputer—a beast designed specifically to meet the complex tuning requirements of molecular dynamics workloads. Starting with the processor, interconnect, and down through the layers until general purpose machines, even when tweaked, cannot compete on the performance, efficiency, and in some cases, scalability front.

This renewed emphasis on tuning from the top-down for specific workloads is encapsulated as well by the recent emphasis on FPGAs as a problem solver for computing at scale—and the Moore’s Law limitations ahead. And further, FPGAs are a workaround for complex workloads that would benefit from fine-grained tweaks designed especially for a specific workload. The next level down from this concept is the custom ASIC—an expensive, development-heavy undertaking, that when it works, works very well—albeit at great cost for a smaller shop trying to refine performance, efficiency and scale potential.

All of this rears its head in the wake of bits of news that have seeped from Google, which is rumored to work in partnership with others on custom ASICs, and details from Baidu, as we reported last week, which is also exploring what might be next after GPUs, which they are using expertly at scale now, for tackling deep learning workloads, including their speech recognition engine. While general purpose hardware can work—especially after skillful tuning—there might be a more custom, more purpose-built approach. One that Baidu thinks might be the only way to continue bolstering the efficiency and performance of increasingly complex training sets and their ability to be executed with high performance at scale.

Baidu pointed us to one company that they are working with now to re-imagine how they train and execute deep learning workloads, Nervana Systems. The company is almost at production stage with its Nervana chip, a custom-designed bit of hardware that brings the advantages of a tailored ASIC with a software environment that will not be difficult to deploy and that comes with the requisite deep learning libraries and tools Baidu and other shops use regularly. According to Nervana’s co-founder, Amir Khosrowshahi, what they have designed has tested out to be 5X to 10X the performance of the upcoming Pascal generation GPUs expected from Nvidia this spring. And if his experiences in the emerging ecosystem around such workloads are any indication, companies who have yet to explore where deep learning might fit will see the light soon.

As we have described in numerous articles, and as deep learning pioneer, Yann LeCun described for The Next Platform, the GPU reigns supreme for deep learning at scale–and while it is one thing to read about the potential performance of a new chip or approach, it’s another to see how implementations perform and what impact these new devices will have on a still-emerging market. It was not long ago that we described one new chip aimed at deep learning–but the race is on to see what works in real production.
(Another) New Chip for Deep Learning

At this stage of the deep learning game, it is difficult to find anyone with a seasoned background in the algorithms and hardware required to feed the next generation of capabilities. Accordingly, most deep learning startups are staffed with PhDs with broad backgrounds, including research in high performance computing and of course, a range of specific disciplines that border at the edge of supercomputing simulation and large-scale data analysis. In many ways, Khosrowshahi represents the diversity of backgrounds that are making up the new host of deep learning companies. His graduate work at Berkeley was focused on neuroscience and machine learning, where he developed a fondness for hardware, thus facilitating his move to Qualcomm to work on neuromorphic devices. This was all after his time as a VP at Goldman Sachs working with fixed income derivatives trading, capital markets, and structured finance.

It was during Khosrowshahi’s tenure at Berkeley matching brain data to machine learning algorithms that much of the new wave of pioneering work in deep learning began forming. Interestingly, as he tells The Next Platform, this all happened so fast that the work he did on machine learning and the various codified algorithms there were quickly eclipsed by new advances in deep learning. Pioneers, including Geoffrey Hinton, Andrew Ng, and others, frequented the lab he worked in, and it didn’t take long before he and fellow researchers began to see light beyond machine learning—and the need for hardware that could target those specific problems.

The interesting thing was, at that time, GPU computing was still emerging, even in high performance computing—at least for a larger breadth of applications. By the time Khosrowshahi left Berkeley for Qualcomm, GPUs were outfitted on some of the world’s fastest supercomputers (with more being added across a larger pool of big systems) and had established a firm foothold in some industrial applications. The real development, at least for GPUs and deep learning, however, was happening at Google, Facebook, and other hyperscale centers where it was quickly realized that the heavy matrix multiplication and lower power consumption against standard CPUs was the only way to efficiently do deep learning training at scale. The second side of that workload also required greater efficiency. But as Khosrowshahi argues, even though GPUs have been the dramatic success story for deep learning, they were not designed for that purpose—and he contends that even the upcoming Pascal generation of GPUs expected this April are not as well-tuned for the task as a custom approach.

Since deep learning workloads are also specific in their requirements—and do not themselves map directly to what is available in a single unit processor or system given the fact that they do not use floating point, are 99 percent based on linear algebra, and do not need the sophistication high-end Xeon or other processors deliver, creating a custom system made sense. Khosrowshahi packed up two of his fellow neuroscience researchers and made the move to build such a processor—all the while backing the idea with a mission to deliver tailored deep learning as a cloud-based service.

“GPUs and CPUs emphasize floating point performance, which is something deep learning doesn’t need. Further, it’s computationally expensive. If you do integer math instead, there are area and power costs that go up too and as a cloud service, that is something we need to avoid since most of our operational costs are in power. What we chose then is a processor that doesn’t use floating point, which gives us the option to jam more compute in and save on power,” he notes. As Baidu noted of its requirements, the consideration of a custom ASIC was facilitated by an interest in “low precision hardware.” According to Khosrowshahi, the architecture does exploit limited precision where we can (eg. communication), but “there are also places where we use much more precision than the 32-bit floating point typical in GPU implementations, for example, where very large reductions are required. We also have freedom to distribute linear algebra in different ways to get different degrees of accuracy as required by the algorithms. We heavily instrument operations at a hardware level to actively manage precision. The downside is quite a bit of system software complexity, which we try to keep hidden from the user.”

The Nervana chip architecture is designed to be scalable with high compute and throughput between processors. “We can arrange the chips in a 3D torus if we want to, but that’s not optimal so there is the ability to use different interconnect geometries for different workloads, whether it’s a four-dimensional cylinder or another type. The goal is to allow users to distribute single matrix multiply operations across many processors and change the current field where you’re forced into doing impoverished forms of parallelism to do deep learning—that bottleneck, as we’ve seen it in the GPU architecture and interconnect—is around doing distributed operations across multiple GPUs, which we have solved.”

Before we get too far ahead of ourselves however, a couple of notes are required. First, the company that Khosrowshahi co-founded two years ago has been so far supporting itself by running deep learning and related machine learning workloads on its own cloud service wherein users store their data with Amazon, or using Azure or SoftLayer, and Nervana takes the processing edge. They also offer an appliance option (although he says this is not desirable) for customers in banking and other areas that are less comfortable wrangling their data through public cloud systems. The cluster at the heart of this, which Khosrowshahi was reticent to describe in detail beyond saying they’re using a 4 to 8 GPU system setup, then handles these workloads as a service, complete with the expertise needed when companies do not have deep learning or machine learning experts on hand.

And while the Nervana chip, the custom ASIC we are describing as central to how Baidu might see its future, might be revolutionary, it will not be offered for sale as a standalone processor.

For Khosrowshahi and colleagues, the future of deep learning will happen in the cloud—at least for now. But he sees a future where such a chip will find its way into all the devices we use for very rapid machine learning—if not some descendant of the deep learning algorithms that take supercomputer-class clusters to run at scale and with tremendous accuracy now.

On the business side, we have to support the current state-of-the-art models so we are bounded in how much we can push things and experiment. There are other efforts in addition to Baidu such as from the Bengio lab and IBM that use very limited bit-widths for computation. We collaborate with some of these efforts by providing high performance emulation libraries to try things out as so much of deep learning is empirical. Most of our internal research, though, has been focused on making sure current models can run unmodified on our architecture, a bit more boring.

Although he says he cannot mention names, Nervana is working via their cloud service with several companies, including a hedge fund that is using the Nervana systems to take their time series analysis to a new level. They also have a customer in the banking sector that is less sophisticated in its use of machine learning and deep learning and are exploring what might be done with their volumes of data. But all of this is a tough business—and Nervana, despite its funding to the tune of nearly $25 million over a few rounds.

That funding number should sound some sort of bell to readers who are wondering how a company that never plans to mass-market a chip can bootstrap the non-trivial process, in terms of design, development, and production at TSMC, can make the financial math work. Asking that question to a former financial math expert like Khosrowshahi yielded the insight that it’s around $10 million for that entire process, which in the big picture, “really isn’t all that much.”

What that means is that Nervana Systems has a big plan for its cloud-based service, which will feature their own hardware later this year, and will, if all goes according to the company’s plan, be the single source for cloud-based deep learning at scale available. “There is risk, there is no question about it, things like this are always risky, but we see the opportunity in the marketplace now—and don’t expect that to diminish. Companies are seeing how deep learning technologies are available as free services on their phones (like speech and image recognition) and are thinking about how this can apply and companies that are already exploring and gaining expertise in deep learning still lack the infrastructure inside to do what they see as possible.”

}