Six sects, besieging the cloud AI chip Bright Summit

The battlefield of AI chips is obviously more lively. Just last Friday, the international authoritative artificial intelligence (AI) performance benchmark MLPerf announced the latest results of the AI ​​inference list for data center and edge scenarios. Whether it is the companies participating in the selection or the actual AI chip performance, it is much better than the previous one.

The leader is naturally the international AI computing giant Nvidia. This is the first time that NVIDIA has submitted its latest flagship AI accelerator H100 Tensor Core GPU, which was just released this year. The AI ​​reasoning performance is 4.5 times higher than the previous generation GPU. Qualcomm has proved that it is still very effective in terms of high energy efficiency through the latest evaluation results of the cloud AI chip Cloud AI 100.


Domestic AI chip companies are also not showing weakness. This time, Biren Technology and Moxin Artificial Intelligence have both "joined the war" for the first time, and they have achieved good results. The performance of some models even surpassed NVIDIA's flagship AI chips A100 and H100.


Biren Technology has submitted the data of two models, ResNet and BERT with 99.90% accuracy in the data center scenario, including Offline mode and Server mode. The overall performance of the 8-card machine in the offline mode is comparable to that of the NVIDIA 8-card A100 model under the BERT model. 1.58 times.


Ink Core's S30 computing card won first place in the single-card computing power of ResNet-50 95784 FPS, reaching 1.2 times that of NVIDIA H100 and 2 times that of A100. There is also South Korea's first AI chip Sapeon X220 launched by South Korea's SK Telecom in November 2020. This time, it also demonstrated the performance of NVIDIA's entry-level AI accelerator card A2 through participation in the test.


However, the Google TPU v4 chip, which showed off its high performance and high energy efficiency in the training benchmark list in June this year, did not appear in this reasoning list. In addition, Intel and Alibaba also respectively demonstrated the performance of systems based only on their server CPUs in accelerating AI inference.


In general, the NVIDIA A100 is still an all-around player who has swept the major test results. The H100, which has not yet been launched, is only showing its edge this time. It is expected that the improvement in training performance will be more "exaggerated". Although domestic AI chips have only participated in the evaluation of some AI models such as ResNet and BERT, their single-point record has been comparable to that of NVIDIA's flagship computing products, demonstrating the ability to replace international advanced products when running specific models.


MLPerf data center reasoning list: mlcommons

MLPerf edge reasoning list: mlcommons


H100 king debuts, Nvidia still reigns


The MLPerf benchmark test is divided into four scenarios: data center, edge, mobile, and IoT according to the deployment method, covering six types of the most representative mainstream AI models - image classification (ResNet50), natural language processing (BERT), speech recognition ( RNN-T), target object detection (RetinaNet), medical image segmentation (3D-UNet), intelligent recommendation (DLRM).


Among them, the three tasks of natural language understanding, medical image segmentation, and intelligent recommendation have two accuracy requirements of 99% and 99.9% to examine the impact of improving the accuracy of AI inference on computing performance.


As of now, Nvidia is the only company that has tested all major algorithms in every round of the MLPerf benchmark. The NVIDIA A100 is still a big hit in the latest MLPerf AI reasoning test list, and its performance is among the best in the multi-model list. The A100's successor, the H100, debuted at MLPerf, breaking multiple world records and outperforming the A100 by 4.5 times.


The performance of NVIDIA H100 is 4.5 times higher than that of A100 (Source: NVIDIA)
The performance of NVIDIA H100 is 4.5 times higher than that of A100 (Source: NVIDIA)

Nvidia submitted two systems based on the H100 GPU single chip, one with an AMD EPYC CPU as the host processor and the other with an Intel Xeon CPU. It can be seen that although the H100 GPU using NVIDIA's latest Hopper architecture only shows the test results of a single chip this time, its performance has exceeded the performance of systems with 2, 4, and 8 A100 chips in many cases.

NVIDIA H100 refreshes performance records for all workloads in data center scenarios (Source: NVIDIA)
NVIDIA H100 refreshes performance records for all workloads in data center scenarios (Source: NVIDIA)

Especially for the BERT-Large model of natural language processing that requires a larger scale and higher performance, the performance of H100 is much higher than that of A100 and Biren Technology GPU, which is mainly due to its Transformer Engine. The H100 GPU is expected to be released at the end of this year and will participate in the training benchmark of MLPerf in the future.


In addition, in terms of edge computing, Nvidia Orin, which integrates the Nvidia Ampere architecture and Arm CPU cores in a single chip, ran all MLPerf benchmarks and was the chip that won the most tests among all low-power SoCs. It is worth mentioning that the edge AI inference energy efficiency of NVIDIA's Orin chip has been further improved by 50% compared to the results that debuted at MLPerf in April this year.

In terms of energy efficiency, Orin edge AI reasoning performance is improved by up to 50% (Source: NVIDIA)
In terms of energy efficiency, Orin edge AI reasoning performance is improved by up to 50% (Source: NVIDIA)

From the test results submitted by NVIDIA at MLPerf in previous years, it can be seen that the performance improvement brought by AI software is becoming more and more significant. Since its debut at MLPerf in July 2020, the A100 has seen a 6x increase in performance thanks to continuous improvements in NVIDIA AI software.


Currently, NVIDIA AI is the only platform capable of running all MLPerf inference workloads and scenarios in data centers and edge computing. Through software and hardware synergistic optimization, NVIDIA GPU has achieved more outstanding achievements in AI inference acceleration in data centers and edge computing.


Biren Technology's general-purpose GPU participates in the war


ResNet and BERT models outperform A100. The general-purpose GPU chip BR104 released by Biren Technology in August this year also made its public debut at MLPerf. The MLPerf reasoning list is divided into two categories: Closed (fixed tasks) and Open (open optimization). The fixed tasks mainly examine the hardware system and software optimization capabilities of the test manufacturers, and the open optimization focuses on the AI ​​technology innovation of the test manufacturers.


This time, Biren Technology participated in the fixed task evaluation of the data center scenario. The model participating in the evaluation was the Inspur NF5468M6 server equipped with 8 Bilin 104-300W boards, and the Bilin 104 board had a built-in BR104 chip. Biren Technology submitted the evaluation of ResNet and BERT 99.9% accuracy models, including both Offline mode and Server mode.

Offline mode corresponds to the situation where data is available locally. For example, in ResNet-50 and BERT models, the Offline mode is more important; data in Server mode comes from real-time data, and data is delivered online in burst and intermittent ways, such as in DLRM, the Server mode is important.

It is reported that Biren Technology only selects these two types of models to participate in the evaluation, mainly considering that they are the most widely used and most important models for Biren Technology's target customers, especially the BERT model.

▲ Biren Technology BR104 took the lead in the overall performance of both offline and online modes in the BERT model selection (Source: Biren Technology)
▲ Biren Technology BR104 took the lead in the overall performance of both offline and online modes in the BERT model selection (Source: Biren Technology)

Judging from the test results, in the selection of the BERT model, compared with the model based on the 8 A100 submitted by NVIDIA, the model based on the 8 Biren Technology BR104 has 1.58 times the performance of the former.

Biren BR104 surpasses A100 in single card performance in ResNet-50 and BERT model selection
Biren BR104 surpasses A100 in single card performance in ResNet-50 and BERT model selection

On the whole, the performance of the 8-card PCle solution of Biren Technology is estimated to be between NVIDIA's 8-card A100 and 8-card H100. In addition to the 8-card model submitted by Biren Technology, the well-known server provider Inspur Information also submitted a server equipped with 4 Biren 104 boards. This is also the first time Inspur Information has submitted server test results based on chips from domestic manufacturers.


Among all 4-card models, the server submitted by Inspur also won first place in the world under the two models of ResNet50 (Offline) and BERT (Offline & Server, 99.9% accuracy).

For a fledgling start-up that is launching its first chip, that's a pretty impressive result.


Ink Core S30 wins image classification, with a single card computing power of 95784 FPS far exceeding H100


Another Chinese cloud AI chip company, Moxin AI, also participated in the MLPerf competition for the first time and achieved a single-card computing power performance that surpassed that of NVIDIA H100 in the reasoning task of the image classification model.


When InkCore designed the AI ​​chip Inther processor (ANTOUM), it adopted the self-developed double thinning technology to realize the innovation of the underlying chip architecture, so as to take into account the data center's demand for high performance and high energy efficiency ratio. At this year's GTIC 2022 Global AI Chip Summit, MoCore AI released its first batch of high sparse rate computing cards S4, S10, and S30 for data center AI inference applications to the industry for the first time, which are single-chip cards and dual-chip cards respectively. and three-chip cards.


The ink core participated in the open optimization test. According to the latest MLPerf reasoning list, the ink core S30 computing card ranks first in the computing power of the ResNet-50 model with a single-card computing power of 95784FPS, which is 1.2 times that of the H100 and 2 times that of the A100.


In terms of running the BERT-Large high-precision model (99.9%), although the ink core S30 did not beat the H100, it achieved 2 times the performance of the A100, and the S30 single-card computing power reached 3837SPS.

Comparison of ink core S30 with A100, H100 when running ResNet-50 and BERT-Large models (Source: Ink Core Artificial Intelligence)
Comparison of ink core S30 with A100, H100 when running ResNet-50 and BERT-Large models (Source: Ink Core Artificial Intelligence)

It is worth mentioning that the ink core S30 uses a 12nm process, while the NVIDIA H100 uses a more advanced 4nm process, which can match the performance of the mainstream AI models of the two major data centers despite the generational difference in the process technology. , mainly thanks to the thinning algorithm and architecture independently developed by InkCore.


MLPerf's test requirements are very strict, not only testing the computing power of each product but also setting the accuracy requirement above 99% to examine the impact of the high requirements on AI inference accuracy on computing performance. That is to say, participating manufacturers cannot sacrifice accuracy in exchange for computing. strength increase. This also proves that the ink core can achieve sparse calculation while taking into account the loss of accuracy.


High energy efficiency, the ace of Qualcomm cloud AI chips


Qualcomm's first cloud AI chip, Cloud AI 100, released as early as 2019, continued to participate in the MLPerf competition and competed with a number of new AI accelerators. Judging from the test results, in terms of high energy efficiency in image processing alone, the Qualcomm Cloud AI 100 chip using the 7nm process can still be proud of the world.


In the latest evaluation results disclosed by MLPerf, Foxconn, Thundercomm, Inventec, Dell, HPE, and Lenovo all submitted test results using Qualcomm's Cloud AI 100 chip. It can be seen that Qualcomm's AI chips have been accepted by the Asian cloud server market.

The Qualcomm Cloud AI 100 is available in two versions, Professional (400 TOPS) or Standard (300 TOPS), both with the advantage of high energy efficiency. In image processing, the chip has 1x more performance per watt than the standard part's NVIDIA Jetson Orin, and is slightly more power efficient in natural language processing, the BERT-99 model.

 Qualcomm Cloud AI 100 leads the energy efficiency ratio in the ResNet-50 and BERT-99 model tests (Source: Qualcomm)
Qualcomm Cloud AI 100 leads the energy efficiency ratio in the ResNet-50 and BERT-99 model tests (Source: Qualcomm)

While maintaining high energy efficiency, Qualcomm's AI chip does not sacrifice high performance. A 5-card server consumes 75W of power, and the achievable performance is nearly 50% higher than that of a 2-card A100 server. The power consumption of a single 2-card A100 server is as high as 300W.

Performance per watt of Qualcomm Cloud AI 100 (Source: Qualcomm)
Performance per watt of Qualcomm Cloud AI 100 (Source: Qualcomm)

For edge computing, Qualcomm's Cloud AI 100's high energy efficiency in graphics processing is already very competitive, but large data centers will have higher requirements for chip versatility. If Qualcomm wants to further enter the cloud market, it may have to Expand support for more mainstream AI models such as recommendation engines in the design of next-generation cloud-edge AI chips.

Achieve high energy efficiency of edge servers without sacrificing high performance (Source: Qualcomm)
Achieve high energy efficiency of edge servers without sacrificing high performance (Source: Qualcomm)

South Korea's first AI chip debuted against NVIDIA's entry-level AI accelerator card


In this MLPerf list, we also saw the presence of Korean companies that are relatively lacking in the field of AI chips. Sapeon X220 is an AI chip independently developed by SK Telecom, a well-known technology company in South Korea, and the first non-storage commercial chip used in data centers in South Korea. It can perform large-scale computing required for AI services at high speed and low power consumption.


The test results are also interesting. The Sapeon X220, powered by Supermicro servers, outperformed Nvidia's entry-level AI accelerator card, the A2 GPU, in data center inference benchmarks late last year. Among them, the performance of X220-Compact is 2.3 times higher than that of A2, and the performance of X220-Enterprise is 4.6 times higher than that of A2.


Energy efficiency performance is equally good, with the X220-Compact being 2.2 times more energy efficient than the A2 and the X220-Enterprise being 2.0 times more energy efficient than the A2 in terms of performance per watt based on maximum power consumption.

Comparison of performance and energy efficiency between the Sapeon X220 series and NVIDIA A2 (Source: SAPEON)
Comparison of performance and energy efficiency between the Sapeon X220 series and NVIDIA A2 (Source: SAPEON)

It is worth mentioning that the Nvidia A2 uses an advanced 8nm process, while the Sapeon X220 uses a mature 28nm process. It is reported that Sapeon chips have been used in smart speakers, smart video security solutions, AI-based media quality optimization solutions, and other applications. This year, SK Telecom also separated the AI ​​chip business and established a company called SAPEON.


SAPEON CEO Soojung Ryu revealed that the company plans to expand various application fields of X220 in the future, and is confident that it will use the next-generation chip X330 in the second half of next year to widen the gap with competing products and further improve performance.


Intel previews the next-generation server CPU Ali Yitian 710 CPU for the first time


Although cloud AI inference chips are contending, as of now, server CPUs are still the dominant players in the AI ​​inference market. In this MLPerf list, we see that only systems equipped with Intel Xeon and Alibaba's self-developed CPU Yitian 710 participated in the evaluation. These systems are not equipped with any AI accelerators, which can more realistically reflect the AI ​​inference acceleration capabilities of these server CPUs.

In the fixed task list, Intel submitted a preview version of the Sapphire Rapids 2-socket system with PyTorch software. Although the reasoning performance was "killed" by the H100, it was enough to beat the A2. After all, this is a server CPU, and the AI ​​inference acceleration capability is just a bonus item, so it seems that the acceleration capability of the Intel Xeon CPU is enough to meet the needs of conventional AI inference tasks.

In the open optimization category, a start-up company called NeuralMagic demonstrated its implementation of finer software based on pruning technology by submitting a system with only Intel Xeon CPUs, which can be achieved with less computing power. software equivalent performance.

Alibaba also demonstrated for the first time the results of running the entire cluster as a single machine, outperforming other results in terms of total throughput. Its self-developed Yitian 710 CPU chip appeared on the MLPerf list for the first time.

In addition, from the system configuration of various manufacturers participating in this MLPerf, we can see that the AMD EPYC server CPU has a growing sense of presence in data center inference applications, and has the momentum to keep pace with Intel Xeon.


NVIDIA's position in the rivers and lakes is stable, and the new forces of domestic AI chips launch a charge


In general, NVIDIA continues to perform stably, topping the MLPerf inference benchmark and is an undisputed big winner. Although some single-point performance scores have been overtaken by other competing products, in terms of versatility, NVIDIA A100 and H100 are still capable of "rubbing" other AI chips on the ground. At present, NVIDIA has not submitted the inference energy efficiency test data of H100, and its performance in training. When these results come out, H100 is expected to be more popular.


Domestic AI chip companies have also emerged. Following the single-card computing power of Alibaba Pingtou’s self-developed cloud AI chip Hanguang 800, which reached the top of the MLPerf ResNet-50 model inference test in 2019, Biren Technology and Moxin also demonstrated their performance through third-party authoritative AI benchmark test platforms. The measured performance strength of AI chips.


From the performance results displayed in this open optimization list, we can see that sparse computing has become a hot trend in AI inference in data centers. A precise and fairer comparison of system strengths, and further verification of its landing value. MLPerf benchmarks are becoming more complex as participating institutions, system sizes, and system configurations increase and become more diverse. The results of these previous lists can also reflect the changes in the technology and industrial structure of global AI chips.

21 views0 comments

Recent Posts

See All