MLPerf Inference v0.7 results
10/21/20: MLPerf Releases Over 1,200 results for leading ML inference systems and new Mobile MLPerf app
Mountain View, CA - October 21, 2020 - Today the MLPerf consortium released results for MLPerf Inference v0.7, the second round of submissions to their machine learning inference performance benchmark suite that measures how quickly a trained neural network can process new data for a wide range of applications on a variety of form factors.
MLPerf Inference v0.7 is an exciting milestone for the ML community. The second benchmark round more than doubles the number of applications in the suite and introduces a new dedicated set of MLPerf Mobile benchmarks along with a publically available smartphone application. The Inference v0.7 benchmark suite has been incredibly popular with 23 submitting organizations and over 1,200 peer-reviewed results - twice as many as the first round - for systems ranging from smartphones to data center servers. Additionally, this round introduces randomized third party audits for rules compliance. To see the results, go to mlperf.org/inference-results-0-7.
The MLPerf Inference v0.7 suite includes four new benchmarks for data center and edge systems:
BERT: Bi-directional Encoder Representation from Transformers (BERT) fine tuned for question answering using the SQuAD 1.1 data set. Given a question input, the * BERT language model predicts and generates an answer. This task is representative of a broad class of natural language processing workloads.
DLRM: Deep Learning Recommendation Model (DLRM) is a personalization and recommendation model that is trained to optimize click-through rates (CTR). Common examples include recommendation for online shopping, search results, and social media content ranking.
3D U-Net: The 3D U-Net architecture is trained on the BraTS 2019 dataset for brain tumor segmentation. The network identifies whether each voxel within a 3D MRI scan belongs to a healthy tissue or a particular brain abnormality (i.e. GD-enhancing tumor, peritumoral edema, necrotic and non-enhancing tumor core), and is representative of many medical imaging tasks.
RNN-T: Recurrent Neural Network Transducer is an automatic speech recognition (ASR) model that is trained on a subset of LibriSpeech. Given a sequence of speech input, it predicts the corresponding text. RNN-T is representative of widely used speech-to-text systems. MLPerf Mobile - A New Open and Community-driven Industry Standard
The second inference round also introduces MLPerf Mobile, the first open and transparent set of benchmarks for mobile machine learning. MLPerf Mobile targets client systems with well-defined and relatively homogeneous form factors and characteristics such as smartphones, tablets, and notebooks. The MLPerf Mobile working group, led by Arm, Google, Intel, MediaTek, Qualcomm Technologies, and Samsung Electronics, selected four new neural networks for benchmarking and developed a smartphone application. The four new benchmarks are available in the TensorFlow, TensorFlow Lite, and ONNX formats, and include:
MobileNetEdgeTPU: This an image classification benchmark that is considered the most ubiquitous task in computer vision. This model deploys the MobileNetEdgeTPU feature extractor which is optimized with neural architecture search to have low latency and high accuracy when deployed on mobile AI accelerators. This model classifies input images with 224 x 224 resolution into 1000 different categories.
SSD-MobileNetV2: Single Shot multibox Detection (SSD) with MobileNetv2 feature extractor is an object detection model trained to detect 80 different object categories in input frames with 300x300 resolution. This network is commonly used to identify and track people/objects for photography and live videos.
DeepLabv3+ MobileNetV2: This is an image semantic segmentation benchmark. This model is a convolutional neural network that deploys MobileNetV2 as the feature extractor, and uses the Deeplabv3+ decoder for pixel-level labeling of 31 different classes in input frames with 512 x 512 resolution. This task can be deployed for scene understanding and many computational photography applications.
MobileBERT: The MobileBERT model is a mobile-optimized variant of the larger BERT model that is fine-tuned for question answering using the SQuAD 1.1 data set. Given a question input, the MobileBERT language model predicts and generates an answer. This task is representative of a broad class of natural language processing workloads.
“The MLPerf Mobile app is extremely flexible and can work on a wide variety of smartphone platforms, using different computational resources such as CPU, GPUs, DSPs, and dedicated accelerators,” stated Prof. Vijay Janapa Reddi from Harvard University and Chair of the MLPerf Mobile working group. The app comes with built-in support for TensorFlow Lite, providing CPU, GPU, and NNAPI (on Android) inference backends, and also supports alternative inference engines through vendor-specific SDKs. The MLPerf Mobile application will be available for download on multiple operating systems in the near future, so that consumers across the world can measure the performance of their own smartphones.
Additional information about the Inference v0.7 benchmarks will be available at mlperf.org/inference-overview.
MLPerf Training v0.7 results
7/29/20: MLPerf Releases Results for Leading ML Training Systems
Mountain View, CA - July 29, 2020 - Today the MLPerf consortium released results for MLPerf Training v0.7, the third round of results from their machine learning training performance benchmark suite. MLPerf is a consortium of over 70 companies and researchers from leading universities, and the MLPerf benchmark suites are the industry standard for measuring machine learning performance.
The MLPerf benchmark shows substantial industry progress and growing diversity, including multiple new processors, accelerators, and software frameworks. Compared to the prior submission round, the fastest results on the five unchanged benchmarks improved by an average of 2.7x, showing substantial improvement in hardware, software, and system scale. This latest training round encompasses 138 results on a wide variety of systems from nine submitting organizations. The Closed division results all use the same model/optimizer(s), while Open division results may use more varied approaches; the results include commercially Available systems, upcoming Preview systems, and RDI systems under research, development, or being used internally. To see the results, go to mlperf.org/training-results-0-7.
The MLPerf Training benchmark suite measures the time it takes to train one of eight machine learning models to a standard quality target in tasks including image classification, recommendation, translation, and playing Go.
This version of MLPerf includes two new benchmarks and one substantially revised benchmark as follows:
BERT: Bi-directional Encoder Representation from Transformers (BERT) trained with Wikipedia is a leading edge language model that is used extensively in natural language processing tasks. Given a text input, language models predict related words and are employed as a building block for translation, search, text understanding, answering questions, and generating text.
DLRM: Deep Learning Recommendation Model (DLRM) trained with Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) dataset is representative of a wide variety of commercial applications that touch the lives of nearly every individual on the planet. Common examples include recommendation for online shopping, search results, and social media content ranking.
Mini-Go: Reinforcement learning similar to Mini-Go from v0.5 and v0.6, but uses a full-size 19x19 Go board, which is more reflective of research.
MLPerf is committed to providing benchmarks that reflect the needs of machine learning customers, and is pioneering customer advisory boards to steer future benchmark construction. DLRM is the first benchmark produced using this process. The benchmark was developed based on expertise from a board consisting of academics and industry researchers with extensive recommendation expertise. “The DLRM-Terabyte recommendation benchmark is representative of industry use cases and captures important characteristics of model architectures and user-item interactions in recommendation data sets,” stated Carole-Jean Wu, MLPerf Recommendation Benchmark Advisory Board Chair from Facebook AI. The terabyte-sized click logs of Criteo AI Lab’s Terabyte CTR dataset is the largest open recommendation dataset, containing click logs of four billion user and item interactions over 24 days. “We are very excited about the partnership with MLPerf to form this new Recommendation Benchmark,” stated Flavian Vasile, Principal Researcher from Criteo AI Lab.
Additional information about the Training v0.7 benchmarks will be available at mlperf.org/training-overview.
MLPerf Inference v0.5 results
11/6/19: MLPerf Releases Over 500 Inference Benchmark Results, Showcasing a Wide Range of Machine Learning Solutions
Mountain View, CA – November 6, 2019 – After introducing the first industry-standard inference benchmarks in June of 2019, today the MLPerf consortium released 595 inference benchmark results from 14 organizations. These benchmarks measure how quickly a trained neural network can process new data for a wide range of applications (autonomous driving, natural language processing, and many more) on a variety of form factors (IoT devices, smartphones, PCs, servers and a variety of cloud solutions). The results of the benchmarks are available on the MLPerf website at mlperf.org/inference-results-0-5.
“All released results have been validated by the audits we conducted,” stated Guenther Schmuelling, MLPerf Inference Results Chair from Microsoft. “We were very impressed with the quality of the results. This is an amazing number of submissions in such a short time since we released these benchmarks this summer. It shows that inference is a growing and important application area, and we expect many more submissions in the months ahead.”
“Companies are embracing these benchmark tests to provide their customers with an objective way to measure and compare the performance of their machine learning solutions,” stated Carole-Jean Wu, Inference Co-chair from Facebook. “There are many cost-performance tradeoffs involved in inference applications. These results will be invaluable for companies evaluating different solutions.”
Of the 595 benchmark results released today, 166 are in the Closed Division intended for direct comparison of systems. The results span 30 different systems. The benchmarks show a 4-order-of-magnitude difference in performance and a 3-order-of-magnitude range in estimated power consumption and range from embedded devices and smartphones to large-scale data center systems. The remaining 429 open results are in the Open Division and show a more diverse range of models, including low precision implementations and alternative models.
Companies in China, Israel, Korea, the United Kingdom, and the United States submitted benchmark results. These companies include: Alibaba, Centaur Technology, Dell EMC, dividiti, FuriosaAI, Google, Habana Labs, Hailo, Inspur, Intel, NVIDIA, Polytechnic University of Milan, Qualcomm Technologies, and Tencent.
“As an all-volunteer open-source organization, we want to encourage participation from anyone developing an inference product, even in the research and development stage,” stated Christine Cheng, Inference Co-chair. “You are welcome to join our forum, join working groups, attend meetings, and raise any issues you find.”
According to David Kanter, Inference and Power Measurement Co-chair, “We are very excited about our roadmap, future versions of MLPerf will include additional benchmarks such as speech-to-text and recommendation, and additional metrics such as power consumption.”
“MLPerf is also developing a smartphone app that runs inference benchmarks for use with future versions. We are actively soliciting help from all our members and the broader community to make MLPerf better,” stated Vijay Janapa Reddi, Associate Professor, Harvard University, and MLPerf Inference Co-chair.
Additional information about these benchmarks are available at https://mlperf.org/inference-overview/. The MLPerf Inference Benchmark whitepaper is available at https://arxiv.org/abs/1911.02549. The MLPerf Training Benchmark whitepaper is available at https://arxiv.org/abs/1910.01500.
MLPerf Training v0.6 results
7/10/19: MLPerf releases Training results showing industry progress
Mountain View, CA - July 10, 2019 - Today the MLPerf effort released results for MLPerf Training v0.6, the second round of results from their machine learning training performance benchmark suite. MLPerf is a consortium of over 40 companies and researchers from leading universities, and the MLPerf benchmark suites are rapidly becoming the industry standard for measuring machine learning performance. The MLPerf Training benchmark suite measures the time it takes to train one of six machine learning models to a standard quality target in tasks including image classification, object detection, translation, and playing Go. To see the results, go to mlperf.org/training-results-0-6.
The first version of MLPerf Training was v0.5; this release, v0.6, improves on the first round in several ways. According to the MLPerf Training Special Topics Chairperson Paulius Micikevicius, “these changes demonstrate MLPerf’s commitment to its benchmarks’ representing the current industry and research state." The improvements include:
- Raises quality targets for image classification (ResNet) to 75.9%, light-weight object detection (SSD) to 23% MAP, and recurrent translation (GNMT) to 24 Sacre BLEU. These changes better align the quality targets with state of the art for these models and datasets.
- Allows use of the LARS optimizer for ResNet, enabling additional scaling.
- Experimentally allows a slightly larger set of hyperparameters to be tuned, enabling faster performance and some additional scaling.
- Changes timing to start the first time the application accesses the training dataset, thereby excluding startup overhead. This change was made because the large scale systems measured are typically used with much larger datasets than those in MLPerf, and hence normally amortize the startup overhead over much greater training time.
- Improves the MiniGo benchmark in two ways. First, it now uses a standard C++ engine for the non-ML compute, which is substantially faster than the prior Python engine. Second, it now assesses quality by comparing to a known-good checkpoint, which is more reliable than the previous very small set of game data.
- Suspends the Recommendation benchmark while a larger dataset and model are being created.
Submissions showed substantial technological progress over v0.5. Many benchmarks featured submissions at higher scales than v0.5. Benchmark results on the same system show substantial performance improvements over v0.5, even after the impact of the rules changes are factored out. (The higher quality targets lead to higher times on ResNet, SSD, and GNMT. The change to overhead timing leads to lower times especially on larger systems. The improved engine and different quality target make MiniGo times substantially different.) “The rapid improvement in MLPerf results shows how effective benchmarking can be in accelerating innovation.” said Victor Bittorf, MLPerf Submitters Working Group Chairperson.
MLPerf Training v0.6 showed increased support for the benchmark and greater interest from submitters. MLPerf Training v0.6 received sixty-three entries, up more than 30%. Submissions came from five submitters, up from three in the previous round. Submissions included the first submission to the “Open Division” submission, which allows the model to be further optimized or a different model to be used (though the same model was used in the v0.6 submission) as a means of showcasing more potential performance innovations through software changes. The MLPerf effort now has over 40 supporting companies, and recently released a complementary inference benchmark suite.
“We are creating a common yardstick for training and inference performance. We invite everyone to become involved by going to mlperf.org or emailing email@example.com” said Peter Mattson, MLPerf General Chair.
MLPerf Inference launched
6/24/19: New Machine Learning Inference Benchmarks Assess Performance Across a Wide Range of AI Applications
Mountain View, CA - June 24, 2019 - Today a consortium involving more than 40 leading companies and university researchers introduced MLPerf Inference v0.5, the first industry standard machine learning benchmark suite for measuring system performance and power efficiency. The benchmark suite covers models applicable to a wide range of applications including autonomous driving and natural language processing, on a variety of form factors, including smartphones, PCs, edge servers, and cloud computing platforms in the data center. MLPerf Inference v0.5 uses a combination of carefully selected models and data sets to ensure that the results are relevant to real-world applications. It will stimulate innovation within the academic and research communities and push the state-of-the-art forward.
By measuring inference, this benchmark suite will give valuable information on how quickly a trained neural network can process new data to provide useful insights. Previously, MLPerf released the companion Training v0.5 benchmark suite leading to 29 different results measuring the performance of cutting-edge systems for training deep neural networks.
MLPerf Inference v0.5 consists of five benchmarks, focused on three common ML tasks:
- Image Classification - predicting a “label” for a given image from the ImageNet dataset, such as identifying items in a photo.
- Object Detection - picking out an object using a bounding box within an image from the MS-COCO dataset, commonly used in robotics, automation, and automotive.
- Machine Translation - translating sentences between English and German using the WMT English-German benchmark, similar to auto-translate features in widely used chat and email applications.
MLPerf provides benchmark reference implementations that define the problem, model, and quality target, and provide instructions to run the code. The reference implementations are available in ONNX, PyTorch, and TensorFlow frameworks. The MLPerf inference benchmark working group follows an “agile” benchmarking methodology: launching early, involving a broad and open community, and iterating rapidly. The mlperf.org website provides a complete specification with guidelines on the reference code and will track future results.
The inference benchmarks were created thanks to the contributions and leadership of our members over the last 11 months, including representatives from: Arm, Cadence, Centaur Technology, Dividiti, Facebook, General Motors, Google, Habana Labs, Harvard University, Intel, MediaTek, Microsoft, Myrtle, Nvidia, Real World Insights, University of Illinois at Urbana-Champaign, University of Toronto, and Xilinx.
The General Chair Peter Mattson and Inference Working Group Co-Chairs Christine Cheng, David Kanter, Vijay Janapa Reddi, and Carole-Jean Wu make the following statement:
“The new MLPerf inference benchmarks will accelerate the development of hardware and software to unlock the full potential of ML applications. They will also stimulate innovation within the academic and research communities. By creating common and relevant metrics to assess new machine learning software frameworks, hardware accelerators, and cloud and edge computing platforms in real-life situations, these benchmarks will establish a level playing field that even the smallest companies can use.”
Now that the new benchmark suite has been released, organizations can submit results that demonstrate the benefits of their ML systems on these benchmarks. Interested organizations should contact firstname.lastname@example.org.
MLPerf Training v0.5 results
12/12/18: MLPerf Results Compare Top ML Hardware, Aim to Spur Innovation
Today, the researchers and engineers behind the MLPerf benchmark suite released their first round of results. The results measure the speed of major machine learning (ML) hardware platforms, including Google TPUs, Intel CPUs, and NVIDIA GPUs. The results also offer insight into the speed of ML software frameworks such TensorFlow, PyTorch, and MXNet. The MLPerf results are intended to help decision makers assess existing offerings and focus future development. To see the results, go to mlperf.org/training-results-0-5.
Historically, technological competition with a clear metric has resulted in rapid progress. Examples include the space race that led to people walking on the moon within two decades, the SPEC benchmark that helped drive CPU performance by 1.6X/year for the next 15 years, and the DARPA Grand Challenge that helped make self-driving cars a reality. MLPerf aims to bring this same rapid progress to ML system performance. Given that large scale ML experiments still take days or weeks, improving ML system performance is critical to unlocking the potential of ML.
MLPerf was launched in May by a small group of researchers and engineers, and it has since grown rapidly. MLPerf is now supported by over thirty major companies and startups including hardware vendors such as Intel and NVIDIA (NASDAQ: NVDA), and internet leaders like Baidu (NASDAQ: BIDU) and Google (NASDAQ: GOOGL). MLPerf is also supported by researchers from seven different universities. Today, Facebook (NASDAQ: FB) and Microsoft (NASDAQ: MSFT) are announcing their support for MLPerf.
Benchmarks like MLPerf are important to the entire industry:
- “We are glad to see MLPerf grow from just a concept to a major consortium supported by a wide variety of companies and academic institutions. The results released today will set a new precedent for the industry to improve upon to drive advances in AI,” reports Haifeng Wang, Senior Vice President of Baidu who oversees the AI Group.
- “Open standards such as MLPerf and Open Neural Network Exchange (ONNX) are key to driving innovation and collaboration in machine learning across the industry,” said Bill Jia, VP, AI Infrastructure at Facebook. “We look forward to participating in MLPerf with its charter to standardize benchmarks.”
- “MLPerf can help people choose the right ML infrastructure for their applications. As machine learning continues to become more and more central to their business, enterprises are turning to the cloud for the high performance and low cost of training of ML models,” – Urs Hölzle, Senior Vice President of Technical Infrastructure, Google.
- “We believe that an open ecosystem enables AI developers to deliver innovation faster. In addition to existing efforts through ONNX, Microsoft is excited to participate in MLPerf to support an open and standard set of performance benchmarks to drive transparency and innovation in the industry.” – Eric Boyd, CVP of AI Platform, Microsoft
- “MLPerf demonstrates the importance of innovating in scale-up computing as well as at all levels of the computing stack — from hardware architecture to software and optimizations across multiple frameworks.” --Ian Buck, vice president and general manager of Accelerated Computing at NVIDIA
Today’s published results are for the MLPerf training benchmark suite. The training benchmark suite consists of seven benchmarks including image classification, object detection, translation, recommendation, and reinforcement learning. The metric is time required to train a model to a target level of quality. MLPerf timing results are then normalized to unoptimized reference implementations running on a single NVIDIA Pascal P100 GPU. Future MLPerf benchmarks will include inference as well.
MLPerf categorizes results based on both a division and a given product or platform’s availability. There are two divisions: Closed and Open. Submissions to the Closed division, intended for apples-to-apples comparisons of ML hardware and ML frameworks, must use the same model (e.g. ResNet-50 for image classification) and optimizer. In the Open division, participants can submit any model. Within each division, submissions are classified by availability: in the Cloud, On-premise, Preview, or Research. Preview systems will be available by the next submission round. Research systems either include experimental hardware or software, or are at a scale not yet publicly available.
MLPerf is an agile and open benchmark. This is an “alpha” release of the benchmark, and the MLPerf community intends to rapidly iterate. MLPerf welcomes feedback and invites everyone to get involved in the community. To learn more about MLPerf go to mlperf.org or email email@example.com.
MLPerf Training launched
5/2/18: Industry and Academic Leaders Launch New Machine Learning Benchmarks to Propel Innovation
Today, a group of researchers and engineers released MLPerf, a benchmark for measuring the speed of machine learning software and hardware. MLPerf measures speed based on the time it takes to train deep neural networks to perform tasks including recognizing objects, translating languages, and playing the ancient game of Go. The effort is supported by a broad coalition of experts from tech companies and startups including AMD (NASDAQ: AMD), Baidu (NASDAQ: BIDU), Google (NASDAQ: GOOGL), Intel (NASDAQ: INTC), SambaNova, and Wave Computing and researchers from educational institutions including Harvard University, Stanford University, University of California Berkeley, University of Minnesota, and University of Toronto.
The promise of AI has sparked an explosion of work in machine learning. As this sector expands, systems need to evolve rapidly to meet its demands. According to ML pioneer Andrew Ng, “AI is transforming multiple industries, but for it to reach its full potential, we still need faster hardware and software.” With researchers pushing the bounds of computers’ capabilities and system designers beginning to hone machines for machine learning, there is a need for a new generation of benchmarks.
MLPerf aims to accelerate improvements in ML system performance just as the SPEC benchmark helped accelerate improvements in general purpose computing. SPEC was introduced in 1988 by a consortium of computing companies. CPU Performance improved 1.6X/year for the next 15 years. MLPerf combines best practices from previous benchmarks including: SPEC’s use of a suite of programs, SORT’s use one division to enable comparisons and another division to foster innovative ideas, DeepBench’s coverage of software deployed in production, and DAWNBench’s time-to-accuracy metric.
Benchmarks like SPEC and MLPerf catalyze technological improvement by aligning research and development efforts and guiding investment decisions.
- "Good benchmarks enable researchers to compare different ideas quickly, which makes it easier to innovate.” summarizes researcher David Patterson, author of Computer Architecture: A Quantitative Approach.
- According to Gregory Stoner, CTO of Machine Learning, Radeon Technologies Group, AMD: “AMD is at the forefront of building high-performance solutions, and benchmarks such as MLPerf are vital for providing a solid foundation for hardware and system software idea exploration, thereby giving our customers a more robust solution to measure Machine Learning system performance and underscoring the power of the AMD portfolio.”
- MLPerf is a critical benchmark that showcases how our dataflow processor technology is optimized for ML workload performance." remarks Chris Nicol, CTO of the startup Wave Computing.
- AI powers an array of products and services at Baidu. A benchmark like MLPerf allows us to compare platforms and make better datacenter investment decisions,” reports Haifeng Wang, Vice President of Baidu who oversees the AI Group.
Because ML is such a fast moving field, the team is developing MLPerf as an “agile” benchmark: launching early, involving a broad community, and iterating rapidly. The mlperf.org website provides a complete specification with reference code, and will track future results. MLPerf invites hardware vendors and software framework providers to submit results before the July 31st deadline.