MLPerf Inference

Overview

The MLPerf inference benchmark measures how fast a system can perform ML inference.

Benchmarks

Each MLPerf Inference benchmark is defined by a model, a dataset, a quality target, and a latency constraint. The following table summarizes the five benchmarks in version v0.5 of the suite. The quality and latency targets are currently being finalized and will be posted soon.

Area Task Model Dataset Quality Target Latency Constraint
Vision Image classification Resnet50-v1.5 ImageNet (224x224) TBD TBD
Vision Image classification MobileNets-v1 224 ImageNet (224x224) TBD TBD
Vision Object detection SSD-ResNet34 COCO (1200x1200) TBD TBD
Vision Object detection SSD-MobileNets-v1 COCO (300x300) TBD TBD
Language Machine translation GNMT WMT16 TBD TBD

Load Generator

MLPerf inference benchmarks are executed via a load generator that issues queries to the ML model in one of several manners that represent real-world use cases. The “LoadGen” is provided in C++ with Python bindings, and is required for all submissions.

The LoadGen is responsible for:

  • Generating the queries.
  • Tracking the latency of queries.
  • Validating the accuracy of the results.
  • Computing final metrics.

Scenarios and Metrics

In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined four different scenarios as described below. A given scenario is evaluated by the LoadGen generating inference requests in a particular pattern and measuring a specific metric.

  • Single-stream: Evaluates real-world scenarios such as a smartphone user taking a picture. For the test run, LoadGen sends an initial query then continually sends the next query as soon as the previous query is processed. The metric is the 90th percentile latency (the latency such that 90% of queries complete at least that fast).
  • Multi-stream: Evaluates real-world scenario such as a multi-camera automotive system that detects obstacles. The LoadGen uses multiple test runs to determine the maximum number of streams the system can support while meeting the latency constraint. The metric is the number of streams supported.
  • Server: Evaluates real-world scenario such as a server in a datacenter that is servicing online requests. The LoadGen uses multiple test runs to determine the maximum throughput value in queries-per-second (QPS) the system can support while meeting the latency constraint 90% of the time. The metric is QPS.
  • Offline: Evaluates real-world scenarios such as a batch processing system. For the test run, LoadGen sends all queries at once. The metric is throughput.

Divisions

MLPerf aims to encourage innovation in software as well as hardware by allowing submitters to reimplement the reference implementations. MLPerf has two Divisions that allow different levels of flexibility during reimplementation. The Closed division is intended to compare hardware platforms or software frameworks “apples-to-apples” and requires using the same model as the reference implementation. The Open division is intended to foster faster models and optimizers and allows using a different model or retraining.

Power Measurement

MLPerf Inference encourages but does not require power measurements for wall-powered and battery-powered systems. If you intend to submit with power measurements, you must join the power working group.

Reference implementations

The reference implementations for the benchmarks are here.

How to submit

If you intend to submit results, please read the submission rules carefully and join the inference submitters working group before you start work. In particular, you must notify the chair of the inference submitters working group five weeks ahead of the submission deadline as described in the submission rules.

Results

The results are here.

Use results

MLPerf is a trademark. If you use it and refer to MLPerf results, you must follow the terms of use. MLPerf reserves the right to solely determine if uses of its trademark are appropriate.

If you use MLPerf in a publication, please cite this website or the MLPerf papers (forthcoming).