MLPerf Storage 基准测试

Slack Docker Pulls

MLPerf Storage 基准测试概览

MLPerf Storage是专门针对机器学习任务的存储系统性能基准测试套件。

本文档介绍如何通过 MLPerf Storage 来对 Alluxio 进行端到端测试。

测试结果摘要

模型 加速器 (GPUs) 数据集 加速器利用 吞吐量 (兆字节/秒) 吞吐量 (样本数/秒)
bert 1 1.3TB 99% 0.1 49.3
unet3d 1 719 GB 99% 409.5 2.9
bert 128 2.4 TB 98% 14.8 6217
unet3d 20 3.8 TB 97%-99% 7911.4 56.59

测试结果基于如下配置的 Alluxio 集群,所有服务器实例均在 AWS 上可用:

  • Alluxio 集群: 一个 Alluxio Fuse 节点和两个 Alluxio Worker 节点。

  • Alluxio Worker 实例: i3en.metal: 96内核 + 768GB 内存+ 100Gb网络 + 8 nvme固态硬盘

  • Alluxio Fuse 实例 c6in.metal: 128内核 + 256GB 内存 + 200Gb网络

准备测试环境

操作系统镜像:Ubuntu 22.02

准备 MLPerf Storage 测试工具

sudo apt-get install mpich
git clone -b v0.5 --recurse-submodules https://github.com/mlcommons/storage.git
cd storage
pip3 install -r dlio_benchmark/requirements.txt

生成数据集

我们建议在本地生成数据集,然后上传到远端存储。 确定要生成的数据大小:

./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
  • 工作负载:选项为 unet3d 和 bert。
  • num-accelerators: 模拟的 GPU 数量。数量越多,单台机器上运行的进程就越多。对于相同大小的数据集,训练时间更短。不过,这会增加对存储 I/O 的需求。
  • host-memory-in-gb: 模拟的内存大小,可以自由指定,甚至可以超过机器的实际内存大小。内存越大,生成的数据集也就越大,需要的训练时间也就越长。

执行此命令后,您将得到如下结果:

./benchmark.sh datasize --workload unet3d --num-accelerators 4 --host-memory-in-gb 32
The benchmark will run for approx 11 minutes (best case)
Minimum 1600 files are required, which will consume 218 GB of storage
----------------------------------------------
Set --param dataset.num_files_train=1600 with ./benchmark.sh datagen/run commands

接下来,您可以使用以下命令生成相应的数据集:

./benchmark.sh datagen --workload unet3d --num-parallel ${num-parallel} --param dataset.num_files_train=1600 --param dataset.data_folder=${dataset.data_folder}

在本地生成数据集后,将其上传到 UFS。

配置 Alluxio

我们推荐使用 Alluxio 3.1 或更高版本进行 MLPerf 测试。 此外,建议在 alluxio-site.properties 中进行以下配置,以获得最佳读取性能:

alluxio.user.position.reader.streaming.async.prefetch.enable=true
alluxio.user.position.reader.streaming.async.prefetch.thread=256
alluxio.user.position.reader.streaming.async.prefetch.part.length=4MB
alluxio.user.position.reader.streaming.async.prefetch.max.part.number=4

有关其他 Alluxio 相关配置,请参阅 Fio Tests 部分。

  • 可将一个或多个 Alluxio Worker 配置为缓存集群。
  • 此外,在每个 MLPerf 测试节点上都需要启动 Alluxio Fuse 进程来读取数据。
  • 确保数据集已从 UFS 完全加载到 Alluxio 缓存中。

运行测试

./benchmark.sh run --workload ${workload} --num-accelerators ${num-accelerators} --results-dir ${results-dir} --param dataset.data_folder=${dataset.data_folder} --param dataset.num_files_train=${dataset.num_files_train}

完成测试后,您可在 results-dir 中找到如下的summary.json 文件:

{
  "model": "unet3d",
  "start": "2024-05-27T14:46:24.458325",
  "num_accelerators": 20,
  "hostname": "ip-172-31-24-47",
  "metric": {
    "train_au_percentage": [
      99.18125818824699,
      99.01649117920554,
      98.95473494676878,
      98.31108303926722,
      98.2658474647346
    ],
    "train_au_mean_percentage": 98.74588296364462,
    "train_au_meet_expectation": "success",
    "train_au_stdev_percentage": 0.38102089124716115,
    "train_throughput_samples_per_second": [
      57.07382805038776,
      57.1334916113455,
      56.93601336110315,
      56.72469392071424,
      56.64526420320678
    ],
    "train_throughput_mean_samples_per_second": 56.90265822935148,
    "train_throughput_stdev_samples_per_second": 0.19058788132211907,
    "train_io_mean_MB_per_second": 7955.518180172248,
    "train_io_stdev_MB_per_second": 26.64594945050442
  },
  "num_files_train": 28125,
  "num_files_eval": 0,
  "num_samples_per_file": 1,
  "epochs": 5,
  "end": "2024-05-27T15:27:39.203932"
}

train_au_percentage 属性代表 GPU 利用率。

此外,您还可以多次运行测试,将运行结果按以下格式保存:

sample-results
	|---run-1
	       |---host-1
	                |---summary.json
	       |---host-2
	                |---summary.json
	          ....
	       |---host-n
	                |---summary.json
	|---run-2
	       |---host-1
 	               |---summary.json
	       |---host-2
	                |---summary.json
	          ....
 	       |---host-n
 	               |---summary.json
	    .....
	|---run-5
	       |---host-1
	                |---summary.json
	       |---host-2
 	               |---summary.json
 	          ....
 	       |---host-n
 	               |---summary.json

然后,使用以下命令汇总多个测试结果:

./benchmark.sh reportgen --results-dir sample-results

最终的汇总结果如下所示:

{
    "overall": {
        "model": "unet3d",
        "num_client_hosts": 1,
        "num_benchmark_runs": 5,
        "train_num_accelerators": "20",
        "num_files_train": 28125,
        "num_samples_per_file": 1,
        "train_throughput_mean_samples_per_second": 56.587322998616344,
        "train_throughput_stdev_samples_per_second": 0.3842685544298719,
        "train_throughput_mean_MB_per_second": 7911.431396900177,
        "train_throughput_stdev_MB_per_second": 53.72429981238494
    },
    "runs": {
        "run-5": {
            "train_throughput_samples_per_second": 57.06105089062497,
            "train_throughput_MB_per_second": 7977.662939935283,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-2": {
            "train_throughput_samples_per_second": 56.18386238258097,
            "train_throughput_MB_per_second": 7855.023869277903,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-1": {
            "train_throughput_samples_per_second": 56.90265822935148,
            "train_throughput_MB_per_second": 7955.518180172248,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-3": {
            "train_throughput_samples_per_second": 56.69229017116294,
            "train_throughput_MB_per_second": 7926.10677895614,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        },
        "run-4": {
            "train_throughput_samples_per_second": 56.09675331936137,
            "train_throughput_MB_per_second": 7842.845216159307,
            "train_num_accelerators": "20",
            "model": "unet3d",
            "num_files_train": 28125,
            "num_samples_per_file": 1
        }
    }
}