缓存预加载
分布式加载允许用户高效地将数据从 UFS 加载到 Alluxio 集群。 这可用于初始化 Alluxio 集群,以便在 Alluxio 上运行工作负载时能够立即提供缓存数据。 例如,分布式加载可用于为机器学习作业预取数据,从而加快训练过程。 分布式加载可利用[文件分割]File-Segmentation.html) 和多重复制来加强高并发数据访问场景中的文件分发。
使用方法
有两种触发分布式加载的推荐方法:
任务加载 CLI
任务加载
命令可用于将数据从 UFS(底层文件系统)加载到 Alluxio 集群。
CLI 会向 Alluxio coordinator 发送加载请求,coordinator 随后会将加载操作分发到所有 worker 节点。
bin/alluxio job load [flags] <path>
# 输出示例
Progress for loading path '/path':
Settings: bandwidth: unlimited verify: false
Job State: SUCCEEDED
Files Processed: 1000
Bytes Loaded: 125.00MB
Throughput: 2509.80KB/s
Block load failure rate: 0.00%
Files Failed: 0
有关 CLI 的详细用法,请参阅 job load 文档。
REST API
与 CLI 类似,REST API 也可用于加载数据。 请求可以发送到任意 worker 节点,worker 节点会将请求转发给 Alluxio coordinator,由 coordinator 分发到所有其他 worker 节点。
通过发送 POST 请求提交作业,请求中应包含目录路径,并将 submit
作为 optType
查询参数。
curl --request POST 'http://{worker_host}:{http_port}/v1/load?path={path}&opType=submit'
请求和响应示例:
curl -v -X POST http://172.30.16.110:19999/v1/load?path=s3://test&opType=submit
# 输出示例
* About to connect() to 172.30.16.110 port 19999 (#0)
* Trying 172.30.16.110...
* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)
> POST /v1/load?path=s3://test&opType=submit HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 172.30.16.110:19999
> Accept: */*
> Content-Length: 183
>
* upload completely sent off: 183 out of 183 bytes
< HTTP/1.1 200 OK
< Date: Wed, 10 Apr 2024 05:47:17 GMT
< Content-Length: 4
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host 172.30.16.110 left intact
true
可以通过发送路径相同的 GET 请求,并将 progress
作为 opType
查询参数来检查进度。
curl --request GET 'http://{worker_host}:{http_port}/v1/load?path={path}&opType=progress'
请求和响应示例:
curl -v -X GET http://host:19999/v1/load?path=s3://test&opType=progress
结果:
* About to connect() to 172.30.16.110 port 19999 (#0)
* Trying 172.30.16.110...
* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)
> GET /v1/load?path=s3://test&opType=progress
> User-Agent: curl/7.29.0
> Host: 172.30.16.110:19999
> Accept: */*
> Content-Length: 81
>
* upload completely sent off: 81 out of 81 bytes
< HTTP/1.1 200 OK
< Date: Wed, 10 Apr 2024 05:48:49 GMT
< Content-Length: 572
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host 172.30.16.110 left intact
"{\"mVerbose\":true,\"mJobState\":\"RUNNING\",\"mVerificationEnabled\":false,\"mSkippedByteCount\":0,\"mLoadedByteCount\":0,\"mScannedInodesCount\":18450,\"mLoadedNonEmptyFileCopiesCount\":0,\"mThroughput\":0,\"mFailureFilesPercentage\":0.0,\"mFailureSubTasksPercentage\":0.0,\"mRetrySubTasksPercentage\":0.0,\"mFailedFileCount\":0,\"mRecentFailedSubtasksWithReasons\":[],\"mRecentRetryingSubtasksWithReasons\":[],\"mSkipIfExists\":true,\"mMetadataOnly\":true,\"mRunningStage\":\"LOADING\",\"mRetryDeadLetterQueueSize\":0,\"mTimeElapsed\":87237,\"mSegmentEnabled\":false}"
可以通过相同路径发送 POST 请求来终止加载操作,并将 stop
作为 opType
查询参数。
curl --request POST 'http://{worker_host}:{http_port}/v1/load?path={path}&opType=stop'
请求和响应示例:
curl -v -X POST http://host:19999/v1/load?path=s3://test&opType=stop
* 结果示例:
* About to connect() to 172.30.16.110 port 19999 (#0)
* Trying 172.30.16.110...
* Connected to 172.30.16.110 (172.30.16.110) port 19999 (#0)
> POST /v1/load?path=s3://test&opType=stop
> User-Agent: curl/7.29.0
> Host: 172.30.16.110:19999
> Accept: */*
> Content-Length: 42
>
* upload completely sent off: 42 out of 42 bytes
< HTTP/1.1 200 OK
< Date: Wed, 10 Apr 2024 05:51:56 GMT
< Content-Length: 5
< Server: Jetty(9.4.53.v20231009)
<
* Connection #0 to host 172.30.16.110 left intact
true