Release Notes
November 16, 2021
This is the first release on the Alluxio 2.7.X line. Alluxio 2.7 further enhances the functionality and performance of machine learning and AI training workloads, serving as a key component in data pre-processing and training pipelines. Alluxio 2.7 also introduces improved scalability of job service operations by batching data management jobs. In addition, enhancements in master RPCs were added to enable launching clusters with a massive number of workers. We also improved fault tolerance capabilities in Alluxio 2.7.
Highlights
Deeper Integration with Machine Learning Workloads
Alluxio 2.7 emphasizes the integration with machine learning and AI training workloads. This release includes a set of optimizations specifically to support a large number of small files and highly concurrent data access, as well as improved POSIX compatibility. We were able to improve training-related tasks such as data preloading to reduce the time and achieve higher scalability.
To ensure the integration is solid for this type of workloads, there are multiple testing frameworks added in this release, including Alluxio embedded fuse stressbench, to prevent regression in functionality and performance in training workloads (documentation).
Alluxio 2.7 also contains improvements in Helm charts, CSI, and Fluid integration which help deploy Alluxio in Kubernetes clusters co-located with training jobs. More metrics and detailed docs are added to help users monitor, debug, and investigate issues with Alluxio POSIX API.
Improved Job Service Scalability
Job Service has become an essential service in many deployments and data processing pipelines involving Alluxio. Alluxio 2.7 introduces Batched Job to reduce the management overhead in the job master and allows the job service to handle an order of magnitude more total jobs and concurrent jobs (see configuration and documentation). This is necessary as users are increasingly using Job Service to perform tasks like distributedLoad
and distributedCopy
on a very large number of files.
In addition, on the client side, more efficient polling is implemented so that pinning and setReplication
commands on large directories place less stress on the job service. This allows users to pin a larger set of files.
Better Master RPC Scalability
Alluxio 2.7 introduces better cluster scalability by improving the efficiency in how workers register to the master upon startup (documentation). When a large number of workers register with the master at the same time, this can overwhelm the master by overconsuming memory, leading to the master being unresponsive.
Alluxio 2.7 introduces a master-side flow control mechanism, called “worker registration lease”. The master can now control how many workers may register at the same time, forcing additional workers to wait for their turn. This ensures a stable consumption of resources on the master process to avoid catastrophic situations.
Alluxio 2.7 also introduces a new option for the worker to break the request into a stream of smaller messages. Streaming will further smooth the master-side memory consumption registering workers, allowing higher concurrency and is more GC-friendly.
Increase Fault Tolerance
Impact of sudden network partitions and node crashes within the Alluxio HA cluster can be fatal to running applications. With intelligent retry support, Alluxio 2.7 now efficiently tracks each RPC in the journal in order for applications to have improved fault tolerance.
Improvements and Bugfixes Since 2.6.2
Docker, CSI, and K8s
- Add toggle to Helm chart to enable or disable master & worker resources (17e0542)
- Add
ImagePullSecrets
to Helm chart (c775c70) - Fix url address of the aliver workers when workers start with kubernetes (3239740)
- Add probe improvements to Helm chart (d7c252d)
- Fix load command trigger cache block fail in container env (a25f0a1)
- Add fail back logic for counting cpu in container is 1 (fe0420c)
- Support CSI in both docker images (575d205)
Metrics
- Fix master lost file count metric is always 0 (91fb695)
- Fix worker cannot report worker cluster aggregated metrics issue (aa823e4)
- Add worker and client metrics size to log info (cc50847)
- Add worker
threadPool
and task metrics (fed20cd) - Add active job size metric (23f3a5d)
- Fix the
Worker.BlocksEvicted
is zero all the time issue (492647b) - Add metric of audit log (a06d374)
- Add metric of raft journal index (3df0ef4)
- Add lock pool metrics (fb1f1b8)
S3 API
- Fix S3 bucket region (e76258f)
- Implement S3 API features required to use basic Spark operations (23be470)
- Support
listObjectV2
for S3 REST API (23f5b85) - Fix S3 REST API
listObjects
delimiter process erorr (0892405) - Set S3 endpoint region and support OCI object storage (36a4d5c)
- Fix S3
listBuckets
REST API non-standard response issue (bf11633) - Add application/octet-stream to consumed content types in S3 Handler (0901eb6)
StressBench
- Add lease support to StressBench tests (a6e2a14)
- Fix RPC benchmarks extra prepartion and low concurrency (99cb44d)
- Add
maxThroughput
for job service (0355f12) - Fix
-create-file-size
param in Stress Master Bench (bd44e98) - Add recursive to create a file in case of missing parent directory (0ba1baf)
- Add Alluxio native API for max throughput in StressBench (bfccb96)
- Add Alluxio native API call for
ClintIO
bench (872e68b) - Update Job service bench (b6ce813)
- Add three types of read and cluster mode in Fuse Stress Bench (af1477d)
General Improvements
- Web UI pages and displayed metrics (33b8a24)(b20ef39)(73bf2ce)(4087a38)
- Intellij support for Alluxio processes (71165a5)(2458871)(0a5669c)(69273af)
- Implement configurable
ExecutorService
creation for Alluxio processes (0eb8e89) - Bump RocksDB version to 6.25.3 (b41b519)
- Convert worker registration to a stream (7c020f6)
- Add master-side flow control for worker registration with
RegisterLease
(e883136) - Replace
ForkJoinPool
withFixedThreadPool
(0cd3b7e) - Add Batched job for job service (dc95b3b)
- Reset priorities after transfer for leadership (6a18760)
- Implement Resource Leak Detection (ab5e505)
- Add
ReconfigurableRegistry
andReconfigurable
(5e5b3e6) - Remove fall-back logic from journal appending (4f870de)
- Add retry suport for intializing
AlluxioFileOutStream
(2134192) - Allow higher keep-alive frequency for worker data server (bd7a958)
- Enable more settings on job-master’s RPC server (b14f6e6)
- Increase allowed keep-alive rate for master RPC server (c061b93)
- Improve exception message in block list merger function (62a9958)
- Implement FS operation tracking for improving partition tolerance (a9f121c)
- Enable additional keep-alive configuration on gRPC servers (3f94a5b)
- Introduce client-side keep-alive monitoring for channels of RPC network (e6cded4)
- Support logical name as HA address (0526fae)
- Support Huawei Cloud (PFS & OBS) (0cf6b91)
- Introduce deadlines for sync RPC stubs (70112cb)
- Use
jobMasterRpcAddresses
to judgeServiceType
(b0eba95) - Support shuffle master addresses for RPC client (665960e)
Bugs
- Resolve lock leak during recursive delete (ae281b5)
- Fix
startMasters()
from receiving unchecked exceptions (8fce673c) - Close UFS in
UfsJournal
close (b2c256e3) - Add filtering to
MergeJournalContext
(f8b2cd1) - Fix
distributedLoad
after file is already loaded (924199f) - Save transfer of leadership from unchecked exceptions (146bbce)
- Fix null pointer exception for
AuditLog
(6d2ce19) - Return directly when an exception is thrown during transfer leader (bddb70e)
- Fix
SnapshotLastIndex
update error (fa99e3873) - Fix
RetryHandlingJournalMasterClient
(4de3acd) - Fix
copyFromLocal
command (1662dcf) - Fix the semantic ambiguity of
LoadMetadataCommand
(b7e67d5) - Fix recursive path cannot be created bug when init multipart upload (28ee226c)
- Catch errors during explicit cache cleanup (5339349)
- Fix leader master priority always 0 (01347b2)
Acknowledgements
We want to thank the community for their valuable contributions to the Alluxio 2.7.0 release. Especially, we would like to thank:
Baolong Mao (maobaolong), Bing Zheng (bzheng888), Binyang Li (Binyang2014), Chenliang Lu (yabola), Jieliang Li (ljl1988com), Kevin Cai (kevincai), Lei Qian (qian0817), Liwei Wang (jffree), Ryan Zang (ryantotti), Steven Choi (Stevenchooo), Tom Lee (tomscut), Tianbao Ding (flaming-archer), XiChen (xichen01), Yang Yu (yuyang733), Yaolong Liu (codings-dan), and Zac Blanco (ZacBlanco)
Enjoy the new release and look forward to hearing your feedback on our Community Slack Channel.