Use Cases

Slack Docker Pulls GitHub edit source

Many leading companies around the world run Alluxio in production to extract value from their data. Some of them are listed in our Powered-By page. In this section, we will introduce some of the most common Alluxio use cases.

Use Case 1: Accelerate Analytics and AI in the Cloud

Many organizations are running analytics and machine learning workloads (Spark, Presto, Hive, Tensorflow, etc.) on object storage in the public cloud (AWS S3, Google Cloud, or Microsoft Azure).

Though cloud object stores are often more cost-effective, easier to use, and easier to scale, there are some challenges:

  • Performance is variable and consistent SLAs are hard
  • Metadata operations are expensive and slowdown workloads
  • Embedded caching is ineffective for ephemeral clusters

Alluxio addresses these challenges by providing intelligent multi-tiered caching and metadata management. Deploying Alluxio on the compute cluster helps:

  • Achieve consistent performance for analytics engines
  • Reduce AI training time and cost
  • Eliminate repeated storage access costs
  • Achieve off-cluster caching for ephemeral workloads

See this example use case from Electronic Arts.

Use Case 2: Speed-up Analytics and AI for On-premise Object Stores

Running data-driven applications on top of an object store deployed on-premise brings the following challenges:

  • Poor performance for analytics and AI workloads
  • Lack of enough native support for popular frameworks
  • Expensive and slow metadata operations

Alluxio solves these problems by providing caching and API translation. Deploying Alluxio on the application side brings:

  • Improved performance for analytics and AI workloads
  • Flexibility of segregated storage
  • Support for multiple APIs with no changes to the end-user experience
  • Reduce the overall storage cost

See this example use case from DBS.

Use Case 3: “Zero-Copy” Hybrid Cloud Bursting

As more organizations are migrating to the cloud, a common intermediate step is to utilize compute resources in the cloud while retrieving data from on-premise data sources. However, this hybrid architecture brings the following problems:

  • Data access across the network is slow and inconsistent
  • Copying data to cloud storage is time-consuming, error-prone, and complex
  • Compliance and data sovereignty requirements may prohibit copying data into the cloud

Alluxio provides “zero-copy” cloud bursting which enables compute engines in the cloud to access data on-premise without the need of a persistent copy of the data in the cloud that needs to be periodically synchronized to the original data on-premises. This brings the following benefits:

  • Performance as if data is on the cloud compute cluster
  • No changes to end-user experience and security model
  • Common data access layer with access-based or policy-based data movement
  • Utilization of elastic cloud compute resources and cost savings

See this example use case from Walmart.

Use Case 4: Hybrid Cloud Storage Gateway for Data in the Cloud

Another hybrid cloud architecture is to access cloud storage from a private datacenter. Using this architecture usually causes the following problems:

  • No unified view for cloud and on-premise storage
  • Prohibitively high network egress costs
  • Inability to utilize compute on-premises for data in the cloud
  • Inadequate performance for analytics and AI

Alluxio solves these problems by acting as a hybrid cloud storage gateway that utilizes on-premise compute for data in the cloud. When deployed with the compute on-premise, Alluxio manages the compute cluster’s storage and provides data locality to applications, achieving:

  • High performance for reads and writes using intelligent distributed caching
  • Network cost savings by eliminating replication
  • No changes to the end-user experience with flexible APIs and security model on cloud storage

See this example use case from Comcast.

Use Case 5: Enable Cross Datacenter Access

Many organizations maintain satellite compute clusters that are independent of their main data cluster for the purposes of performance, security, or resource isolation. These satellite clusters need to access data remotely from the main cluster, which is challenging because:

  • Cross-datacenter copies are manual and time-consuming
  • Unnecessary network traffic for replication is expensive
  • Replication jobs on an overloaded storage cluster dramatically impact the performance of existing workloads

Alluxio can be deployed on the compute nodes in the satellite cluster and configured to connect to the main data cluster, serving as one logical copy of data. Thus:

  • No redundant data copies across datacenters
  • Elimination of complex data synchronization
  • Improved performance compared to remote region data access
  • Self-service data infrastructure across business units