Cache Pattern

Fast intermediate (SQL) storage.

The cache pattern has been one of the most complex problems for a long time. The idea behind creating a data warehouse in the 1980s was to cache the data in an aggregated way so we could make fast business decisions for our company.

To this day, it remains challenging to cache your data independently. Caching means constantly duplicating data, storing it optimally, and updating data in case the source changes. However, because of the significant outcomes, we still use it in every data engineering solution.

Let's dig into this chapter to see how the caching layer evolved in a pattern in the data engineering field and what that might look like.

Definition

What is (data) caching?

Caching data or data caching is a process that stores multiple copies of data or files in a temporary storage location—or cache—so they can be accessed faster. It saves data for data applications, notebooks, data pipelines, or BI dashboards, which ensures users need not download information every time they access a website or application, speeding up their user experience.

Cached data typically includes pre-processed and aggregated data stored locally on your device or server to quickly load the results instead of running a heavy SQL or, worse, a heavy Hadoop MapReduce or data pipeline job beforehand.

Caching is a good solution for the von Neumann bottleneck, which looks at faster ways to serve memory access.

Characteristics

What makes the cache pattern recognizable?

There are many different ways of caching, which we already saw in previously discussed Convergent Evolutions like Materialized Views, OBT, Traditional & Modern OLAP Systems, dbt table, ODS, Semantic Layer, and the classic, Data Warehouse all make use of the cache pattern.

graph TD

    CE_MV[MV]
    CE_OBT[OBT]
    P_CachingDisk[P: Cache]
    CE_SemanticLayer[Semantic Layer]
    CE_ModernOLAP[Modern OLAP System]
    CE_ODS[ODS]
    CE_TraditionalOLAP[Traditional OLAP System]
    CE_dbt[dbt table]
    CE_DataWarehouse[Data Warehouse]

    %% Linking Nodes
    CE_MV --> P_CachingDisk
    CE_OBT --> P_CachingDisk
    CE_TraditionalOLAP --> P_CachingDisk
    CE_dbt --> P_CachingDisk
    CE_ModernOLAP --> P_CachingDisk
    CE_ODS --> P_CachingDisk
    CE_SemanticLayer --> P_CachingDisk
    CE_DataWarehouse --> P_CachingDisk

Some are simply creating a copy of the data and store it on disk; others, most often, are loading it into memory, therefore in-memory called (newer ones even use GPU and even faster memory), and some create their cache solution from scratch or use intermediate faster retrieval storage similar to Redis.

A caching layer is used nearly everywhere. But why? Why is caching used in so many different technologies or data engineering tools behind the scenes, again and again?

The obvious answer is one of its characteristics: rapid data retrieval. By caching data, systems can access and process it much faster than if they were to retrieve it from the primary storage every time or rerun heavy data jobs.

This is especially beneficial for customer-facing applications that demand real-time or near-real-time analytics, where any delay in data access can be detrimental.

How Martin Kleppmann categorizes caching in his Book

Interestingly, Martin synonymizes caching in his book Designing Data-Intensive Applications by Martin Kleppmann with denormalized, derived data. He also categorized OLAP or Data Cube as a specialized Materialized View under the umbrella of caching aggregations and uses Data Warehouse and OLAP interchangeably.

Application

This is especially beneficial for customer-facing applications that demand real-time or near-real-time analytics, where any delay in data access can be detrimental. Typical use cases are BI dashboards, real-time analytics, and data pipelines.

For example, in BI dashboards, caching ensures that users can access updated insights without the delay of recalculating metrics from raw data each time a report is viewed. This is key in business environments where timely decision-making depends on the most current data. For instance, a typical company with different ERP systems might use cached data to provide sales metrics and aggregated insights to management, instant and pre-cached.

In data pipelines, caching can optimize the performance of data processing jobs by storing intermediate results. This reduces the load on the underlying data sources and speeds up the pipeline execution time. For example, a machine learning pipeline might cache transformed datasets to avoid reprocessing them each time a new model is trained, thus significantly reducing the time to iterate on model development.

These applications underscore the importance of caching in enhancing performance, reliability, and user experience in data-intensive environments. You'll get fast storage with better in-memory, fitting cached data in memory, optimized retrieval with SQL, and efficient data access.

Examples

There is no shortage of examples. For illustration purposes, look at the table below that shows alternative data storage options, which integrate caches themself, except the first column "cache".

Data Storage OptionsCache(Modern) OLAPSemantic LayerMessage QueueRelational DatabaseData WarehouseData Lake
ScaleMegabytesTerabytesGigabytesGigabytesGigabytesTerabytesPetabytes
Storage Cost$$$$$$$$$$$$$$$$$$$$$$$
Data Life TimeHoursYearsHoursWeeksYearsYearsDecades
ApplicationsApplication optimizationReal-time analytics and interactive queryingCentralized business logic for analyticsData collection, real-time analyticsRunning applicationsAnalytics and MLAnalytics and ML
LatencyLowLowLowHighMediumMediumVery High
InterfaceREST, NextJS, Flask, Spring BootSQL, APIs, BI tools, custom dashboardsSQL, APIs (REST, GraphQL), BI toolsFlink, Spark, PinotNextJS, Flask, Spring Boot, SQLSQL, Tableau, Power BISQL, Spark, Trino
ExamplesRedis, MemcachedApache Druid, ClickHouse, Apache Pinot, Apache KylinLooker, Cube, MetriQL, dbt Metrics LayerKafka, RabbitMQ, Amazon SQS, Azure Queue Storage, Amazon KinesisPostgres, Oracle, MySQLSnowflake, BigQuery, Redshift, DuckDB, ClickHouseS3 / Blob Storage / GCP Object Storage, Iceberg / Delta Lake / Hudi
Cache ExampleNone, e.g., Redis is in-memory and stored in RAMCached query results and materialized viewsCentralized metric definitions cached for fast retrievalIn-memory caching for message queuesIndexes, data in tables, query execution plansData Marts, Cached Query results in Cloud WarehousesTransaction logs for Delta Lake, Hudi, Iceberg

Data Storage Options | Table inspired by Eczachly on Threads.

Cube Store

An specific interesting example of a cache implementation is the Cube Store, the caching layer of the popular open-source Semantic Layer Cube. Because Cube implementing initially a cache based on Redis, and ended up implementing their own cache layer based on open-source building blocks with Apache Parquet (storage), Apache Arrow (in-memory data structures) and DataFusion (query execution framework).

Reasons can be read in their article, main reasons are scalability and performance limitations of Redis, especially with growing performance requirements, and Redis resource consumption have become the bottleneck under high consumption. Especially with excessive optimistic concurrency clashes and complex instructions for simple operations were significant issues.

With their own implementation they mimic Redis's functionality but use atomic domain-specific instructions instead of lengthy command batches, reducing clashes and simplifying cache and queue access flow. Cube Store will also implement distributed LRU caching and store data in a binary format with columnar compression where appropriate, enhancing performance and throughput.

Redis

A true cache, the initial creator of this category, is Redis (Remote Dictionary Server). Redis is an open-source, in-memory data structure store that has become the de facto standard for caching due to its unparalleled speed and efficiency.

Its in-memory approach allows instant read and write operations, significantly reducing latency compared to disk-based databases. This speed is crucial for applications that require rapid access to data. Redis supports various data structures such as strings, hashes, lists, and sorted sets, allowing developers to choose the most efficient structure for their needs, further optimizing performance.

Beyond its speed, Redis also offers features like persistence and replication, contributing to its robustness and reliability. Persistence options, such as snapshotting and append-only files (AOF), ensure data durability, while replication enables high availability and load balancing. Redis Cluster allows horizontal scaling, distributing data across multiple nodes to handle larger datasets and higher throughput. The extensive ecosystem and strong community support, with numerous client libraries for various programming languages, make Redis easily integrate into diverse technology stacks, solidifying its position as the go-to caching solution.

Advantages: When to Use This Pattern

As with the examples in this chapter, a cache pattern's benefits are often desired. When should you apply or follow such a pattern yourself?

Mainly when intermediate storage for fast data retrieval is needed and the current tools set does not include a cache already. For example, for highly aggregated data that fetches and calculates over many data sources or even systems.

Caching also means duplicate data, often optimized; therefore, checking caching on a holistic and higher level is essential, too. Because it can be expensive to duplicate data several times, check your Data Architecture, especially the Data Model, for an optimized process and data flow before using a cache.

You can use the cache pattern in many different ways; there is no one way of caching something, and that's why this pattern is used nearly everywhere. We'll dig deeper in later design pattern chapters based on the cache pattern, with best practices and how to implement it.

The advantages derived from using this pattern in data engineering contexts are:

  • Speed: Significantly faster data retrieval.
  • Efficiency: Reduces the need to run heavy data jobs repeatedly.
  • User Experience: Faster access to data improves user satisfaction.

Challenge and Limitations

Caching your data also has downsides, challenges, and limitations.

Only queries that are the same are cached. If you have the same query as another user, but the filters are different, a minimal cache can't reuse previously stored data with other filters. This leads to a balance between storage cost and the amount of data you want to cache vs. some queries running slowly. We know that problem well from the OLAP cube's granularity and cardinality, as well as the way they store pre-calculated metrics with their level of dimensions. That's why caching can quickly end up in a sophisticated and complex piece of software that many have tackled before, as shown in the examples above. This is a significant limitation to creating or using one.

Another tricky challenge is handling updates on data, especially fast-changing ones. Usually, data does not stand still, especially when massaged more and coming closer to the users: address changes, product updates, and sales increase. And the speed at which these happen is getting faster every day. Data freshness makes it especially challenging for a cache. Because what does fast data help if the content is out of date and not accurate?

Within data pipelines, we know the term Idempotence, where we can reproduce certain operations that can be applied multiple times without changing the resulting outcome by being given the same inputs. Applied to a cache, we can cache based on days, months, or other parameters/partitions, allowing us only to cache a new subset of changed data. But if we need aggregations, we will need an algorithm that brings the data together consistently.

I'm sure you see the dilemma. Not to mention, you can only update factual data with its related dimensions. Building your cache layer requires a lot of expertise in different domains such as storage, network, data engineering on updating data most efficiently, and lots of software engineering. Once implemented, maintaining such a critical piece is even more complex.

That's why we have seen many different ways of caching that have evolved in the addressed convergent evolutions. From simple SQL to disk persistence to sophisticated caches that optimize for sub-second responses. Summarized, the challenge concludes to:

  • Data Freshness: Ensuring the cache is current is challenging.
  • Complexity: Managing cache invalidation and consistency requires expertise.
  • Maintenance: Building and maintaining a custom cache layer demands significant resources.

Alternatives

The most straightforward alternative is where caching is obsolete, as Query Engine provides near real-time data access without needing an intermediate store. Modern query engines like DuckDB and WebAssembly (WASM) have significantly optimized query execution speed. These engines handle complex queries efficiently with columnar or vector-based engines, enabling on-the-fly data retrieval without the overhead of maintaining a cache.

Another alternative is implementing more data streaming architectures. Platforms like Apache Kafka and Apache Flink allow for real-time data processing and analytics, which can be a powerful substitute for caching in environments where data is continuously changing. Ingested into Modern OLAP cubes, you have live, updated data on the fly. However, as discussed and seen in examples and convergent evolution, a modern OLAP or a Message Queue such as Kafka is a cache of itself and uses technology to cache its data. So, in that sense, it's not a valid alternative.

That can be said for all convergent evolution terms linked to the cache, take materialized views, OBT, dbt tables, ODS, Semantic Layer, or a Data Warehouse.

Each of these alternatives comes with its own trade-offs and implementation considerations. The choice between them depends, as always, on the specific requirements of the use case, including data volume, query complexity, update frequency, and resource constraints. The most suitable approach can be applied with the best performance and low maintainability for the overall data engineering lifecycle by evaluating these factors.

Relationships to Other Patterns

As identified in the Convergent Evolution -> Design Design Patterns Overview, the data engineering design pattern based on the cache pattern is Dynamic Querying. A design pattern that shows how caching integrates with dynamic query patterns and other related data engineering approaches. In that chapter, we'll discuss implementation and use cases in more detail.

graph LR

    P_CachingDisk[P: Cache]
    DP_DynamicQuerying[DP: Dynamic-Querying]
    P_CachingDisk --> DP_DynamicQuerying

Do you have other use cases of implementing or applying a caching pattern? Do you see other characteristics or advantages of this pattern? Please let me know in the comments below. Otherwise, let's explore the following pattern in the following chapter.

Comments
Active Here: 0
Be the first to leave a comment.
Loading
Someone is typing
Your comment will appear once approved by a moderator.
No Name
Set
Instructor
4 years ago
This is the actual comment. It's can be long or short. And must contain only text information.
(Edited)
Your reply must be approved by a moderator.
No Name
Set
Instructor
2 years ago
This is the actual comment. It's can be long or short. And must contain only text information.
(Edited)
Load More
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Load More