Hybrid fault tolerance in distributed in-memory storage systems

Abstract

An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage. Fault tolerance of in-memory storage systems is essential, as the loss of cached data requires access to data from external storage, which evidently increases the response latency. Typically, replication and erasure code (EC) are two fault-tolerant schemes that pose different trade-offs between access performance and storage usage. To help make the best performance and space trade-off, we design ElasticMem, a hybrid fault-tolerant distributed in-memory storage system that supports elastic redundancy transition to dynamically change the fault-tolerant scheme. ElasticMem exploits a novel EC-oriented replication (EOR) that carefully designs the data placement of replication according to the future data layout of EC to enhance the I/O efficiency of redundancy transition. ElasticMem solves the consistency problem caused by concurrent data accesses via a lightweight table-based scheme combined with data bypassing. It detects corelated read and write requests and serves subsequent read requests with local data. We implement a prototype that realizes ElasticMem based on Memcached. Experiments show that ElasticMem remarkably reduces the time of redundancy transition, the overall latency of corelated concurrent data accesses, and the latency of single data access among them.

FullText(HTML)