IBM recently released object storage traces that reflect cloud object storage workloads and contributed these traces to the Storage Networking Industry Association (SNIA).
The traces that IBM contributed include read and write requests made against objects in a cloud-based object storage. These traces can help us understand the behavior of cloud workloads and drive new research and insight into enhancing the cloud. Today, there is a significant amount of academic interest in real data access traces that can be used to investigate various workload aspects. Although file system and block traces are easily available, there were no publicly available access traces for object storage. That’s why IBM decided to make these traces available to the community, and we’re looking forward to seeing the research insights that will follow as a result.
Using traces to explore cloud cache policies
For example, our team at IBM Research leveraged these traces to explore how classical FIFO and LRU cache policies apply for object storage. In fact, we published the summary in a paper in HotStorage 2020 under the title “It’s Time to Revisit LRU vs. FIFO.” The paper explores modern cache systems that can be deployed on scales undreamed-of just a few years ago.
With the advent of big data and cloud computing, cache storage can consume upwards of terabytes of data and more. Using these traces, we have been able to contrast different methods for managing a large-scale cache. It enabled us to revisit the question of the effectiveness of the popular LRU cache eviction policy versus the FIFO heuristic, which attempts to offer LRU-like behavior.
Several past works have considered this question and commonly stipulated that while FIFO is much easier to implement, the improved hit ratio of LRU outweighs this ease-of-use. We found that two main trends call for a reevaluation of this premise.
The first trend is that new caches — such as front-ends to cloud storage — are very large-scale, and this makes managing cache metadata in RAM no longer feasible. The second trend is new types of workloads. Using the insight gained from the traces, we have been able to substantiate this opinion and demonstrate cases where FIFO provides better performance characteristics than the commonly used LRU algorithm.
Insights that can optimize cache research
The object storage traces are a treasure trove of information for optimizing cloud workloads. They provide insight for cache research, in general, and particularly for large-scale hybrid cloud caches. While IBM has been using these traces to reevaluate eviction policies for large-scale caches, other groups have expressed interest in these traces for other uses. For example, an academic group at a leading university has been using the traces to study the effect of different cache policies on variable-sized data.
A closer look at what’s inside the object storage traces
The IBM object storage traces are a set of anonymized traces that IBM is making available to the broader research community. The trace data set is composed of 98 traces containing around 1.6 billion requests for 342 million unique objects. The complete trace data set is about 88 GB in size. Each trace contains the REST operations issued against a single bucket in IBM Cloud Object Storage during the same single week in 2019. Each trace was selected based on a single criterion — that it contains some read (i.e., GET OBJECT) requests. Each trace contains GET OBJECT, PUT OBJECT, HEAD OBJECT, DELETE OBJECT requests taken over a week-long period, where each request includes a timestamp, the request type, the object ID, a starting offset and an ending offset and the total object size. Only successful requests (i.e., that returned a return code of 200) are listed. Originally, this data set was intended to enable the study of cache behavior and, therefore, requests that were not served were of no interest.
Bucket names are omitted, and objects are represented as IDs generated through a one-way keyed hash function.
The format of each trace record is <time stamp of request>, <request type>, <object ID>, <optional: size of object>, <optional: beginning offset>, <optional: ending offset>. The timestamp is the number of milliseconds from the point where we began collecting the traces.
For example:
- 1219008 REST.PUT.OBJECT 8d4fcda3d675bac9 1056
- 1221974 REST.HEAD.OBJECT 39d177fb735ac5df 528
- 1232437 REST.HEAD.OBJECT 3b8255e0609a700d 1456
- 1232488 REST.GET.OBJECT 95d363d3fbdc0b03 1168 0 1167
- 1234545 REST.GET.OBJECT bfc07f9981aa6a5a 528 0 527
- 1256364 REST.HEAD.OBJECT c27efddbeef2b638 12752
- 1256491 REST.HEAD.OBJECT 13943e909692962f 9760
- 1256556 REST.GET.OBJECT 884ba9b0c6d1fe97 23872 0 23871
- 1256584 REST.HEAD.OBJECT d86b7bfefc63995d 12592
Learn more
The IBM Cloud Object Storage traces are a set of object storage workload traces that can facilitate cache, object storage, and cloud research. The traces are now available on the SNIA site. We hope you will use them and find them as insightful as we have. You can find them here.