[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Any examples of a specification of an S3-like object store API?




I'm looking for some example specifications of an eventually consistent object store, such as amazon S3.

This isn't because I plan to implement one, it's because I want to do some things against such an object store, specifically using a consistent database to address the inconsistency problems (similar to Netflix's s3mper - http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html ), then implement an O(1) output committer for hadoop/Tez/Spark which handles race conditions in multiple (speculative) competing executors and is resilient to failures —precisely the features that you don't get when working with S3 today. And of course, to show that such a mechanism works, a bit of formality can only help.

I know AWS and perhaps the Azure team have been using TLA+ internally: are there any specifications of the exposed behaviours of the S3 & Azure Data Lake stores around? Something from the service owners themselves would be best. Otherwise: has anyone done some examples of object stores with create consistency, queued metadata updates for object listings, asynchronous delete/update operations

For extra fun, S3 appears to briefly cache negative lookups, so that while the creation sequence always holds

~ exists(path)
PUT(path)
GET(path)

an initial GET could leave a negative result which the next GET would retrieve, so the following sequence is not guaranteed to be true, at least if there is "not enough" delay between the PUT and the subsequent GET.

~GET (path)
PUT(path)
GET(path)

While I don't want to go near that problem, it exists —so I'd better write it down and code for it.

Right now I'm not even sure how best to define that "eventually" concept except to say after some time t the observed state of updated/deleted objects will change, what the values of t are for different infrastructure instances


-Steve