HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Best data structure for storing an ordered list, where each item is located on a separate server?

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
itemeachstoringwhereseparatelocatedforstructureserverlist

Problem

I need a structure that resembles an RSS feed of items, but each item is on a different network machine. Users cannot insert in the middle of the feed but only at its head. Deletions are possible.

The first idea was to implement a basic linked-list where each item of the list is an RSS item. Being on the network, this means that to retrieve the latest 10 feed items, I need to wait sequentially until each node returns from the network (so that I can follow the pointer to the next item in the linked list).

Skip lists make this more concurrent as each node also gives me back pointers to other portions of the list and I can request more items in parallel.

However, I'm looking even further for something that follows more closely my users need: there will be more requests of the latest portion of the feed rather than the older portion. With this requirement in mind, I'd be looking for a structure that can perhaps allow me to achieve higher concurrency closer to the head of the list (where latest feed items are), and less concurrency (with more sequential properties) as users get towards the end of the list.

My thinking is that skip lists achieve the same level of concurrency across the entire list, so perhaps there's something I can use to make it "very concurrent" at the beginning and "less concurrent" at the end rather than "same concurrency" all along.

Solution

A standard approach is to have a distributed data structure, and then use a cache for efficiency. The cache remembers the results of common queries; if you ask the same query again, and the underlying data structure hasn't changed, then you can return the same remembered answer without needing to re-compute it.

For instance, you might store the items spread across networked machines however you like, as a singly linked list. Then, you'd have a single machine M that is responsible for keeping a cache of the 10 items at the head of the list. Each time you insert a new item to the head of the list, you also need to notify M to let it know to invalidate its cache. (As an optimization, instead of throwing away all cached data, M might be able to update the cached information when it receives a notification -- though this does introduce some additional concurrency challenges.)

Now any attempt to read the first items from the queue goes first to M, which can check whether it has the answer cached, before traversing the linked list through the network.

You'll then need to analyze how strict your consistency requirements are. Do you require that M always return the absolutely correct result? Or can the cache be temporarily inconsistent with the backing data structure for a short period of time? That will affect whether you need to use coordination mechanisms like 2-phase commit or whether you can use simpler, higher-performing mechanisms.

I've described this in a particularly simple form, but you can generalize to (a) other backing data structures (not just a singly linked list), and (b) other caches for other kinds of queries that might be performed (not just the top 10 items from the head of the list).

Context

StackExchange Computer Science Q#51977, answer score: 2

Revisions (0)

No revisions yet.