Difference between revisions of "Distributed Tile Caching Model"
|  (w00t) | |||
| (14 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| − | This design outlines a model for a [[Distributed Tile Caching|distributed tile cache]]. | + | This design outlines a model for a [[Distributed Tile Caching|distributed tile cache]]. The primary design goals are minimizing response latency for tile requests, and maintaining redundant storage of tiles across the cache. Secondary design goals are to allow non-caching clients to request tiles directly from the cache, and to protect tile integrity by only permitting whitelisted peers to store data in the cache. | 
| = Definitions = | = Definitions = | ||
| Line 7: | Line 7: | ||
| Each ''peer'' has its own persistent ''peer key'', generated randomly. | Each ''peer'' has its own persistent ''peer key'', generated randomly. | ||
| − | The '' | + | The ''layer configuration'' consists of a list of (name, source URL, source layers, SRS, bounding box, width, height, maximum level) tuples. The ''name'' is an arbitrary string consisting of alphanumeric characters. The ''maximum level'' value is an integer. The remaining values are as request parameters given by the OGC WMS 1.1.1 specification. | 
| − | The ''directory server'' will serve the directory via HTTP from  | + | The ''directory'' of peers consists of a list of (key, IP, port, weight) tuples, where ''weight'' is the bandwidth that the peer is willing to serve in KB/s, expressed as an integer. | 
| + | |||
| + | The ''directory server'' will serve the directory and layer configuration via HTTP from well-known URLs in gzip compressed whitespace-separated text. | ||
| = Peers = | = Peers = | ||
| − | == Discovering peers == | + | == Discovering other peers == | 
| # Request the directory listing from the server, passing the peer's directory tuple. | # Request the directory listing from the server, passing the peer's directory tuple. | ||
| − | # Normalize the weights of each peer. | + | # Normalize the weights of each peer to a float in the range [0, 1). | 
| # Create an empty balanced binary tree. | # Create an empty balanced binary tree. | ||
| # For every other peer listed: | # For every other peer listed: | ||
| ## Set the peer's timeout value to ''v''. | ## Set the peer's timeout value to ''v''. | ||
| − | ## Add the peer to the tree using its key. | + | ## Set the peer's message sequence ID to 0. | 
| + | ## Add a reference to the peer to the tree using its key. | ||
| ## Calculate ''r'' = ''normalized weight'' x 64. | ## Calculate ''r'' = ''normalized weight'' x 64. | ||
| ## For ''i'' in range(1, ''r''): | ## For ''i'' in range(1, ''r''): | ||
| ##* Calculate a subsidiary key by concatenating the peer key with the binary value of ''i'', and take the SHA-1 sum of the result. | ##* Calculate a subsidiary key by concatenating the peer key with the binary value of ''i'', and take the SHA-1 sum of the result. | ||
| − | ##* Insert the peer into the tree using the subsidiary key. | + | ##* Insert a reference to the peer into the tree using the subsidiary key. | 
| == Maintaining the local directory == | == Maintaining the local directory == | ||
| − | # After ''d''  | + | # After ''d'' seconds have passed, request a new directory listing with an ''If-Modified-Since'' header. | 
| # If the server responds with a 304, wait another ''d'' minutes and check again. | # If the server responds with a 304, wait another ''d'' minutes and check again. | ||
| # Otherwise: | # Otherwise: | ||
| #* For every peer in the binary tree not in the new listing, remove it. | #* For every peer in the binary tree not in the new listing, remove it. | ||
| #* For every peer in the directory listing not in the binary tree, add it. | #* For every peer in the directory listing not in the binary tree, add it. | ||
| + | #* For every peer in both the tree and the directory listing, reset its timeout counter to ''v''. | ||
| == Selecting peers for a given tile == | == Selecting peers for a given tile == | ||
| − | + | * Concatenate ''layer'' + ''level'' + ''row'' + ''column'' and take the SHA-1 sum. This is the ''tile key''. | |
| − | + | * Until at most ''k'' distinct peers are selected: | |
| − | + | *# Select the first peer from the binary tree with key greater than or equal to the tile key. | |
| − | # | + | *# If there are no matching peers in the tree, select the first peer in tree. | 
| − | # | + | *# If the selected peer's timeout counter is 0, unselect it. | 
| − | #* Set the tile key to the key of the peer just selected. | + | *# Set the tile key to the key of the peer just selected. | 
| + | *# If, after the first selection, the tile key is once again the key of the first peer selected, then there are fewer than ''k'' eligible peers, so return those. | ||
| − | == Seeding a tile == | + | == Issuing requests == | 
| + | |||
| + | === Seeding a tile === | ||
| # Fetch the tile from the data source (or render the tile or whatever). | # Fetch the tile from the data source (or render the tile or whatever). | ||
| # Select ''k'' peers for the tile. | # Select ''k'' peers for the tile. | ||
| − | |||
| # Send a PUT message to each peer asynchronously. | # Send a PUT message to each peer asynchronously. | ||
| − | ==  | + | === Fetching a tile === | 
| − | + | # Select ''k'' peers for the tile. | |
| + | # Send a GET message for the given tile to each of the selected peers asynchronously. | ||
| − | ==  | + | === Expiring a tile === | 
| − | + | * A DELETE request can cover a rectangular range of tiles at a given level for a given layer. | |
| − | + | * Start by selecting ''k'' peers for the lower left tile in the expiration request. | |
| − | #  | + | * Send a DELETE message for the given tiles to each of the selected peers asynchronously. | 
| − | #  | + | |
| − | # If  | + | === Pinging other peers === | 
| + | |||
| + | # Every ''p'' seconds, select the peer with the largest key less than our peer key, whose timeout counter is greater than 0. If there are no such peers, select the peer with the largest key whose timeout counter is greater than 0. | ||
| + | # Send that peer a PING message. | ||
| + | # If the peer does not respond within ''t'' seconds, decrement its timeout counter to a minimum of 0. | ||
| + | |||
| + | === Reseeding === | ||
| + | |||
| + | Every 2''p'' seconds, a peer should select a random tile from its local cache and reseed it in the network (if the local peer's cache supports random access). | ||
| + | |||
| + | == Receiving requests == | ||
| − | == Responding to a GET request == | + | === Responding to a GET request === | 
| − | + | * If the tile is present in the local cache, send a PUT message in response. | |
| − | #  | + | * Otherwise: | 
| − | # If  | + | *# Send a PONG message to keep the other peer from timing us out. | 
| − | #  | + | *# If the tile does not belong to a layer that the peer is configured to cache, ignore the request. | 
| + | *# Otherwise, select ''k'' peers for the tile. | ||
| + | *# If our peer key is greater than or equal to the tile key, and less than or equal to the first key selected: | ||
| + | *#* Fetch the tile from its data source. | ||
| + | *#* Send a PUT message in response. | ||
| + | *#* Seed the tile in the network. | ||
| − | == Receiving  | + | === Receiving a PUT request === | 
| − | + | * If the tile does not belong to a layer that the peer is configured to cache, discard it. | |
| − | + | * Otherwise, store the tile in the local cache. | |
| − | |||
| − | |||
| − | ==  | + | === Receiving a DELETE request === | 
| − | + | A peer should keep track of the last 2''k'' DELETE messages received. | |
| − | #  | + | * If this DELETE message is a duplicate, discard it. | 
| − | # If the peer  | + | * Otherwise: | 
| − | # Otherwise,  | + | *# Remove the tile(s) from the local cache. | 
| + | *# Select ''k'' peers for the lower left tile of the DELETE request. | ||
| + | *# If the tile key is greater than the peer's key but less than the first peer selected, stop. | ||
| + | *# Otherwise, propagate the DELETE request to the ''k'' peers asynchronously. | ||
| − | == Receiving PINGs == | + | === Receiving PINGs === | 
| Every PING message should be responded to with a matching PONG message. | Every PING message should be responded to with a matching PONG message. | ||
| Line 90: | Line 113: | ||
| = Directory service = | = Directory service = | ||
| − | + | * When a peer requests the directory listing, store its tuple in the directory, along with the time it made the request. | |
| − | + | * Every 2''d'' seconds, remove any peers from the directory that have not made a directory request since the last check. | |
| − | + | * Peers can and perhaps should be whitelisted to insure data integrity over the network. Peers not on the whitelist should not be added to the directory. | |
| − | + | * The directory listing should be about 60 bytes per peer. With 10,000 peers, and assuming a 6:1 gzip compression ratio, a fresh listing should be at most 100k compressed. | |
| − | most 100k compressed. | + | * If/when the directory listing needs to contain more than ~10,000 peers, a Kademlia-like mechanism for thinning keys farther away from a given peer can be used to keep the list that any one peer sees down to a reasonable number. | 
| − | + | * The directory service can seek high availability through a shared database backend and round-robin DNS. | |
| − | = Plausible  | + | = Plausible parameter values = | 
| − | |||
| − | |||
| − | |||
| * ''k'' = 3 | * ''k'' = 3 | ||
| + | * ''t'' = 1 second | ||
| + | * ''v'' = 8 | ||
| + | * ''d'' = 600 seconds | ||
| + | * ''p'' = 30 seconds | ||
| = Protocol format = | = Protocol format = | ||
| Line 126: | Line 150: | ||
| == Message sequence == | == Message sequence == | ||
| − | + | Before sending a message, a peer should increment its own internal message sequence ID. | |
| + | |||
| + | == Message integrity == | ||
| + | |||
| + | To ensure message integrity, a peer should make the following checks when a message is received: | ||
| + | |||
| + | # Compare the peer key embedded in the message with the sending IP's recorded peer key. If they do not match, and the message is a PUT or DELETE message, discard it. | ||
| + | # Compare the checksum embedded in the message with the checksum of the payload. If they do not match, discard the message. | ||
| + | # Check the message sequence against the sending peer's most recent sequence ID. | ||
| + | #* If the message sequence is less than or equal to the previously set sequence ID for that peer, discard the message. | ||
| + | #* Otherwise, update the peer's sequence ID and reset its timeout counter to ''v''. | ||
| == PING messages == | == PING messages == | ||
| Line 147: | Line 181: | ||
| A DELETE message payload consists of the tuple (Layer, Level, MinRow, MinCol, MaxRow, MaxCol). | A DELETE message payload consists of the tuple (Layer, Level, MinRow, MinCol, MaxRow, MaxCol). | ||
| + | |||
| + | = References = | ||
| + | |||
| + | * This model is based largely on the Kademlia algorithm discussed at length in [[Distributed Tile Caching]], with the addition of a directory server to keep latency down. | ||
| + | * [http://www8.org/w8-papers/2a-webserver/caching/paper2.html Web Caching with Consistent Hashing] is a seminal work in distributed caching. | ||
| + | * [http://code.sixapart.com/svn/memcached/trunk/api/perl/dev/cons-hash.pl A pure-Perl implementation] of the algorithm described in the above paper. | ||
Latest revision as of 15:32, 13 November 2006
This design outlines a model for a distributed tile cache. The primary design goals are minimizing response latency for tile requests, and maintaining redundant storage of tiles across the cache. Secondary design goals are to allow non-caching clients to request tiles directly from the cache, and to protect tile integrity by only permitting whitelisted peers to store data in the cache.
Definitions
A key is a 20 byte SHA-1 sum.
Each peer has its own persistent peer key, generated randomly.
The layer configuration consists of a list of (name, source URL, source layers, SRS, bounding box, width, height, maximum level) tuples. The name is an arbitrary string consisting of alphanumeric characters. The maximum level value is an integer. The remaining values are as request parameters given by the OGC WMS 1.1.1 specification.
The directory of peers consists of a list of (key, IP, port, weight) tuples, where weight is the bandwidth that the peer is willing to serve in KB/s, expressed as an integer.
The directory server will serve the directory and layer configuration via HTTP from well-known URLs in gzip compressed whitespace-separated text.
Peers
Discovering other peers
- Request the directory listing from the server, passing the peer's directory tuple.
- Normalize the weights of each peer to a float in the range [0, 1).
- Create an empty balanced binary tree.
- For every other peer listed:
- Set the peer's timeout value to v.
- Set the peer's message sequence ID to 0.
- Add a reference to the peer to the tree using its key.
- Calculate r = normalized weight x 64.
- For i in range(1, r):
- Calculate a subsidiary key by concatenating the peer key with the binary value of i, and take the SHA-1 sum of the result.
- Insert a reference to the peer into the tree using the subsidiary key.
 
 
Maintaining the local directory
- After d seconds have passed, request a new directory listing with an If-Modified-Since header.
- If the server responds with a 304, wait another d minutes and check again.
- Otherwise:
- For every peer in the binary tree not in the new listing, remove it.
- For every peer in the directory listing not in the binary tree, add it.
- For every peer in both the tree and the directory listing, reset its timeout counter to v.
 
Selecting peers for a given tile
- Concatenate layer + level + row + column and take the SHA-1 sum. This is the tile key.
- Until at most k distinct peers are selected:
- Select the first peer from the binary tree with key greater than or equal to the tile key.
- If there are no matching peers in the tree, select the first peer in tree.
- If the selected peer's timeout counter is 0, unselect it.
- Set the tile key to the key of the peer just selected.
- If, after the first selection, the tile key is once again the key of the first peer selected, then there are fewer than k eligible peers, so return those.
 
Issuing requests
Seeding a tile
- Fetch the tile from the data source (or render the tile or whatever).
- Select k peers for the tile.
- Send a PUT message to each peer asynchronously.
Fetching a tile
- Select k peers for the tile.
- Send a GET message for the given tile to each of the selected peers asynchronously.
Expiring a tile
- A DELETE request can cover a rectangular range of tiles at a given level for a given layer.
- Start by selecting k peers for the lower left tile in the expiration request.
- Send a DELETE message for the given tiles to each of the selected peers asynchronously.
Pinging other peers
- Every p seconds, select the peer with the largest key less than our peer key, whose timeout counter is greater than 0. If there are no such peers, select the peer with the largest key whose timeout counter is greater than 0.
- Send that peer a PING message.
- If the peer does not respond within t seconds, decrement its timeout counter to a minimum of 0.
Reseeding
Every 2p seconds, a peer should select a random tile from its local cache and reseed it in the network (if the local peer's cache supports random access).
Receiving requests
Responding to a GET request
- If the tile is present in the local cache, send a PUT message in response.
- Otherwise:
- Send a PONG message to keep the other peer from timing us out.
- If the tile does not belong to a layer that the peer is configured to cache, ignore the request.
- Otherwise, select k peers for the tile.
- If our peer key is greater than or equal to the tile key, and less than or equal to the first key selected:
- Fetch the tile from its data source.
- Send a PUT message in response.
- Seed the tile in the network.
 
 
Receiving a PUT request
- If the tile does not belong to a layer that the peer is configured to cache, discard it.
- Otherwise, store the tile in the local cache.
Receiving a DELETE request
A peer should keep track of the last 2k DELETE messages received.
- If this DELETE message is a duplicate, discard it.
- Otherwise:
- Remove the tile(s) from the local cache.
- Select k peers for the lower left tile of the DELETE request.
- If the tile key is greater than the peer's key but less than the first peer selected, stop.
- Otherwise, propagate the DELETE request to the k peers asynchronously.
 
Receiving PINGs
Every PING message should be responded to with a matching PONG message.
Directory service
- When a peer requests the directory listing, store its tuple in the directory, along with the time it made the request.
- Every 2d seconds, remove any peers from the directory that have not made a directory request since the last check.
- Peers can and perhaps should be whitelisted to insure data integrity over the network. Peers not on the whitelist should not be added to the directory.
- The directory listing should be about 60 bytes per peer. With 10,000 peers, and assuming a 6:1 gzip compression ratio, a fresh listing should be at most 100k compressed.
- If/when the directory listing needs to contain more than ~10,000 peers, a Kademlia-like mechanism for thinning keys farther away from a given peer can be used to keep the list that any one peer sees down to a reasonable number.
- The directory service can seek high availability through a shared database backend and round-robin DNS.
Plausible parameter values
- k = 3
- t = 1 second
- v = 8
- d = 600 seconds
- p = 30 seconds
Protocol format
Protocol messages will be served via UDP.
Each message is a tuple consisting of (Peer Key, Type, Sequence, Checksum, Payload), for a total of 29 + n bytes.
Message type may be one of:
- PING
- PONG
- GET
- PUT
- DELETE
Message sequence must be a monotonically increasing 32-bit number.
Message checksum is a CRC-32 checksum of the data payload
The message payload takes up the remainder of the message.
Message sequence
Before sending a message, a peer should increment its own internal message sequence ID.
Message integrity
To ensure message integrity, a peer should make the following checks when a message is received:
- Compare the peer key embedded in the message with the sending IP's recorded peer key. If they do not match, and the message is a PUT or DELETE message, discard it.
- Compare the checksum embedded in the message with the checksum of the payload. If they do not match, discard the message.
- Check the message sequence against the sending peer's most recent sequence ID.
- If the message sequence is less than or equal to the previously set sequence ID for that peer, discard the message.
- Otherwise, update the peer's sequence ID and reset its timeout counter to v.
 
PING messages
PING messages have a checksum of 0 and no payload.
PONG messages
A PONG message payload consists of the sequence number from the corresponding PING packet.
GET messages
A GET message payload consists of the tuple (Layer, Level, Row, Column). The layer value is a zero-terminated string. The row and column values are 32-bit integers.
PUT messages
A PUT message payload consists of the tuple (Layer, Level, Row, Column, Data).
DELETE messages
A DELETE message payload consists of the tuple (Layer, Level, MinRow, MinCol, MaxRow, MaxCol).
References
- This model is based largely on the Kademlia algorithm discussed at length in Distributed Tile Caching, with the addition of a directory server to keep latency down.
- Web Caching with Consistent Hashing is a seminal work in distributed caching.
- A pure-Perl implementation of the algorithm described in the above paper.