Singletons Going Steady

  

October 02, 2013

Minimizing page load times with hash-ring-ctypes, a fast Python hash ring

Using hash-ring-ctypes to parallelize asset loading.

Today, I released a new project on GitHub. It's called (droll, I know) hash-ring-ctypes, and it's a Python ctypes-based wrapper around libhashring.

At Pitchfork, we use a hash ring to parallelize site asset downloads from our CDN across four domains. Here's the gist: browsers can only download so many files from a single domain at a time, and we want to maximize that to reduce page load time. The easiest solution is to shard our CDN domain into four segments using CNAMES, creating a cluster of four new domains: cdn.pitchfork.com through cdn4.pitchfork.com, We then populate a hash ring with those domains, creating four points (“nodes”) in the ring. The ring can then take a value — in our case, a relative path to an asset like “albumreviews/1/cover.jpg” — and return the corresponding node. Unless we add or subtract from the ring, the lookup function is idempotent.

I realize that I haven't explained how nodes are placed into the ring, or how nodes are looked up. For a thorough explanation of consistent hashing and the concept of a hash ring, check out this and this.

How do I use this thing?

For now, install from GitHub via pip:

$ pip install -e git+https://github.com/mattdennewitz/hash-ring-ctypes#egg=hash_ring

To use, import hash_ring and create an instance of its HashRing class. Nodes can be provided at creation via the nodes kwarg:

import hash_ring

nodes = ['cdn.pitchfork.com', 'cdn2.pitchfork.com', 'cdn3.pitchfork.com']

# create the hash ring with our cdn domains
ring = hash_ring.HashRing(nodes=nodes)

This class's default settings will create five “replicas” for each node you add. Replicating each node in this fashion ensures that no specific node is used more than the rest. You can specify the number of replicas with the replicas kwarg:

# create a ring with 10 replicas
ring = hash_ring.HashRing(replicas=10, nodes=nodes)

Nodes can also be added and removed during the life of the ring:

# add 'cdn4'
ring.add_node('cdn4.pitchfork.com')

# remove 'cdn2'
ring.remove_node('cdn2.pitchfork.com')

Now that we've populated our hash ring, we can use it to look up a domain for a specific asset:

# find the CDN domain for a given file path
fn = 'albumreviews/1/cover.jpg'
cdn_domain = ring.lookup(fn)

cdn_domain will be one of the CDN domains in the ring. The domain name and file path can be glued together to form the asset's CDN URL. Doing this for all assets on my page ensures that they're distributed across several domains, thus achieving higher parallelization of asset downloads.

Implementation

Here's a simple implementation to illustrate the entire process:

>>> import urlparse
>>> import hash_ring

# create hash ring for cdn domains
>>> nodes = ['cdn.pitchfork.com',  'cdn2.pitchfork.com',
...          'cdn3.pitchfork.com', 'cdn4.pitchfork.com']
>>> ring = hash_ring.HashRing(nodes=nodes)

>>> def cdn_url(path):
>>>    return urlparse.urljoin(ring.lookup(path), path)

>>> cdn_url('albumreviews/1/header.jpg')
'cdn.pitchfork.com'

>>> cdn_url('albumreviews/10000/header.jpg')
'cdn2.pitchfork.com'

Register cdn_url with your template system, and you'll have your asset URLs automatically hashed to maximize parallel downloads.

#

I hope this library and explanation helps. Please fork the project and file issues on GitHub. Documentation is available on RTFD.

Comments

comments powered by Disqus