Thursday, November 22, 2012

Tornado and Blob Chunking

The Tornado web framework is a great choice for implementing an application that can anticipate lots of client connections. It's claim to fame is that it can support thousands of concurrent users. And it does all of this in a single-threaded IO loop. Which can block on certain IO operations — like serving large blob objects.

The trick with Tornado is to understand where this blocking behavior occurs, and how that contrasts to the threaded-approach of handling many client connections concurrently. With a threaded-approach, the concept of a request handler, the code the application developer writes to fulfill user requests, doesn't need to do anything special. The framework takes care of maintaining thread management for any given request. With IO loops, we need to be conscientious of the fact that only one request handler is handled at any given time. So the trick is this — exit from the request handler as quickly as possible.

Serving large blobs helps to illustrate this idea because they typically take a while to complete a request. Imagine you have a request handler that takes care of serving these blobs, and you have 10 users all asking for one at the same time. The request handler can only do one thing at a time, so the first download will happen really fast, while the user at the end of the line isn't so impressed. Tornado has some asynchronous facilities to help mitigate scenarios such as these. Remember, we want to exit the request handler as quickly as possible. Take a look at this really simple blob server.

import os.path
import httplib
from tornado.options import define, options
from tornado.ioloop import IOLoop
from tornado.web import Application,\
                        RequestHandler,\
                        asynchronous 

define(
    'blob',
    default = '',
    type = str
)

define(
    'chunk_size',
    default = 1024 * 1024,
    type = int
)

class FileHandler(RequestHandler):

    content_type = 'application/octet-stream'

    @asynchronous
    def get(self, *args):
        IOLoop.instance().add_callback(self.init_blob)

    def init_blob(self):
        try:
            self.blob = open(options.blob)
        except IOError, exc:
            self.set_status(httplib.NOT_FOUND)
            self.finish(str(exc))
            return
        blob_name = os.path.basename(self.blob.name)
        blob_size = os.path.getsize(self.blob.name)
        self.set_header('Content-Type', self.content_type)
        self.set_header('Content-Length', blob_size)
        self.set_header(
            'Content-Disposition',
            'inline; filename="%s"' % blob_name
        ) 
        IOLoop.instance().add_callback(self.send_chunk)

    def send_chunk(self):
        try:
            chunk = self.blob.read(options.chunk_size)
        except IOError, exc:
            self.set_status(httplib.INTERNAL_SERVER_ERROR)
            self.finish(str(exc))
            return
        if chunk:
            self.write(chunk)
            self.flush()
            IOLoop.instance().add_callback(self.send_chunk)
        else:
            self.blob.close()
            self.finish()

def main():
    options.parse_command_line()
    Application([
        (r'/', FileHandler),
    ]).listen(8888)
    IOLoop.instance().start()

if __name__ == "__main__":
    main()

Notice that the main job of the FileHandler.get() method is to add a callback handler, after which, we return immediately. This allows the IO loop to move onto the next request. The callback we've registered is FileHandler.init_blob(), which prepares the blob file that we're to serve in this request. Once the blob is prepared, we register yet another callback — FileHandler.send_chunk(). This is the meat of the file serving where each IO loop iteration sends a chunk of the blob back to the client. We repeatedly add new callbacks that send chunks to the client till the entire blob has been transferred.

A little awkward compared to a more monolithic request handler? Maybe. This approach does, however, allow you to "chunk" rather than "thread". Just another way of looking at the problem.