Hey folks,
the cloud promises an abundance of CPU cycles and huge opportunities for parallell processing. However, before you can tap into this potential, you need to have a good mechanism for distribution. Hashing can help, but there are other options as well. One of them is to use a queue.
So how does it work? Well essentially a queue is a big list where new items are added to the bottom and existing items are read from the top. So lets say you want to encode 10,000 videos using 100 machines. The way a queue can help with this is by having one machine loop over all videos and put a message in the queue for each one of them. This message is a text string (JSON is nice for this) that represents 1 job of converting 1 video. So you end up with a queue that has 10,000 messages in it, each representing one job.
Now you can use EC2 to launch 100 High CPU instances at $0.20 / hour. Each one of these workers now runs a program in infinite loop that queries the queue for job messages. As soon as a message is fetched it becomes invisible to all other workers querying the queue. This is called the visibility timeout and lasts for a number of X seconds definable by you. So during this time the worker who fetched the message is responsible for completing the job. If it does, it sends a request to the queue for deleting the message. If the worker fails (for example because it crashes), the message will appear in the queue again after the timeout, and other workers can process it.
So as you can see there are a lot of advantages to this approach:
- The worker machines don't need to be aware of each other
- Fault tolerant, no job gets lost because a worker machine fails
- Effectiveness, the queue tries to make sure no message gets processed twice
- The system itself can never get overloaded and it always works as fast as it can
- Queue length can serve as a very nice measurement of load on a given system
Of course all of those advantages depend on the queue service not failing which of course is very difficult to achieve. So in a scenario as the one described above, Amazon's SQS service is a very interesting solution. Why?
Well first of all it is cheap. While testing the service I put 1.7 mio messages in my queue - Amazon charged me $1.74, so its like $1 / mio messages + bandwidth. The next good thing is that its highly scalable. Whether you put 10 or 10 mio. messages in your queue, Amazon says they'll sort it out for you. And last but not least, there are already aws monitoring tools out there to monitor your queues hosted with amazon.
So far so good. There are also things that suck about SQS. First of all, the latency is pretty high. My tests confirmed what Wikipedia and others say about SQS: It takes ~2-10s for a message added to a queue to be available for reading. If you need a very responsive queue, that rules SQS out for you. Also very stupid is the lack of a "flush" function. So while you are developing you have to write your own tool for flushing a queue. And last but not least is the fact that SQS requires your system to be idempotent. This basically means that SQS does not guarantee that a message cannot be fetched by 2 workers at the same time and those be processed twice. Idempotent means that your app needs to be prepared for that and processing anything twice needs to lead to the same result. But of course, SQS tries to avoid this scenario as much as possible.
Anyway, if you come to the conclusion that SQS can solve more problems for you than it creates, it is an amazing service. So how do you use it in your apps? Well I first tried using the PHP library Amazon provides, but I have come to hate it. I mean it is very comprehensive and does the job. But the people who wrote it were clearly Java engineers forced to write PHP. I feel sorry for them.
For a long time my searches for an alternative came up empty, but last night I discovered at least 2 viable options. The first is called php-aws and provides very clean, easy to use classes for S3, SQS, EC2 and AWIS. From their project page I also found another project called Tarzan. PHP-AWS recommends Tarzan as a super-robust and comprehensive alternative. And from my analysis of it, it indeed looks like a very mature project and I encourage everybody to check it out.
Well - too bad for me that I was already way into implementing my own class last night when discovering those two options : ). But nevertheless I am very proud to present the Debuggable.com SQS PHP5 Library. Besides a very easy to use and intuitive interface, it features the following attributes:
- Exponential backoff on failures and retry maximum
- Uses CURL for reliable HTTP communication
- It is completely unit tested, which was an interesting challenge
The lib itself is as simple as they come:
$queue =
new SqsQueue
('my_queue',
array('key' =>
'...',
'secretKey' =>
'...'));
$queue->
sendMessage(array('autoJsonSerialize' =>
'is fantastic'));
$lookMaIAmUsingSPL =
count($queue);
I do not recommend anybody to use the library for production purposes right now, but if you want to get started with SQS or study the implementation I think you have found an excellent library. Writing the library certainly has been a great experience and opportunity for me to study SQS in detail.
Anyway, enough said. Back to my cloud bed.
-- Felix Geisendörfer aka the_undefined