Batch processing with Akka part 4

This article is a follow-up to Batch processing with Akka part 3.

In the last part of this series of articles I have a look at data processing within a network of computers. To do that the program was split into two logical parts, the one part – the master – contains the actors that read and write the data from and to files, the other part – the workers – contains the actors that do the actual processing. The program configuration contains definitions, which parts of the program should be started, and on which host the master is runnning. So it is possible to have one master and multiple workers running within the same network and to spread the work amongst them.

Implemeting the changes for that was easy. I had to add the akka-remoting library, modify the configuration an change the lookup for two actors in the program code. Without further changes it is possible to start the program on a couple of hosts and to increase the performance. While the program is running, it is possible to add machines running workers and to remove them again without the need to stop the master program.

Conclusion of these tests: Akka is an enormous help when developing flexible and scaling programs. As a programmer you don’t have to care about the details of threads, threadpools, synchronisation, locking or network protocols, this is all handled by Akka.

The program code is available at https://bitbucket.org/sothawo/akkabatch.

Batch processing with Akka part 3

This article is a follow-up to Batch processing with Akka part 2.

In the third part of this series I deal with error recovering for the case that records are lost during processing or in case the processing of a record takes too long time. The may happen for example when multiple computers are involved in the processing and network problems occur.

To handle such cases, the program checks in regular intervals if records, that have been sent into the system for processing, are processed in time. If this is not true, the records are sent again into the system. The length of the check interval and the time that is used to detect a timeout are adapted automatically during processing.

The error is simulated by the actors that are doing the processing as they are just dropping a configurable part of their messages instead of processing them.

The effort to implement this behavior its not too small, but it’s not too hard either. Care has to be taken that the records are not sent into the system too often, on the other hand the check interval must be short enough to not slow down the processing. But the result is remarkable:

Tested on a computer with 8 cores and a file with 520.000 records:

drop-rate processing-time
0% – 26ms
10% – 24ms
25% – 19ms
50% – 20ms
60% – 26ms
75% – 45ms

The measured times show that a drop rate of 0% causes some overhead by checking if records need to be resent. But it’s clear that when dropping 25% of the records and resending the data the system is fastest while correcting the errors. I think this comes from the fact that the actors that are dropping data are faster available for new processing than actors that do their job properly. Only if the drop rate rises above 60% the records have to be resent so often that this leads to a performance loss.