Batch processing with Akka part 4

This article is a follow-up to Batch processing with Akka part 3.

In the last part of this series of articles I have a look at data processing within a network of computers. To do that the program was split into two logical parts, the one part – the master – contains the actors that read and write the data from and to files, the other part – the workers – contains the actors that do the actual processing. The program configuration contains definitions, which parts of the program should be started, and on which host the master is runnning. So it is possible to have one master and multiple workers running within the same network and to spread the work amongst them.

Implemeting the changes for that was easy. I had to add the akka-remoting library, modify the configuration an change the lookup for two actors in the program code. Without further changes it is possible to start the program on a couple of hosts and to increase the performance. While the program is running, it is possible to add machines running workers and to remove them again without the need to stop the master program.

Conclusion of these tests: Akka is an enormous help when developing flexible and scaling programs. As a programmer you don’t have to care about the details of threads, threadpools, synchronisation, locking or network protocols, this is all handled by Akka.

The program code is available at https://bitbucket.org/sothawo/akkabatch.

Batch processing with Akka part 3

This article is a follow-up to Batch processing with Akka part 2.

In the third part of this series I deal with error recovering for the case that records are lost during processing or in case the processing of a record takes too long time. The may happen for example when multiple computers are involved in the processing and network problems occur.

To handle such cases, the program checks in regular intervals if records, that have been sent into the system for processing, are processed in time. If this is not true, the records are sent again into the system. The length of the check interval and the time that is used to detect a timeout are adapted automatically during processing.

The error is simulated by the actors that are doing the processing as they are just dropping a configurable part of their messages instead of processing them.

The effort to implement this behavior its not too small, but it’s not too hard either. Care has to be taken that the records are not sent into the system too often, on the other hand the check interval must be short enough to not slow down the processing. But the result is remarkable:

Tested on a computer with 8 cores and a file with 520.000 records:

drop-rate processing-time
0% – 26ms
10% – 24ms
25% – 19ms
50% – 20ms
60% – 26ms
75% – 45ms

The measured times show that a drop rate of 0% causes some overhead by checking if records need to be resent. But it’s clear that when dropping 25% of the records and resending the data the system is fastest while correcting the errors. I think this comes from the fact that the actors that are dropping data are faster available for new processing than actors that do their job properly. Only if the drop rate rises above 60% the records have to be resent so often that this leads to a performance loss.

Batch processing with Akka part 2

This article is the follow-up to Batch processing with Akka part 1.

This first implementation does not care about error recovery, there is no special exception or error handling.

The records are sent into the actor system, where they are processed in parallel and written to the output file at the end. The actor system is running in a single VM, there is no load-balancing across several computers at the moment.

The akka implementation has the following features:

  • The number of messages that are processed simultaneously in the system is limited by a configuration entry. This prevents the system from being flooded by messages
  • As the creation of objects from csv-lines does not create a siginificant prozessor load, a special dummy-processing has been implemented which in about 50 percent makes a Thread.sleep(1) while the other 50 percent are spent with calculating the 1000th fibonacci number. This simulates blocking processing as well as cpu consuming work.

The same dummy-processing is used in a second program where the records are processed sequentially without the use of threads.

Test runs of the sequential and the akka based program with a file with 520.000 records on two different computers yield the following times:

computer 1: 2 processors:
sequential processing: 7min, 18sec
parallel processing: 1min, 9sec
speed-factor: 6,4

computer 2: 8 processors:
sequential processing: 5min, 27sec
parallel processing: 15sec
speed-factor: 22,4

On observing the number of used threads during processing, akka in it’s standard configuration uses 3x#cores for the number of used threads. The number of threads in this simple program is about the same as the speed factor gained by the use of parallel programming.

About program complexity: By adopting the actor model there is a little increased work in modelling the actors and messages in respect to the sequential version. But this is more than paid off by the speed gain, and up to now there is no taking into account of using balancing with several computers.
Very important is the fact that the program code does not contain any thread or synchronisation elements, this is all provided by akka in the background, thus preventing problems which stem from wrong synchronisation.

Batch processing with Akka part 1

This is the first of a series of articles in which I’d like to report about the implementation of a batch processing by using the akka framework.

The first time I heard about akka was in the summer of 2012, but it’s only up to now that I find the time to do some deeper investigations. In the articles to follow I will tell about success and failures and what I find positive or negative.

The first implementations will be made in Java, later I will compare this to a scala implementation.

The general conditions for the batch processing are:

  • in- and output of csv files
  • the order of the output records corresponds to the input records
  • processing of several millions of records must be possible

The code of the implementations is not freely available at the moment, perhaps I will publish the Git-repository some time later.

As this are my first tests with Akka, I will in the beginning only use the basic functions of Akka and skip themes like clustering.