The challenge
The challenge was to process a 3GB text input file containing four types of letters: A, C, T and G, and find a specified sequence in this file in shortest possible time. The boundary conditions were quite flexible:

-       Any hardware or software was allowed

-       The upload time to one server via FTP or something similar was not counted

-       Time measurement started with start of processing

For more details and the full challenge description, check the blog entry:www.excelian.com/blog/eureka-moment-at-the-bottom-of-a-pint-glass



Our first approach: Frontal confrontation
These kind of challenges are usually simple to implement using brute force methods. In fact this is how we started: We began working on the problem by writing a simple iterative program in C++ which was reading the file byte-by-byte and looking for the sequence (naive and innocent). We found out very shortly that it was not good method as it was very slow (~80s). Fair enough, we have to admit that it was inefficient but useful at the same time. Thanks to this initial approach and the code used we quickly figured out multiple types of problems we were going to face:

  • High disk I/O
  • Long type not able to handle 3bln of characters
  • CPU low utilization
Based on this lesson, we decided that we need to lower disk I/O, utilize all available CPUs and make sure that they were used at the optimum 100%.

Solution number 1: Single process. Let’s keep it simple.
Our first solution was to use a single node to perform computations. 3GB of data is not much so we decided that it will fit into computer memory and we wouldn’t have any network overhead, which could be high if we were going to stream data from any network file system.

The first program was created using C++ with OpenMP extensions for parallelisation using threads. For the pattern search we used Boyer–Moore–Horspool algorithm and we used RAM disk as the storage for input file.

Program was executed on a Linux laptop with i5 and 2 cores CPU (4 threads). The results were around 3.8s. We got results of 0.78s when we ran it on an 8 core Amazon instance but with a different data set and sequence. The timing was good but not good enough so we decided to do something else…

Solution number 2: GP/GPU. We need more power!
Looking to improve our results, we decided to try a GPU solution, as the problem seemed to fit the architecture pretty well. We ran our test on a GeForce GTX 770 card, initially with 1GB of data.

Again, we ran our first tests using a non-sophisticated approach (our old friend brute force) to solve the problem, which ended up with terrible results. The running time was well over 1 second. The solution was compute bound because the excessive shared memory usage caused many bank conflicts [1].

It was obvious that it was a no-go, so we changed the underlying algorithm to improve the search and make it quicker [2]. This solution gave us much better results and the problem became bandwidth bound. Copying the data to memory took around 400ms, while the running time of the application was around 100ms. To avoid this huge penalty of moving the data we used ZeroCopy memory [3]. In this case the data stays in the host’s memory and only a proper pointer is being passed to the kernel. Those changes brought down the running time to around 200ms for 1GB of data, which compared to the OpenMP solution, did not really live up to our expectations. We needed to carry on!

Solution number 3: Distributed Computing. The more the merrier.
We already had the solution working on a single machine so we decided to scale it out. However, to do this we had to ensure quick access to input file from all processing nodes. So, the question was: when is the best moment to distribute the file?

We had already confirmed that the upload phase was not counted into processing time (part of the boundary conditions) so that was our moment but, how to scale RAM disk? Apparently it was not a very difficult thing to do. We knew at least two solutions and we were sure there were plenty of other possible solutions out there. Another one we came across was DRBD [4], which replicates block devices. It could have worked, however there is a limitation of replication to a maximum of 3 servers – not good enough for the Grid Team!

Fortunately, we didn’t have to use that as we had a better solution: GlusterFS [5], a distributed and highly available network file system which works on top of an existing local file system. Basically, you can start GlusterFS on top of Ext4, XFS and TMPFS (RAM disk) which we used as storage backend. This file system was very good as it allowed us to configure the replication level equal to the number of nodes we had, thus making sure that data was available on every node.

After input data availability detail was sorted out there was still one challenge remaining: how do we distribute the workload among multiple nodes? This, actually, was not difficult as we had used C/C++ so there was one obvious choice: OpenMPI [6].

In MPI world we did create two kinds of nodes: Master and Slave. Master node was instructing slave nodes as to which part of the data they were supposed to process and was also responsible for collecting the results. We divided the input file into +/-1000 chunks and slaves were pulling at the master node to give them another range to process. This way, the more overloaded nodes were not blocking processing of data. No showstoppers allowed. Direct benefit: we tended to get similar calculation times for same amount of compute nodes. An elegant and consistent solution we must say.

The Challenge day. Final countdown.
Saturday 8:30pm, 1st of March. Everything ready for the test. We prepared Amazon cluster for the final and most important run. Preparation took some time as we decided to deploy 15 nodes (c3.2xlarge instances [7]) resulting in 120 cores. Amazon provides APIs and PKI security infrastructure, so we were able to automate the deployment process.

We must admit that results were quite impressive as we were able to process 3GB of data in 108ms on average making the first result available after only 12ms. Our best time was 105ms and we were not able to go below this score with 120 cores.

Additionally to the results, we tested the scalability of our solution. This is shown on the charts below (For better visibility, the charts are divided in two):





Of course it is important to keep in mind that those results were created on Virtual machines hosted on shared infrastructure (Amazon). There is a high probability that the tested application behaves differently (potentially faster) on real dedicated servers without Xen layer.

Could we do better?
Yes, we could. The most obvious solution would have been a combination of OpenMPI with GP/GPU where, instead of using a graphic card with dedicated memory, we could use APU (like AMD Kaveri) solution in which RAM is shared between CPU and GPU. This way we wouldn’t have to copy data from RAM to graphics memory and just use it directly from GPU processors saving a lot of precious time. Happy to get your suggestions, feedback or constructive critics. 

The winners of the first Excelian Tech Challenge: Stan Kulczycki, Greg Walczak, Manuel Dominguez and Peter Vegh.

REFERENCES

[1] http://cuda-programming.blogspot.co.uk/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

[2]http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CEQQFjAC&url=http%3A%2F%2Falg.csie.ncnu.edu.tw%2Fcourse%2FStringMatching%2FQuick%2520Searching.ppt&ei=YFUcU_zGAqWp7AbK2ID4Dw&usg=AFQjCNGT-2I71dAPrdcIgm0f8fxitJQvXQ&sig2=H7HNqeIztPtDqpMgsVlXMw&bvm=bv.62578216,d.ZGU

[3] http://on-demand.gputechconf.com/gtc-express/2011/presentations/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf

[4] http://www.drbd.org

[5] http://www.gluster.org

[6] http://www.open-mpi.org

[7] http://aws.amazon.com/ec2/instance-types 
Team GAAS