Excelian started in the last weeks investigating Xcelerit’s new technology: Xcelerit SDK for parallel programming and HPC. The framework claims to be an easy to use, highly scalable SDK which can utilize most of the resources available on a server, including CPU and GPU, and offers a hardware agnostic way of coding parallel applications. In this article I will share my first impressions with the framework and test its integration with a compute grid middleware commonly used in financial sector.


About Xcelerit
SDK Xcelerit is a SDK for development of parallel applications, the key uniqueness being that it hides the complexity of parallel application development from the developer and tries to be as efficient as possible. Parallelization approach is based on flow graph dataflow paradigm  (Fig 1.). The developer defines a flow of how the data is processed from input to output. Each processing unit is called an Actor and each one of the actors receives data, process it and send it to the next node in the graph (Fig 2.). Actors are not allowed to store any data apart from constants and parameters defined at creation time. Thanks to that kind of flow there is not much overhead on mutex locking situations as no shared resource are accessed by multiple threads and thus it results in higher performance.

[img]images/img/2_figure1_dataflow.png[/img]

One of the constraints of the model is that the Actors are not able to use external resources like shared data caches during the processing. Such common data has to be pre-fetched before processing unit is executed. Then data can be passed to Actor as an input, parameter or constant. It may be useful to build common data orchestration layer which will feed actors with data before or during the execution based on the flow, leveraging fast access to data in an in-memory cache if required.

[img]images/img/3_figure2_actor.png[/img]



GPU processing support
Xcelerit SDK is able to handle CPU and GPU at the same time. Xcelerit developers achieved this by providing their own by providing their own compiler, driving a set of existing compilers and tools in order to generate code for both CPU and GPU. Decision where to run the code is taken in runtime depending on available resources. The question is how to write application which uses both? In fact, if you stick to the development guidelines of Xcelerit you don't have to do anything. Each piece of Actor code which follows the coding rules can be executed on both CPU and GPU. The library detects what kind of CPU and GPU is available on the node in runtime and based on this information distributes the workload. Right now Xcelerit supports CUDA enabled graphic cards only but Intel Xeon Phi and potentially FPGAs will be supported in the future, which means that vendor lock-in should be reduced.

Constraints
There are a few constraints imposed on the development by the SDK: firstly, you can't use any external library inside Actor code such as Boost, Blaze or any 3rd party C/C++ library; secondly, any in-house library code has to be re-compiled by the provided compiler to be able to run on GPU card using Xcelerit’s model.

External libraries and shared resources

What if you have to use some shared resource or external library? You can do it but that way you will be not able to execute that code on GPU but on CPU only. In fact you can mix CPU and GPU Actors, so depending on your needs, you can design a workflow which will use both efficiently. You can also obey accessing external resource from actor by optimizing your processing workflow in such way, that data from external data source is fetched before processing.

What about Boost and STL? Xcelerit knows that you need some of the function from these libraries and they did their best to provide alternative implementation of most important algorithms and containers from STL, Boost and also they provide library for statistics and finance computations.

Existing code migration
A common challenge in the industry is legacy code migration when changing the underlying technology: Quant teams within banking have often invested many man-years of efforts in building and validating libraries. Leveraging Xcelerit’s SDK means re-building the library, knowing that it can be done step by step and not in one shot. Xcelerit’s design requirements can be progressively implemented. The migrated libraries will have to be revalidated to be sure that no regressions have been introduced, especially if the application is supposed to use GPU. When considering the library, do not forget that Xcelerit SDK is a black box for developers. You can’t really debug the SDK’s internal but you can debug and profile all user-implemented code as normal.

Is it easy to use the Xcelerit SDK? Yes it is: to learn Xcelerit SDK I decided to rewrite Monte-Carlo LIBOR swaption portfolio pricing calculation written purely in C and make it run on the SDK. I managed to do it within few hours of playing with the framework, which I think is good result. Documentation is descriptive and example code is well written.

Performance
In order to make you feel how the library improves performance of your legacy applications I would like to present some numbers in the table below. I executed my Monte-Carlo application in different hardware and software configurations to check how it can improve calculation speed.

[img]../../../images/img/Monte_Carlo_simulation_table.JPG[/img]

Table 1. Monte Carlo simulation execution time (131K paths) on Intel 2xXeon E5620 (no HT, 8 cores) and 2x Tesla M2050 GPUs
Does it improve performance? Yes, it does especially when GPU kicks in! For CPU only mode, improvement was 9.3x and for 8 core machine it is a very good result. Speedup with GPU was even more impressive (92.7x) and in fact it is a reason why to use Xcelerit SDK in the first place. I think that I could do better if I would use Xcelerit SDK specific API in some places instead of using legacy C code. Of course, this was a rough test and it would have to be validated for actual production models. If you are interested more in Xcelerit SDK performance, you can check out their two blog posts: Programming with the Xcelerit SDK and Benchmarks: Xcelerit SDK vs. OpenMP.

IBM Platform Symphony Integration
Xcelerit focuses on workload distribution within one computing node (i.e. one physical server), but does not scale to other computing nodes in a cluster. To manage workloads across a cluster or even grids, Xcelerit recommends to use industry-standard job schedulers and grid middleware. As part of my research I decided to integrate Xcelerit SDK with IBM Platform Symphony. Symphony is a grid middleware for distributed execution of Service Oriented Architecture (SOA) applications. In other words it distributes load between different nodes in the cluster based on node availability, service SLA and demand.

Integration part
The challenge was to integrate Xcelerit SDK based application as Symphony application. The code integration was relatively straightforward. In fact what I had to do was to move source files of my test source code to Symphony Service project, invoke computation method from the source and compile.

Apart from code integration you also need to modify your project configuration to add Xcelerit SDK library as dependency. In Visual Studio you can do it via user interface but on Linux it is a bit more complex because you need to modify your make file to use Xcelerit compiler driver. As you use Xcelerit with service side of Symphony application only you don’t have to recompile client side. This actually makes you feel why it is good to invest in SOA oriented applications.

Task distribution
When you create Symphony service you need to design how to distribute processing to be most effective. Xcelerit alone processes a whole dataset by data redistribution between CPUs and GPUs automatically within one machine only. Symphony on the other hand it gives another level of parallelization as it distributes processing among multiple computing nodes.

In Symphony smallest processing dataset is called a Task. When you have to process say 100,000,000 of input data, you have group this data into packages and create Tasks from them. You need to create packages relatively big because on top of Symphony task distribution, Xcelerit is distributing tasks within machine using its own task scheduling which can be seen on the Figure 3.

[img]images/img/4_figure3_datadistribution.png[/img] 

As said before Xcelerit based application by default is using every possible resource on the server, it includes all CPUs, GPUs and memory. This results in the fact that you can run only one instance of Xcelerit enabled application per node and it is not recommended to run any other services at the same time as this will be sub-optimal. If you have more slots per machine, a running multiple Xcelerit services will slow down other services as they all fight for resources (CPU, Memory, GPU). The same principles apply as for any other multi-threaded or GPU/CUDA application run from within Symphony.

Resource sharing
The simplest solution to that problem of multiple multi-threaded or concurrent Xcelerit processes on one node is to reduce number of slots per machine to just one. It will force Symphony to spawn only one instance of application per node. However for some system architectures it may not be a good solution. By reducing slots to only one, we are reducing computing capability of other applications too because they will be able to use only one slot. If other applications are using single-threaded then most of CPU power is wasted on the node and we need more slots there.

If you would like to create more slots per machine you need to implement resource limitation for Xcelerit SDK processes. Memory and GPU limit usage can be done using Xcelerit API and this is straightforward. CPU limitation is more complex as it depends on operating system. In Windows Server you can configure CPU affinity on process startup (start  command) and on Linux you can do the same (taskset command) or you can assign processes to different resource groups (cgroups).

If you have a hybrid cluster with CPU only nodes and CPU+GPU nodes you also need to remember to configure resource groups within EGO to mark which nodes are CPU only and which one are with GPU. Alternatively you can split machines into two resource groups and do the same with services based on that if they are GPU enabled or not.

Conclusions and integration summary
To conclude my research on Xcelerit SDK I must admit that I'm very positive about the library and there is a plenty of use cases where it can be used. I think that it is especially useful in the situation where you want to take leverage of GPU computing and CPU parallel processing quickly without redesigning whole architecture from scratch.

The library is very good for easy speed improvements. You take your current C++ code, convert it to Xcelerit data flow graph architecture and that is all you are faster. You don’t have to write GPU specific code, data synchronization, thread synchronization, their library do it all for you. At a larger scale, the integration with grid middleware seems promising but fine-grained performance tuning and integration study will be required in order to get the best out of the technology as well as integrating Xcelerit-enabled applications with others.
Vincent Carbonare