Big Data in Capital Markets

Big data has moved on from being a buzzword in the annual waves of technology hype, and now information management is becoming an important discipline across Financial Services. This article looks at why big data has become a hot topic within the industry, why this demands smarter computing and finally what this holds for the future.

The regulatory transformations sweeping through the industry over the last few years have required firms to make significant changes to their business processes and infrastructure. In parallel, technology has been evolving and many Tier 1 banks have sought to take advantage of this by investing in data lakes as part of their architecture; meeting these requirements in a more strategic way. As a result, some of the first uses cases for big data were in the Regulatory and Compliance areas for Capital Markets – for two main reasons.

Figure 1 – Data lake usage in Capital Markets

Firstly, on the back of increasingly onerous regulations (Dodd-Frank, Basel III, MFID II to name but a few), firms are required to perform stress tests, reporting & reconciliation on many different types of trade, and to then immediately position this data to various agencies. Rather than developing multiple solutions from separate applications, data lakes allow a smarter approach. Data from multiple separate platforms in their original schemas is centralised in a single repository. New schemas can be imposed dynamically when querying or alternatively ETL (Extract, Transform, Load) processes can feed a traditional, consolidated RDBMS (Relational database management system) data warehouse. The regulatory development work can then be done once by sourcing the data warehouse, not the individual platforms, and can easily be enhanced or re-used numerous times. An added advantage is that they preserve a transparent audit trail of the information (data lineage) from the source systems - a clear requirement of BCBS 239 for risk reporting.

Secondly, in the compliance space, firms can make use of data lakes to analyse in real-time a broad range of complex structured (e.g. orders, trades, prices, risk metrics etc.) and unstructured (IMs, emails, research) data to help filter out and classify activity or individuals that warrant further investigation. Key to note here is that such analysis is only possible through having the data in one place, so smart algorithms can piece together different data sets to identify suspicious behaviour. At the same time, screening configuration data (e.g. country lists for AML, watchword lists for email/IM etc.) need only be updated in one place.

Smarter analysis

In the compliance area, there has been a strong push to improve surveillance, ideally to proactivelyidentify rogue trading and hence limit exposures rather than raise a red flag after the event. Given the complexity and scale of the information available to be scrutinised, it’s only natural that firms are looking to computers to do more of the work. Machine learning is hardly a new concept in computer science, but with the growth of big data technologies, it is now being applied in real life situations to do just this.

Put simply, machine learning is when algorithms automatically improve with experience. So by feeding in past scenarios of market abuse, for example, it’s possible to have programs learn from this and then evaluate the probabilities of future re-occurrences by examining current data. In theory, these can even flag up problems before they can become critical – matching patterns of behaviour that left unchecked in the past led to major issues.

One of the prime examples is NASDAQ, who have recently formed an alliance with Digital Reasoningto enhance its SMARTS market surveillance platform. SMARTS already had strong capabilities for handling the structured data, but Digital Reasoning’s technology allows scrutiny of the enormous amount of unstructured data such as emails, IM chats and phone calls. It then uses behavioural analytics to interpret them which then provides a holistic view of the trading environment including ‘hidden’ relationships between people to help assess risk - something that a rules-based approach cannot easily uncover. An insider trading example provided by Digital Reasoning shows the technology automatically sifting through social media, company and stock trading data, and flagging a company short-selling a lot of ABC stock which is owned by an ex-colleague of an ABC director.
This kind of relationship wouldn’t have been picked up through previous compliance systems with a much narrower focus.

Raw, unstructured data is read through batches and streamed via a cluster of Hadoop Map/Reduce or Storm nodes, which perform the initial mapping into entities through Natural Language Processing. This entity data is then stored in a knowledge base (Cassandra or HBase storage layer) which is then updated through further analysis using Hadoop to categorise, map out correlations and relationships, following initial model training.
The results can be accessed through Impala and integrated into a variety of tools for querying and visualisation.

Whilst the overall platform is proprietary, it’s important to note that some of the machine learning algorithms are available as part of Open Source technologies such as Apache-Mahout. Hence, if the firms have the appetite to perform a deep analysis of their data, a lot of the building blocks are already out there.

Looking ahead

Machine learning and AI technology has moved out of the lab and is now increasingly being used in a number of industries, particularly as a means to probe big data. Financial Services firms are looking to catch up and exploring how to use these technologies not just in the regulatory space, but also to unlock value through improved analysis and modelling for trading purposes. This is a growth area with huge potential, as the amount of digital information (particularly unstructured) is growing at staggering rates – more data was created during the last two years than in the rest of human history.

For more insights on big data related topics see the big data issue of Tech Spark magazine. To receive your copy please email


Not to be published