Despite being a fan of such solutions (I have been heavily involved in projects around Hadoop and Hbase, particularly when operating in the cloud), I have to agree with the assessment of the current market as well as the prediction of future adoption. However, unlike my colleague I do not think the problem is one of problem definition but instead one of understanding. What is unstructured data?Both opponents and promoters of Big Data/NoSQL solutions agree, for the most part, that Big Data/NoSQL is best suited to unstructured data. Though the point is still under contention, for the purposes of this article we will assume that for structured data a relational storage and query system is better suited and the market is saturated with products that contain all of the features you could possibly need.
Many may then ask, if the lion’s share of financial problems is based upon structured data then why are we surprised at the lack of take up of alternatives? However I would disagree with the assumption that there is very little unstructured data in the finance industry and indeed challenge the notion that relational data stores do not and are not currently storing unstructured data.
The problem I believe manifests itself in the blob. This very flexible data storage mechanism has been adopted by all the major players in the database world and is used extensively throughout financial datasets and indeed in other sectors too. By very definition this data type is designed for unstructured data (usually xml or binary) that does not easily fit into the relational paradigm. Data stored in this fashion is reduced to key-value pair arrangement of the type that is common in NoSQL storage and the basis for most Hadoop workloads.
To say that data stored in this way was popular in the finance industry would be an understatement. In fact I have worked with one large financial institution that stored market data for an asset class in its entirety in xml blob format within an Oracle relational database.
From the above it is reasonable to deduce that relational data stores are capable of storing unstructured data and (particularly in the case of xml) can do so efficiently. So when would you use a NoSQL solution in place of the traditional relation approach? Issues with relational database scalability have been well documented and it is easy to see why a scenario where data is read, modified and written back to this store immediately could be accomplished better using a tool such as Hadoop rather than Oracle RAC for instance.
Big Data/NoSQL solutions will always be the correct solution when you have unstructured data (or even the correct solution in most cases). However they should always be considered and as demonstrated above may not be ruled out as easily as is currently though.
How Big is Big Data?Another point of confusion around Big Data/NoSQL solutions is the size of dataset to be used with a given solution. The major players in the Big Data world grew out of systems from the likes of Google that deal with truly enormous datasets that are beyond even the largest financial datasets.
However I do not believe there is a clear cut limit on data sizes and there is a significant overlap in sizes managed by traditional relational solutions and Big Data/NoSQL solutions. It is perfectly feasible to have a dataset of 50TB in size that could be entirely suitable for a NoSQL solution whilst having a dataset of 100TB in size that is better suited to a relational store.
Most accept the above but question whether such large datasets actually exist in the financial world. Traditional analytics may not include datasets this large but there has recently been an observed shift to more compute and storage intensive methodologies (versus only more compute a few years ago) within the finance arena that can easily generate very large datasets. I would contend that previously the cost/benefit ratio of such datasets would not have been high enough but with the fall in price of hardware and software required to support such data these are becoming realistic prospects.
A Monte Carlo Case StudyA typical Monte Carlo VaR process simulates a wide variety of market conditions (called scenarios) and generates data (usually unit prices) for a large number of financial products for each of these conditions. This computation process is intense but well within traditional computation method boundaries but can generate enormous amounts of data. Usually a large number of scenarios are used (100,000+) and an example from a recent Excelian project produced more than 300GB of data per currency per night. If we were to extrapolate such data to cover the entire market (of around 30 currencies for example) then 9TB is generated per night and even this is relatively small compared with some workloads.
Storage for such a large amount of data must be considered and a relational store is a perfectly viable option, however, the use case for the data must also be considered. In this particular case subsections of the data are used in a scalar product calculation that provides near real-time updates in the VaR as a result of intraday requests. NoSQL/Big Data solutions are designed for this kind of usage due to their “distributed by design” approach and are likely to provide clear performance and simplicity advantages over a typical relational solution.
It is worth noting that many existing implementations of Monte Carlo VaR calculations do not store this interim data because of the storage and data access constraints. This means that VaR recalculation must regenerate scenarios and this usually makes it impractical to perform intraday recalculation. This is a clear example of NoSQL/Big Data solutions providing clear business value where it perhaps was not practical before.New data – new opportunity
Responsibility for the slow uptake in Big Data/NoSQL solutions outside of start-ups and other generally technology hungry sectors can, in my opinion, be largely laid at the “shock and awe” tactics employed by early adopters and promoters of the technology. Technologies like Hadoop were billed as the solution to all problems and this led to more than one adoption where it was not the right tool for the job. What we are experiencing now however is a backlash against that early enthusiasm that has gone too far the other way. Big Data/NoSQL solutions are considered the realm of social media applications and not taken seriously for the kind of bread and butter problems that drive the industry. We are in an industry which needs time to digest new technologies to see how to leverage all the engineering efforts that are put into these products, I believe we are beginning to recognise the potential of this new wave of technology and derive real business benefit from it. Far from bleak I think the future of Big Data/NoSQL is one of enormous potential.