On the 24th April, Excelian attended a summit organised by Oracle on a hot topic which is Big Data and Analytics, especially the challenge of processing large amount of data to find meaningful insight from it.
There were lots of delegates who are (or potentially) interested in this subject as more and more enterprises want to get a better insight on the data they own in order to improve their business and serve their clients in better way.
Undoubtedly “Big Data” has become a buzz word and there are plenty of commercial IT solutions that help customers in gathering, processing and aggregating vast amount of data, potentially in different formats, to get an overall view of it and to bring some value to the business.
Information management is about using information to make a decision and that decision-making requires effective information that is delivered to the right place, at the right time and to the right person(s), so there is a need for precision to get the correct information.
It is fundamental that there is an understanding of data to make such decisions and Big Data brings lots of challenges that are not easy to tackle; there are 3 main ones which are size, scope and speed.
The first challenge, size, is the most obvious one as the amount of data can be very daunting especially when data size grows continuously. Also data can be of different formats which makes things even more challenging when you need to aggregate it.
The second challenge, scope, is regarding what you want to get from the analysis of data and what kind of information you want to extract for your business.
The third challenge, speed, is subjective to one’s expectations on processing large amount of data and potentially some SLAs in place. It is important to process data very quickly to get valuable information on time in order to be flexible and agile in the business.
Nowadays, various companies do their business based on Big Data, especially Google, Amazon, Netflix and Facebook, just to name a few. Interestingly, more enterprises in different industries (mobile, gaming, telcos, marketing, retail, insurance, banking) are looking at the potential of Big Data so there will be more opportunities to build tools and develop skills that will be required for the challenge.
After all, the value of Big Data is to explore new ways of processing data and find the bits that are we are interested in. Definitively Big Data is going to be disruptive for information management systems therefore enterprises that intend to embark on this challenge need to mature in its understanding of the value and role of information.
The Hadoop ecosystem
Hadoop is another buzz word that is associated with Big Data and its existence is due to the necessity to process lots of unstructured data. What is it then? It is an open source project to provide a framework for the distributed processing of data across clusters of computers.
Hadoop is based on the map-reduce paradigm which has been designed by Google and it is explained in detail here. Briefly, the concept is that you split your data in smaller chunks so that you can process that data in parallel on multiple nodes (allowing for scaling horizontally) using the business logic that has been specifically developed to analyse that kind of data.
Hadoop is part of Apache ecosystem, entirely written in Java and it is free to use. There are commercial software solutions based on Hadoop with some customisations and enhancements. Major software companies like Oracle and Microsoft provide software solutions for Big Data based on Hadoop like in the Oracle Big Data Appliance which makes Hadoop more appealing to know more about and get started with.
Alongside with Hadoop there are other open source projects related to it and the most important ones are Hive and HBase.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop. Essentially it projects structure onto this data and query the data using a SQL-like language called HiveQL.
Hadoop is not an out-of-the-box solution that you can simply deploy and start to use but it is a framework that you can leverage on to build your Big Data processing logic and get some information from it.
The Oracle approach
Oracle offers a broad portfolio of products to help enterprises acquire, manage and integrate Big Data with existing information, with the goal of achieving a complete view of business in the fastest, most reliable and cost effective way.
Oracle offer an engineered system of hardware and software called Oracle Big Data Appliance which is designed to derive value from their Big Data strategies. It incorporates Cloudera’s Distribution which is a packaged version of Hadoop with a management console; it also has an open source distribution of R language which is used for mathematical and statistical data analysis.
Another product they offer is Oracle In-Memory Machine which is another engineered system of hardware and software for business intelligence (BI), giving extreme performance and providing advanced data visualisation and exploration to quickly provide actionable insight from large amounts of data. This solution can access different data sources to get structured and unstructured data.
It is worth mentioning that Oracle also provides Oracle NoSQL databases with Oracle Database 11g.
Oracle is undoubtedly investing a lot on the Big Data revolution and it is at a good position to be the major player in this arena. Oracle experience comes from databases and structured data, therefore this innovation from Oracle is seen as a natural progress.
Hadoop has been in the open source community for a few years and it is now getting a big momentum as more IT providers are using it either in their software or cloud solutions.
At the moment it is a fundamental piece of the Big Data puzzle in order to analyse large amount of raw data at low level.
Big Data management can be very complicated so when selecting tools for data analyses there are few considerations to take into account:
Where will the data be processed? Locally-hosted software, dedicated appliance or in the cloud? [LIST]
From where does the data originate? How will that data be transported? Often it is easier to move the application close to the data (data affinity) so to avoid network latency.
How clean is the data? Variety means that it needs a lot of cleaning and that costs (time and money).
What is your organisational culture? Do you have teams with the necessary skills to analyse the data? Analysing data needs the creative ability to look at problems in different ways.
What do you want to do with the data? Having some idea about the outcomes of the analysis may help to identify patterns in the data.