In June Microsoft opened their Windows Azure Infrastructure as a Service (IaaS) offerings allowing users to manage their own virtual machines within the Microsoft Azure cloud. This new feature set greatly widens the potential roles for Azure hosted products, in particular the now industry standard Apache Hadoop data storage and processing platform.

Previous attempts to implement Hadoop clusters in the Azure cloud have met with communication issues between nodes. Thankfully, a side effect of the expansion into IaaS is the relaxation of these restrictions and the new options for communication between virtual machines are simple and intuitive.

Using the new services and the Cloudera CDH4 distribution Excelian was able to build and operate a Hadoop cluster on Azure. The intention of the project was to mimic the Platform as a Service (PaaS) functionality from Amazon’s EC2 cloud (known as Amazon Elastic Map Reduce).

The fact that Hadoop is an open source project means that it has been customised into many different distributions. Amazon by default offer three distributions in their Elastic Map Reduce service, an Amazon distribution and two MapR distributions. Excelian chose the Cloudera CDH4 as it includes the latest updates to the Hadoop code base (including Map Reduce v2) and offers high availability, a previous point of contention when considering Hadoop distributions.

Excelian took a minimalist approach to the cluster design with the focus on meeting the requirements of a stable and highly available Cloudera distribution whilst utilising the minimum amount of virtual machines possible. The figure below shows the architecture of the cluster, demonstrating a minimum requirement of 6 machines to operate a cluster.


From the diagram above you can see that 3 ZooKeeper server nodes are provided on the nodes with least expected utilisation on order to provide the “hot standby” automatic failover that is synonymous with Cloudera’s approach to Hadoop.

The other ZooKeeper components shown on the diagram are clients and one must be running on each of the Name Nodes (whether primary or standby) in the cluster. In the event of a failover, ZooKeeper will manage the transition of the node running the Hadoop HA Primary Name Node service into a full Primary Name Node.

The core Hadoop HDFS components of Name node and data node are shown here in blue. The Hadoop Primary Name Node service provides service for tracking the location of data stored in the cluster. Please note that CDH4 does not require the usual secondary name node when using automatic HA as the hot standby performs the tasks this would normally perform. The Hadoop Data Node service carries responsibility for storing data on the Hadoop File System (HDFS).

Finally, shown in yellow are the YARN (Yet Another Resource Negotiator) components. The YARN components manage resource usage for the cluster and are responsible for executing MapReduce job in the most efficient way. The YARN Resource Manager service manages the jobs on the cluster by communicating with the YARN Node Manager service running on each of the Data Nodes. The YARN History Service is not absolutely required but provides useful job life cycle data.

The NFS server described here is in fact used to synchronise the current name node and the stand by and must be accessible from both the Primary and HA name nodes whilst not running on either.

By using the Windows Azure SDK a cluster of virtual machines can be automatically started and configured for use with Hadoop. Azure’s ability to “capture” a virtual machine and save it as an image for use to spawn more machines is a great help to this process. A generic Hadoop image could be created that already had all the necessary components for a Hadoop cluster installed. Using this functionality, all nodes could be based on the template saving much installation work that would have made the start-up times for each node excessive.

The result is a set of scripts that can be customised for specific use cases that allow on demand starting, job execution and stopping of Hadoop clusters on the Microsoft Azure cloud.

It is worth noting that Microsoft’s Azure now offers a Platform as a Service (PaaS) version of Hadoop (based on the Hortonworks distribution) but it is currently only available as a technology preview and can only create very small clusters. Once started, the cluster can then be modified by the user and have multiple jobs submitted to it. The user, however, does not have full control over these clusters and the core Hadoop services are managed by the cloud provider rather than the user.

With the current industry leaders in cloud computing focusing heavily on IaaS it may be that services like Amazon’s Elastic Map Reduce will be left behind. Indeed the Hadoop versions being run on such services appear to be far from generic and are already reliant on specialised version of old Hadoop code. As Excelian has proven, however, Microsoft (and others) have provided the tools to create and maintain alternatives but it remains to be seen whether significant advances in this type of Hadoop application will be driven from the community or the cloud providers.