Excelian took part to the first European Cassandra conference organised this week in London by Acunu.

This very technology focused event was a very good opportunity to see how Cassandra has become a mature and quite widespread technology in the last years. Many Cassandra committers/PMC like Sylvain Lebresne, Gary Dusbabek or Eric Evans as well as David Gardner, the organiser of the London Cassandra user group, were present. This first blog post will at summarizing where Cassandra stands today and where it fits within the Big Data world; a second blog post will go more into depth on the implementation examples of Cassandra presented during the conference.

Cassandra 101

Cassandra originated at Facebook before becoming an Apache open source project, now in version 1.0 since the end of 2011, which is always a significant milestone in the life of such a project.

Cassandra is a distributed NoSQL in-memory database that is able to handle a humongous number of writes with almost perfect linear scalability (it has been successfully tested with 288 nodes). It has been implemented with principles stated in two famous white papers, the Amazon Dynamo white paper and the Google Big Table white paper. Key elements to know about the solution can be summarized as follows:

    • it is highly scalable, especially when considering writes (which is not an easy thing to achieve)
    • it is highly available and has no single point of failure (because it is fully distributed), which means easier management
    • it is a column-based datastore: every data access pattern will not be suited for Cassandra. It offers a SQL-type language called CQL that allows to do run evolved queries
    • it offers configurable consistency levels and replication levels
    • it offers expiration policies to remove data based on a CQL-configurable time to live
    Cassandra within Big Data

    Big Data has been the new thing lately and to understand where Cassandra fits within this space, it can be interesting to compare it with existing solutions.

    The first one that springs to mind is Hadoop. Hadoop is aimed at batch processing (vs. real-time processing for Cassandra) and will probably be at its best when the data processing patterns are compatible with a map/reduce algorithm and when they require going through all the data stored in the cluster (or at least a significant portion of it). In many cases, the two solutions are used in conjunction, Cassandra to handle the real-time queries and Hadoop in the background to perform complex data analysis. Cassandra-FS is also a file system supported by Hadoop and it is possible to run Map/Reduce algorithms, Pig and Hive against a Cassandra cluster.

    In the finance world, Oracle Coherence is a wide-spread in-memory data grid solution, used in various context. Use cases for Coherence will generally be also well suited for Cassandra but it is important to point out that:

      • Coherence does not require durable storage (but can support it) whereas Cassandra has a backend storage where data is persisted at given intervals.
      • Coherence has advanced in-processing features that allow running advanced calculations on the data stored in the grid
      • Coherence offers a set of features like notifications, complex querying and comparable read/write performance that will generally mean it is a more polyvalent solution than Cassandra
      • Cassandra is free :)

      Acunu offers a packaged version of Cassandra with some extra components, aiming at simplifying the operational complexity of managing Cassandra clusters. Within the service offering, we can find:

      • A new storage engine, running on an optimised Linux kernel that allows Cassandra to talk directly to the disk controllers of the underlying machine which brings better performance and predictability (this can be an issue when Cassandra is used for an extended period of time (10h+) in a very intense way)
      • A management console than runs in a distributed fashion and enables one-click operations like backup or cloning of Cassandra clusters
      • A real-time analytics solutions that was presented at the end of the conference and was fairly impressive in terms of ability to perform real-time queries on large Cassandra clusters
      Many real-life examples were presented during the conference and, if generally it is admitted that the technology is becoming mature and is very efficient, the main challenges are now around the ability to run and manage large clusters in a simple way (i.e. without having a Cassandra committer in the team...)

      Stay tuned for the second blog post!