It’s fair to say that I’ve created a few data warehouses in my time and retired a few more. Not all of my efforts were shining successes – but to be fair, Gartner and others studiously report that nearly two-thirds of all data warehouse projects fail. Whether they are marts, warehouses, or enterprise data warehouses (EDW), one thing seems to be constant – the bigger they are, the harder they fall. This is because they take so long to reach their potential (if ever) and take so much effort to create. The goal of a single source of truth is always just over the next hill. The dream of a 360-degree view of an organization is ever-expanding, just like the universe we live in.
So if two-thirds fail, that still means that one-third succeed, right? I’m afraid not – at least, not any more. It used to be acceptable to run long, error-prone data processing cycles to move massive amounts of data (usually of dubious quality) into huge and expensive databases, just for the privilege of looking at month-old data. Even then we couldn’t really look at it – it still took an army of business intelligence specialists to shape the data and create reports for the business to consume. Today, data-driven organizations want their data served up now – and with a heaping side of insight! The old paradigm simply doesn’t cut it anymore.
Being the helpful technologists that we are, we tried to beat the problem to death with hardware – upgrading fr om SMP (symmetric multiprocessing) to MPP (massive parallel processing). We built expensive optimized appliances and tried huge arrays of commodity grid computers, but the pesky problem still lived. A few clever folks figured out that we should stop moving the data around just to fiddle with it, or move it to where the centralized compute power resides. It’s a bit like the “Chinese Whispers” game – the more times you hand the data off, the longer it takes to process and the more it changes. A simpler approach is to move the data just once (to grid storage) and then move the compute power to where the data lives!
Enter the rise of the Enterprise Data Lake (EDL), which lets us write once and read often. With storage becoming so cheap, we can now afford to dump massive amounts of all kinds of data into the lake. New technology allows us to move the compute processing to wh ere the data is stored (or entirely in memory) at a fraction of the cost and effort of warehousing. The best part of this approach is that the data is available almost immediately. All of that sounds great, but there is of course a downside – if you don’t perform data management, it can quickly become a (pick your water body metaphor) swamp, moat, ditch…you get the idea. Even so, I think it will soon be safe to say, “The EDW is dead, long live the EDL!”