Last Christmas I gave you Hadoop, but the very next day, you gave it away

I have spent some of the quiet Christmas nights in front of a burning fireplace reading articles about the future of the data warehouse. There are many opinions and arguments on how the future of data warehousing will be.
Two types of professionals are argumenting the pros and the cons in regards of technical architecture surrounding your data warehouse solution.
The technocrats make strong arguments in regards of specific technologies that will solve your challenges in the new data area.
On the other hand, you have professionals that are more of the old school and are skeptical to let technology drive the type of challenge you are going to solve. Let’s call them the conservatives.
Being a conservative myself it’s easy to point out what the technocrats do wrong, but are they as wrong as someone as me at times argue?

Doing my Christmas reading this year I have come across several “The death of the data warehouse” articles and blogposts, I have even co-written one myself. They argue at both sides, we the conservatives tend to argue that companies have so much trouble handling the traditional data that it would be madness to start Big Data or data lake initiatives before you get control over your traditional data. Whilst the technocrats argue that with new technology comes new architectural possibilities and sees the old way of Extracting, Transforming and Loading data into modelled data repository is unnecessary when you can have all your data in for instance HDFS with Hadoop and do querying and modelling there.

Stephen Smith introduces some great arguments in his article from the Eckerson Group, “The Demise of the Data Warehouse”, where his point is to combine a Data Lake with Master Data Management.
In his article, he talks about Informaticas AI based Data Catalog, EIC, as the solution to this. The thought here being to use a good metadata repository and only draw the data from the data lake when you need it.

When I first read the article, I thought this was an excellent idea, no need to model things in the data warehouse that you are uncertain that anyone will use.
Then I started to think about how this will work in real life.

Let’s picture ourselves in a small to mid-size retail company and let’s say we have invested in a Hadoop cluster and put all our data there. All our ERP systems, accounting systems, our weblogs and all the data we could think of.
In this scenario, all you would have to search metadata for, let’s say, total sales and profit margin for all customers in the northern region for May 2017.
This seems like an easy enough report, but how do we know we get total sales?
Do the online sales and the sales in actual stores occur in the same application?
Who have made the calculation for product margin?
And is it correct?
How and where do we segment customers so that we know in what region they belong?

So, to get the result for this easy question we must have structure in our underlying data to be certain that we get the correct answer.
When I am in meetings with C-Level executives and discuss the need of data warehouse or structured data, I always ask the following question:
“Have you ever spent significant time in a meeting discussing what number is correct?”
The answer to this is of course always yes.

In my point of view this will be worse in a metadata driven data lake initiative. You need to structure your data! A tool like EIC will help you on the way, but you still need to have hands on development do structure your metadata dictionary.
So, what you really do is moving the logical layer of your data warehouse into a metadata search engine. From my point of view, you really haven’t done anything new, just the same in another tool (If you look at the end result).
As Dave Wells says in his article “Counterpoint: The Data Warehouse is still alive”
“So please read these proclamations of data warehouse demise with a healthy dose of skepticism. The data warehouse is alive but it faces many challenges. It doesn’t scale well, it has performance bottlenecks, it can be difficult to change, and it doesn’t work well for big data. It certainly hasn’t lived up to the promises of the past. Smith is right that the single version of the truth still eludes us.”
He says further in his article that a better solution than just Data Lake + Master data Management is Data Lake + Master data Management +Data Warehouse as a counterpoint to Mr. Smith’s article.
I strongly agree with Mr. Wells on this point.

As I have written in a previous blogpost I feel strongly that a data lake in combination with a data warehouse is the absolute best way to go.
But keep in mind that with new possibilities comes greater complexity so you need to have the correct toolbelt and the tools that fit your need.

And if you are worried about the cost and complexity of a data warehouse project you need to look into how Xpert BI can help you solve your challenges.

This was some of my thoughts this Christmas, I think I’ll retire until next year now and try to read a suspense novel for the rest of the holidays.

I wish you all a happy new year!

The articles mentioned:
the demise of the data warehouse
counterpoint the data warehouse is still alive

Submit a Comment

Your email address will not be published. Required fields are marked *

""