My take on Data Lake

(I’m a poet and I know it)

I have been spending a lot of time warning people about the pitfalls, or rather deep ends, of the Data Lakes and Big Data initiatives. First, I want to state that I think the Data Lake as a concept is a great idea, but it needs to co-exist with your more traditional data architecture.
So, what I’m discussing in this post is my take on the total data architecture as I see it in today’s data age.

The place for holding the truth
We all know the data warehouse mantra “One single version of the truth” or the variance “One single source of the truth”. The single source statement will be even less relevant with the coming of Big Data. But the one single version of the truth is probably even more important now in the big data age. You need one place to implement your business rules, and place to contain the truth. What we see now is that the place for holding the truth is not always in the same place.

Let’s start by identifying the data source components of your data architecture.

 

“Traditional” or “small” data

This is your on-site systems, like your ERP system, your self-made operational side system or your manual master-data spreadsheets. If you start to look at your system portfolio you would probably find a lot of data that would fit into an enterprise data architecture. Here you will find most of your reporting basis, and most of your analytical basis. I would say that this your most valuable data, and hence this is where you should spend most of your resources preparing and manage. Whether the data resides on-site or on a server in the cloud, the extraction method is the same if you have a connection to the actual database.

SaaS solutions

From a data collection view I have mixed experiences with SaaS solutions. I have countless times run into trouble when trying to get data from the various SaaS providers. What you get when you buy or rent a SaaS solution is an application that serves its operational purpose, but you don’t get direct access to your own data. More and more providers let you access your data through API’s, which is a good solution, but in my experience, there are still a lot of smaller vendors that rely on flat file integration. This means that you must make a manual flat file integration at your side and store (possibly sensitive) data in open flat files on ftp servers. In some cases, you even must pay extra to get access to your data.

Big Data

With the coming of new technology, we have new faster, cheaper and smarter ways to save more of the unstructured and semi structured data. Depending on your line of business one could argue that this is where your data “gold” is. Sensor data from various IOT systems, web logs you can store in detail and other fast data that you have the possibility to analyze as they come in. This is the area where traditional data warehouse architecture gets disrupted. It’s not given that you should model this data or implement it in your star schema. Also, this is where the ETL process must be adapted to whatever technology you choose to store your data in.

The toolsets are still not good enough

So, let’s talk about the Big Data perspective. Do we still need relational databases for modelling our data hub? Do we need to buy expensive in-memory databases when we can store and access data much cheaper and faster with for example HDFS? The answer to these questions are neither yes nor no. The answer solely depends on how your business is set today.
I have one thought regarding moving all your data storage to HDFS, the technology is still too inaccessible. The toolsets are yet not good enough, and there is a lack of professionals that know them. This will, of course, as with every new emerging technology, improve as time goes by.
With the data lake methodology, we have got a term that says something about how and where we should store the new “Big” data. CTO James Dixon at Pentaho coined the term as a contrast to the data mart to solve the problem of “information siloing”. As time has gone by, more and more vendors have adapted the term, and it has become an architectural necessity when planning your enterprise data architecture. The main thing we should remember about Data Lake is that Data Lake is a methodology and not a technology. Many people believe that Data Lake equals Hadoop, but this is not accurate.Data Lake is a methodology that can contain multiple technologies, and you should use the technology that fits your need, or even better, fits your data’s need.
This means you can have some data in Hadoop, some on Azure and some stored in a MongoDB. And, together, this is your Enterprise Data Lake.
I have seen presentations from companies that offload all their data into a data lake and store everything in Hadoop. Where they also set the business rules and make their reporting marts in Hadoop. So, it is an option to bypass the whole relational data warehouse, but as I said earlier the technology, as I see it, is still too immature that this is the smart move now.

At the Gartner Data and Analytics Summit in London earlier this year I attended a session where Edmond Mesrobian, CTO at Tesco and Stephen Brobst, CTO at Teradata talked about the data architecture at Tesco. I come from a fairly big Norwegian retailer, but what they talked about was almost science fiction to me. I think they had every technology you could think of in their architecture slide. They have made groundbreaking work at Tesco in regards of the new data architecture. One of the things that I noticed with interest was that they still have their EDWH intact in their architectural slide. They said that the EDWH will never disappear, but they will not make a new one.
So even at Tesco, that has the muscles and manpower to embrace every technology that they want, they still undestand the value of structuring the data in a data warehouse for reporting.
They also said that its necessary to enrich the EDWH with relevant data from the data lake, but not everything needs to be modelled into your star schema. Some of the data could just be offloaded in a raw format so the analysts had easier access to the data.

Being a Norwegian Data Warehouse Architect most of my projects have been around Norwegian customers. I often see that many of the companies that promote Big Data technologies and methods, compare Norwegian companies with large US companies. This often leads to, sadly, overselling technologies and methods fitted for much larger enterprises. In Norway, there are few, if any, companies that are as big as Tesco or Wall Mart or Bank of America.
So, if you are a small or mid-size company, you should choose an architecture that fits your needs. You must consider pricing models for the new technology. It’s not certain that you will get ROI on your cloud database. You must see if you can get qualified personnel to utilize the technology, and not rely on one single consultant. And, most important, you must look at the data you have or the data you are planning to get. Only after taking these steps can you start to choose your architecture and technology.

So how do I see the architecture for mid-size to big corporation in Norwegian standards? Well I’m really glad you asked…
The picture below describes in a helicopter perspective how I believe the best architecture for a “Norwegian-sized” company should look like.
Just a quick note, all the vendor and technology names are meant as examples!

Hadoop BI Builders overview

This picture sums up most of what I have talked about in this post. This is how I see the co-existence with the traditional data warehouse and the new emerging technologies.
We often in these architecture slides forget to include the SaaS solutions and getting data from them. Make sure you have an easy way of reading from API’s and storing them in your data hub.
If your SaaS provider doesn’t support API’s, my strongest recommendation is to use a different provider. This is the best way to force the various providers to implement it.
When we talk about exposing data, think about your data warehouse as a data provider too, so you also need to expose your data through API’s. You must do this so your web pages or other systems have an easy way of getting the correct information.

Time is one of our most valuable assets
There are of course also scenarios where you can expose data from your data lake. When to expose your data from the lake or your data hub depends on the nature of the data. If it is live stock count from stores and you have that data in your data lake you should of course expose it from the lake, but if the data are run in batch over night or even by the hour you probably should expose them from your data warehouse.

Regarding your traditional data from your ERP systems or other systems where you have direct ownership to the database, you will, as I have pointed out before be best utilized by an automation tool. I am not only saying this because I work for a company that is developing, probably, the best automation tool on the market. But I am also saying this because you need to free up time for your developers to get them up to speed on the emerging technologies. Time is one of our most valuable assets, so use that asset the best way you can.

As I am writing this our brilliant developers are busy implementing data collection to and from other new technologies, and not only your traditional data. But I will get back to the details regarding this in a little while.

Submit a Comment

Your email address will not be published. Required fields are marked *

""