My take on Data Lake

My take on Data Lake

(I’m a poet and I know it)

I have been spending a lot of time warning people about the pitfalls, or rather deep ends, of the Data Lakes and Big Data initiatives. First, I want to state that I think the Data Lake as a concept is a great idea, but it needs to co-exist with your more traditional data architecture.
So, what I’m discussing in this post is my take on the total data architecture as I see it in today’s data age.

The place for holding the truth
We all know the data warehouse mantra “One single version of the truth” or the variance “One single source of the truth”. The single source statement will be even less relevant with the coming of Big Data. But the one single version of the truth is probably even more important now in the big data age. You need one place to implement your business rules, and place to contain the truth. What we see now is that the place for holding the truth is not always in the same place.

Let’s start by identifying the data source components of your data architecture.

 

“Traditional” or “small” data

This is your on-site systems, like your ERP system, your self-made operational side system or your manual master-data spreadsheets. If you start to look at your system portfolio you would probably find a lot of data that would fit into an enterprise data architecture. Here you will find most of your reporting basis, and most of your analytical basis. I would say that this your most valuable data, and hence this is where you should spend most of your resources preparing and manage. Whether the data resides on-site or on a server in the cloud, the extraction method is the same if you have a connection to the actual database.

SaaS solutions

From a data collection view I have mixed experiences with SaaS solutions. I have countless times run into trouble when trying to get data from the various SaaS providers. What you get when you buy or rent a SaaS solution is an application that serves its operational purpose, but you don’t get direct access to your own data. More and more providers let you access your data through API’s, which is a good solution, but in my experience, there are still a lot of smaller vendors that rely on flat file integration. This means that you must make a manual flat file integration at your side and store (possibly sensitive) data in open flat files on ftp servers. In some cases, you even must pay extra to get access to your data.

Big Data

With the coming of new technology, we have new faster, cheaper and smarter ways to save more of the unstructured and semi structured data. Depending on your line of business one could argue that this is where your data “gold” is. Sensor data from various IOT systems, web logs you can store in detail and other fast data that you have the possibility to analyze as they come in. This is the area where traditional data warehouse architecture gets disrupted. It’s not given that you should model this data or implement it in your star schema. Also, this is where the ETL process must be adapted to whatever technology you choose to store your data in.

The toolsets are still not good enough

So, let’s talk about the Big Data perspective. Do we still need relational databases for modelling our data hub? Do we need to buy expensive in-memory databases when we can store and access data much cheaper and faster with for example HDFS? The answer to these questions are neither yes nor no. The answer solely depends on how your business is set today.
I have one thought regarding moving all your data storage to HDFS, the technology is still too inaccessible. The toolsets are yet not good enough, and there is a lack of professionals that know them. This will, of course, as with every new emerging technology, improve as time goes by.
With the data lake methodology, we have got a term that says something about how and where we should store the new “Big” data. CTO James Dixon at Pentaho coined the term as a contrast to the data mart to solve the problem of “information siloing”. As time has gone by, more and more vendors have adapted the term, and it has become an architectural necessity when planning your enterprise data architecture. The main thing we should remember about Data Lake is that Data Lake is a methodology and not a technology. Many people believe that Data Lake equals Hadoop, but this is not accurate.Data Lake is a methodology that can contain multiple technologies, and you should use the technology that fits your need, or even better, fits your data’s need.
This means you can have some data in Hadoop, some on Azure and some stored in a MongoDB. And, together, this is your Enterprise Data Lake.
I have seen presentations from companies that offload all their data into a data lake and store everything in Hadoop. Where they also set the business rules and make their reporting marts in Hadoop. So, it is an option to bypass the whole relational data warehouse, but as I said earlier the technology, as I see it, is still too immature that this is the smart move now.

At the Gartner Data and Analytics Summit in London earlier this year I attended a session where Edmond Mesrobian, CTO at Tesco and Stephen Brobst, CTO at Teradata talked about the data architecture at Tesco. I come from a fairly big Norwegian retailer, but what they talked about was almost science fiction to me. I think they had every technology you could think of in their architecture slide. They have made groundbreaking work at Tesco in regards of the new data architecture. One of the things that I noticed with interest was that they still have their EDWH intact in their architectural slide. They said that the EDWH will never disappear, but they will not make a new one.
So even at Tesco, that has the muscles and manpower to embrace every technology that they want, they still undestand the value of structuring the data in a data warehouse for reporting.
They also said that its necessary to enrich the EDWH with relevant data from the data lake, but not everything needs to be modelled into your star schema. Some of the data could just be offloaded in a raw format so the analysts had easier access to the data.

Being a Norwegian Data Warehouse Architect most of my projects have been around Norwegian customers. I often see that many of the companies that promote Big Data technologies and methods, compare Norwegian companies with large US companies. This often leads to, sadly, overselling technologies and methods fitted for much larger enterprises. In Norway, there are few, if any, companies that are as big as Tesco or Wall Mart or Bank of America.
So, if you are a small or mid-size company, you should choose an architecture that fits your needs. You must consider pricing models for the new technology. It’s not certain that you will get ROI on your cloud database. You must see if you can get qualified personnel to utilize the technology, and not rely on one single consultant. And, most important, you must look at the data you have or the data you are planning to get. Only after taking these steps can you start to choose your architecture and technology.

So how do I see the architecture for mid-size to big corporation in Norwegian standards? Well I’m really glad you asked…
The picture below describes in a helicopter perspective how I believe the best architecture for a “Norwegian-sized” company should look like.
Just a quick note, all the vendor and technology names are meant as examples!

Hadoop BI Builders overview

This picture sums up most of what I have talked about in this post. This is how I see the co-existence with the traditional data warehouse and the new emerging technologies.
We often in these architecture slides forget to include the SaaS solutions and getting data from them. Make sure you have an easy way of reading from API’s and storing them in your data hub.
If your SaaS provider doesn’t support API’s, my strongest recommendation is to use a different provider. This is the best way to force the various providers to implement it.
When we talk about exposing data, think about your data warehouse as a data provider too, so you also need to expose your data through API’s. You must do this so your web pages or other systems have an easy way of getting the correct information.

Time is one of our most valuable assets
There are of course also scenarios where you can expose data from your data lake. When to expose your data from the lake or your data hub depends on the nature of the data. If it is live stock count from stores and you have that data in your data lake you should of course expose it from the lake, but if the data are run in batch over night or even by the hour you probably should expose them from your data warehouse.

Regarding your traditional data from your ERP systems or other systems where you have direct ownership to the database, you will, as I have pointed out before be best utilized by an automation tool. I am not only saying this because I work for a company that is developing, probably, the best automation tool on the market. But I am also saying this because you need to free up time for your developers to get them up to speed on the emerging technologies. Time is one of our most valuable assets, so use that asset the best way you can.

As I am writing this our brilliant developers are busy implementing data collection to and from other new technologies, and not only your traditional data. But I will get back to the details regarding this in a little while.

Self-Service Data Preparation, why would you do that?

Self-Service Data Preparation, why would you do that?

In my previous posts, I have talked a lot about preparing your data and not forgetting your structured data in this time of Big Data, IOT and other cool stuff. Today I thought I’d take a step back and try to explain what our company do when it comes to preparing and structuring data and discuss how this can be a viable option for you.

So, what is Xpert BI?

We call it Self-Service data preparation or data warehouse automation. So, what does that mean and why should you do this instead of doing traditional ETL or ELT? A tool is only a tool

Our CTO and the brain behind Xpert BI, Erik Frafjord, had an idea that the process of getting to your data prepped and ready for reporting needed to be optimized, but not at the cost of quality, governance and documentation. In traditional data warehouse projects 80% of your time is spent preparing your data, and only 20% of the effort is spent on reports, analytics and distribution of the value your data add. You should try to switch that and make 20% go into the preparation and use your 80% on the decision-making process.

Using Xpert BI you always start with your metadata. The first step is to identify and understand your data sources. This being a flat file, or thousands of tables in your SAP installation. For a lot of the big ERP vendors like SAP or Axapta Xpert BI already have predefined application adapters where system specific metadata is combined with the platform specific metadata. This automatically enriches your data model with friendly-names and relationships to fully understand your system implementation.

Depending on the quality and availability of metadata, it is not unusual to include application owners or experts to enhance the source system data model. Doing this saves you a lot of time afterwards when you are going to use the data for further transformation processes or direct in analytics.

This understandable layer is what we call the idealization layer, here you have a “copy” of your source fully enriched with understandable data on the same level as your source. It is an ODS, but it is a lot more useful and understandable.

SAP before idealisationSAP data source after idealisation

(Example of idealized data from SAP)

Business logic, denormalizations, integrations and transformations of data are done using MS Management Studio using T-SQL. The reason for this is that we didn’t want to reinvent something that works, and by using existing technology the need for training is minimal using Xpert BI. So, joining tables, and applying your business rules you do here, either with the help of a wizard or by writing your own sql code. Our tool manages all data loads by always maintaining table dependencies can easily configure commonly used load options such as incremental loads, filters, surrogate keys and snapshots and SCD of any given dimension attributes.

This means that you can use the methodology of your choosing designing your data warehouse or your DataMart. If you want a straightforward Kimball approach or want to design a Data Vault you are free to choose the architectural design that fits your organization best. Xpert BI can handle any number of databases when it comes to complete technical documentation and dependency control.

We also enrich your solutions with surrogate keys so your joining goes incredibly fast. You also get the complete lineage of your solution as your technical documentation. This lineage is also used for knowing in which order your tables needs to be loaded, so you never should worry about the sequence you are running your jobs. The tool will automatically optimize the parallelization and sequencing of data loads.Dont be dependent on one developer

In a technical perspective, this sounds and works amazing, but it’s not only in the technical picture you will get a win. What you do here is to force your developers into a more governing way of
doing the development.  You take away their possibility to get creative. Being a musician in my spare time it kind of hurts a little to say this, I have always encouraged the people working for me to be creative. But in this part of the process you need a governed process so anyone can read and continue to develop where someone left off. You shouldn’t be dependent on that one developer that made the original package to do the changes.

So, who would benefit from this?

The sales pitch would of course say everyone. Luckily I am not a salesperson. Let’s try to refine and identify when you should think about an automation tool like Xpert BI.

On a high level, there are three scenarios that would apply.

  1. You are starting a new data warehouse initiative.
  2. You are rebuilding or starting over your data warehouse initiative.
  3. Your data warehouse initiative is not giving you the full picture or its just updating too slow.

If don’t have a data warehouse yet you should consider the automated way. This is probably the most obvious scenario where you should explore the automated opportunity.

The second scenario is much the same as the first, but here you usually have an advantage that some of your business rules are defined, and you only need to revise them. You also most likely must change your architecture because you have an architecture that has been disrupted, or you inherited the architecture from your predecessor.

The third scenario we could call the pain scenario. Here the organization is most likely experiencing that the reports doesn’t give them the answers they are looking for, the wrong answers or no answers at all. There is a lot of reason for a solution to get into this stage, but the main reason is the lack of governance in the development process. This scenario often leads to a rebuild or a start over, or at least it should.

So, having defined the three main scenarios where you should consider an automated approach, does that mean you should? Well you are getting closer to an “Of Course”, but before you rush into buying yet another tool that’s going to magically fix things, look at your organization, is this going to fit our organization?

Do your organization already have a development department staffed with BI developers, you could argue that with strong leadership and good governance they can develop and maintain your data warehouse without the use of a new tool like Xpert BI.

Big organizational changes that would lead to a change in tasks or downsizing, could lead to a disrupted department. Which again could and most likely will lead to a failed project.

I have saying that goes “A tool is only a tool, it’s the people that utilize the tool that make the change, never the tool itself”. So, promote the goal and the way to get there, before you squeeze the elephant into the buss. It might turn out to be a much smaller animal that needs the ride.

Hope I have clarified some of the things that Xpert BI can do for you, it’s not magic, but it is in many cases a lot smarter and a lot faster.

 

9 days and 8 nights in London

9 days and 8 nights in London

I just got back from a long and exiting trip to London, where BI Builders has been sponsoring two big events. First Big Data World, where I was a speaker and then the Gartner Analytics Summit, where my colleague Anja was a speaker. And I would like to share my thoughts and key takeaways from this inspiring trip.

So firstly, Big Data World, a huge conference that was co-located with four other conferences; Smart IOT, Cloud Security Expo, Cloud Expo and Data Centre World.

The data warehouse is still sexy

As you probably can imagine the buzz words flew high here, Big Data vendors in all shapes and sizes had a booth and it was amazing to both hear about and get demonstrations of how cloud based, non-structured data could be utilized. There are a lot of upcoming smart software and technics that will help us utilise our data in new and exciting ways.

Our take on the Big Data World conference was, as I have written about earlier; the need for structure and control in this unstructured “big-data-world” that we are moving into, and even are in the middle of, is very important.

In my session, I talked about getting ready for the new era, and prepping for it before you run into it. Ready, Prep, Go! The analysis is only as good as the data quality and structure, and the need for integration with more structured master data entities is important when applying the analysis to business value.

The data warehouse isn’t sexy any more….

But what I have learned is that it’s the word itself and not the concept that is not sexy. But if you call it a data hub, discovery hub, central information factory or a data warehouse, it doesn’t really matter. The concept is still the same, with some variations of course. But if it’s modelled as a star schema or a snowflake, the end game is still to structure data for performance and flexibility and get the best possible fact-based decisions out of your historical data. Espen giving presentation in London

When we talk about big data and real time streaming data this type of data preparations seems old-fashioned and unnecessary. But after you have made the real-time reporting against your ‘IOT thingies’ the smart thing is to save some of that data into a prepared data hub so you can see development over time and connect it to your other dimensions. This was an area there were close to no information about.

 

Gartner 2016 logical data warehouseI talked about how our tool could be used to bridge the gap between the two worlds. The data blending between your EDW and your data lake, and how we fit into Gartner’s concept of the Logical Data Warehouse architecture. If you can use a self-service data preparation tool like Xpert BI to help you move and structure the data from your data lake and into your structured environment, you will get more value out of both structured and unstructured data. We have some great ideas on automating/accelerating this process, and we are working hard on getting it from concept to real world implementation, these are exiting times!

At Gartner our audience was a bit more in our main target area. Anja did a great session talking about her experience from some of our customers and the challenges they have met building and maintaining a data warehouse, or data hub. And how now, when everyone is doing analytics, you have to do it smarter by using automation wherever possible to gain and maintain competitive advantage. Our booth was swamped after the session with companies that told us they had encountered one or more of the challenges Anja described in her session. It can seem that the challenges involved when both building and maintaining a well driven EDW are global and timeless.

And many of the challenges comes back to the ETL tools allowing for creativity in prepping your data, not governing the process enough. With a tool like Xpert BI you force the development into a less creative state with out of the box technical documentation. Then it is up to you to move the creativeness out to the end user layer, where it belongs.

Anja giving presentation in LondonI attended a session with Teradata and Tesco telling about their new architecture. Tesco is huge, the complexity and variety in their technical portfolio was more than impressing. In my previous job I was a BI manager at a Norwegian retailer, but this was science fiction to me. But as Stephen Brobst, CTO at Teradata said: with all these technologies you will have to ingest, or blend some of the data in to your traditional data warehouse to be able to do reporting over time. It’s always good to hear one of the rock stars in our area say the same thing as yourself 🙂

And one other thing he said which can’t be said enough times, Data Lake does NOT equal Hadoop!

To sum it up, it was an inspiring trip and an extremely educational trip, and I hope that BI Builders managed to inspire some of you as well.