Recently I’ve been thinking about my concepts of data warehouse design and if they still make sense for the market. A few months ago, I wrote some rules about what I think constitutes a good data warehouse, check them below.
DATA WAREHOUSE GOAL
In my opinion, all data warehouses are built to address a simple goal. No matter which kind of company we are talking about, they all possess the same objective: provide data to the business. What business does with the data, that’s another discussion which isn’t relevant data warehouse design.
The process to provide data is quite straightforward: there is a source, a flow that moves the data and a repository for the data. The flow usually does some sort of cleansing and transformation, but the concept is simple.
Having in mind the goal, is then the technology relevant? Don’t think so. That can be solely important for consumers considering their personal preference and having in mind the features that each vendor can provide in terms of database management but technology doesn’t define the existence of not of a Data Warehouse. I saw analysts building Data Warehouses using Excel files and Access databases. Sure it wasn’t the best solution for performance, multi-concurrence, and reliability but it served the goal.
Big data technologies extended the capacity of companies. Suddenly loading, transforming and consuming millions of rows is nothing for installed systems. Therefore, the conversation about the relevance and the structure of Data Warehouses started to appear. Faster loads can be done and data normalization isn’t a real need now because there aren’t performance problems while analyzing huge amounts of data.
Should we all move to Big Data technologies then? Should we push all the information we need to file repositories and then analyze it as we need?
Structured Data is still relevant. Cleaning and transformation of data are still relevant. I think we just need to embrace the new technologies and allow for a more hybrid solution instead of going full wild west or deciding for a complete strict approach.
What does this mean? Means that Modern Data Warehouses* should provide hybrid solutions each enables different types of users to access data in the state that fits their best interests:
- You are a Data Scientist and want pure raw data? Sure, just access the “Raw Zone”;
- You are a Data Analyst and want clean and transformed data according to business rules? Access the “Data Warehouse Zone”;
- You are a Business User and want data of a specific entity? Access the specific “Data Mart Zone”;
A great article about this topic can be found below here. In short, I can say that this sentence sums everything I think about this topic:
[M]odern data warehouse gives us – the flexibility to choose, to have different solutions for different use cases…Simon Whiteley
* Modern Data Warehouse is used in this article as the solution that aggregates all of the different analytical layers.