Why build a data warehouse when you can have _____. Fill in the blank with the latest marketing epiphany.
What is a data lake? Here’s how Gartner defines it. “…enterprise-wide data management platforms for analyzing disparate sources of data in its native format…” Basically, a data lake is a collection of data from multiple sources that is loaded into a central repository in its native format.
A data lake would be an ideal repository for data scientists to use when attempting to train models, uncover trends, and such. A data lake is not a good solution for organizations seeking to track promotion performance, identify the most valuable customers, track performance against enterprise goals, etc. Don’t trust me on this one. Connect to your organization’s ERP system and start digging through the hundreds or thousands of tables. Then try to combine this data with the ERP systems from the last three mergers. The problem gets extremely complex to say the least.
In other words, while a data lake and a data warehouse both contain data their use cases differ greatly. One does not replace the other, and both are very useful. The term data lake may be new, but the concept is quite old. Often data warehousing includes an initial staging point which could be considered a data lake. This data is not conformed, not integrated, and typically not trimmed. It is raw data that will be useful to skilled data scientists. It would be a very bad idea to unleash a data lake on the typical business analyst. The analyst will spend countless hours attempting to conform data and create data structures that accurately represent operations.
Data lakes are a very valuable resource. They can be fast to create, very flexible, and can support many kinds of analysis. Data lakes are not data warehouses. That is not to say that a data lake cannot create an accurate EBITDA by business unit, but to do so you will need to first integrate & consolidate the source data which would essentially be equivalent to a data warehouse.