The ”data lake”, a catchy new buzzword in analytics circles, has many people wondering if they still need a data warehouse. You may have heard that you can run analysis directly against the data lake, and that’s true. This quickly leads to the question, why build a data warehouse when you can have a data lake with a fraction of the effort?
Let’s start at the beginning. What exactly is a data lake?
Here’s how Gartner defines it: ”…enterprise-wide data management platforms for analyzing disparate sources of data in its native format…” Basically, a data lake is a collection of data from multiple sources that is loaded into a central repository in its native format. Ok, so what is it good for?
Because the data in the lake is “raw”, it is best suited for data scientists to use when attempting to train models, uncover trends, and such. The data lake is not a good solution for organizations seeking to track promotion performance, identify the most valuable customers, track performance against enterprise goals, etc. This type of analysis requires integrated, cleaned and conformed data. Try a little experiment to prove it to yourself: Connect to your organization’s ERP system and start digging through the hundreds or thousands of tables to see how quickly you can produce a quarterly budget versus actual report. If you have additional ERP systems or other data sources you need for your analysis the problem gets even more difficult to say the least.
In other words, while a data lake and a data warehouse may both be fed from the same data sources, their use cases differ greatly. One does not replace the other, and both are very useful. The term data lake may be new, but the concept is quite old. Often data warehousing includes an initial staging point which could be considered a data lake. This data is not conformed, not integrated, and typically not trimmed. It is raw data that could be useful to skilled data scientists, but it is far less useful to the typical business analyst and nearly worthless to the non-technical manager. Even with a strong knowledge of SQL it will take countless hours to join, organize and conform the data into data structures that accurately represent operations and can be used for analysis.
In summary, data lakes can be a valuable resource for organizations looking to conduct advanced statistical analysis on huge volumes of data, but they aren’t a replacement for data warehouses. Sure, you could use a data lake to create an accurate EBITDA by business unit, but to do so you will need to first figure out how to integrate & conform the source data, essentially the same steps you go through to create a data warehouse, only your resulting solution will be single-use whereas a data warehouse is an automated solution that remains up-to-date..