What is a data warehouse?
Hello everyone, my name is Paul Felix I’m the founder of leapfrog bi. Today I’m going to be addressing a question that I get asked pretty regularly. And that is, what exactly is a data warehouse? A data warehouse is a database, as the name implies, but it’s actually a lot more than that. First of all, data does not originate in a data warehouse. Instead we’re going to go out to an organization’s source systems such as a CRM or customer relationship management system, your financial system, your HR system, your operational systems ,flat files, your master data. All this information is going to be collected ,or copied, from those systems of record, and brought into the data warehouse. So the data warehouse becomes a central repository with all of the company’s relevant information.
So that begs the question, why would we do such a thing? Why would we copy data that’s already maintained, that we know is current, in these systems of record, and then place it into this central repository? Well to answer that question we need to really take a step back and talk about how decisions are made. What is the impact of a decision or cumulative decisions on an organization? I think we could agree that an organization’s success is defined, or depends, on the cumulative ability of everyone in that organization’s ability to make good decisions, or to make decisions that have successful outcomes. So if that’s the case then how do we enable decision-makers to make good decisions? And by decision makers I want to make sure I’m clear about that we’re not only talking about those very strategic decisions that are made by only the executives in the organization. Those are very important decisions of course, but we’re also talking about routine, operational decisions that are made on a day to day basis by everyone in an organization. So how do we enable this full spectrum of decision makers to make the best decision possible? Well one way we do that is by providing decision-makers with relevant, accurate information. A decision is best made when a decision-maker understands the environment that influences that decision’s outcome. And that environment is kind of two-fold. First of all we have an organization’s internal information – we’ve talked about all these business systems already: CRM and financial and so on. But you also have external data such as the weather, the physical environment, you have the financial environments or the markets. All of those external influencers also need to be brought into a data warehouse so that they can be integrated with internal data and provided to a decision maker such that that decision-maker has the best possible understanding of the environment that impacts a decision’s outcome. And this is why a data warehouse is potentially one of the most valuable assets that an organization can possess.
All right so why, once again, why would we take data out of the systems of record and put it in the central repository – the data warehouse? Why not just go straight to the systems of record and use that as a source of our information to empower all of these decisions, or to provide information for all these decision-makers? I’m going to provide four reasons why a data warehouse exists:
- Reason number one: single version of the truth. This is probably the most commonly cited reason for data warehousing. It spans a lot of different concepts –I’m going to give a couple of them here. One example is an organization may have a number of business systems that track the same information. Let’s just talk about customer information. You may have customer information in a customer relationship management system, and you may have customer information in your financial system because you’re tracking sales, and you may have customer information even in your operational systems, potentially. It’s very important that when a person asks for a list of customers that they get the same answer from day-to-day or from person to person. That won’t happen, typically, if you’re trying to collect this customer information from each of your systems individually. So providing a single version of the truth is an important characteristic of a data warehouse. When we go to that data warehouse and we ask for a list of customers it is the enterprise list of customers. Any business logic that needs to be applied has already been applied when the business user goes to that data warehouse to retrieve that information.
- Reason number two: performance. Performance can really be broken down into two areas. First of all let’s assume that the data warehouse doesn’t exist. In that case we would have no option other than to go to where the data originates, which is the systems of record, to again empower those business users to get the information you need to make better decisions. If we go to a system of record and we ask that system a record to support the type of reporting analytics that we’re talking about, decision support information, we’re going to often times bring those systems to their knees. And that’s because, very simply, systems of records are designed for transaction processing. They’re designed to read very small amounts of information and write very small amounts of information at a time. Think of a point of sale. Every time someone makes a purchase that point of sale records a record that says here’s the purchase, here’s the line items of that purchase, here’s who purchased it – very small amounts of information. Now contrast that with our decision maker’s requirement. A decision maker may ask to see the aggregated sales volume for a particular product, quarter by quarter, and give me the comparison for the prior year’s sales for those same products. That requires often millions, multiple millions of records, to be traversed. And that type of question or query is going to often bring those systems to their knees which has adverse impact on the system, on the source system, because it’s no longer focused only on serving as a point of sale system, which is going to be, the performance of that is going to be deteriorated, but it’s now also trying to serve this decision support role. So the system of record impact is definitely a negative performance implication if we don’t have a data warehouse. On the other side of the spectrum, the business user is expecting to get an answer to their question. And we are trying to provide that business user with the best performance possible because we want them to use this information. We don’t want them to go to some report and have to wait for five minutes or ten minutes or an hour, or possibly have the report delayed by days potentially, because we have to batch process this thing to limit the impact on the source systems. So the user experience of the business user is another area where performance is critical. In a data warehouse once again, one of the roles of the data warehouse is to deliver a well performing repository. Whenever that user asks the question the data warehouse is going to respond in a timely way because we’re going to organize the data into data structures that are designed to support that type of question.
- Reason number three: simplicity. Applications have backends, or databases, that are highly normalized, and basically that means, once again, that they’re designed to carry out transaction processing. They’re designed for very small reads and writes. That is perfect for applications, but it is not perfect for a business user trying to go out and collect a piece of information. Once again our customer example: if you have a business user that wants to get a list of customers, and they’re just going to one application, such as, let’s use the customer relationship management system as an example. Well that’s CRM system, it may store customer information in two, three, a dozen or more tables, and consolidating that information into a single list of customers with all of the attributes that we want such as the address, a phone number, whatever it might be, the demographics of that customer, is often not at all a simple process. If you compound that problem by adding in two, three, four or a number of an organization’s applications or source systems, you have a situation that is just insurmountable for a business user to go out and try to achieve in any reasonable amount of time. So one of the roles of a data warehouse is to provide these users a simple way to navigate data. Whenever they go to the data warehouse and they ask for a list of customers, there it is, a simple list of customers. They ask for a list of products, there it is. They want to see the sales over the past year, no problem, here’s the date, you select the year you want, and it the information is filtered and returned for you very easily and intuitively.
- All right, reason number four: data persistence. This is very simple. Organizations have certain needs – they want to persist their data in certain ways. That doesn’t always align with the way source systems persist data. Let’s once again talk about a customer. An application may store customer information and it may also allow the business user that’s interfacing with that application to update that customer information such as an address. The application itself may or may not store a full history of that customers record. In other words, the application may store only the current version of a customer’s address, as opposed to keeping a full audit trail that says, well, at this point in time the customer address was A, and then at this point in time the customer address changed to B. I’m just giving one example of data persistence here, but an application has a certain behavior – it may create or store the whole audit trail of a customer record or it may store only the current version. If it’s the current version, well that doesn’t support a number of business requirements that an organization may have. As an example, if we want to know how a product is selling across the last two years, month to month, and we want to break that analysis down by location, well if we only know a customer’s current address then that analysis is not possible. We have to know where the customer was at the time of the sale. So a data warehouse is going to persist data in a way that meets an organization’s needs. If we need a full audit trail we’ll store that full audit trail, regardless of how the source of some decides to behave. Source systems also archive data, they just simply take some data offline to limit the load on their databases. Again, a data warehouse is going to have a different set of requirements. Often a lot of history is stored within a data warehouse and that history is very important to establish key trends that that are used, through different types of regression or analytics to determine what the future opportunity might hold. So again, a data warehouse is going to serve as the organization’s area where data is persisted.
Okay, so we talked about the four reasons why data warehouse exists, and we also defined what a data warehouse is. Now it’s also important to define what a data warehouse is not. And that is, it’s not a product and it’s not a technology. By a product I mean it’s not it’s not Oracle, it’s not db2, it’s not a Microsoft SQL server. It’s not a product. Products are very important, of course. Products and tools are all very important in implementing and maintaining and monitoring a data warehouse, but the data warehouse itself is not defined by any particular product. A data warehouse is not a technology. There are many ways of implementing a data warehouse, and technology, again, is very important to successfully implementing a data warehouse. But regardless, if the data warehouse is implemented in a relational database and a multi-dimensional queue but no sequel technology, or if the data is stored in RAM, or stored on dimensional disk, all of these things are, again, very important, but they don’t define the data warehouse.
Okay so we talked about what a data warehouse is: it’s database. We know why we’re creating a data warehouse: we’re going to use it as a decision support system. We know why we’re not going back to the systems of record: four reasons that I gave were data persistence, simplicity, performance and single version of the truth. And we also know what a data warehouse is not: it’s not a product, it’s not a particular technology. So now you know exactly what a data warehouse is. Go out there, get busy, build your data warehouse. Contact leapfrog bi and we’ll be happy to help.