Database vs. Data Warehouse vs. Data Mart vs. Data Lake

 

What exactly is a database? How does a database differ from a data warehouse? How does a data mart relate to a data warehouse? And what on earth is a data lake? In this article, I’ll answer all of these questions.

Let’s start off with the question, “what exactly is a database?”

What is a Database?

Well, in theory, a database is anything that stores data in an organised fashion. So, in theory, an Excel file or text file could be considered a database.

But, in analytics and Business Intelligence, that is not what we mean when we use the term “database”. What we’re usually actually referring to is a database containing data in a structured format in a relational database management system (or RDBMS for short). So something like MySQL, SQL Server, PostgreSQL etc.

These databases are called relational because they are made up of different tables that are related to each other because they share common data. The relational model is used because it is a very efficient way of storing and retrieving data.

Each table contains data in a rows and columns tabular format (hence the name). Each column contains a field of data (a metric or dimension) and each row contains a record of that data.

When we’re talking about databases, we do also need to mention NoSQL databases that are designed to store and retrieve semi-structured and unstructured data. But these are far less common in a Business Intelligence setting.

So, databases are everywhere in your day-to-day life, behind applications, wherever data is being captured and stored.

What is a Data Warehouse?

Well, to make things confusing, a data warehouse is actually also a database. But it has different characteristics and purposes to a regular database. You see, a data warehouse is designed to be a centralised repository of data from lots of different sources.

Whereas databases are built to collect data, a data warehouse’s main purpose is to serve the analytics needs of an organisation. It’s built for better data retrieval which makes data analysis faster and more efficient.

This isn’t to say that you can’t configure a regular database to have high-performance data retrieval and analysis capabilities as well. In fact, some smaller businesses who might not need a fully functioning data warehouse might just do data analysis using the database.

But, ideally, you want to be letting the database get on with the principal job it’s designed to do of collecting data and not taxing its resources by conducting data analysis.

Another important characteristic of data warehouses is that they contain copies of both current and historical data that have been extracted from various sources usually via data pipelines and ETL processes. In Business Intelligence, having all of this data in one place enables users to blend and analyse it together to get a better, more cohesive, understanding of business performance.

What is a Data Mart?

So let’s move on to data marts. How does a data mart relate to a data warehouse?

Well, because a data warehouse can contain data related to all activities and areas of an organisation (operations, sales, marketing, finance, customer service etc), it can get pretty crowded and difficult to navigate for anyone wanting to locate and work with data that belongs to them - or at least that they have been authorised to work with.

So to make things simpler, data marts are created. What they do is essentially ringfence certain data sources for certain people within the organisation. This not only makes life easier but it also ensures data security and allows for compliance with data governance rules and regulations.

So, for example, the marketing team will get access to a marketing data mart, sales to a sales data mart, and so on and so forth. And what this does is it avoids certain people or departments seeing sensitive data they shouldn’t.

One other benefit of data marts is that they ensure everyone is working with the same data. So one user isn’t potentially analysing data they shouldn’t, making wrong calculations and getting erroneous results that then leads them to go on to make decisions based on those wrong results.

By restricting data to that which has been pre-prepared and vetted, it gives organisations what is often referred to as a “single source of truth”.

What is a Data Lake?

The last of the data storage terms we’re going to talk about is data lake. Like data warehouses, data lakes are centralised data repositories. But that is where the similarities end.

Data lakes are built to contain large volumes of data of all different types in all different formats. Everything from regular structured data like in data warehouses, to semi-structured data like JSON or XML files, to unstructured data.

Unstructured data is basically anything else. Literally anything. Anything from video and audio files, to images and graphics, to text documents, raw log files, sensor data, social media feeds. The list is endless.

I hope this has cleared everything up for you and answered any questions you may have had!

 
Adam Finer