Databases, Warehouses, Lakes, & Marts.
Making Sense of your Viz's data source
A dashboard is only as good as the developer’s understanding and application of the data feeding it.
Understanding the granularity, structure, and source of your data is essential in delivering an accurate and meaningful analysis. Although a lot of visualizations on Tableau Public use Excel as their data sources, most dashboards in business use cases will have either live connections or extracts from a database. This is where understanding the big data terminology for data sources is crucial.
When accessing data from a live data connection it is absolutely crucial that you understand how the data is being stored. There are a number of terms used to describe where a data set is living. Sometimes these terms are wrongly used in an interchangeable fashion, so we want to set the record straight to make sure you’re not mixing up your databases, warehouses, marts, and lakes.
Most likely your data is coming from one of four places; a data warehouse, database, data mart, or data lake. Knowing the difference is important, so let’s take a look at some specific aspects of each.
A database is a structure of tables, columns, and rows. Generally, data in a database is from one source. Most analytics databases are relational. For example, an invoice # in one table with translate to the invoice # field in another table allowing joins on those common fields.
A data warehouse is similar to a database in that it holds relational structured data, but it is meant to hold data from multiple sources. This allows data from different sources to be queried together.
A data mart often refers a self service web based feature that enables users to grab data in an extract form for further analysis or reporting. The data found in the data mart can come from one of several sources so its good to understand what data source is feeding the data mart to get a full understanding of what is available to users. Most times these interfaces are highly curated, but it is still good to know how the data gets there.
A data lake is unlike the rest of these in that it can hold unstructured, semi-structured, or structured data from multiple sources. Instead of performing data cleaning before loading to a data warehouse, data is dumped into the data lake and then whatever data is needed gets transformed, so the data lake contains both clean and unclean data. As an example, an invoice # from one table could mean something completely different than an invoice # from another table. Due to the lack of rigidity in data lakes they are easier to fix and maintain. This format of storing data has been a trend in recent years, but their popularity has been declining as of late.
Once you have determined the source of your data and how it’s being stored, there are still some more questions that you should ask before using the data.
Where is this data coming from? Knowing the actual source system(s) of the data will help you understand the scope of the data that you are looking at. Make sure to understand at a function level how the data is collected.
What does each row in this data set represent? This is perhaps the most important question for an analyst to understand. Data sets can be aggregated at all different levels, so knowing what each row represents is vital. If you’re dealing with sales data does each row represents a transaction, a store and its sales for each day, or is it MTD sales by region? Knowing these questions will affect every part of dashboard development from calculations to design & layout.
How is the data transformed? To avoid querying data sources directly, semantic layers are used to transform the data that feed into data visualization software. It is important to understand how your data may be aggregated at this semantic layer so you exactly what data you have at your disposal. As a side note, Tessellation suggests using semantic layers in order to increased performance for your end product. By performing calculations and aggregations at the semantic layer, you cut out a lot of the burden at the Server level when your viz renders. This helps your viz and its features perform at much faster speeds than if you used data straight from the source.