Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Vault Tutorial

What is the Data Vault?

The Data Vault is a methodology for modeling, developing and populating databases which contain historical data primarily used for business intelligence, analytics and data science. In other words, the Data Vault is a methodology for building Data Warehouses. The methodology builds on agile and other modern methodoogies. The modeling approach builds on prior work: Raph Kimbal's Star Schema and William Inmon's Normalized Data Warehouse.

The Data Vault books and training matrials provide in depth information about each of the above table types. This includes: use cases, special columns, load procedures and query procedures. If you want to gain mastery of this methodology, classes and certifications are recommended.

The author of this tutorial is NOT Data Vault certified -- so the contents are not official positions of the Data Vault Community. Instead they are the perspectives of a Data Analytics professional, outside the community. See links below for official Data Vault information.

Data Vault History and Evangelists

The Data Vault approach was developed by Dan Linstedt in the 1990s and released in 2002 through a series of articles in The Data Administration Newsletter (TDAN).

"The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise" The Data Vault 2.0 specification was released in 2013. It included improvements to the modeling method, improved data load, big data support, architecture and best practice methods.

Data Vault References and Links

These websites and articles provide insight into the Data Vault:

Data Vault Data Modeling Specification : Dan Linstedt - The Defacto Standard of Data Vault Modeling
World-Wide Data Vault Consortium : Community sponsored Conferences and Certification
Data Vault 2.0 Certifications and Bootcamp : Includes Bill Inmon and Dan Linstedt
Data Vault UK User Group : Good presentations for download
Data Vault Alliance : Professional Community for Data Vault Enthusiasts
Wikipedia on Data Vault : Introduction, History and Links
Introduction to Data Vault for Data Warehousing : summary of book

Learn from these Data Vault videos:

Learn Data Vault : Dan Lindstedt and company – 40 Videos
Data Vault Academy : YouTube Channel
Data Vault Academy : Data Vault Core Objects
Data Vault Architecture : SOLR Case Study (1 hour)

Follow these Data Vault blogs:

Dan Linstedt Blog : Originator of Data Vault
Mike Evers Blog :
Kent Graziano Blog : Data Vault and Snowflake Evangelist
Hans Hultgren Blog :
Roelant Vos Blog :
Rick F. van der Lans Blog : Data Vault and Data Virtualization Author

Data Vault Benefits

These benefits and capabilities may be realized through the use of the Data Vault:

Accessability: data is accessed using the SQL standard database language which is universally supported.
Consistency: data in loaded in a consistent pattern - retaining source system formats in the hub-links-satellite pattrn.
Extenseability: new data types and formatis are readily accomodated.
Rapid Development: data vaults can be quickly created using acceration tools.
Scalabilty: can grow to support business needs.
Timeliness: data can be loaded in real-time or near real-time - making data immediately ready for use.

Data Vault Data Organization and Flow

The Data Vault builds on the proven approach of: input, store/process and output.

Data Vault Flow - Boxes

The data flows through these steps and structures / zones:

Data Sources: originating sources of data such as: application databases, cloud providers, Internet of Things (IoT) and information brokers.
Staging: landing area where data is first stored when entering the Data Vault. Data is stored in the same format as it was in the Data Source.
Raw Vault: the core of the Data Vault where data is stored in raw form (like the Data Source) and structured so that history is retained. See below for more about the Raw Vault.
Business Vault: data sourced from the Raw Vault with business rules applied to conform the data to business needs. Data may be enriched with reference data.
Information Marts: data sourced from the Raw Vault and Business Vault which supports a specialized use case. A Data Vault may have multiple Information Marts such as: Dimensional Datastore for Business Intelligence, Flattenned Datastore to support Data Science or Current Valued Datastore to support Services.
Business Intelligence: providing business insights through reports, dashboards and data visualizations.
Data Science: utilizes approaches like Artificial Intelligence (AI) and Machine Learning (ML) which may needed data in flattenned or graph format.
Services: provides access to data via APIs. For example, a web service may be used to obtain current information about a customer account.

Raw Vault Zone of the Data Vault

Let's look at the Raw Vault in greater detail. It has much in common with the Core Zone of the Atomic Data Warehouse. The Data Vault tends to be composed of the following zones:

Data Vault Architecture - Level 3

These tables play the lead roles in the Raw Data Vault:

Hubs: a single business entity identified by a stabile, unchanging business key. Data is inserted into the hub once per business key and not changed after that.
Links: a relationship between two or more business keys. Links connect the hubs.
Satellites: timephase data that describe hub or links. Rows are inserted (no updates please) into the satellite as new data is ingested into the data vault. Each satellite is populated from a single data source and provides an audit trail of all changes.

These additional table types play supporting roles:

PITs: Point In-Time tables are a special type of satellite which speeds up performance by improving join efficiency. The PIT table may contain datetimes which can be joined to satellites associated with a single Hub. The PIT table may contain other data such as calculated values.
Bridges: brings together hub and link keys that support a particular business case.
Reference: provides static lookup data such as: calendars, currencies, countries and transaction codes.

Data and Analytics Tutorial

Data and Analytics Success

Requirements

Architecture and Design

Enterprise Information Management

Data Stores and Structures

Data Integration

BI and Data Visualization

Data Science

Test and Deploy