Data Vault Tutorial
What is the Data Vault?
The Data Vault is a methodology for modeling, developing and populating databases which
contain historical data primarily used for business intelligence, analytics and data science.
In other words, the Data Vault is a methodology for building Data Warehouses.
The methodology builds on agile and other modern methodoogies.
The modeling approach builds on prior work: Raph Kimbal's Star Schema
and William Inmon's Normalized Data Warehouse.
The Data Vault books and training matrials provide in depth information
about each of the above table types. This includes: use cases,
special columns, load procedures and query procedures.
If you want to gain mastery of this methodology, classes and certifications
are recommended.
The author of this tutorial is NOT Data Vault certified --
so the contents are not official positions of the Data Vault Community.
Instead they are the perspectives of a Data Analytics professional, outside the community.
See links below for official Data Vault information.
Data Vault History and Evangelists
The Data Vault approach was developed by Dan Linstedt in the 1990s
and released in 2002 through a series of articles in
The Data Administration Newsletter (TDAN).
"The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise"
The Data Vault 2.0 specification was released in 2013.
It included improvements to the modeling method, improved data load,
big data support, architecture and best practice methods.
Data Vault References and Links
These websites and articles provide insight into the Data Vault:
Learn from these Data Vault videos:
Follow these Data Vault blogs:
Data Vault Benefits
These benefits and capabilities may be realized through the use of the Data Vault:
- Accessability: data is accessed using the SQL standard database language which is universally supported.
- Consistency: data in loaded in a consistent pattern -
retaining source system formats in the hub-links-satellite pattrn.
- Extenseability: new data types and formatis are readily
accomodated.
- Rapid Development: data vaults can be quickly created
using acceration tools.
- Scalabilty: can grow to support business needs.
- Timeliness: data can be loaded in real-time or near real-time - making data immediately ready for use.
Data Vault Data Organization and Flow
The Data Vault builds on the proven approach of: input, store/process and output.
The data flows through these steps and structures / zones:
- Data Sources: originating sources of data such as:
application databases, cloud providers, Internet of Things (IoT) and information brokers.
- Staging: landing area where data is first stored when entering the Data Vault.
Data is stored in the same format as it was in the Data Source.
- Raw Vault: the core of the Data Vault where data is stored in raw form (like the Data Source)
and structured so that history is retained. See below for more about the Raw Vault.
- Business Vault: data sourced from the Raw Vault with business rules applied to conform the
data to business needs. Data may be enriched with reference data.
- Information Marts: data sourced from the Raw Vault and Business Vault which supports a
specialized use case. A Data Vault may have multiple Information Marts such as: Dimensional Datastore for Business Intelligence,
Flattenned Datastore to support Data Science or Current Valued Datastore to support Services.
- Business Intelligence: providing business insights through reports, dashboards and data visualizations.
- Data Science: utilizes approaches like Artificial Intelligence (AI) and Machine Learning (ML)
which may needed data in flattenned or graph format.
- Services: provides access to data via APIs. For example, a web service may be used to
obtain current information about a customer account.
Raw Vault Zone of the Data Vault
Let's look at the Raw Vault in greater detail.
It has much in common with the Core Zone of the
Atomic Data Warehouse.
The Data Vault tends to be composed of the following zones:
These tables play the lead roles in the Raw Data Vault:
- Hubs: a single business entity identified by a
stabile, unchanging business key.
Data is inserted into the hub once per business key and
not changed after that.
- Links: a relationship between two or more business keys.
Links connect the hubs.
- Satellites: timephase data that describe hub or links.
Rows are inserted (no updates please) into the satellite as new data is ingested into the data vault.
Each satellite is populated from a single data source
and provides an audit trail of all changes.
These additional table types play supporting roles:
- PITs: Point In-Time tables are a special type of
satellite which speeds up performance
by improving join efficiency. The PIT table may contain datetimes which can be
joined to satellites associated with a single Hub. The PIT table may contain other data such as
calculated values.
- Bridges: brings together hub and link keys that support
a particular business case.
- Reference: provides static lookup data such as: calendars, currencies, countries and transaction codes.
Advertisements