Infogoal Logo
GOAL DIRECTED LEARNING
Master DW

Data and Analytics Tutorial

Data and Analytics Overview
Under Construction

Data and Analytics Success

Data and Analytics Strategy
Project Management
Data Analytics Methodology
Quick Wins
Data Science Methodology

Requirements

BI Requirements Workshop

Architecture and Design

Architecture Patterns
Technical Architecture
Data Attributes
Data Modeling Basics
Dimensional Data Models

Enterprise Information Management

Data Governance
Metadata
Data Quality

Data Stores and Structures

Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Integration

Data Pipeline
Change Data Capture
Extract Transform Load
ETL Tool Selection
Data Warehoouse Automation
Data Wrangling
Data Science Workflow

BI and Data Visualization

BI - Business Intelligence
Data Viaulization

Data Science

Statistics
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics

Test and Deploy

Testing
Security Architecture
Desaster Recovery
Rollout
Sustaining DW/BI

Data Vault Tutorial

What is the Data Vault?

The Data Vault is a methodology for modeling, developing and populating databases which contain historical data primarily used for business intelligence, analytics and data science. In other words, the Data Vault is a methodology for building Data Warehouses. The methodology builds on agile and other modern methodoogies. The modeling approach builds on prior work: Raph Kimbal's Star Schema and William Inmon's Normalized Data Warehouse.

The Data Vault books and training matrials provide in depth information about each of the above table types. This includes: use cases, special columns, load procedures and query procedures. If you want to gain mastery of this methodology, classes and certifications are recommended.

The author of this tutorial is NOT Data Vault certified -- so the contents are not official positions of the Data Vault Community. Instead they are the perspectives of a Data Analytics professional, outside the community. See links below for official Data Vault information.

Data Vault History and Evangelists

The Data Vault approach was developed by Dan Linstedt in the 1990s and released in 2002 through a series of articles in The Data Administration Newsletter (TDAN).

    "The Data Vault Model is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise"
The Data Vault 2.0 specification was released in 2013. It included improvements to the modeling method, improved data load, big data support, architecture and best practice methods.

Data Vault References and Links

These websites and articles provide insight into the Data Vault:

Learn from these Data Vault videos:

Follow these Data Vault blogs:

Data Vault Benefits

These benefits and capabilities may be realized through the use of the Data Vault:

  • Accessability: data is accessed using the SQL standard database language which is universally supported.
  • Consistency: data in loaded in a consistent pattern - retaining source system formats in the hub-links-satellite pattrn.
  • Extenseability: new data types and formatis are readily accomodated.
  • Rapid Development: data vaults can be quickly created using acceration tools.
  • Scalabilty: can grow to support business needs.
  • Timeliness: data can be loaded in real-time or near real-time - making data immediately ready for use.

Data Vault Data Organization and Flow

The Data Vault builds on the proven approach of: input, store/process and output.

Data Vault Flow - Boxes

The data flows through these steps and structures / zones:

  • Data Sources: originating sources of data such as: application databases, cloud providers, Internet of Things (IoT) and information brokers.
  • Staging: landing area where data is first stored when entering the Data Vault. Data is stored in the same format as it was in the Data Source.
  • Raw Vault: the core of the Data Vault where data is stored in raw form (like the Data Source) and structured so that history is retained. See below for more about the Raw Vault.
  • Business Vault: data sourced from the Raw Vault with business rules applied to conform the data to business needs. Data may be enriched with reference data.
  • Information Marts: data sourced from the Raw Vault and Business Vault which supports a specialized use case. A Data Vault may have multiple Information Marts such as: Dimensional Datastore for Business Intelligence, Flattenned Datastore to support Data Science or Current Valued Datastore to support Services.
  • Business Intelligence: providing business insights through reports, dashboards and data visualizations.
  • Data Science: utilizes approaches like Artificial Intelligence (AI) and Machine Learning (ML) which may needed data in flattenned or graph format.
  • Services: provides access to data via APIs. For example, a web service may be used to obtain current information about a customer account.

Raw Vault Zone of the Data Vault

Let's look at the Raw Vault in greater detail. It has much in common with the Core Zone of the Atomic Data Warehouse. The Data Vault tends to be composed of the following zones:

Data Vault Architecture - Level 3

These tables play the lead roles in the Raw Data Vault:

  • Hubs: a single business entity identified by a stabile, unchanging business key. Data is inserted into the hub once per business key and not changed after that.
  • Links: a relationship between two or more business keys. Links connect the hubs.
  • Satellites: timephase data that describe hub or links. Rows are inserted (no updates please) into the satellite as new data is ingested into the data vault. Each satellite is populated from a single data source and provides an audit trail of all changes.

These additional table types play supporting roles:

  • PITs: Point In-Time tables are a special type of satellite which speeds up performance by improving join efficiency. The PIT table may contain datetimes which can be joined to satellites associated with a single Hub. The PIT table may contain other data such as calculated values.
  • Bridges: brings together hub and link keys that support a particular business case.
  • Reference: provides static lookup data such as: calendars, currencies, countries and transaction codes.

Advertisements

Advertisements:
 


Infogoal.com is organized to help you gain mastery.
Examples may be simplified to facilitate learning.
Content is reviewed for errors but is not warranted to be 100% correct.
In order to use this site, you must read and agree to the terms of use, privacy policy and cookie policy.
Copyright 2006-2020 by Infogoal, LLC. All Rights Reserved.

Infogoal Logo