Data Analytics Architecture Patterns
Data Architecture is a complex topic which may seem to be difficult
to organize and learn.
What can be done to enable an order approach?
Part of the answer is to start a high level of detail (Level 0)
and progress to a finer level of detail (Level 0):
- Level 0: Database Symbol - a single disk symbol with a title of the database is used for high level diagrams. (This is an architecture level.)
- Level 1: Ingest and Expose - a rectange showing input and output connectors. (This is an architecture level.)
- Level 2: Inner Zones - major zones of functionality within a database
- Level 3: Nested Inner Zones - zones within inner zones – a lower level of detail. (This is an architecture level.)
- Level 4: Information Domains - information topics in scope
for the database
- Level 5: Logical Data Model and Procedure Signature - data model rendered as Entities, Attributes and Relationships plus description of procedures that includes procedure name and parameters. (This is a design level.)
- Level 6: Physical Data Model and Procedure Specifications - data model rendered and tables, columns and foreign keys plus specification of procedures that includes detailed logic. (This is a design level.)
Another part of the answer is to apply consistent design patterns and
principles to each database and/or database system as appropriate.
This will result in fewer architectural components and enable team members to
work on multiple databases. Databases and their associated applications tend to
include the following architectural elements ("inner zones"):
- Ingest: zone where data is input to the database.
- Core: zone where subject content is stored. This is
central focus of the database.
- Expose: zone where data is made available outside of the database.
- Process: zone where database processes are tracked and controlled.
- Archive: zone where history data is stored.
- Metadata: zone where data describing database content and structure are stored. Glossaries and Data Lineage are examples of data managed here.
- Notify: zone where logs of events are stored.
What follows is a description of each architectural level along with a description
of the inner zones. The Atomic Warehouse database is used an example.
Level 0: Database Symbol
The Level 0 diagram shows each database as a disk symbol which
is the commonly used method of show databases on diagrams.
The following diagram shows a single database.
Level 0 diagrams often depict data flows between databases.
Level 1: Ingest and Expose
The Level 1 diagram shows the paths data follows as input to and
output from the database.
In current data architecture terminology data is ingested - acquired from
outside the database and it is exposed - provided to consumers outside
of the database. Data flows from one database or systems Expose to the next
system Ingest.
These are examples of Ingest methods:
- Services/APIs: a defined interface for interacting with a
system. The external system calls the API to send data to the
target database. APIs tend to be implemented using programming
languages external to the database such as Java and Python.
Message payloads are often rendered in XML or JSON.
- Procedures: a procedural method for interfacting with a
database. An external system calls a database stored procedure to
provide data to the target database.
- Views / Virtual: the target database obtains data
through an exposed view of the source database
or through a exposed view of a data virtualization layer such as:
Dremio, Presto or Denodo.
- Landing Zone: data is copied from a data source to an
set of designated Ingest tables.
The Landing Zone may be populated through tools such as:
Extract Transform Load (ETL), Change Data Capture (CDC),
Data Warehouse Automation (DWA) and webscraping.
- Subscribe: data is acquired through a Publish-Subscribe (Pub-Sub)
messaging system. Data might enter via a Message Business or output
from a Data Pipe like Kafka.
These are examples of Expose methods which tend to match up to Ingest methods:
- Services/APIs: a defined interface for interacting with a
system. The external system calls the API to send data to the
target database. APIs tend to be implemented using programming
languages external to the database such as Java and Python.
Message payloads are often rendered in XML or JSON.
- Procedures: a procedural method for interfacting with a
database. An external system calls a database stored procedure to
provide data to the target database.
- Views / Virtual: the target database obtains data
through an exposed view of the source database
or through a exposed view of a data virtualization layer such as:
Dremio, Presto or Denodo.
This might be thought of as a Virtual Drop Zone.
- Drop Zone: data is copied from the core to a set of
designated Expose tables.
The Drop Zone may be populated through tools such as: Extract Transform Load (ETL),
stored procedures, external programming languages such as Python.
- Publish: data is exposed through a Publish-Subscribe (Pub-Sub)
messaging system. Data might exit via a Message Business or output
to a Data Pipe like Kafka.
Level 1 diagrams often depict data flows between databases where detail
concerning Ingest or Expose of data is desired.
Level 2: Inner Zones
Level 2 reveals the internal structure of the database at zone level.
These zones are often included in the internal structures of databases:
- Ingest: zone where data is input to the database.
It is best practice to limit database inputs to the Ingest Zone.
- Core: zone where subject content is stored. This is
central focus of the database.
It is best practice to isolate the Core Zone from external processes and databases.
- Expose: zone where data is made available outside of the database.
It is best practice to limit database outputs to the Expose zone.
- Process: zone where database processes are tracked and controlled.
It is best practice to use the same Process Zone schema across databases.
- Archive: zone where history data is stored in Raw Immutable form.
This means that data is stored in the Ingest form and not is not altered which
makes for an effective audit trail.
- Metadata: zone where data describing database content and structure are stored. Glossaries and Data Lineage are examples of data managed here.
It is best practice to use the same Metadata Zone schema across databases
or to share a Metadata repository across the enterprise.
- Notify: zone where logs of events are stored.
It is best practice to use the same Process Zone schema across databases.
Notifications may be sent to a centralized Notifications System or Database.
Level 3: Nested Inner Zones
Level 3 drills into the inner structure to show internal design patterns.
For example, the Atomic Warehouse Core tends to have these structures
or similar structures with different names:
- Object: tables that are identified through
business keys. The Object contains enough information to be identified.
The primary key of the object is inherited by the Tie and Properties tables.
Similar tables in other approaches include: header, master and
hub (Data Vault).
- Tie: tables that associate one or more Object tables.
Similar tables in other approaches include: association, relationship and
link (Data Vault).
- Properties: tables that contain data elements that
describe an Object or Tie.
- Guide: tables that enable efficient and effective
data access such as: hierarchy navigation, supertype / subtype
and use conditions.
- Reference: tables that contains static lookup data such as:
calendards, currencies, countries and transaction codes.
Level 4: Information Domain Model
Level 4 describes the Information Domains within the database.
Level 5: Logical Data Model and Procedure Signature
Level 5 describes the solution in terms of Logical Data Models
and Procedure External Description.
Level 5 is more Design Level than Architectural Level.
A Logical Data Model is a detailed data model
that describes a solution using
business terms. The Logical Data Model:
- Is created on a detailed level using entities, relationships, and attributes
- Uses Entity-Relationship Diagramming (ERD)
- Is used during requirements and design phases
- Is indpendent of a specific database or technology.
The Level 5 Procedure Description provides the following information
about each Procedure (such as stored procedure or exposed service method.
- Procedure Name: descriptive name or title for the procedure.
- Description / Purpose: short narrative that describes the procedure.
- Inputs: list of input parameter descriptive names.
- Outputs: list of output descriptive names.
Level 6: Physical Data Model and Procedure Specifications
Level 6 describes the solution in terms of Physical Data Models,
Data Mapping Specifications
and Procedure Specifications.
Level 6 is more Design Level than Architectural Level.
A Physical Data Model is a detailed data model
that describes a solution in technical for a specific
database. The Physical Data Model is derived from the Logical Data Model:
- Tables are derived from Entities.
- Foreign keys are derived from Relations.
- Columns are derived from Logical Data Model Attributes.
- Uses Entity-Relationship Diagramming (ERD)
- Is used during design and build project phases
- Is specific to a particular database, including optimization and constraints.
Data Mapping Specifications identify sources and targets of data at the
attribute level. The specification which may be developed using a
graphic tool includes:
- Data Source: at the database, message and table level.
- Data Target: at the database, message and table level.
- High Level Actions: copy, filter, validate, edit and enrich for example.
- Source Attributes: at the column and field level.
- Target Attributes: at the column and field level.
- Detail Level Actions: manipulate, sum and encode for example.
The Level 6 Procedure Specification provides the following information
about each Procedure (such as stored procedure or exposed service method.
- Procedure Name: technical name or title for the procedure.
- Description / Purpose: short narrative that describes the procedure.
- Inputs: list of input parameter technical names with datatypes
- Outputs: list of output technical names with datatypes
- Logic: description of procedure logic including actions such as:
query, filter, validate, aggregate and calculate.