The choice of where and how to store the data for the analytical use is a critical architectural question. Part of the issue is the Contemporary Data Architecture (CDA) pattern, which is explained in Table 1. CDA patterns include:
- Independent Data Marts
- Coordinated Data Marts
- Centralized Data Warehouse
- Hub and Spoke
- Federated / Logical Data Warehouse
- Big Data / Data Lake.
Table 1 DW Architecture Patterns
Pattern | Description |
Independent Data Marts | Multiple databases containing analytic data are created and maintained by different organizational units. The databases tend to be inconsistent with each other, having different dimensions, measures and semantics. There is “no single version of the truth” and it is a challenge to perform cross data mart analysis. These weaknesses make the independent data mart the least desirable of the architecture patterns. |
Coordinated Data Marts | Data is harmonized across multiple analytic databases or areas of databases. Each database meets a specific need such as customer segmentation or product analysis. There are shared dimensions and semantics between the databases. For example, customer and product dimensions are consistent across the databases. Consistent dimensions and semantics support a logical enterprise view. |
Centralized Data Warehouse | An Enterprise Data Warehouse (EDW) contains atomic data, summarized data and dimensional data. This is provided through a single database. Logical data marts are built from this database to support specific business requirements. Analysis between subject areas is facilitated by data being stored in a single database. |
Hub and Spoke | The hub and spoke architecture is similar to the centralized data warehouse architecture except that there are physical data marts instead of logical data marts. A cost and complexity element is added due the need to copy data from the central data warehouse to the distributed data marts. |
Federated / Logical Data Warehouse | The federated/logical data warehouse makes data that is distributed across multiple databases look to users like a single data source. Data may be stored in traditional relational databases as well as Hadoop or NoSQL formats. Data Virtualization software is used to create a logical view of the data and to optimize query of data across databases.
Common keys such as customer and product identifiers are used to tie the data together. This approach reduces the expense and time needed for data movement. The approach may have a high overhead and slow response time because queries between data sources in multiple locations can be inefficient. |
Big Data / Data Lake | The big data/data lake architecture stores data in the Hadoop environment which is an economical and flexible way to store large volumes of data including unstructured data.
Some organizations are shifting away from traditional relational databases and moving to big data environments. |
Data Model Patterns for Data Warehousing
A data model is a graphical view of data created for analysis and design purposes. While architecture does not include designing data warehouse databases in detail, it does include defining principles and patterns for modeling specialized parts of the data warehouse system.
Areas that require specialized patterns are:
- Staging / landing area – looks like source systems
- Atomic Data Warehouse – uses normalized ERD (Entity Relationship Diagram)
- Data mart – uses dimensional modeling – the star schema, the snowflake schema or the cube.
In addition to these specialized patterns, the architecture should include other pattern descriptions for:
- Naming of tables and columns
- Assignment of keys
- Indexing
- Relational Integrity (RI)
- Audit history.