Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Sources for Data and Analytics

Learn How to Manage Data Sources for Input to Data Warehousing and Business Intelligence

A data and analytics effort is only as good as the data that is put into it. The saying "Garbage In, Garbage Out" is all too true. A leading cause of data warehousing and business intelligence project failures is to obtain the wrong or poor quality data.

This article will provide an efficient and effective workflow for obtaining the right source data and using it in the data warehousing and business intelligence project.

Managing data warehouse input sources includes a number of steps organized into two phases. In the first phase the following activities are undertaken:

Manage the Data Source Identification Process
Identify Subject Matter Experts (SMEs)
Identify Dimension Data Sources
Identify Fact Data Sources

When the major data sources have been identified it is time to quickly gain detailed understanding of each one:

Obtain Existing Documentation
Model and Define the Input
Profile the Input
Improve Data Quality
Save Results for Further Reuse

Manage the Data and Analytics Data Source Identification Process

The source identification process is critical to the success of data warehousing and business intelligence projects. It is important to move through this effort quickly, obtaining enough information about the data sources without being bogged down in excess detail while still obtaining the needed information.

Start out with a list of the entities planned for the data warehouse / data mart. This can be managed with a spreadsheet containing these columns:

Entity name
Data mart role (Fact, Dimension, Bridge, etc.)
Subject Area
Data Source(s)
Analyst Name(s)
Subject Matter Expert(s)
Status

Data Warehousing Data Source Entity Plan

Complete the entity name, data mart role and subject area entries. Assign an analyst to each entity who will find data sources and subject matter experts for each entity.

Identify Data Warehousing Data Source Subject Matter Experts

Consider the following questions when determining the sources and costs of data for the Data Warehouse:

Where does the data come from?
What processes are used to obtain the data?
What does it cost to obtain the data?
What does it cost to store the data?
What does it cost to maintain the data?

Identify Dimension Data Sources for the Data Mart

Dimensions enable business intelligence users to put information in context They focus on questions of: who, when, where and what. Typical dimensions include:

Time period / calendar
Product
Customer
Household
Market Segment
Geographic Area

Master data is a complementary concept and may provide the best source of dimensional data for the data warehouse. Master data is data shared between systems that describes entities like: product, customer and household. Master data is managed using a Master Data Management (MDM) system and stored in an MDM-Hub. Benefits of this approach include:

It is less expensive to access data from a single source (MDM-Hub) than extracting from multiple sources.
MDM data is rationalized.
MDM data is of high quality

If an MDM-Hub does not exist consider creating one. It will have many uses beyond supporting the data warehouse and business intelligence.

If no MDM-Hub is available, you will need to examine source systems and determine which system contains the data most suitable for dimensions. If the data is not stored in a managed database, you may need to define the data locally, in a spreadsheet or desktop database, and then provide to the data warehousing system.

Identify Fact Data Sources for the Data Mart

The Fact contains quantitative measurements while the Dimension contains classification information. The data sources for Fact tends to be transactional software systems. For example:

System	Example Fact Data
Sales Order Entry	Sales Transaction Return Transaction
Customer Service	Service Episode Service Result
Accounts Payable	Payment Transaction
Sales Campaign	Sales Campaign Event

Larger enterprises may have multiple systems for the same kind data. In that case, you will need to determine the best source of data - the System of Record (SOR) as the source of data warehousing data.

[PAGE_BREAK]

Detailed Data Source Understanding for Data Warehousing

When the major data sources have been identified it is time to quickly gain detailed understanding of each one. Consolidate the spreadsheet developed in the identification phase by data source, then create a new spreadsheet to track and control detailed understanding:

Data Subject Name
Obtain Doc Date
Define Input Date
Profile Input Date
Map Date
Data Quality Date
Save Results
Analyst Name
SME Name(s)
Status

Data Source Plan P2

This approach provides an effective workflow as well as a project planning and control method. Due dates are assigned and actual complete dates and status are tracked.

Obtain Existing Documentation

When seeking to understand a data source, the first thing to do is look at existing documentation. This avoids "re-inventing the wheel". If a data source is fully documented, data profiled and of high quality most of the job of data source discovery is complete.

Existing documentation may include:

Data models
Data dictionary
Internal / technical documentation
Business user guides
Data profiles and data quality assessments

Check through the documentation to assess its completeness and usefulness.

The data source analyst should study the existing documentation before any in depth discussions with the SMEs. This improves the credibility of the data analyst and save time for the SMEs.

Model and Define the Input

The data model is a graphic representation of data structures that improves understanding and provides automation linking database design to physical implementation. This section assumes that the data source is stored in a relational database that modeled using typical relational data modeling tools.

If there is an existing data model, start with that, otherwise use the reverse engineering capability of the data modeling to build a physical data model. Next, group the tables that are of interest into a subject area for analysis. Unless, a large percentage of the data source is needed for the data warehouse avoid studying the entire data source. Stay focused on the current project.

For each selected data source table define:

Physical Name
Logical Name
Definition
Notes

For each selected data source column define:

Physical Name
Logical Name
Order in Table
Datatype
Length
Decimal Positions
Nullable/Required
Default Value
Edit Rules
Definition
Notes

Profile the Data Source

The actual use and behavior of data sources often tends not to match the name or definition of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems such as:

Invalid code values
Missing data values
Multiple uses of a single data item
Inconsistent code values
Incorrect values such as sales revenue amounts

Data profile is an organized approach to examining data to better understand and later use it. This can be accomplished by querying the data using tools like:

SQL Queries
Reporting tools
Data quality tools
Data exploration tools

For code values such as gender code and account status code do a listing showing value and count such as this gender code listing:

Code	Count	Notes
F	500	Female
M	510	Male
T	12	Transgender?
Z	5	???
NULL	1000	Missing

Other systems may represent female and male as 1 and 2 rather than F and T, and so may require standardization when stored in the data warehouse. When data from multiple sources is integrated in the data warehouse it is expected that it will be standardized and integrated.

Statistical measures are a good way to better understand numeric information such as revenue amounts. Helpful statistics are:

Mean (average)
Median
Mode
Maximum
Minimum
Quartile Averages
Standard Deviation
Variance

Consistency within a database is another important factor to determine through data profiling. For example, there may be an order table which should only have orders for customers established in the customer table. Perform queries to determine whether this is true.

Improve Data Quality

Data profiling may reveal problems in data quality. For example, it might show invalid values are be entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are the valid values. Some steps that could be taken to improve data quality include:

Work with data owners to define the appropriate level of data quality. Build this into a data governance program.
Determine why there are data quality problems -- do a root cause analysis.
Correct the data in the source system through manual or automated efforts.
Add edits or database rules to prevent the problem.
Change business processes to enter correct data.
Make data quality visible to the business through scorecards, dashboards and reports.

Save Results for Further Reuse

The information gathered during the data source discovery process is valuable metadata that can be useful for future data warehousing or other projects. Be sure to save the results and make available for future efforts. This work can be a great step toward building an improved data resource.

Data and Analytics Tutorial

Data and Analytics Success

Requirements

Architecture and Design

Enterprise Information Management

Data Stores and Structures

Data Integration

BI and Data Visualization

Data Science

Test and Deploy