Infogoal Logo
GOAL DIRECTED LEARNING
Master DW

Data and Analytics Tutorial

Data and Analytics Overview
Under Construction

Data and Analytics Success

Data and Analytics Strategy
Project Management
Data Analytics Methodology
Quick Wins
Data Science Methodology

Requirements

BI Requirements Workshop

Architecture and Design

Architecture Patterns
Technical Architecture
Data Attributes
Data Modeling Basics
Dimensional Data Models

Enterprise Information Management

Data Governance
Metadata
Data Quality

Data Stores and Structures

Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Integration

Data Pipeline
Change Data Capture
Extract Transform Load
ETL Tool Selection
Data Warehoouse Automation
Data Wrangling
Data Science Workflow

BI and Data Visualization

BI - Business Intelligence
Data Viaulization

Data Science

Statistics
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics

Test and Deploy

Testing
Security Architecture
Desaster Recovery
Rollout
Sustaining DW/BI

Data Sources for Data and Analytics

Learn How to Manage Data Sources for Input to Data Warehousing and Business Intelligence

A data and analytics effort is only as good as the data that is put into it. The saying "Garbage In, Garbage Out" is all too true. A leading cause of data warehousing and business intelligence project failures is to obtain the wrong or poor quality data.

This article will provide an efficient and effective workflow for obtaining the right source data and using it in the data warehousing and business intelligence project.

Managing data warehouse input sources includes a number of steps organized into two phases. In the first phase the following activities are undertaken:

When the major data sources have been identified it is time to quickly gain detailed understanding of each one:

Manage the Data and Analytics Data Source Identification Process

The source identification process is critical to the success of data warehousing and business intelligence projects. It is important to move through this effort quickly, obtaining enough information about the data sources without being bogged down in excess detail while still obtaining the needed information.

Start out with a list of the entities planned for the data warehouse / data mart. This can be managed with a spreadsheet containing these columns:

Data Warehousing Data Source Entity Plan

Complete the entity name, data mart role and subject area entries. Assign an analyst to each entity who will find data sources and subject matter experts for each entity.

Identify Data Warehousing Data Source Subject Matter Experts

Consider the following questions when determining the sources and costs of data for the Data Warehouse:

Identify Dimension Data Sources for the Data Mart

Dimensions enable business intelligence users to put information in context  They focus on questions of: who, when, where and what.  Typical dimensions include:

Master data is a complementary concept and may provide the best source of dimensional data for the data warehouse.  Master data is data shared between systems that describes entities like: product, customer and household.  Master data is managed using a Master Data Management (MDM) system and stored in an MDM-Hub.  Benefits of this approach include:

If an MDM-Hub does not exist consider creating one.  It will have many uses beyond supporting the data warehouse and business intelligence.

If no MDM-Hub is available, you will need to examine source systems and determine which system contains the data most suitable for dimensions.  If the data is not stored in a managed database, you may need to define the data locally, in a spreadsheet or desktop database, and then provide to the data warehousing system.

Identify Fact Data Sources for the Data Mart

The Fact contains quantitative measurements while the Dimension contains classification information.  The data sources for Fact tends to be transactional software systems.  For example:

System

Example Fact Data

Sales Order Entry
  • Sales Transaction
  • Return Transaction
Customer Service
  • Service Episode
  • Service Result
Accounts Payable
  • Payment Transaction
Sales Campaign
  • Sales Campaign Event

Larger enterprises may have multiple systems for the same kind data.   In that case, you will need to determine the best source of data - the System of Record (SOR) as the source of data warehousing data.

[PAGE_BREAK]

Detailed Data Source Understanding for Data Warehousing

When the major data sources have been identified it is time to quickly gain detailed understanding of each one. Consolidate the spreadsheet developed in the identification phase by data source, then create a new spreadsheet to track and control detailed understanding:

Data Source Plan P2

This approach provides an effective workflow as well as a project planning and control method. Due dates are assigned and actual complete dates and status are tracked.

Obtain Existing Documentation

When seeking to understand a data source, the first thing to do is look at existing documentation. This avoids "re-inventing the wheel". If a data source is fully documented, data profiled and of high quality most of the job of data source discovery is complete.

Existing documentation may include:

Check through the documentation to assess its completeness and usefulness.

The data source analyst should study the existing documentation before any in depth discussions with the SMEs. This improves the credibility of the data analyst and save time for the SMEs.

Model and Define the Input

The data model is a graphic representation of data structures that improves understanding and provides automation linking database design to physical implementation. This section assumes that the data source is stored in a relational database that modeled using typical relational data modeling tools.

If there is an existing data model, start with that, otherwise use the reverse engineering capability of the data modeling to build a physical data model. Next, group the tables that are of interest into a subject area for analysis. Unless, a large percentage of the data source is needed for the data warehouse avoid studying the entire data source. Stay focused on the current project.

For each selected data source table define:

For each selected data source column define:

Profile the Data Source

The actual use and behavior of data sources often tends not to match the name or definition of the data. Sometimes this is called "dirty data" or "unrefined data" that may have problems such as:

Data profile is an organized approach to examining data to better understand and later use it. This can be accomplished by querying the data using tools like:

For code values such as gender code and account status code do a listing showing value and count such as this gender code listing:

Code

Count

Notes

F

500

Female
M

510

Male
T

12

Transgender?
Z

5

???
NULL

1000

Missing

Other systems may represent female and male as 1 and 2 rather than F and T, and so may require standardization when stored in the data warehouse. When data from multiple sources is integrated in the data warehouse it is expected that it will be standardized and integrated.

Statistical measures are a good way to better understand numeric information such as revenue amounts. Helpful statistics are:

Consistency within a database is another important factor to determine through data profiling. For example, there may be an order table which should only have orders for customers established in the customer table. Perform queries to determine whether this is true.

Improve Data Quality

Data profiling may reveal problems in data quality.  For example, it might show invalid values are be entered for a particular column, such as entering 'Z' for gender when 'F' and 'M' are the valid values.  Some steps that could be taken to improve data quality include:

Save Results for Further Reuse

The information gathered during the data source discovery process is valuable metadata that can be useful for future data warehousing or other projects.  Be sure to save the results and make available for future efforts.  This work can be a great step toward building an improved data resource.

 

 


Advertisements

Advertisements:
 


Infogoal.com is organized to help you gain mastery.
Examples may be simplified to facilitate learning.
Content is reviewed for errors but is not warranted to be 100% correct.
In order to use this site, you must read and agree to the terms of use, privacy policy and cookie policy.
Copyright 2006-2020 by Infogoal, LLC. All Rights Reserved.

Infogoal Logo