Infogoal Logo
GOAL DIRECTED LEARNING
Master DW

Data and Analytics Tutorial

Data and Analytics Overview
Under Construction

Data and Analytics Success

Data and Analytics Strategy
Project Management
Data Analytics Methodology
Quick Wins
Data Science Methodology

Requirements

BI Requirements Workshop

Architecture and Design

Architecture Patterns
Technical Architecture
Data Attributes
Data Modeling Basics
Dimensional Data Models

Enterprise Information Management

Data Governance
Metadata
Data Quality

Data Stores and Structures

Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Integration

Data Pipeline
Change Data Capture
Extract Transform Load
ETL Tool Selection
Data Warehoouse Automation
Data Wrangling
Data Science Workflow

BI and Data Visualization

BI - Business Intelligence
Data Viaulization

Data Science

Statistics
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics

Test and Deploy

Testing
Security Architecture
Desaster Recovery
Rollout
Sustaining DW/BI

Data Lake Architecture

What is a Data Lake?

The Data Lake is the centralized, organized and secure repository where "Big Data" is ingested, stored and analyzed and the many Vs are supported. Data is stored in a large scale file system such as HDFS or blobs. A huge volume of data into the petabytes and beyond can be stored. A great variety of data formats is supported: structured, semi-structured and unstructured. A high velocity of data arriving at a speedy pace is absorbed by the data lake. Removing data issues may be required to provide a high degree of data veracity. Maximizing value returned by the Data Lake is the bottom line.

Leading data storage platforms for Data Lake include:

  • Apache HDFS: The open source Hadoop Distributed File System is the original and vendor independent storage platform for the Data Lake. It is available on premise as well on leading cloud platforms: Azure, AWS and GCP. HDFS data is distributed across multiple nodes which enable high performance and capacity.
  • Microsoft Azure Cloud: Multiple data storage variations are supported: HDFS, Azure Data Lake Generation 1 (ADLG1), Azure Data Lake Generation 2 (ADLG2) and blob.
  • Amazon Web Services (AWS): AWS S3 (simple storage) is a platform for Data Lake storage.
  • Google Cloud Platform (GCP): Google Cloud Storage is a platform for Data Lake storage.

Data Lake Users and Benefits

The two largest users of the Data Lake are data scientists and down stream systems. Data Scientists access the Data Lake as part of analytics and may extract the data for use in Data Science Sandboxes. Downstream systems such as the Atomic Warehouse receive data from the Data Lake for specific purposes. Organizations that implemented Data Lakes outperformed similar companies by nearly 9% according to an Aberdeen survey.

These benefits and capabilities may be realized through the use of the Data Lake:

  • Accessability: data is accessed using the SQL standard database language which is universally supported.
  • Availability: data in many formats are available - multiple content types.
  • Flexibility: supports many types of analytics - no need to have all of the answers up front.
  • Economy: large volumes of data can be stored for a low price compared to storage in a database.
  • Scalabilty: can grow to support business needs.
  • Timeliness: data can be loaded in real-time or near real-time - making data immediately ready for use.

Data Lake Glossary

These are the top Data Lake terms that you need to know:

    TermDefinition
    Apache Avroan open source row-oriented, highly compressed data format in HDFS. It includes schema definition.
    Apache HAWQHadoop native SQL query engine which high performance queries on data stored in HDFS. HAWQ is relatively new (incubator status) and has better performance than Hive.
    Apache HDFSHadoop Distributed File System
    Apache HiveAn early Hadoop native SQL-like query engine. Hive is batch oriented.
    Apache ImpalaHadoop native SQL query engine for analytics which provides improved performance over Hive. It was release in 2013.
    Apache KafkaLeading stream-processing sotware platform. It is used to transport data to the Data Lake for ingestion. Donated by LinkedIn to Apache. Commercially extended by Confluent.
    Apache ORCan open source column-oriented HDFS data format - similar to Parquet.
    Apache Parquetan open source column-oriented HDFS data format - similar to ORC.
    Data Lakea centralized,, orgranized and secure repository where large voumes of data are ingested, stored and analyzed.
    Data SwampA data lake gone wrong.
    MPP DBMSA DataBase Management System with Massively Parallel Processing. MPP enables high performance, scalable processing by spreading work across multiple processing nodes.
    SQLStructured Query Language is an ANSI Standard computer language commonly used to access data stored in databases.

Data Lake Data Flow

Data Lake Flow - Boxes

Data Lake Internal Structure

The Data Lake datastore contains massive volume data which is stored in raw format matching its data sources. Data Lake typically stores data in blobs or flat files organized into folders. Storage of unstructed data such as images and documents are use cases for the Data Lake. The data lake may use a file system such as the Hadoop File System (HDFS) which stores multiple copies of data to enable rapid, parallel retrieval.

The Data Lake tends to be composed of the following zones or similar zones with different names:

  • Ingest: zone where data is input to the datastore. It is best practice to limit Data Lake inputs to the Ingest Zone. Data Lake inputs will vary. Data may arrive in batches or in near real-time streams. Also, data may be structured, semi-structured or unstructured.
  • Core: zone where subject content is stored. This is central focus of the datastore. The Data Lake core data may be exposed to Data Scientists for analytics purposes - this is a more flexible approach - unlike the Atomic Warehouse for instance. It is often organized into folders and topics with timestamps added to file names.
  • Expose: zone where data is made available outside of the datastore.
  • Process: zone where datastore processes are tracked and controlled. It is best practice to use the same Process Zone schema across datastores. In the case of the Data Lake, the Process Zone may be stored in a separate relational database.
  • Archive: zone where history data is stored in Raw Immutable form. This means that data is stored in the Ingest form and not is not altered which makes for an effective audit trail.
  • Metadata: zone where data describing datastore content and structure are stored. Glossaries and Data Lineage are examples of data managed here. It is best practice to use the same Metadata Zone schema across datastores or to share a Metadata repository across the enterprise.
  • Notify: zone where logs of events are stored. It is best practice to use the same Process Zone schema across datastores. Notifications may be sent to a centralized Notifications System or Database.

Data Lake - Level 3

Data Lake References and Links

These articles provide insight into the Data Lake:


Advertisements

Advertisements:
 


Infogoal.com is organized to help you gain mastery.
Examples may be simplified to facilitate learning.
Content is reviewed for errors but is not warranted to be 100% correct.
In order to use this site, you must read and agree to the terms of use, privacy policy and cookie policy.
Copyright 2006-2020 by Infogoal, LLC. All Rights Reserved.

Infogoal Logo