Data Sources
Database Choices
Big Data
Atomic Warehouse
Dimensional Warehouse
Logical Data Warehouse
Data Lake
Operational Datastore (ODS)
Data Vault
Data Science Sandbox
Flat Files Data
Graph Databases
Time Series Data

Data Lake Architecture

What is a Data Lake?

The Data Lake is the centralized, organized and secure repository where "Big Data" is ingested, stored and analyzed and the many Vs are supported. Data is stored in a large scale file system such as HDFS or blobs. A huge volume of data into the petabytes and beyond can be stored. A great variety of data formats is supported: structured, semi-structured and unstructured. A high velocity of data arriving at a speedy pace is absorbed by the data lake. Removing data issues may be required to provide a high degree of data veracity. Maximizing value returned by the Data Lake is the bottom line.

Leading data storage platforms for Data Lake include:

Apache HDFS: The open source Hadoop Distributed File System is the original and vendor independent storage platform for the Data Lake. It is available on premise as well on leading cloud platforms: Azure, AWS and GCP. HDFS data is distributed across multiple nodes which enable high performance and capacity.
Microsoft Azure Cloud: Multiple data storage variations are supported: HDFS, Azure Data Lake Generation 1 (ADLG1), Azure Data Lake Generation 2 (ADLG2) and blob.
Amazon Web Services (AWS): AWS S3 (simple storage) is a platform for Data Lake storage.
Google Cloud Platform (GCP): Google Cloud Storage is a platform for Data Lake storage.

Data Lake Users and Benefits

The two largest users of the Data Lake are data scientists and down stream systems. Data Scientists access the Data Lake as part of analytics and may extract the data for use in Data Science Sandboxes. Downstream systems such as the Atomic Warehouse receive data from the Data Lake for specific purposes. Organizations that implemented Data Lakes outperformed similar companies by nearly 9% according to an Aberdeen survey.

These benefits and capabilities may be realized through the use of the Data Lake:

Accessability: data is accessed using the SQL standard database language which is universally supported.
Availability: data in many formats are available - multiple content types.
Flexibility: supports many types of analytics - no need to have all of the answers up front.
Economy: large volumes of data can be stored for a low price compared to storage in a database.
Scalabilty: can grow to support business needs.
Timeliness: data can be loaded in real-time or near real-time - making data immediately ready for use.

Data Lake Glossary

These are the top Data Lake terms that you need to know:

Term	Definition
Apache Avro	an open source row-oriented, highly compressed data format in HDFS. It includes schema definition.
Apache HAWQ	Hadoop native SQL query engine which high performance queries on data stored in HDFS. HAWQ is relatively new (incubator status) and has better performance than Hive.
Apache HDFS	Hadoop Distributed File System
Apache Hive	An early Hadoop native SQL-like query engine. Hive is batch oriented.
Apache Impala	Hadoop native SQL query engine for analytics which provides improved performance over Hive. It was release in 2013.
Apache Kafka	Leading stream-processing sotware platform. It is used to transport data to the Data Lake for ingestion. Donated by LinkedIn to Apache. Commercially extended by Confluent.
Apache ORC	an open source column-oriented HDFS data format - similar to Parquet.
Apache Parquet	an open source column-oriented HDFS data format - similar to ORC.
Data Lake	a centralized,, orgranized and secure repository where large voumes of data are ingested, stored and analyzed.
Data Swamp	A data lake gone wrong.
MPP DBMS	A DataBase Management System with Massively Parallel Processing. MPP enables high performance, scalable processing by spreading work across multiple processing nodes.
SQL	Structured Query Language is an ANSI Standard computer language commonly used to access data stored in databases.

Data Lake Data Flow

Data Lake Flow - Boxes

Data Lake Internal Structure

The Data Lake datastore contains massive volume data which is stored in raw format matching its data sources. Data Lake typically stores data in blobs or flat files organized into folders. Storage of unstructed data such as images and documents are use cases for the Data Lake. The data lake may use a file system such as the Hadoop File System (HDFS) which stores multiple copies of data to enable rapid, parallel retrieval.

The Data Lake tends to be composed of the following zones or similar zones with different names:

Ingest: zone where data is input to the datastore. It is best practice to limit Data Lake inputs to the Ingest Zone. Data Lake inputs will vary. Data may arrive in batches or in near real-time streams. Also, data may be structured, semi-structured or unstructured.
Core: zone where subject content is stored. This is central focus of the datastore. The Data Lake core data may be exposed to Data Scientists for analytics purposes - this is a more flexible approach - unlike the Atomic Warehouse for instance. It is often organized into folders and topics with timestamps added to file names.
Expose: zone where data is made available outside of the datastore.
Process: zone where datastore processes are tracked and controlled. It is best practice to use the same Process Zone schema across datastores. In the case of the Data Lake, the Process Zone may be stored in a separate relational database.
Archive: zone where history data is stored in Raw Immutable form. This means that data is stored in the Ingest form and not is not altered which makes for an effective audit trail.
Metadata: zone where data describing datastore content and structure are stored. Glossaries and Data Lineage are examples of data managed here. It is best practice to use the same Metadata Zone schema across datastores or to share a Metadata repository across the enterprise.
Notify: zone where logs of events are stored. It is best practice to use the same Process Zone schema across datastores. Notifications may be sent to a centralized Notifications System or Database.

Data Lake - Level 3

Data Lake References and Links

These articles provide insight into the Data Lake:

Data Lake Tutorial : Guru99
Demystifying Data Lake : Pradeep Menon - 2017 - Medium
9 Key Benefits of Data Lake : Kumar Chinnakali - 2016 - Data Science Central
Collective Definition of Data Lake : Kumar Chinnakali - 2016 - Data Science Central

Data and Analytics Tutorial

Data and Analytics Success

Requirements

Architecture and Design

Enterprise Information Management

Data Stores and Structures

Data Integration

BI and Data Visualization

Data Science