当前位置：网站首页>Data Lake: evolution of data Lake Technology Architecture

Data Lake: evolution of data Lake Technology Architecture

2022-07-22 16:56:00 【YoungerChina】

1. background

Large domestic Internet companies , Dozens of... Are generated every day 、 A few hundred TB, Even a few PB Raw data . These companies usually use open source big data components to build big data platforms . Big data platform experience “ With Hadoop Offline data platform represented by ”、“Lambda Architecture platform ”、“Kappa Architecture platform ” Three stages .

Data lake can be regarded as the latest generation of big data technology platform , In order to better understand the basic architecture of the data Lake , Let's first look at the evolution of big data platforms , So as to understand why we should learn data Lake Technology .

2. Offline big data platform - first generation

The first stage ： With Hadoop Offline data processing components represented by .Hadoop In order to HDFS For core storage , With MapReduce It is the basic component of batch data processing of basic calculation model . around HDFS and MR, In order to continuously improve the data processing capacity of the big data platform , A series of big data components have been born , For example, real-time oriented KV Operation of the HBase、 oriented SQL Of Hive、 Workflow oriented Pig etc. . meanwhile , With the increasing demand for batch processing performance , New computational models have been proposed , Produced Tez、Spark、Presto Wait for the computing engine ,MR The model gradually evolved into DAG Model .

In order to reduce the intermediate result of data processing, write file operation ,Spark、Presto And other computing engines try to use the memory of the computing node to cache the data , So as to improve the efficiency of the whole data process and system throughput .

3. Lambda framework

3.1 Architecture introduction

With the continuous change of data processing capacity and processing requirements , More and more users are finding , Batch mode improves performance anyway , Nor can it meet the processing scenarios with high real-time requirements , Streaming computing engine came into being , for example Storm、Spark Streaming、Flink etc. .

However , With more and more applications online , We found , In fact, batch processing and stream computing are used together , To meet most application needs , Scenes with high real-time requirements , Will use Flink+Kafka Build a real-time stream processing platform , To meet the real-time needs of users . therefore Lambda Architecture proposed , As shown in the figure below .

Lambda The core idea of architecture is “ Flow batch separation ”, As shown in the figure above , The entire data flow flows from left to right into the platform . After entering the platform, one is divided into two , Some of them go to batch processing mode , Part of the flow calculation mode . No matter which calculation mode , The final processing results are provided to the application through the service layer , Ensure consistency of access .

This data architecture contains many big data components , It greatly increases the complexity and maintenance cost of the overall architecture .

3.2 Pain points

After years of development ,Lambda The architecture is relatively stable , It can meet the past application scenarios . But it has many fatal weaknesses ：

1、 High data governance costs

The real-time computing process cannot reuse the data kinship of the offline warehouse 、 Data quality management system . We need to re implement a set of data kinship for real-time computing 、 Data quality management system .

2、 The cost of development and maintenance is high

It is necessary to maintain both offline and real-time data warehouse systems , The same set of calculation logic needs to store two copies of data . for example , Update of one or more original data , You need to run the offline data warehouse again , The cost of data updating is very large .

3、 The data caliber is inconsistent

Because real-time computing and offline computing are completely different , Because the delayed arrival of data is different from the running time of the two types of code , Result in inconsistent calculation results .

So is there an architecture that can solve Lambda The problem of architecture ？

4. Kappa framework

Lambda Architecturally “ Flow batch separation ” Processing links increase the complexity of research and development . therefore , Some people ask whether we can use a system to solve all problems . At present, the more popular method is based on flow computing . So let's talk about that Kappa framework , adopt Flink+Kafka Connect the entire link in series .Kappa Architecture solves Lambda The computing engine in the architecture is inconsistent between the offline processing layer and the real-time processing layer , Development 、 The operation and maintenance cost is high , The calculation results are inconsistent .

4.1 Architecture introduction

Kappa The architecture scheme is also called “ Batch flow integration ” programme . We borrow Flink+Kafka To build a flow batch integration scenario , When need is right ODS For further analysis of layer data , take Flink The result of the calculation DWD Layer write to Kafka, It will also be part of DWS Calculation results of layer Kafka.Kappa The architecture is not perfect , It also has many pain points .

4.2 Pain points

1、 Weak data backtracking ability

Kafka Weak support for complex demand analysis , In the face of more complex data analysis , And we will DWD and DWS The data of the layer is written to ClickHouse、ES、MySQL Or is it Hive For further analysis , This undoubtedly brings the complexity of the link . The bigger problem is when doing data backtracking , Due to the complexity of the link, the data backtracking ability is very weak .

2、OLAP Weak analytical ability

because Kafka It is a sequential storage system , There is no way to implement sequential storage system directly OLAP Analytical , For example, predicate pushdown is an optimization strategy , On the sequential storage platform （Kafka） It is difficult to implement on .

3、 Data timing is challenged

Kappa The architecture is heavily dependent on message queuing , We know that the accuracy of the message queue itself depends strictly on the order of its upstream data , however , The more data layers of message queues , The greater the likelihood of disorder . Usually ,ODS The data of layer is absolutely accurate , hold ODS After calculation, the layer data is written to DWD There will be disorder when the layer is ,DWD To DWS More likely to cause disorder , Such data inconsistency is a big problem .

5. Summary of big data architecture pain points

From traditional hadoop The architecture goes to Lambda framework , from Lambda The architecture goes to Kappa Evolution of Architecture , The evolution of big data platform infrastructure gradually includes all kinds of data processing capabilities required by applications , But these platforms still have many pain points .

Is there a storage technology , It can not only support data efficient backtracking ability , Support data update , It can also read and write data in batches , And it can also realize data access from minute level to second level ？

6. Real time data warehouse construction needs

This is also the urgent need of real-time data warehouse construction . In fact, it can be done through Kappa Architecture upgrade , To solve the problem Kapp

a Some problems encountered in the architecture , Next, we will mainly share the current hot data Lake technology .

So is there such an architecture , It can meet the needs of real-time , It can also meet the requirements of offline computing , And it can also reduce the cost of development and operation , Solve the problem of Kappa Pain points encountered in architecture ？ The answer is yes .

7. Data Lake Architecture

7.1 Data Lake concept

The data lake is a centralized repository , It can store structured and unstructured data . It can be stored as business data （ No need to structure the data first ）, And run different types of analysis – From control panel and visualization to big data processing 、 Real time analysis and machine learning , To guide better decisions .

1. Store original data

The data Lake needs enough storage capacity , It can store all the data of the company .
Data lakes can store all kinds of data , Include structured 、 Semi structured （XML、Json etc. ） And unstructured data （ picture 、 video 、 Audio ）.
The data in the data lake is a complete copy of the original business data , These data maintain their original appearance in the business system .

2. Flexible underlying storage capabilities

In actual use , The data in the data lake is usually not accessed at high frequency , In order to achieve acceptable cost performance , Data Lake construction usually selects cost-effective storage engines （ Such as S3/OSS/HDFS）.

Provide large-scale storage for big data , And scalable large-scale data processing capabilities .
May adopt S3/HDFS/OSS Equally distributed storage platform as storage engine .
Support Parquet、Avro、ORC And other data structure formats .
It can provide data cache acceleration function .

3. Rich computing engine

From batch calculation of data 、 Flow computation , Interactive query analysis to machine learning , All kinds of computing engines belong to the scope of data Lake . With the combination of big data and artificial intelligence technology , All kinds of machine learning / Deep learning algorithms have also been introduced , for example TensorFlow/PyTorch The framework already supports from HDFS/S3/OSS Read the sample data on the for machine learning training . therefore , For a qualified data Lake project , Pluggability of computing storage engine , It is the basic ability that the data Lake must have .

4. Perfect data management

Data Lake needs to have perfect metadata management ability ： Including data sources 、 data format 、 Connection information 、 data schema、 Authority management and other capabilities .
Data Lake needs to have perfect data lifecycle management capability . It can not only store raw data , It also needs to be able to save the intermediate result data of various analysis and processing , And complete records of data analysis and processing process , Help users trace the generation process of any piece of data completely .
Data Lake needs to have perfect data acquisition and data publishing capabilities . Data Lake needs to be able to support a variety of data sources , And can obtain the full amount from the relevant data source / Incremental data ; And then normalize the storage . Data lake can push data to the appropriate storage engine , To meet different application access needs .

7.2 Data Lake Architecture

LakeHouse Architecture has become the hottest trend of architecture evolution , Direct access to stored data management systems , It combines the main advantages of data warehouse .LakeHouse It is built based on the architecture of separation of storage and accounting . The biggest problem with the separation of storage and computing is the network , Especially for data warehouse with high-frequency access , Network performance is critical . Realization Lakehouse There are many options , such as Delta,Hudi,Iceberg. Although the emphasis of the three is different , But they all have the general function of data lake , such as ： Unified metadata management 、 Support a variety of calculation and analysis engines 、 Support the separation of high-order analysis and computing storage .

So what is the general open source data Lake architecture like ？ Here I draw an architecture diagram , There are four main levels ：

1 distributed file system

The first layer is distributed file system , For users who choose cloud technology , Usually choose S3 And Alibaba cloud storage data ; Users who like open source technology generally use their own HDFS Store the data .

2 Data acceleration layer

The second layer is the data acceleration layer . Data Lake architecture is a typical storage computing separation architecture , The performance loss of remote reading and writing is very large . Our common practice is , Put frequently accessed data （ Hot data ） Cache locally on compute node , So as to realize the hot and cold separation of data . The advantage of this is , Improve data reading and writing performance , Save network bandwidth . We can choose open source Alluxio, Or Alibaba cloud Jindofs.

3 Table format layer

The third layer is Table format layer , Encapsulate data files into tables with business implications , The data itself provides ACID、snapshot、schema、partition Semantics of equal table level . This layer can choose open source data Lake three swordsmen Delta、Iceberg、Hudi One of .Delta、Iceberg、Hudi It is a technology of building data Lake , They are not data lakes in themselves .

4 Calculation engine

The fourth layer is various data computing engines . Include Spark、Flink、Hive、Presto etc. , These computing engines can access the same table in the data Lake .