当前位置:网站首页>August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
August 24, 2020: what are small documents? What's wrong with a lot of small files? How to solve many small files? (big data)
2020-11-06 21:50:00 【Fuda Dajia architect's daily question】
Fogo's answer 2020-08-24:
Know the answer
1. Small files :
Small files mean that the file size is significantly smaller than HDFS Upper block (block) size ( Default 64MB, stay Hadoop2.x China and Murdoch think 128MB) The file of .
2. Small file problem :
HDFS Small file problem of :
(1)HDFS Any of the files in , The directory or data block is in NameNode Node memory is represented as an object ( Metadata ), And this is subject to NameNode Physical memory capacity limits . Each metadata object accounts for about 150 byte, So if there is 1 Thousands of little files , Each file takes up one block, be NameNode About need 2G Space . If storage 1 Billion documents , be NameNode need 20G Space , There's no doubt about it 1 It is not advisable to have 100 million small documents .
(2) Processing small files is not Hadoop The design goal of ,HDFS Is designed to stream access to large data sets (TB Level ). thus , stay HDFS It's inefficient to store a large number of small files in . Accessing a large number of small files often leads to a large number of seek, And constantly in DatanNde Jump to retrieve small files . This is not a very effective access mode , Seriously affect performance .
(3) Processing a large number of small files is much faster than processing large files of the same size . Each small file takes up one slot, The task start will take a lot of time, even most of the time is spent on starting and releasing tasks .
MapReduce Small file problem on :
Map Tasks typically process only one block of input at a time (input. If the file is very small , And there's a lot of , So each of these Map Tasks only deal with very small input data , And it will produce a lot of Map Mission , every last Map The tasks will be added bookkeeping expenses .
-
Why there are so many small files
In at least two scenarios, a large number of small files will be generated :
(1) These small files are all part of a large logical file . because HDFS stay 2.x Version only supports appending files , So save unbounded files before that ( For example, log files ) One common way is to write the data in blocks HDFS in .
(2) The file itself is very small . For example, for a large picture corpus , Each picture is a separate file , And there's no good way to merge these files into one big file . -
Solution
These two situations need different solutions :
(1) For the first case , A document is made up of many records , Then you can call HDFS Of sync() Method ( and append Methods used in combination ), Generate a large file at regular intervals . perhaps , It can be done by writing a MapReduce Program to merge these little files .
(2) For the second case , You need containers to group these files in some way .Hadoop Offers some options :
① Use HAR File.Hadoop Archives (HAR files) Is in 0.18.0 The version introduces HDFS Medium , It came into being to ease the consumption of a large number of small files NameNode Memory problems .HAR The document is passed through the HDFS Build a hierarchical file system to work on .HAR File by hadoop archive Command to create , And this command actually runs MapReduce Homework to package small files into a small number of HDFS file . For the client , Use HAR There's no change in the file system : All original files are visible and accessible ( Just use har://URL, instead of hdfs://URL), But in HDFS The number of files in the middle has decreased .
② Use SequenceFile Storage . File name as key, File contents as value. In practice, it's very effective . For example, for 10,000 individual 100KB Small file size problem , You can write a program that will merge into one SequenceFile, Then you can stream ( To deal with or use directly MapReduce) SequenceFile.
③ Use HBase. If you produce a lot of small files , Depending on the access mode , There should be different types of storage .HBase Store data in Map Files( Indexed SequenceFile) in , If you need random access to perform MapReduce Flow analysis , This is a good choice .
版权声明
本文为[Fuda Dajia architect's daily question]所创,转载请带上原文链接,感谢
边栏推荐
- 移动端像素适配方案
- C language I blog assignment 03
- Method of code refactoring -- Analysis of method refactoring
- Code generator plug-in and creator preform file analysis
- Ora-02292: complete constraint violation (midbjdev2.sys_ C0020757) - subrecord found
- Unity performance optimization
- Call analysis of start method in JNI thread and callback analysis of run method
- DC-1 target
- Small program introduction to proficient (2): understand the four important files of small program development
- An article takes you to understand CSS pagination examples
猜你喜欢
GitHub: the foundation of the front end
Event monitoring problem
How does cglib implement multiple agents?
Take you to learn the new methods in Es5
2020-08-29:进程线程的区别,除了包含关系之外的一些区别,底层详细信息?
Contract trading system development | construction of smart contract trading platform
[elastic search engine]
What are the highlights of Huawei mate 40 series with HMS?
预留电池接口,内置充放电电路及电量计,迅为助力轻松搞定手持应用
An article will take you to understand CSS alignment
随机推荐
jenkins安装部署过程简记
2020-08-15:什么情况下数据任务需要优化?
How much disk space does a file of 1 byte actually occupy
Summary of common SQL statements
es创建新的索引库并拷贝旧的索引库 实践亲测有效!
Small program introduction to proficient (2): understand the four important files of small program development
Js字符串-String字符串对象方法
How to make characters move
C and C / C + + mixed programming series 5 - GC collaboration of memory management
消防器材RFID固定资产管理系统
Visual rolling [contrast beauty]
This project allows you to quickly learn about a programming language in a few minutes
[elastic search engine]
Summary of front-end performance optimization that every front-end engineer should understand:
Common syntax corresponding table of mongodb and SQL
实用工具类函数(持续更新)
Elasticsearch database | elasticsearch-7.5.0 application construction
Message queue - Analysis
递归、回溯算法常用数学基础公式
应用层软件开发教父教你如何重构,资深程序员必备专业技能