Introduction to big data development foundation and project practice Hadoop core and ecosystem technology stack 2.HDFS distributed file system


This paper mainly introduces HDFS distributed file system, including HDFS characteristics, command line and API operation, HDFS, HDFS read-write mechanism analysis, HDFS metadata management mechanism, Hadoop quota, archiving and cluster security mode and log collection cases.

1.HDFS features

HDFS (full name: Hadoop distributed file system, Hadoop distributed file system) is a core component of Hadoop and a distributed storage service.

Distributed file systems span multiple computers and have broad application prospects in the era of big data. They provide the required expansion ability for storing and processing large-scale data.

HDFS is a kind of distributed file system.

HDFS locates files through a unified namespace directory tree; In addition, it is distributed, and many servers combine to realize their functions. The servers in the cluster have their own roles (the essence of distribution is to split and perform their own duties).

Common features of HDFS are as follows:

  • Typical Master/Slave architecture

The architecture of HDFS is a typical Master/Slave structure.

NameNode is the master node of the cluster, and DataNode is the slave node of the cluster. They work together to complete the distributed data storage task.

HDFS clusters are often composed of one NameNode (except HA architecture and Federation mechanism. HA architecture has two namenodes, and Federation mechanism has multiple namenodes) + multiple datanodes.

  • Block storage (block mechanism)

Files in HDFS are physically stored in blocks, and the size of blocks can be specified through configuration parameters;
In the Hadoop 2. X version, the default block size is 128M. When the file size is greater than 128block, HDFS performs automatic segmentation, which is insensitive to us.

  • NameSpace

HDFS supports the traditional hierarchical file organization structure. Users or applications can create directories and save files in these directories. The file system namespace hierarchy is similar to most existing file systems: users can create, delete, move, or rename files.

Namenode is responsible for maintaining the file system namespace. Any changes to the file system namespace or attributes will be recorded by namenode.

That is, HDFS provides customers with a separate (Linux like) Abstract directory tree in the form of hdfs://NameNode The hostname:port / directory of, for example hdfs://node01:9000/test/input .

  • NameNode metadata management

The directory structure and file block location information are called metadata.

The metadata of NameNode records the block information corresponding to each file (the id of the block and the information of the DataNode where it is located).

  • DataNode datastore

The specific storage management of each block of the file is undertaken by the DataNode node.

A block will have multiple datanodes to store, and the DataNode will regularly report the block information it holds to the NameNode, so that the NameNode can timely control the status information of the cluster to ensure the security of data and the timeliness of task execution.

  • Replica mechanism

For fault tolerance, all blocks of the file will have copies. The block size and copy factor of each file are configurable. The application can specify the number of copies of a file. The copy factor can be specified when the file is created or changed later.

The default number of copies is 3 (including the original block).

  • Write once, read many times

HDFS is designed to adapt to the scenario of one write and multiple reads, and does not support random modification of files( Support additional write, not only random update).

Because of this, HDFS is suitable for the underlying storage service for big data analysis, but not for network disk and other applications (inconvenient modification, large delay, large network overhead and high cost).

The HDFS architecture is as follows:

The roles of each role are as follows:

(1) NameNode(nn) - manager of HDFS cluster, Master

  • Maintain and manage the NameSpace of HDFS

  • Maintain replica policy

When a problem occurs, the replica data is read from different racks.

  • Record the mapping information of the file Block

  • Responsible for processing client read and write requests

(2) DataNode - the NameNode issues a command, the DataNode executes the actual operation, and the Slave node

  • Save actual data block

  • Responsible for reading and writing data blocks

(3) Client -- client

  • When uploading files to HDFS, the Client is responsible for dividing the files into blocks and then uploading them

  • The main purpose of requesting NameNode interaction is to obtain the location information of the file block

  • Fetch or write files and interact with DataNode

  • The Client can use some commands to manage HDFS or access HDFS

2. Command line and API operation HDFS

(1) Shell command line client

There are two modes of HDFS client operation: Shell command line and Java API.

First, use the Shell command line to operate HDFS.

The basic syntax format is:

hadoop fs -order of the day / hdfs dfs -order of the day

View the complete list of commands as follows:

Contains complete administrative commands.

Use the following:

# Create HDFS folder
[root@node01 ~]$ hdfs dfs -mkdir /test
# Set a maximum of two files to be uploaded under this folder
[root@node01 ~]$ hdfs dfsadmin -setQuota 2 /test
[root@node01 ~]$ hdfs dfs -put test.txt /test
# Upload files. Only one file can be uploaded
[root@node01 ~]$ hdfs dfs -put test.txt /test/test-2.txt
put: The NameSpace quota (directories and files) of directory /test is exceeded: quota=2 file count=3
# Clear file limit
[root@node01 ~]$ hdfs dfsadmin -clrQuota /test
[root@node01 ~]$ hdfs dfs -put test.txt /test/test-2.txt

You can see that the quantity limit is set to 2, so you can upload up to 1 file. After clearing the quantity limit, you can upload multiple files.

The commands for setting the space size limit are as follows:

# Set the space quota for the directory to size
hdfs dfsadmin -setSpaceQuota size directory
# Clear space quota
hdfs dfsadmin -clrSpaceQuota directory
# View file quota quantity
hdfs dfs -count -q -h directory

Use the following:

# The limited space size is 30KB
[root@node01 ~]$ hdfs dfsadmin -setSpaceQuota 30k /test
# Uploading a file larger than 30KB will prompt that the file exceeds the limit
[root@node01 ~]$ hdfs dfs -put img0427.xml /test
put: The DiskSpace quota of /test is exceeded: quota = 30720 B = 30 KB but diskspace consumed = 402653256 B = 384.00 MB
# View hdfs file quota quantity
[root@node01 ~]$ hdfs dfs -count -q -h /test
        none             inf            30 K          29.9 K            1            2                 24 /test
# Clear space quota
[root@node01 ~]$ hdfs dfsadmin -clrSpaceQuota /test
[root@node01 ~]$ hdfs dfs -put test.txt /test/test-3.txt
[root@node01 ~]$ hdfs dfs -ls /
Found 9 items
drwxrwxrwx   - root supergroup          0 2021-09-01 17:59 /api_test
drwxrwxrwx   - root supergroup          0 2021-08-26 19:22 /cl
-rw-r--r--   1 root supergroup     281214 2021-09-02 12:43 /packet.txt
drwxr-xr-x   - root supergroup          0 2021-09-02 17:34 /test
drwxrwxrwx   - root supergroup          0 2021-08-26 00:36 /tmp
-rw-r--r--   1 root supergroup         18 2021-09-02 11:12 /tmp.txt
drwxrwxrwx   - root supergroup          0 2021-08-26 20:19 /user
drwxrwxrwx   - root supergroup          0 2021-08-25 22:33 /wcinput
drwxrwxrwx   - root supergroup          0 2021-08-26 00:37 /wcoutput
[root@node01 ~]$ hdfs dfs -ls /test
Found 3 items
-rw-r--r--   3 root supergroup         12 2021-09-02 17:15 /test/test-2.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:34 /test/test-3.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:13 /test/test.txt
[root@node01 ~]$ 

It can be seen that files cannot be uploaded successfully after setting the space quota. Files can be uploaded after canceling the space quota.

When the HDFS cluster starts, the Fsimage and edits files need to be loaded, and neither of these files records the datanode node node information corresponding to the block. Therefore, if the Client requests to upload files at this time, the cluster cannot work. At this time, you need to use safe mode.

Security mode is a special state of HDFS, which belongs to HDFS self-protection mode. In this state, the file system only accepts data read requests, but does not accept change requests such as deletion and modification. When the NameNode master node is started, HDFS first enters the safe mode. When the DataNode is started, it will report the available block and other status to the NameNode. When the whole system meets the safety standard, HDFS will automatically leave the safe mode. If the HDFS is in the safe mode, the file block cannot replicate any copies. Therefore, the minimum number of copies is determined based on the state when the DataNode is started. No replication will be made when the DataNode is started (so as to meet the minimum number of copies). When the HDFS cluster is just started, the default time of 30S is for the safe period, Only after 30S, the cluster is out of the security period, and then the cluster can be operated.

The commands related to safe mode are as follows:

hdfs dfs -safemode <enter|leave|get|wait|forceExit>

Use the following:

[root@node01 ~]$ hdfs dfsadmin -safemode enter
Safe mode is ON
[root@node01 ~]$ hdfs dfs -get /tmp.txt
[root@node01 ~]$ hdfs dfs -put hadoop.txt /test
put: Cannot create file/test/hadoop.txt._COPYING_. Name node is in safe mode.
[root@node01 ~]$ hdfs dfsadmin -safemode leave
Safe mode is OFF
[root@node01 ~]$ hdfs dfs -put hadoop.txt /test
[root@node01 ~]$ hdfs dfs -ls /test
Found 4 items
-rw-r--r--   3 root supergroup         52 2021-09-02 18:14 /test/hadoop.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:15 /test/test-2.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:34 /test/test-3.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:13 /test/test.txt

It can be seen that after entering the security mode, only files are allowed to be read and not written. Files can be written only after the security mode is turned off.

Hadoop archiving technology is mainly to solve the problem of a large number of small files in HDFS cluster.

Since a large number of small files will occupy the memory of the NameNode, for HDFS, storing a large number of small files will cause a waste of memory resources of the NameNode.

Hadoop archive file HAR file is a more efficient file archiving tool. HAR file is created by a group of files through the archive tool. Multiple files only need one piece of metadata to store. While reducing the memory use of NameNode, it can access files transparently. Generally speaking, The HAR file is a file for NameNode, which reduces the waste of memory. For actual operation, the file is still multiple independent files. As shown below:

Now test Archive:

(1) Start the Yarn cluster

Because archiving requires computing tasks, you need to start the Yarn cluster. If the Yarn cluster is not started, you need to start Yarn first and execute the command

(2) Archive file

Archive all files in the / test directory into an archive called test.har, and store the archived files in the / test path, as follows:

[root@node01 ~]$ hadoop archive -help
usage: archive <-archiveName <NAME>.har> <-p <parent path>> [-r <replication factor>] <src>* <dest>
 -archiveName <arg>   Name of the Archive. This is mandatory option
 -help                Show the usage
 -p <arg>             Parent path of sources. This is mandatory option
 -r <arg>             Replication factor archive files
[root@node01 ~]$ hadoop archive -archiveName test.har -p /test /output
21/09/02 18:29:13 INFO client.RMProxy: Connecting to ResourceManager at node03/
21/09/02 18:29:14 INFO mapreduce.JobSubmissionFiles: Permissions on staging directory /tmp/hadoop-yarn/staging/root/.staging are incorrect: rwxrwxrwx. Fixing permissions to correct value rwx------
21/09/02 18:29:14 INFO client.RMProxy: Connecting to ResourceManager at node03/
21/09/02 18:29:14 INFO client.RMProxy: Connecting to ResourceManager at node03/
21/09/02 18:29:15 INFO mapreduce.JobSubmitter: number of splits:1
21/09/02 18:29:15 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
21/09/02 18:29:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1629908848730_0002
21/09/02 18:29:16 INFO impl.YarnClientImpl: Submitted application application_1629908848730_0002
21/09/02 18:29:16 INFO mapreduce.Job: The url to track the job: http://node03:8088/proxy/application_1629908848730_0002/
21/09/02 18:29:16 INFO mapreduce.Job: Running job: job_1629908848730_0002
21/09/02 18:29:27 INFO mapreduce.Job: Job job_1629908848730_0002 running in uber mode : false
21/09/02 18:29:27 INFO mapreduce.Job:  map 0% reduce 0%
21/09/02 18:29:40 INFO mapreduce.Job:  map 100% reduce 0%
21/09/02 18:29:54 INFO mapreduce.Job:  map 100% reduce 100%
21/09/02 18:29:54 INFO mapreduce.Job: Job job_1629908848730_0002 completed successfully
21/09/02 18:29:54 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=390
                FILE: Number of bytes written=401295
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=611
                HDFS: Number of bytes written=465
                HDFS: Number of read operations=18
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=11
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=9066
                Total time spent by all reduces in occupied slots (ms)=10903
                Total time spent by all map tasks (ms)=9066
                Total time spent by all reduce tasks (ms)=10903
                Total vcore-milliseconds taken by all map tasks=9066
                Total vcore-milliseconds taken by all reduce tasks=10903
                Total megabyte-milliseconds taken by all map tasks=9283584
                Total megabyte-milliseconds taken by all reduce tasks=11164672
        Map-Reduce Framework
                Map input records=5
                Map output records=5
                Map output bytes=374
                Map output materialized bytes=390
                Input split bytes=116
                Combine input records=0
                Combine output records=0
                Reduce input groups=5
                Reduce shuffle bytes=390
                Reduce input records=5
                Reduce output records=0
                Spilled Records=10
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=269
                CPU time spent (ms)=2460
                Physical memory (bytes) snapshot=340410368
                Virtual memory (bytes) snapshot=4169043968
                Total committed heap usage (bytes)=140947456
        Shuffle Errors
        File Input Format Counters 
                Bytes Read=407
        File Output Format Counters 
                Bytes Written=0
[root@node01 ~]$ hdfs dfs -ls /output
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-09-02 18:29 /output/test.har
[root@node01 ~]$ hdfs dfs -ls /output/test.har
Found 4 items
-rw-r--r--   3 root supergroup          0 2021-09-02 18:29 /output/test.har/_SUCCESS
-rw-r--r--   3 root supergroup        354 2021-09-02 18:29 /output/test.har/_index
-rw-r--r--   3 root supergroup         23 2021-09-02 18:29 /output/test.har/_masterindex
-rw-r--r--   3 root supergroup         88 2021-09-02 18:29 /output/test.har/part-0

You can see that the archive file test.har directory is generated, which contains four files:

_ SUCCESS indicates that the status is successful;

part-0 saves the merged file;

_ Index and_ Master index represents the start and end index files respectively, and records the distribution of original files in part-0 files.

(3) View Archive

As follows:

[root@node01 ~]$ hdfs dfs -ls -R /output/test.har
-rw-r--r--   3 root supergroup          0 2021-09-02 18:29 /output/test.har/_SUCCESS
-rw-r--r--   3 root supergroup        354 2021-09-02 18:29 /output/test.har/_index
-rw-r--r--   3 root supergroup         23 2021-09-02 18:29 /output/test.har/_masterindex
-rw-r--r--   3 root supergroup         88 2021-09-02 18:29 /output/test.har/part-0
[root@node01 ~]$ hdfs dfs -ls -R har:///output/test.har
-rw-r--r--   3 root supergroup         52 2021-09-02 18:14 har:///output/test.har/hadoop.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:15 har:///output/test.har/test-2.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:34 har:///output/test.har/test-3.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 17:13 har:///output/test.har/test.txt

You can see that the HAR protocol is required to view the source file corresponding to the archive file.

(4) Solution Archive

De archiving files is equivalent to copying HAR files. As follows:

[root@node01 ~]$ hdfs dfs -cp har:///output/test.har/* /demo
[root@node01 ~]$ hdfs dfs -ls /demo
Found 4 items
-rw-r--r--   3 root supergroup         52 2021-09-02 18:37 /demo/hadoop.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 18:37 /demo/test-2.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 18:37 /demo/test-3.txt
-rw-r--r--   3 root supergroup         12 2021-09-02 18:37 /demo/test.txt

6. Log collection cases

(1) Demand analysis

The business structure of the company is as follows:

The business system consists of multiple servers. Each server has a log. With the extension of time, the log will gradually increase. At this time, it is necessary to scroll the log. The logs of each server will scroll to generate a large number of logs, which have a small amount of value, but will occupy a large amount of disk space. At this time, HDFS can be used for processing:

Write the rolled logs to the HDFS file system. At the same time, in order to distinguish the uploaded logs from the non uploaded logs and avoid affecting the business system, a temporary directory needs to be created, and a backup directory needs to be created to save the logs in the recent period of time, so that the server can view the latest log files without reading them from HDFS.

Therefore, the functions to be realized are as follows:

  • Scheduled collection has finished scrolling log files

  • Upload the files to be collected to the temporary directory

  • Backup log files

(2) Realization of scheduling function

Create a new project LogCollector, pom.xml as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=""





        <!-- -->

        <!-- -->

        <!-- -->



Create a new package com.bigdata.collectlog. The new classes under collectlog are as follows:

package com.bigdata.collectlog;

import java.util.Timer;

 * @author Corley
 * @date 2021/9/2 19:01
 * @description LogCollector-com.bigdata.collectlog
 * Scheduled collection has finished scrolling log files
 * Upload the files to be collected to the temporary directory
 * Backup log files
public class LogCollector {

    public static void main(String[] args) {

        // Background daemon thread
        Timer timer = new Timer();
        // Scheduled acquisition task scheduling
        timer.schedule(new LogCollectorTask(), 0, 3600*1000);



Among them, the public void schedule(TimerTask task, long delay, long period) method of Timer class is used to realize the scheduling of timed collection tasks. The three parameters represent the collected task logic, delay time and cycle time respectively. TimerTask class is an abstract class and implements the Runnable interface. It needs to inherit from TimerTask class and implement the timed business logic in void run() method.

The customized LogCollectorTask class framework is as follows:

package com.bigdata.collectlog;

import java.util.TimerTask;

 * @author Corley
 * @date 2021/9/2 19:07
 * @description LogCollector-com.bigdata.collectlog
public class LogCollectorTask extends TimerTask {
    public void run() {
        // Collected task logic
        // 1. Scan the specified directory to find the file to be uploaded

        // 2. Transfer the files to be uploaded to the temporary directory

        // 3. Use HDFS API to upload files to the specified directory

        // 4. Backup directory of uploaded files


(3) Realization of collection and upload function

The improvement methods in the LogCollectorTask class are as follows:

package com.bigdata.collectlog;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.TimerTask;

 * @author Corley
 * @date 2021/9/2 19:07
 * @description LogCollector-com.bigdata.collectlog
public class LogCollectorTask extends TimerTask {
    public void run() {
        // Collected task logic
        // 1. Scan the specified directory to find the file to be uploaded
        File logDir = new File("E:/Test/logs");
        File[] uploadFiles = logDir.listFiles(new FilenameFilter() {
            public boolean accept(File dir, String name) {
                return name.startsWith("access.log.");

        // 2. Transfer the files to be uploaded to the temporary directory
        // Determine and create temporary directory
        File tmpFile = new File("E:/Test/tmp");
        if (!tmpFile.exists()) {
        assert uploadFiles != null;
        for (File file : uploadFiles) {
            file.renameTo(new File(tmpFile.getPath() + "/" + file.getName()));

        // 3. Use HDFS API to upload files to the specified directory
        Configuration configuration = new Configuration();
        FileSystem fileSystem = null;
        // format date
        DateTimeFormatter timeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
        LocalDateTime dateTime =;
        String timeString = dateTime.format(timeFormatter);
        // Determine whether the backup directory exists and create it
        File logBakDir = new File("E:/Test/log_bak/" + timeString);
        if (!logBakDir.exists()) {
        // Determine whether the HDFS path exists and create it
        Path logPath = new Path("/collect_log/" + timeString);
        try {
            fileSystem = FileSystem.get(new URI("hdfs://node01:9000"), configuration, "root");
            if (!fileSystem.exists(logPath)) {
            File[] files = tmpFile.listFiles();
            for (File file : files) {
                // Store by date
                fileSystem.copyFromLocalFile(new Path(file.getPath()),
                        new Path(logPath.getParent() + "/" + logPath.getName() + "/" + file.getName()));

                // 4. Backup directory of uploaded files
                file.renameTo(new File(logBakDir.getPath() + "/" + file.getName()));
        } catch (IOException | URISyntaxException | InterruptedException e) {
            try {
                if (null != fileSystem) {
            } catch (IOException ioException) {


To test, prepare log files as follows:

λ ls logs\
access.log  access.log.1  access.log.2  access.log.3  access.log.4  access.log.5  access.log.6

If necessary, click below to download:

Now test and execute the main method in the LogCollector class, as follows:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See for more info.

Because the program is a scheduled task, it will be blocked;

Check the logs directory first:

λ ls logs\

As you can see, only access.log is left;

View tmp Directory:

λ ls tmp/

tmp directory is empty;

Check the log again_ Bak directory, as follows:

λ tree /f log_bak\
Folder PATH listing for volume file
Volume serial number is C0000100 10EF:7240

You can see that the scrolled logs are saved to the log bak directory.

Finally, check the HDFS file system as follows:

[root@node01 ~]$ hdfs dfs -ls /
Found 12 items
drwxrwxrwx   - root supergroup          0 2021-09-01 17:59 /api_test
drwxrwxrwx   - root supergroup          0 2021-08-26 19:22 /cl
drwxr-xr-x   - root supergroup          0 2021-09-02 22:01 /collect_log
drwxr-xr-x   - root supergroup          0 2021-09-02 18:37 /demo
drwxr-xr-x   - root supergroup          0 2021-09-02 18:29 /output
-rw-r--r--   1 root supergroup     281214 2021-09-02 12:43 /packet.txt
drwxr-xr-x   - root supergroup          0 2021-09-02 18:14 /test
drwxrwxrwx   - root supergroup          0 2021-08-26 00:36 /tmp
-rw-r--r--   1 root supergroup         18 2021-09-02 11:12 /tmp.txt
drwxrwxrwx   - root supergroup          0 2021-09-02 21:06 /user
drwxrwxrwx   - root supergroup          0 2021-08-25 22:33 /wcinput
drwxrwxrwx   - root supergroup          0 2021-08-26 00:37 /wcoutput
[root@node01 ~]$ hdfs dfs -ls /collect_log
Found 1 items
drwxr-xr-x   - root supergroup          0 2021-09-02 22:03 /collect_log/2021-09-02
[root@node01 ~]$ hdfs dfs -ls /collect_log/2021-09-02
Found 6 items
-rw-r--r--   3 root supergroup     179822 2021-09-02 22:03 /collect_log/2021-09-02/access.log.1
-rw-r--r--   3 root supergroup     251115 2021-09-02 22:03 /collect_log/2021-09-02/access.log.2
-rw-r--r--   3 root supergroup       6037 2021-09-02 22:03 /collect_log/2021-09-02/access.log.3
-rw-r--r--   3 root supergroup     149080 2021-09-02 22:03 /collect_log/2021-09-02/access.log.4
-rw-r--r--   3 root supergroup     314855 2021-09-02 22:03 /collect_log/2021-09-02/access.log.5
-rw-r--r--   3 root supergroup       1571 2021-09-02 22:03 /collect_log/2021-09-02/access.log.6

As you can see, the log file is uploaded to HDFS.

(4) Program tuning

There are many areas where the program can be optimized:

(1) There are many parameters in the LogCollectorTask class, such as each path, which are written directly in the code. Obviously, it is not scalable and can be extracted into the configuration file;

(2) After the task is started, multiple threads are executed each time, but the configuration file only needs to be loaded once, so you can use the singleton design pattern to obtain the Properties object

Create a new configuration file under src/main/resources as follows:


Create a new package com.bigdata.common, and a new Constant class Constant class under the common package is as follows:

package com.bigdata.common;

 * @author Corley
 * @date 2021/9/2 22:16
 * @description LogCollector-com.bigdata.common
public class Constant {

    public static final String LOG_DIR = "LOG.DIR";
    public static final String LOG_PREFIX = "LOG.PREFIX";
    public static final String LOG_TMP_FILE = "LOG.TMP.FILE";
    public static final String LOG_TARGET_DIR = "HDFS.TARGET.DIR";
    public static final String LOG_BAK_DIR = "LOG.BAK.DIR";


Create a new package com.bigdata.singleton. Create a singleton class under the singleton package. The singleton mode includes hungry man type and lazy man type.

The hungry Han style is as follows:

package com.bigdata.singleton;

import com.bigdata.collectlog.LogCollectorTask;

import java.util.Properties;

 * @author Corley
 * @date 2021/9/3 10:34
 * @description LogCollector-com.bigdata.singleton
 * Hungry Han single case mode
public class HungryPropertiesTool {

    private static Properties properties = null;

    // Class can be initialized and executed once when loading
    static {
        properties = new Properties();
        try {
        } catch (IOException e) {

    private static Properties getProperties() {
        return properties;

The lazy form is as follows:

package com.bigdata.singleton;

import com.bigdata.collectlog.LogCollectorTask;

import java.util.Properties;

 * @author Corley
 * @date 2021/9/3 10:46
 * @description LogCollector-com.bigdata.singleton
 * Lazy singleton mode
public class LazyPropertiesTool {

    private static volatile Properties properties = null;

    public static Properties getProperties() throws IOException {
        if (null == properties) {
            synchronized ("Lock") {
                if (null == properties) {
                    properties = new Properties();

        return properties;

Among them, volatile keyword is a keyword that prohibits instruction reordering in Java. Instruction reordering will occur in multithreading. Volatile keyword can ensure order and visibility.

The log collection task classes are as follows:

package com.bigdata.collectlog;

import com.bigdata.common.Constant;
import com.bigdata.singleton.LazyPropertiesTool;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.Properties;
import java.util.TimerTask;

 * @author Corley
 * @date 2021/9/2 19:07
 * @description LogCollector-com.bigdata.collectlog
public class LogCollectorTask extends TimerTask {
    public void run() {

        // Properties properties = new Properties();
        // try {
        //     properties.load(LogCollectorTask.class.getClassLoader().getResourceAsStream(""));
        // } catch (IOException e) {
        //     e.printStackTrace();
        // }

        Properties properties = null;
        try {
            properties = LazyPropertiesTool.getProperties();
        } catch (IOException e) {

        // Collected task logic
        // 1. Scan the specified directory to find the file to be uploaded
        File logDir = new File(properties.getProperty(Constant.LOG_DIR));
        Properties finalProperties = properties;
        File[] uploadFiles = logDir.listFiles(new FilenameFilter() {
            public boolean accept(File dir, String name) {
                return name.startsWith(finalProperties.getProperty(Constant.LOG_PREFIX));

        // 2. Transfer the files to be uploaded to the temporary directory
        // Determine and create temporary directory
        File tmpFile = new File(properties.getProperty(Constant.LOG_TMP_FILE));
        if (!tmpFile.exists()) {
        assert uploadFiles != null;
        for (File file : uploadFiles) {
            file.renameTo(new File(tmpFile.getPath() + "/" + file.getName()));

        // 3. Use HDFS API to upload files to the specified directory
        Configuration configuration = new Configuration();
        FileSystem fileSystem = null;
        // format date
        DateTimeFormatter timeFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
        LocalDateTime dateTime =;
        String timeString = dateTime.format(timeFormatter);
        // Determine whether the backup directory exists and create it
        File logBakDir = new File(properties.getProperty(Constant.LOG_BAK_DIR) + timeString);
        if (!logBakDir.exists()) {
        // Determine whether the HDFS path exists and create it
        Path logPath = new Path(properties.getProperty(Constant.LOG_TARGET_DIR) + timeString);
        try {
            fileSystem = FileSystem.get(new URI("hdfs://node01:9000"), configuration, "root");
            if (!fileSystem.exists(logPath)) {
            File[] files = tmpFile.listFiles();
            for (File file : files) {
                // Store by date
                fileSystem.copyFromLocalFile(new Path(file.getPath()),
                        new Path(logPath.getParent() + "/" + logPath.getName() + "/" + file.getName()));

                // 4. Backup directory of uploaded files
                file.renameTo(new File(logBakDir.getPath() + "/" + file.getName()));
        } catch (IOException | URISyntaxException | InterruptedException e) {
            try {
                if (null != fileSystem) {
            } catch (IOException ioException) {


After running, the same effect as before can be achieved, but the performance is optimized.

Added by Leadwing on Sat, 04 Sep 2021 21:30:00 +0300