How can we double the backup speed of a massive number of small files?

Wed, May 31, 2023

With the increasingly improved management of information technology in industries such as healthcare, education, e-commerce, and finance, there are also more types of unstructured data, including records (schools, libraries), images (e-commerce), videos (hospitals), bills (finance), drawings (design companies), and more. The amount of data in enterprises has grown exponentially from gigabytes (GB) to terabytes (TB) and even petabytes (PB). Within this massive amount of data, there are often numerous small files, with individual file sizes below kilobytes (KB), but the number of files can reach millions, tens of millions, or even billions, resulting in a massive volume of small files.

When faced with big data composed of a massive number of small files, improving the backup speed of data becomes a headache for enterprises. Aurreum utilizes various combinations of technologies, optimizing the backup process at different stages, to comprehensively accelerate the backup of massive small files and achieve “precise” deduplication, “fast” transmission, and “minimal” resource usage!

Variable-length block segmentation technology - Accurate deduplication of data

Aurreum employs variable-length block segmentation with heavy pruning technology for the backup of unstructured data, especially massive small files. Since the data in massive small files is subject to complex changes, using fixed block segmentation often requires re-segmenting the entire backup data. Variable-length block segmentation, on the other hand, only performs segmentation on the changing data, resolving the low deduplication rate issue caused by re-segmenting unchanged data in fixed block segmentation. This approach not only significantly reduces the utilization of client computing resources in data segmentation and processing but also ensures the optimal deduplication effect for massive small files.

Multi-channel parallel backup - Multiplicative efficiency improvement

Aurreum employs techniques such as file index and data backup parallel processing, as well as multi-channel parallel backup data collection, in the construction of file backup channels, resulting in a multiplied improvement in backup efficiency. It’s similar to using multiple parallel water pipes to drain a reservoir, effectively increasing the drainage efficiency.

Parallel processing of file index and data backup

For file backup processing at the gigabyte (GB) level, the conventional approach is to perform serial processing in a single process: first traverse the files to be backed up, build a file index, and then perform file data backup. However, in the case of massive file data, there are often situations with a high number of small files and deep file directory hierarchies. File traversal and index building can take a long time, leading to low backup efficiency when done serially.

Aurreum, on the other hand, separates file indexing and data backup into two different processes and processes them concurrently. While the system traverses the file directory, it simultaneously builds the file index and performs file backup. This significantly shortens the backup time and improves efficiency.

Multi-channel parallel data backup

For data collection in file backup, Aurreum employs multi-channel parallel technology. Using a pipeline approach, the system traverses the files that need to be backed up, builds file indexes, and then distributes the information of the files that need to be backed up to various data backup channels. The backup data is transmitted through multiple channels in parallel to the storage server for storage.

The challenge of multi-channel parallel processing lies in the strategy of distributing the file information after segmentation and the data integration process during recovery after multi-channel backup. Aurreum utilizes proprietary algorithms to automatically monitor channel occupancy and evenly distribute backup data to idle channels. During data recovery, the data is restored to the original file directory based on the segmentation information during backup, ensuring efficient and secure operations.

Server-side automatic synthesis - Minimized client resource usage

For the backup of massive files, especially massive small files, synthesized backup is the optimal strategy. It solves a series of issues associated with traditional periodic “full backup + incremental backup” strategies, such as long backup times, substantial utilization of client computing resources, I/O resources, network resources, and disruptions to normal operation of core business.

Aurreum supports the synthesis backup of files, which involves performing a full backup once and then subsequent incremental backups. Simply put, Aurreum merges the first full backup data with the subsequent incremental backup sets to generate new full backup data. Then, at regular intervals, the newly generated full backup set and incremental data set are merged to create new full backup data, and this cycle continues.

Aurreum’s file synthesis backup can be applied to any platform and environment, particularly for file backups mounted through NFS and CIFS. This is an area that volume-level CDP technology currently does not support.

By combining various technologies, Aurreum upgrades the backup speed for massive small files with the principles of “precise,” “fast,” and “minimal.” Stay tuned for more technological updates!