This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>
Enterprise products, solutions & services
The development of 5G, IoT, and AI technologies has created many new applications. At the same time, more and more cloud-native applications are moving into private offline data centers from the public cloud. These new applications are widely used in customers' production and decision-making systems, generating a large amount of unstructured data. In preparation for the yottabyte era, enterprises must find ways to efficiently and quickly process and analyze mass data to enable timely decision-making. Meanwhile, manufacturing and media advancements are causing the cost of SSDs to decrease year by year, suggesting that SSDs will gradually replace HDDs.
In an HDD-based storage system, the CPU is the control center for all data access paths, and the I/O process is designed based on the idea of using the CPU to control disks, as shown in Figure 1.
In this control mode, the performance of a single storage node is determined by its component that first reaches the performance bottleneck, including the CPU, memory, or HDDs (a single storage node is typically equipped with multiple HDDs). In an HDD system, the performance of a single disk is affected by factors like its drive and ports. The read and write bandwidth of a typical 7200 RPM SATA HDD is between 100 MB/s and 150 MB/s. When a single storage node is equipped with one or two CPUs and 10 to 30 HDDs, the maximal performance will be delivered. The bandwidth of a single node ranges from 3 GB/s to 4 GB/s. If the number of disks exceeds 30, the performance of a single storage node cannot increase linearly.
To accelerate data processing, the RDMA technology (such as InfiniBand or RoCE) is widely used. Figure 2 shows the RDMA-based I/O logic control.
RDMA indeed simplifies the I/O stack for cross-network data communication. Once data is written into the memory, the CPU can trigger the corresponding DMA instruction to send the memory data to the server through the RDMA-enabled network. On the server side, the RDMA-capable network interface card (NIC) sends data directly to the specified address in the memory. However, as shown in Figure 2, data can be persisted in the storage medium only after the CPU finishes processing (including Erasure Coding calculation) from the memory.
The HDD-based storage architecture has a major disadvantage in flash memory media. In typical cases, the bandwidth of an enterprise-class NVMe SSD can reach over 7 GB/s and the IOPS can reach 500,000 IOPS. Because the CPU and memory bandwidth resources of a single node cannot be expanded infinitely, the storage architecture with the CPU as the control center is limited by the CPU resources and memory bandwidth, and it cannot fully utilize the performance of multiple NVMe SSDs on a single storage node. Therefore, storage architectures must be innovated to unlock the performance of media in the all-flash era.
What characteristics should the flash native storage architecture have in the all-flash era? In the future, all-flash media will come in diversified forms, such as NVMe SSDs and NAND boards. A single node has a greater chance to fully release the performance of all-flash media only when the following four aspects are considered in flash native architecture innovation:
1. RDMA network communication
2. I/O scheduling
3. Simplified software stack
4. TOE/DTOE
Huawei OceanStor Pacific provides the following optimizations in the four aspects:
The flash native architecture must still support RDMA technology (such as the RoCE network technology). Otherwise, the primary principle of the flash native architecture cannot be met. If a storage system does not use RDMA, its IOPS or OPS latency will be limited by the network latency, and the IOPS or OPS performance of a single NVMe SSD cannot be fully utilized.
x86 and Arm processors are mostly multi-core processors. Typically, a storage node is configured with one or more CPUs to process I/O requests. Each CPU has multiple cores (e.g. 10 cores, 20 cores), among which the storage software stack needs to proactively plan CPU cores and I/O scheduling to release the performance of flash media. Otherwise, disordered scheduling will cause low bandwidth or IOPS performance of a single storage node, and flash media performance cannot be fully utilized. To solve this problem, OceanStor Pacific introduces CPU core grouping and I/O task classification. Different I/O types are assigned with different priorities. I/Os with higher priorities (for example, read and write requests from applications) are allocated to groups with more cores. The number of cores in a CPU core group is dynamically adjusted based on system loads. See Figure 3.
CPU core grouping and I/O tasks are prioritized and dynamically scheduled. This enables the system to automatically adapt to foreground application I/O pressure and background tasks. When the foreground pressure increases, the amount of resources allocated to background I/Os will be reduced, improving system performance. When the foreground pressure is reduced, more resources are added for background I/Os to accelerate the execution of background tasks.
The OceanStor Pacific storage system uses a fully symmetric architecture, in which any node can complete the function logic used for the distributed cluster. Therefore, a single node inevitably introduces a large number of threads and cross-module context switches. Simplifying the I/O software stack allows storage nodes to maximize the performance of flash media. When a large number of pthread threads are created, the default Linux pthread model has problems such as low scheduling efficiency and out-of-control thread stack memory allocation. OceanStor Pacific addresses these problems with the new Light Weight Thread (LWT) model to allow for scheduling a large number of I/O threads and accurately controlling the memory usage of threads. In addition, OceanStor Pacific uses the Scatter Gather List (SGL) memory management mechanism to avoid memory copy for better I/O efficiency when performing a context switch.
OceanStor Pacific also allocates different media areas for both metadata that is frequently changed and data that is infrequently changed to reduce the GC impact caused by frequently changed metadata. See Figure 4.
To further streamline the I/O process, OceanStor Pacific uses smart NICs and supports Direct TCP Offload Engine (DTOE). After receiving a client request, a standard NIC does not identify the specific information. Instead, it needs to notify the kernel of the network request interrupt, so the kernel handles the interrupt and notifies the application program to process the network information. The entire process involves the change between the user mode and the kernel mode and multiple context switches, resulting in inefficient I/O processing. With the use of DTOE, all client requests are processed on smart NICs. This removes redundant operations like interrupts and switches between the user mode and kernel mode from the traditional I/O path, greatly simplifying the I/O process. See Figure 5.
To embrace the all-flash era, the data storage system, which is the foundation of IT infrastructure, must be fully innovated to meet the requirements of flash memory media. OceanStor Pacific scale-out storage offers such innovation, delivering high bandwidth and IOPS with low latency. That's why more and more customers, such as a leading private bank in Türkiye and a well-known large bank in Singapore, are choosing the OceanStor Pacific flash model to power their core business applications. As intelligent SSDs and high-throughput network communication bus mature over the coming years, OceanStor Pacific will continue to innovate to meet the performance needs of new applications in the yottabyte data era.
Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy, position, products, and technologies of Huawei Technologies Co., Ltd. If you need to learn more about the products and technologies of Huawei Technologies Co., Ltd., please visit our website at e.huawei.com or contact us.
随诊什么意思 | 肠胃炎引起的发烧吃什么药 | 脚背疼挂什么科 | 荔枝什么时候成熟 | 下葬有什么讲究或忌讳 |
psa是什么 | rpl是什么意思 | 斑斓是什么意思 | herb是什么意思 | 手心长痣代表什么 |
为什么洗澡后皮肤会痒 | 吃什么开胃 | 做梦梦见死去的亲人是什么意思 | 牙龈发紫是什么原因 | 内裤发霉是什么原因 |
阿尔茨海默症吃什么药 | 吃黑木耳有什么好处 | zeesea是什么牌子 | 7.3是什么星座 | 新的五行属性是什么 |
心境障碍是什么病hcv8jop6ns3r.cn | 安宫丸什么时候吃hcv9jop1ns4r.cn | 七月十日是什么日子hcv7jop5ns6r.cn | 肌无力是什么原因引起的chuanglingweilai.com | 体型最大的恐龙是什么hcv9jop5ns2r.cn |
为什么一紧张就想拉屎hcv8jop1ns4r.cn | 窦骁父母是干什么的hcv8jop3ns3r.cn | 牙龈肿痛吃什么中成药bysq.com | 黑棕色是什么颜色hcv9jop5ns2r.cn | 有什么组词hcv9jop3ns5r.cn |
肌筋膜炎吃什么药shenchushe.com | 趁什么不什么hcv8jop6ns3r.cn | EPS什么意思hcv8jop2ns8r.cn | 驻马店古代叫什么liaochangning.com | 车厘子是什么季节的1949doufunao.com |
血肿是什么意思liaochangning.com | 一岁宝宝流鼻涕吃什么药hcv8jop2ns8r.cn | 血红蛋白低吃什么可以补起来hcv8jop2ns3r.cn | 黄芪配什么不上火hcv8jop0ns0r.cn | 孙权孙策什么关系hcv7jop6ns7r.cn |