User-Level Parallel File I/O

Parallel disk I/O subsystems are becoming more important in today’s large-scale parallel machines. Parallel disk systems provide a significant boost in I/O performance reducing the gap between processor and disk speeds. We describe a Unix-like file I/O user interface, implemented in a parallel file I/O subsystem on an MIMD machine, the nCUBE 2. Based on message passing, we develop parallel disk read/write algorithms to achieve higher parallelism when more disk drives are used. We use the closed queuing network model to analyze the effect of some tunable system parameters for our parallel file system. We then performed simulation experiments in order to obtain more realistic performance data, by comparing the original vendor-supplied file system with ours. The results indicate that the speedup in I/O performance is almost equal to the number of disk drives we use. Thus, our user-level parallel file I/O approach will provide scalable I/O performance.

Peripheral Devices The hypercube Compute Nodes

Figure 1: An NChannel I/O Board
To understand the multiple-channel configuration, consider the following example. We use 8 I/O nodes on an I/O board with logical number 3. These 8 I/O nodes are shared by 1024 compute nodes in the hypercube through 64 distinct I/O SI channels. For example, the I/O node 2 uses its channel 0 to connect to the compute node 11, channel 1 to the compute node 75, channel 2 to the compute node 139, and so on. An SI node is a compute node directly connected to its corresponding I/O node via the SI channel. All other nodes (non-SI nodes) route their I/O messages to their nearest target SI node, which then routes further to the corresponding I/O node. The path from the non-SI node to SI node follows the e-routing scheme [NCUBE, 1992]. The nCUBE 2 provides 8 different channels to increase the bandwidth. Nevertheless, the disadvantage is that several compute nodes share the same SI node.

. Parallel File I/O Mechanism
The disk driver and file system interface "nsdisk" runs on each I/O node and manages files. It can be viewed as a separate file system process that controls attached disks. These nsdisks receive messages from compute nodes and perform I/O jobs. nsdisk is a UNIX-like file system which uses file descriptor tables to keep the information of open files for each process. On the compute node side, the process only keeps a file table structure containing the address of the I/O process (e.g., nsdisk) servicing this file and the file descriptor used by this I/O process. A separate file in this I/O process has its own file descriptor. A compute process can be viewed as a client process and an I/O process as a server process. We may create separate file descriptors in each I/O node to access several files simultaneously. A compute process can refer to these different files as a logically single, large file to achieve parallel I/O. Figure 2 illustrates the file structure, with process A opening two files, one in I/O node 0, the other in I/O node 1. Process B opens a file in the I/O node 0. The UFID (User File ID) is used as a file descriptor in the compute node, and the SFID (System File ID) is for the I/O node. The Vertex operating system, the vendor-supplied OS for the nCUBE 2, allocates two UFIDs (0 and 2) for process A, and receives SFID 0 and 1 from I/O node 0 and I/O node 1, respectively. Process B has UFID 1 and SFID 2 in the I/O node 0. The entry in the process file The problem of multiple disks on separate computing entities has been studied for distributed file systems [Reddy&Banerjee, 1989]. NFS (Network File System) from Sun is an example of such a file system [Leffler et al., 1989]. Note that any one file is constrained to reside on a single disk. This clearly causes a bottleneck at the I/O node if data are required concurrently by several compute nodes. The vendor-supplied nCUBE 2 file system is based on NFS. Each I/O node uses the file server nsdisk process to manage its independent file system. The nCUBE 2 uses the disk volume name as a prefix of the path name to open a file, e.g., "//df00/...". A disk's volume name is unique to the whole nCUBE 2 file system. Therefore, the user process can decide to which I/O node the I/O message should go. Although a file is constrained to reside on a single disk, we still can use the data declustering technique [Reddy&Banerjee, 1989] to open several files on each disk to make the data distribute to these different files.
I/O in the nCUBE 2 file system is based on message-passing. It includes the asynchronous and synchronous services. The nCUBE 2 library provides read and write function calls (the same as the UNIX read and write function calls) to achieve the synchronous block I/O operations. The asynchronous function calls are readstart/readend and writestart/writeend: readstart (file_descriptor, length_of_buffer), readend(file_descriptor, buffer), writestart(file_descriptor, buffer, length_of_buffer), writeend(file_descriptor); readstart requests a read operation from file descriptor file_descriptor of size length_of_buffer, then readend fills the data buffer with the data requested by readstart, with the same file descriptor argument used by a previous readstart call. If there are several readstarts with the same file descriptor, readend will choose the most recent request. writestart and writeend have the same calling convention. In our implementation of parallel I/O, we use the nread or nreadp function call [NCUBE, 1992] to select messages from I/O nodes. The message will contain the I/O node number, data and returned status. With this information, we can easily determine which disk I/O has finished.

Design and Analysis of the nCUBE 2 Parallel File I/O
We describe the design of our Unix-based parallel file I/O system and then provide an abstract model (a closed queuing network) in order to analyze its expected behavior. We present the predicted performance based on this model in this section; the experimental results obtained via simulation are presented and discussed in the subsequent section. These two sets of data should be viewed in parallel.
A parallel I/O file system must distribute data among multiple disk devices. The distribution should be as even as possible; otherwise disk skew [Kim, 1986] occurs which decreases the overall I/O performance because it degrades parallel access. If we want to write a block b to a disk, the placement rule should determine which disk block b resides locally. The data placement rule should avoid disk skew. Disk interleaving is chosen here. For a file striped across disk drives numbered from 0 to D-1 (D the total number of disk drives), block b (numbered from 0) is on the disk drive numbered (b mod D).
The nCUBE 2 uses independent disk controllers which are attached to separate processors (I/O nodes) in the interconnection network. In interleaving data on n disk devices, the goal is to reduce the read/write access time by a factor of 1/n. However, it is not easy to achieve an I/O speed-up of n unless we disregard the contention of interconnection communication. Therefore, the connectivity plays an important role in the multiprocessor machine.
We use PIO File for a logically single, large file whose blocks are declustered among small files on disk drives of I/O nodes. We index the first block on the first disk drive by 0. • offset: indicates the PIO File pointer position in terms of bytes; the file pointer always points to the byte that is ready to be read or written.
• For each open file on the disk drive D k , k ∈ {0, ..., D-1}, F k : file descriptor of the file on D k , which is required by Vertex to provide the information needed to send the I/O operation message to the file server. N k : address of I/O node where D k resides. Each I/O node can have several disk drives attached. The number of these disk drives depends on the disk controller which is connected to an I/O node. Some disk controllers can control up to seven disk drives such as the SCSI disk interface, but others may not. However, an I/O node can only handle one incoming I/O message at a time and then issues the physical disk operation through the disk controller. For example, disk drives D 0 and D 1 are all attached to the I/O node N . Vertex issues an I/O operation to the I/O node N for D 1 and then issues another one for D 0 . No matter how long the disk D 1 takes to operate the I/O, the file system of I/O node N guarantees that the acknowledge (ACK) message of I/O will be sent for the first one (D 1 ), and then for D 0 .
The interconnection network is based on e-routing and wormhole routing [Ni&McKinley, 1993]. The path from a node to another node is fixed according to e-routing. Wormhole routing guarantees that the second message from N will be blocked until the first one is finished. In the PIO File table, we need to keep the addresses of I/O nodes where disk drives reside. The reason is that an I/O ACK message only contains the information of the I/O node address and the returned status of I/O operation. We need the I/O node address to search the data structure (message queued linked list) for the matched I/O node address's request queued in that data structure. When an I/O request is issued by Vertex, we append information related to this request to the message queued linked list. If an I/O ACK message arrives, we retrieve information of the corresponding queued request. Two primitives are used in our algorithms, namely "ASYNCHRONOUS DISK READ/WRITE" and "READ ACK FROM I/O NODES" (see Figure  3).

PIO OPEN, PIO CLOSE:
Because each I/O node contains an independent file system, we use a file in each disk drive to store the declustered block data. PIO OPEN will return a user PIO File descriptor (pfd) to be used in successive I/O operations. PIO CLOSE is used to close all the involved files on disk drives.

• ASYNCHRONOUS DISK READ/WRITE
{issue an asynchronous disk read or write regarding F k with D k and N k ; the current file position is offset} i f disk operation is read t h e n { read nbytes of data from D k into user space buffer } call readstart(F k , nbytes); e l s e { write nbytes of data from user space buffer to D k } call writestart(F k , buffer, nbytes); Insert information of D k , F k , N k , buffer, offset, nbytes i n t o data structure at the end of message queued linked list; However, when processes are terminated, Vertex will close all the files for the user processes. The algorithms of PIO OPEN and PIO CLOSE are given in Figure 4. ACK message from any disk drive comes, the request of the next corresponding data block belonging to this disk drive will be issued.

Algorithm PIO OPEN
In Figure 5, the predefined parameter C is the number of blocks in the communication buffer of each disk drive. If we set C to be 2 instead of 1, PIO READ and PIO WRITE will issue two disk operations to each disk drive initially. This will keep the disk drives busier. However, the product of block size, number of disk drives and C must not exceed the size of the communication buffer which is allocated in and uses up user memory space.

PIO SEEK:
To allow random access of files, we also allow resetting the file pointer for the PIO File. The calling convention of PIO SEEK (code omitted here) is the same as the UNIX lseek system call.

Figure 5: Algorithm Parallel File I/O Read/Write
In PIO WRITE/READ, if D is the disk size and C is the number of blocks in the communication buffer per disk drive, then there are at most D *C active messages in the whole system if there are no other jobs in the system. Each message needs to start from the compute node and then goes through the interconnection network to the file server. After the service of I/O, the file server sends another message (ACK) back to the compute node. However, we can view this so that this message carries the return value back to the compute node. In the middle processing of PIO WRITE/READ, the number of active messages is always D*C; whenever a message comes back, it issues another message to the same destination. Of course, this is not true when PIO WRITE/READ is almost finished because some disk drives may have done all of their corresponding data blocks and could be idle. However, if we only consider very large files, the system almost always has D . C active messages running through it.
This kind of workload suggests the closed queuing network model [Jain, 1991] which has no external arrivals or departures in the system. The jobs (messages) in the system keep circulating from one queue to the next. The total number of jobs (messages) in the system is constant. Each job goes through several service centers to receive the system resources and then re-circulates. Service centers may be of two types, queuing center and delay center. Jobs at a queuing center compete for the use of the server. Thus the time spent by a job at a queuing center has two components: waiting time and service time. Queuing centers are used to represent any system resource where jobs compete for service, e.g., the interconnection network and file system servers. At the delay center, jobs are allocated their own server, so there is no competition for service. Thus the residence time of a job at a delay center is exactly the job's service time. We consider messages as jobs. Each message must receive service from the system resource and then goes back to the same compute node. We use PIO WRITE or PIO READ to predict the performance and explain how influential the contention of interconnection network is. This illustrates that useful results such as how to define the three parameters D, B and C can be obtained. The next stage is the interconnection network with the e-routing scheme defining paths to disk drives for I/O request messages. If any two I/O request messages use the same DMA channel to route their respective messages, one waits until the other finishes its routing. These two paths are overlaid due to the shared channel; these paths are channel-joint. Paths with no common channel shared are said to be channel-disjoint. Channel-disjoint paths are viewed as separate queuing service centers. Each I/O request message chooses its queuing service center to go through to its corresponding I/O node. For example in Table 1, if compute node 3 wants to write three data blocks to I/O nodes 1, 2 and 3 through compute nodes 7, 11 and 15, respectively, the three paths are as follows: Proof: We consider a source x = (x n-1 x n-2 ... x j ... x i . . . x 1 x 0 ) 2 , and two destinations y = (y n-1 y n-2 ... y i ... y 1 y 0 ) and z = (z n-1 z n-2 ... z j ... z 1 z 0 ). We assume that i is the right-most bit position where x and y differ, and j for x and z. If i ≠ j, then we know that the source node uses channel i or j to route to node (x n-1 x n-2 ... x j . . . x i ... x 1 x 0 ) 2 or (x n-1 x n-2 . . . x j ... x 1 x 0 ) 2 first , and then continues to y or z, respectively (the complement of x i is x i ). In e-routing, the message travels in ascending order of the dimension. Therefore, nodes through which the path from x to y passes are all with least significant bits x i ... x 1 x 0 , but those nodes which the path from x to z travels have least significant bits x i ... x 1 x 0 . They are all different nodes; thus, it is not possible to use the same channel in their respective routing. Therefore, if paths use different channels to take off from a common source, then they are channel-disjoint. If there are m different channels, then we have m channel-disjoint routings.  [Jain, 1991] to derive our predicated performance measurement.
We obtained the following system parameters using monitor programs to estimate the service times (service demand) of service centers. This does not include any time spent waiting for service.
• CPU Service Center (Stage 1) { including memory copy of data and computation of placement rule } We used different data block sizes (4Kbytes, 8Kbytes and 16Kbytes) with the same parameterseight I/O nodes and only one channel-disjoint routing (one service center in stage 2) to predicate the performance of PIO WRITE. We used only one active I/O message for each I/O node. We neglect the influence of the fourth stage because very little time is spent there (15 microseconds). Table 2 lists some performance measurements: We use the same parameters as in the above example but we only focus on a 16 Kbytes data block. In the above example, I/O request messages use only one channel-disjoint routing. A message needs to wait for another one completing its transmission. In the following example, we assume we have multiple channel-disjoint routings available. We assign the class i I/O request m e s s a g e g o i n g t h r o u g h t h e channel-disjoint routing ⎣i / 2⎦ in the case of 4 channel-disjoint routings. In the case of 8 channel-disjoint routing, the class i I/O request message goes through the channel-disjoint routing i. As shown in Table 3, when we use four channel-disjoint routings, the I/O operational transfer rate is almost twice that of one channel-disjoint routing. However, if we add four more channel-disjoint routings (for a total of eight routings), the I/O operational transfer rate is bounded. There is little difference between four and eight channel-disjoint routings; the residence time for four or eight channeldisjoint routing is 7.38 or 9.17 msec, respectively, compared with one channel-disjoint routing.  Table 4). But the utilization of each I/O node or I/O operational transfer rate is bounded when we add extra channel-disjoint routings. With 8 channel-disjoint routings, although we spend less time (9.11 msec compared with 13.09 msec for 4 channel-disjoint routing) in the Interconnection Network, we spend more time in the CPU stage. That is why there is no significant difference between 4 and 8 channeldisjoint routings. We conclude that the best parameters are 16 Kbytes data blocks and the double buffer approach.
If we can at least find four channel-disjoint routings, the parallel file I/O system will attain almost optimal behavior, namely more than 5Mbytes/second I/O operational rate over 8 I/O nodes.

. Experimental Results
We implemented the user-level parallel file I/O system described in the previous section, utilizing the message-passing mechanism and multiple file servers in I/O nodes to allow the user to take advantage of the multiple disk drives provided by a large-scale parallel machine, here the nCUBE 2. We used one real compute node in the hypercube and eight I/O nodes with multiple channels connected to the hypercube.    The nCUBE 2 provides a better multiple-channel connectivity instead of the single channel to an I/O node provided by iPSC/2 CIO. If we centralize I/O activities in some compute nodes with more channel-disjoint routings to I/O nodes, these compute nodes send file data to those with few channel-disjoint routings. Since we our approach applies equally to other large-scale machines we can use it in general message passing MIMD systems to get scalable I/O performance.

. Conclusion
Our user-level parallel file I/O approach satisfied our goal of achieving significant I/O performance to close the gap between the processor and the disk. We provided parallel Unix-like file interfaces by which the user can explore more functionality of parallel file I/O such as the mapping mode described in [DeBenedictis&Rosario, 1992]. Our user-level parallel file I/O is scalable to the hardware architecture. The more disk drives are available, the greater the I/O performance will be (up to a point). Moreover, our approach, developed for the nCUBE 2, can be applied to other large-scale MIMD machines. Although the underlying configuration is important, we can use the Theorem to decide which compute node has the maximum number of channel-disjoint routings to the specific disk drives and then centralize our file I/O. Our parallel file system is not only good for sequential files but can also be used for random access, which provides the user a flexible way to control data thoroughly. It is transparent to the user. The user can also use the approach of I/O minimization [Leiss, 1995] to reduce the number of disk I/O activities. The capability of random access provides a good way to achieve it.