November 2004
This document describes a solution to the problem of handling user and group IDs in GPFS in a multi-cluster environment. We use the term “multi-cluster environment” to describe a setup where several independent GPFS clusters exist, possibly managed by separate organizations, and an ability to mount a GPFS file system across clusters is desired. The specific set of challenges covered here arise from the fact that the user ID (UID) and group ID (GID) space is normally unique to a particular cluster. While a user may have an account on several clusters, this account may have a different set of UIDs/GIDs in every cluster. Rather than implementing and deploying a separate, GPFS-specific, global infrastructure for managing user IDs, names, and accounts, this document describes an interface that allows GPFS to integrate into an existing user registry infrastructure, such as the Globus Security Infrastructure (GSI) used by TeraGrid. The interface is based on a set of user-supplied ID remapping helper functions (IRHF) for performing dynamic UID/GID remapping in GPFS code as transparently as possible to the end user. The goal is to allow users to access a remotely mounted (i.e. mounted from a cluster other than the one where the file system was created) GPFS file system much like a local GPFS file system, without taking a large performance hit due to the extra work involved in UID/GID remapping.
We base our remapping algorithm on the concept of a “file system home cluster”, which is the cluster where the file system was created; the nodes belonging to this cluster are “home nodes”. We will refer to a cluster that is mounting a file system from a home cluster as “remote cluster”, and this cluser is composed of “remote nodes”. Every GPFS file system belongs to a particular cluster with a given ID space. We choose to treat the ID space of the home cluster of any given file system as the base ID space. All IDs written to disk as GPFS metadata will be the IDs from the home cluster ID space, regardless of which node (home or remote) initiates the I/O, thus allowing for a uniform view of a file system.
In order to map a numeric UID or GID from one cluster to a numeric ID from a different cluster, we rely on an infrastructure of “globally unique names” (GUNs) that exists outside of GPFS. GPFS proper does not have its own self-contained means of remapping IDs. The goal here is to fit into an existing infrastructure to the extent possible and not put an extra burden on cluster administrators by having multiple ID infrastructures. We treat a GUN as an opaque data object that can be translated into a local ID on any node that is a part of a GPFS cluster by a locally supplied IRHF; any one GUN may translate into one and only one local ID, or the translation may fail. The reverse translation, from a local ID to a GUN, may not necessarily be always performed on a one-to-one basis, however, as discussed below. At a minimum, our UID remapping scheme relies on the following two base assumptions:
While in general we deal with the problem of mapping either a UID or a GID, the two obviously cannot always be treated in the same manner. For example, while a set of globally unique user names may exist, there may not necessarily be such a concept for group names. We structured our ID remapping interface so that it can deal with GIDs in a way that could accommodate several possible GID translation scenarios.
In order to implement ID remapping in a performance-efficient manner, GPFS has to internally cache some mapping results. Therefore, we rely on certain properties of ID mappings to implement cache invalidation and expiration, as discussed in Section 5.
Although different clusters can have different user ID spaces, some clusters may share a common UID space, for example separate clusters that reside at the same site. For this reason we use the concept of a “UID domain”. When configuring a cluster, the administrator can specify the UID domain that the cluster belongs to (a globally unique string). It is assumed that no UID mapping will be necessary when mounting file systems between two clusters that belong to the same UID domain. This also allows for more efficient caching of UID mappings when a node in a different domain has two or more file systems mounted out of remote clusters in the same UID domain.
The UID remapping interface is based on two user-supplied programs or scripts for mapping between one or more numerical user or group IDs, as defined on the node where the mapping is invoked, and external names (GUNs). The IRHF programs will be invoked with the following command line syntax:
mmuid2name domain intent nUids nGids
mmname2uid domain intent nUids nGids
As the names suggest, mmuid2name is for mapping numerical IDs to GUNs and mmname2uid for mapping GUNs to numerical IDs. The domain parameter specifies the UID domain that the file system home cluster belongs to. The intent parameter indicates the reason for GPFS invoking the mapping (see below); nUids and nGids are two numbers indicating how many user and/or group names or IDs need to be mapped. The actual list of names or IDs to be mapped will be provided as input on stdin, one per line, where the first nUids lines contain user names or Ids and the next nGids lines contain group names or IDs. GPFS expects the IRHF program to write the result of the mapping to stdout, one ID or name per line. The reason for passing the input ID list via stdin rather than the command line is that the amount of memory available for storing the content of command line arguments is platform-specific and may be too limited.
The intent parameter will allow the user-supplied IRHF program to tailor its behavior to the context in which the mapping is performed, for example, in order to decide how to map groups or how exactly to handle IDs for which no mapping exists on the local node. It will have one of the following three values:
GPFS will attach a UNIX® pipe to stdin/stdout file descriptors of the IRHF program and use these pipes for communication. To avoid possible deadlocks due to pipe buffer limits and potentially obtain better performance from batch processing, we stipulate that both mmuid2name and mmname2uid must always finish reading all input lines before starting to generate output.
IRHF should exit with return code 0 when no errors were encountered, or with non-zero return code otherwise. If GPFS receives a non-zero return code from IRHF, the I/O operation that resulted in the IRHF invocation will be aborted with EINVAL error code.
Following we will describe in more detail the scenarios in which various mappings will be invoked, the meaning of the IDs or names passed to the IRHF, the expected behavior of the IRHF, issues that arise in different scenarios, and options for the IRHF to handle these issues.
There are several distinct scenarios where the issue of ID remapping comes up.
When an application on a remote node accesses a GPFS file system in another cluster (e.g., opens or creates a file), GPFS will map the UID/GIDs of the thread issuing the I/O operation into a set of IDs in the file system home cluster ID space, and execute the I/O request using the remapped IDs. This ensures that permission checking and file ownership will be handled just as if the application were running on a node in the file system home cluster. GPFS will perform this mapping by executing the following operations:
In this process, an IRHF command will be invoked twice: once on the node where the I/O thread is running to map local IDs to global names, and once on a node in the file system home cluster to map global names to local IDs:
If the user attempting to execute the I/O request is properly authorized to do so, as is normally the case, then there will be a GUN associated with the local ID, and the user will have an account in the file system home cluster, and thus both remapping requests will succeed, at least for mapping the UID. Otherwise, mmname2uid should return a set of credentials that correspond to an “invalid” or “nobody” user, as defined on the file system home cluster.
In the environments where GIDs cannot be mapped individually (e.g. because global names are defined for users, but not for groups), the proposed treatment is to replace the entire GID list with that associated with the given user on the home cluster. In other words, in this case, mmuid2name would ignore the GID list and produce a single GUN as output; mmname2uid would be called with a single GUN and return the UID and the GID list associated with the given user’s account in the file system home cluster. This will result in a consistent approach to group permission checking no matter where the application is running (although it will render changing the effective GID, e.g. with newgrp, inconsequential for cross-cluster GPFS I/O).
Note that the exact behavior for mapping groups and for handling unknown users is defined by the IRHF implementation, since GPFS will pass the output from mmuid2name to mmname2uid without interpreting its content; GPFS does not require or expect the number of the returned GUNs to match that given on input. For example, it is up to the user-supplied IRHF how user “nobody” is represented as a GUN or what UID/GIDs are assigned to such a user.
If the entire set of GIDs is to be replaced, as discussed above, the number of supplemental GIDs naturally should not exceed the maximum allowed by the host OS. It should be noted that since different OSes may impose different limits on the maximum number of supplemental GIDs, one should expect the possibility of the GID list being truncated on the remote cluster.
Since the case of credential remapping for permission checking and file ownership is particularly performance-sensitive, we propose to implement a caching mechanism to avoid costly lookup operations where possible, as described in Section 5.
ID remapping in GPFS is disabled by default, and can be enabled by setting enableUIDRemap GPFS config option to yes.
Example: On a node that remotely mounts a GPFS file system, mmuid2name is called to map credentials of an I/O thread into GUNs. It is given 1 UID and 3 GIDs as input, and the output is a single GUN corresponding to the given UID:
mmuid2name sdsc.edu credentials 1 3
stdin:
15001
2000
2001
3000
<EOF>
stdout:
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
<EOF>
On a node in the file system home cluster, mmname2uid is then called to convert the output above into a set of numeric IDs that would compose the I/O thread credentials. The output is the UID of the local account belonging to Jane Doe (30001), and the corresponding set of 4 GIDs (IDs of the groups that Jane Doe belongs to on the home cluster):
mmname2uid sdsc.edu credentials 1 0
stdin:
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
<EOF>
stdout:
30001
2001
2002
3001
3002
<EOF>
When a stat system call is issued, e.g. as a result of the “ls –l” command, GPFS will need to return a set of UID/GID pairs for a set of files. If GPFS simply returns the IDs as stored on disk,
i.e. IDs in the home cluster ID space, these IDs may get locally interpreted in a manner that some users may find confusing: “ls –l”, for example, would display those IDs that happen to be present in the local /etc/passwd and /etc/group as symbolic names of local users/groups, and display the rest as numeric IDs. If this is considered to be undesirable, one way to address the problem is to perform a “reverse ID remapping”, i.e. translate UIDs/GIDs from the ID space of the file system home cluster to the local ID space, using a scheme similar to that outlined in Section 4.1. However, we cannot reasonably expect all such remapping requests to succeed, since only a subset of users in any given cluster will normally have both (1) a GUN defined and
(2) an account in every cluster where remote GPFS mounts are desired. There is fundamentally no good solution for this problem. If one chooses to remap IDs, then only a small subset of IDs may get translated into meaningful values, with the rest showing up as ‘nobody’, which is probably not very useful in most cases. If one chooses to not remap any IDs, the output of something like “ls –l” command may potentially be confusing to some users. It should be noted that the latter is the standard semantics for other network file systems, such as NFS or AFS, and users familiar with those environments will not be confused by GPFS following the same semantics; it is also clearly much more efficient than performing any UID remapping. Reverse ID mapping for stat calls is disabled by default, and can be enabled (separately from credentials ID mapping) by setting enableStatUIDRemap GPFS config option to yes.
If enabled, ID mapping for stat will follow the same steps as for credentials mapping, except in the reverse direction: GPFS will first invoke mmuid2name to map UIDs/GIDs to a GUNs on a node in the file system home cluster, and then call mmname2uid to map to local UIDs/GIDs on the node where stat was called:
Example: On a node in the file system home cluster, mmuid2name is called to map a set of IDs to GUNs during processing of a set of stat calls on a remote node. It is given a list of 4 UIDs (of which 3 are unique) and 4 GIDs (of which 2 are unique) as input, and only two of the UIDs and none of the GIDs have GUNs associated with them:
mmname2uid sdsc.edu stat 4 4
stdin:
30001
30001
21843
18304
2001
2001
4002
4002
<EOF>
stdout:
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
/C=US/O=National Center for Supercomputing Apps/CN=John Doe
<EOF>
In this example, mmuid2name produces an empty line for IDs that cannot be mapped to a GUN; however, any other convention can be used to represent unmappable IDs here, since GPFS passes the output of mmuid2name to mmname2uid without interpreting the content of each output line. We only require that mmuid2name produces exactly one output line for each input line. On the remote node (the node where the stat calls originated from), mmname2uid is then called to convert mmuid2name output (accompanied by a set of original numeric IDs from the home cluster ID space) into a new set of numeric IDs corresponding to the local ID space. In this example, the local policy is to remap IDs during stat when a valid mapping exists, and otherwise map IDs to local “nobody” (UID 99) and “nogroup” (GID 65534) IDs. In this case, Jane Doe has a local account, but John Doe does not.
mmname2uid sdsc.edu stat 4 4
stdin:
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
30001
/C=US/O=NPACI/OU=SDSC/CN=Jane Doe
30001
/C=US/O=National Center for Supercomputing Applications/CN=John
Doe
21843
18304
2001
2001
4002
4002
<EOF>
stdout:
15001
15001
99
99
65534
65534
65534
65534
<EOF>
Since operations for displaying or changing access control lists (ACLs) in GPFS are performed by invoking specific GPFS commands (mmgetacl, mmputacl) rather than system calls that pass numerical UID values, there is no intrinsic need to convert between numerical ID values in two different clusters. Instead, we can directly convert between numerical ID values in the file system home cluster (UID/GID values stored in ACLs on disk) and user or group names used in the output or input of GPFS ACL commands. Unlike for stat operation, it makes little sense to map IDs not known on a node where an ACL command is issued to “nobody” or “unknown”. For example, running mmputacl with the output from mmgetacl as input should leave the ACL unchanged rather than setting some of the IDs in the ACL (those not known on the node where the commands were issued) to “nobody”. Hence, GPFS handles each ACL command as if the command had been issued on a node in the file system home cluster. When an ACL command is issued for a file that resides on a remote file system, the command will issue a warning stating that it is executing on a remote node, and therefore all symbolic user and group names are the names as known on this node.
Processing a chown or chgrp system call issued on a remote node requires mapping a single UID or GID from the remote ID space to the ID space in the file system home cluster. In environments where GIDs cannot be mapped individually (e.g., where groups do not have global names), it will not be possible support chgrp() on a remote node. Furthermore, only root is allowed to perform chown, and we expect that root access to remote file systems would normally be disallowed. Therefore, we chose to disallow chown and chgrp operations on remotely mounted GPFS file systems. If needed, a chown/chgrp command can simply be executed in the file system home cluster.
It is of paramount importance to implement ID remapping in a way that does not impose a significant performance penalty. Clearly, performing a full remapping operation on every I/O request is prohibitively expensive. A remapping request necessarily involves invoking the IRHF twice, meaning running an external application, with the associated overhead and the actual mapping cost, and also sending a message to another node, which means additional latency. Altogether, we cannot reasonably expect that a remapping request should be very fast. This dictates the need for some form of caching, so that the majority of the remapping requests on performance-sensitive paths are satisfied out of the cache. We implement two distinct types of caching: client-side caching (numeric ID to numeric ID mapping on remote I/O nodes), and server-side caching (GUN to numeric ID mapping on the node in the file system home cluster where the corresponding part of the remapping was performed).
The basic concept of ID remapping for credentials checking is complicated by the fact that a given local user may be associated with more than one GUN, and this dictates that some assumptions are made. The key assumption that we make is that a given ID to GUN mapping is effective on a node-wide basis (and not, for example, on a session-wide or a process-group-wide basis, so that the mapping can be done with the sole input being a list of UID/GIDs), and the mapping remains valid for at least a moderately long period of time, which would allow us to effectively cache ID remapping results. Note that a full complement of environmental variables, as set in the user process, will be supplied to mmuid2name to facilitate the task of disambiguating UID to GUN mapping. However, on different I/O nodes the same user may have different GUNs associated with her at the same time, so ID to ID caching cannot be done outside of the I/O node.
Note that while it makes sense to cache individual ID to ID mappings when each ID is somewhat independent, e.g. when processing a stat request, it is not efficient to take the same approach to caching the results of mappings obtained for the purposes of credentials checking. We chose a more practical approach of caching UID to a set of credentials (UID/GIDs list) mappings in the latter case.
There is two mechanisms available for removing a cached ID remapping entry. First, both client-side and server-side cache entries are automatically removed from cache when they expire. The GPFS configuration parameter uidExpiration specifies the maximum lifetime of a remapping cache entry in seconds. The default value is 36000 seconds (10 hours). Second, any given client-side remapping cache entry can be invalidated using tsctl command with the uidinvalidate parameter (See Appendix A). Note that explicit server-side cache invalidation should never be necessary. While it is anticipated that a client-side mapping may change somewhat frequently, e.g. when a user decides to switch the identity under which the remote cluster is being accessed (meaning a different UID-to-GUN mapping should be used), the server-side mappings cached on the home cluster should normally be significantly more static. If a user has a given GUN in the home cluster associated with a particular UID (which is the mapping cached on the server side), the GUN-to-UID mapping normally never changes unless the user account is removed, and automatic cache expiration is sufficient to address such a case.
By default, UID remapping in GPFS is disabled. In order to enable UID remapping, the system administrator should take the following steps: 1) Ensure that the infrastructure to be used for GUN-to-id mapping is operational on all GPFS
nodes.
2) Place mmuid2name and mmname2uid executables in /var/mmfs/etc directory on all GPFS nodes, and ensure that these executables are readable and executable by all users (not just by root).
3) Ensure that GPFS is stopped on all nodes. 4) Enable UID remapping for credentials checking:
mmchconfig enableUIDRemap=yes
Note that UID remapping should be enabled in both home and remote clusters to work correctly.
5) Optionally, enable UID remapping for stat. This step should be carefully considered as it carries an additional performance penalty. See discussion in Sec. 4.2 for an explanation of the benefits and drawbacks involved. The following command enables stat UID remapping:
mmchconfig enableStatUIDRemap=yes
Note that stat UID remapping should be enabled in both home and remote clusters to work correctly. 6) Optionally, change the UID remapping cache expiration timeout (the default value is 36000 seconds):
mmchconfig uidExpiration=NNNN
where NNNN is the expiration timeout in seconds.
7) Optionally, set UID domain for each cluster. By the default, the UID domain name is identical to the cluster name. If some separate clusters have the same UID domain space, those clusters should have their UID domain name set to the same value. UID domain can be set either at cluster creation time, using –U option of the mmcrcluster command, or at a later time using mmchconfig:
mmchconfig uidDomain=DomainName where DomainName is the name of the UID domain, conforming to the same rules as the cluster name.
8) Start GPFS, perform a mount of a file system owned by a remote cluster, and test UID remapping by accessing the file system as a user for which a valid UID mapping exists.
Automatic UID remapping cache entries expiration may prove to be insufficient in certain environments. GPFS tsctl command can be used by the system administrator to explicitly invalidate a client-side cache entry for a given user or group ID. Note that this command can be invoked only by the superuser. It is anticipated that this command would be run from a script that establishes or removes an authorization for a given user to access a remote cluster.
tsctl uidinvalidate device {-c | -u | -g } id
device full name of the file remote file system: remote cluster name followed by the GPFS device name, separared by a colon. Example:
cluster1.example.org:gpfs0
Invalidate the cache entry used for credentials checking. id is the numeric user ID of the local user. -u Invalidate the cache entry used for stat remapping for the user ID specified by the numeric id. -g Invalidate the cache entry used for stat remapping for the group ID specified by the numeric id.
Example 1: Invalidate remapping cache entry for the local user with UID 1001, used for credentials remapping requests for the file system cluster2.example.org:gpfs4:
tsctl invalidate cluster2.example.org:gpfs4 –c 1001
Example 2: Invalidate remapping cache entry for the local user with GID 2002, used for stat UID remapping requests for the file system cluster2.example.org:gpfs4:
tsctl invalidate cluster2.example.org:gpfs4 –g 2002
tsctl will return 0 if the cache entry has been successfully found and removed, and non-zero if any errors were encountered.
© IBM Corporation 2004 IBM Corporation Marketing Communications Systems and Technology Group Route 100 Somers, New York 10589
Produced in the United States of America November 2004
All Rights Reserved
This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area.
All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only.
IBM, the IBM logo are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at:
http://www.ibm.com/legal/copytrade.shtml.
UNIX is a registered trademark of The Open Group in the United States, other countries or both.
Other company, product, and service names may be trademarks or service marks of others.
Information concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-IBM products should be addressed with the suppliers.
All performance information was determined in a controlled environment. Actual results may vary. Performance information is provided “AS IS” and no warranties or guarantees are expressed or implied by IBM. (Optional – include this statement only if the document contains performance information.)
The IBM home page on the Internet can be found at http://www.ibm.com .