Configuring Microsoft Azure for LSF resource connector

Follow these steps to configure Microsoft Azure to create instances for LSF resource connector to make allocation requests on behalf of LSF. The instances launched from Microsoft Azure join the LSF cluster. If instances become idle, LSF resource connector terminates them.

Before you begin

The Microsoft Azure provider requires IBM® Spectrum LSF Version 10.1, Fix Pack 2, or later.

Before using the resource connector with the Microsoft Azure provider, you must apply the latest LSF Fix Pack, manually move the required configuration files to the appropriate directory under LSF_TOP/conf/resource_connector/<provider_name>/conf, and change the ownership of those new files and directories to the cluster administrator.

For more information about applying Fix Packs to LSF resource connector, see Use the LSF patch installer to update resource connector.

  • You must have root access to the LSF management host.
  • The LSF management host and the compute nodes (Microsoft Azure instances) must be able to communicate with each other.
  • You must be able to restart the LSF cluster.
  • You must be familiar with and have the ability to perform Microsoft Azure administrative operations.
  • The virtual network to be used by Microsoft Azure virtual instances must be configured so that they can communicate with LSF hosts.
    Note: This is not required if you are running a simple test application.
  • You must have the ability to configure the DNS or NIS server and use short host names on Microsoft Azure.

    The default DNS server on Microsoft Azure does not work for LSF because the Microsoft Azure instance's full host name, with domain name, usually exceeds 60 characters, which is too long for LSF.

About this task

LSF resource connector has been tested on the following systems:
  • Linux x86 Kernel 2.6, glibc 2.11 RHEL 6.x
  • Linux x86 Kernel 3.10, glibc 2.17 RHEL 7.x
  • Linux x86 Kernel 3.10, glibc 2.17 CentOS 7.x
  • LSF 10.1.0 Fix Pack 2, or later

In the following steps, you must perform all operations as the Microsoft Azure administrator unless otherwise stated.

For full details on installing LSF, see Installing IBM Spectrum LSF on UNIX and Linux.

To allow user-submitted jobs to run on the instance, the instance must have this user prepared or LSF user mapping configured. For more information about user groups and user account mapping, see Managing Users and User Groups and Between-Host User Account Mapping.

Procedure

  1. Create the Azure Java SDK authentication file.

    Any application that runs on Microsoft Azure must be registered as an Azure application, and set roles to access resources such as images and networks. Create an Azure authentication file to grant access to the LSF resource connector Azure plug-in, and register the application as an Azure application. You can set up access control for the LSF resource connector under multiple subscriptions:

    • To generate a role for the authentication key file, manually:
      1. Edit the Microsoft Azure custom_role.json file to include a custom role to accurately control access rights for LSF resource connector. List multiple subscriptions in under the AssignableScopes section so that the LSF resource connector can work under those multiple subscriptions. Here is an example custom_role.json file, with a custom role named LSF Resource Connector, with two subscriptions under AssignableScopes:
        {
        "Name": "LSF Resource Connector",
        "IsCustom": true,
        "Description": "LSF resource connector for Azure, access/create/delete VM RG storage network"
        "Actions": [
        "Microsoft.Storage/*",
        "Microsoft.Network/*",
        "Microsoft.Compute/*",
        "Microsoft.Authorization/*/read",
        "Microsoft.Resources/subscriptions/resourceGroups/*",
        "Microsoft.Resources/deployments/*",
        "Microsoft.Insights/alertRules/*",
        "Microsoft.Insights/diagnosticSettings/*",
        "Microsoft.Support/*",
        "Microsoft.ResourceHealth/availabilityStatuses/read"
        ],
        "NotActions": [
        ],
        "AssignableScopes": [
        "/subscriptions/1db8ceea-a921-4395-9586-6fc87945f8d7",
        "/subscriptions/5db8ceea-a921-4395-9586-6fc87945f8d9"
        ]
        }
      2. Create the custom role in Microsoft Azure, register the LSF resource connector with the Azure application, and assign the custom role to your multiple subscriptions.
        Use the following example as a guide: it creates the LSF Resource Connector custom role, registers the LSF resource connector with the Azure application called MyAzureApp, and assigns the LSF Resource Connector custom role to two subscriptions:
        $ az role definition create --role-definition custom_role.json
        $ az ad sp create-for-rbac -o json -n "MyAzureApp" --role "LSF Resource Connector"
        --scopes "/subscriptions/1db8ceea-a921-4395-9586-6fc87945f8d7"
        "/subscriptions/5db8ceea-a921-4395-9586-6fc87945f8d9
    • To generate a role for the authentication key file, automatically:
      1. Verify that you have Azure Active Directory administrator permissions and installed Azure CLI 2.0.
      2. Log in to the Azure CLI and run the authgen.py script to create the authentication file:
        curl -L https://raw.githubusercontent.com/Azure/azure-libraries-for-java/master/tools/authgen.py | python > my.azureauth

    For more information on how to create the Azure Java SDK authentication file, see the Microsoft Azure website on GitHub: https://github.com/Azure/azure-libraries-for-java/blob/master/AUTH.md

    The new authentication file that the LSF resource connector supports is a Java properties file and contains the following information:

    subscription=########-####-####-####-############
    client=########-####-####-####-############
    tenant=########-####-####-####-############
    key=XXXXXXXXXXXXXXXX
    managementURI=https\://management.core.windows.net/
    baseURL=https\://management.azure.com/
    authURL=https\://login.windows.net/
    graphURL=https\://graph.windows.net/
  2. Create a key pair.

    Azure supports public-key cryptography to secure the login information for an instance. A Linux instance has no password; you use a key pair to securely log in to your instance. You specify the SSH public key when you launch your instance, then use the private key when you log in using SSH.

    Provide the content of public key file to LSF to launch Azure instances. Configure the file path of the public key in the azureprov_templates.json file.

    1. Use the ssh-keygen command to create an SSH key pair.
    2. Provide the content of the public key file to LSF to launch Azure instances. Specify the file path of the public key in the azureprov_templates.json configuration file.
  3. Create a resource group for LSF on the Azure Portal or by using the Azure CLI..

    A resource group is a way of grouping your Azure resources together. This resource group specifies the Azure resources that LSF uses.

    For more information on how to create resource groups in Azure, refer to the Microsoft Azure documentation: https://docs.microsoft.com/en-us/azure/azure-resource-manager/create-resource-group-in-template.

  4. Create an Azure Virtual Network (VNet) and subnets in the Azure resource group for LSF.

    An Azure Virtual Network (VNet) is a virtual network that is dedicated to an Azure account.

    For more details on Azure Virtual Networks, refer to the Microsoft Azure documentation: https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview

    A subnet is a range of IP addresses in your VNet. You can launch Azure resources into a subnet that you select. Specify the name of the subnet (for example, subnet) in the azureprov_templates.json file that is used to launch the instance.

    Create the VNet and subnets in the Azure resource group that you created for LSF

  5. Create a an LSF network security group in the Azure resource group for LSF.

    A network security group (NSG) contains a list of security rules that allow or deny network traffic to resources that are connected to VNets. Add customized rules to open all LSF listening ports to the security group that launches instances. The ports must match those from the existing LSF cluster.

    The following are the default port number values:
    • LSF_LIM_PORT=7869 (TCP and UDP)
    • LSF_RES_PORT=6878 (TCP)
    • LSB_SBD_PORT=6882 (TCP)

    You can also accept all traffic from the LSF management host by adding IP address of the LSF management host, or you can accept all traffic across the VNet.

    Note: If you allow the traffic only from the default ports, some NIOS commands might not work since the ports are configured for them dynamically and are different from the LSF ports.

    Add the network security group to the Azure resource group that you created for LSF.

  6. Create an instance and install the LSF management host.

    The LSF cluster administrator must manually launch an Azure instance and install LSF on the management host.

    Enabling dynamic hosts is sufficient for this management host configuration. Set ENABLE_DYNAMIC_HOSTS="Y" in the install.config file if you are installing a new LSF management host. If you have an existing LSF management host, manually the configure LSF_DYNAMIC_HOST_WAIT_TIME parameter in the lsf.conf file and the LSF_HOST_ADDR_RANGE parameter in the lsf.cluster.clustername file.

    Tip: You do not have to configure the resource connector parameters at this time. You can configure these parameters after you successfully build the LSF cloud image

    For example, specify the following parameters in the install.config file:

    LSF_TOP="/opt/lsf"
    LSF_ADMINS="lsfadmin"
    LSF_CLUSTER_NAME="cluster1"
    LSF_MASTER_LIST="lsfmanagement1 lsfmanagement2"                  
    LSF_ENTITLEMENT_FILE="/root/platform_lsf_std_entitlement.dat"
    LSF_TARDIR="/root/"
    ENABLE_DYNAMIC_HOSTS="Y"
    LSF_DYNAMIC_HOST_WAIT_TIME="2"
  7. Build the LSF cloud image.

    To create an Azure instance image for an LSF cloud compute host, manually launch an Azure instance and install LSF on that instance.

    1. Create the Azure instance and choose the manage disk when you launch the instance.

      For more information on how to create a Linux virtual machine in the Azure portal, refer to the Microsoft Azure documentation: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/quick-create-portal.

    2. Use the ssh command to log in to the Azure instance and use the Azure Java SDK authentication file that you created.
    3. Copy the LSF packages to Azure instance
      • lsf10.1_linux2.6-glibc2.3-x86_64.tar.Z
      • lsf10.1_linux2.6-glibc2.3-x86_64-442293.tar.Z (LSF 10.1, Fix Pack 2)
      • lsf10.1_lsfinstall.tar.Z
      • lsf_std_entitlement.dat
    4. Check that the ed.x86_64 software is installed.

      If this package is not installed, use the yum install command to install it.

      yum install ed
    5. Install LSF as a server host on the Azure instance.

      For example, edit the server.config file with the installation options you need:

      LSF_TOP="/opt/lsf"
      LSF_ADMINS="lsfadmin"
      LSF_TARDIR="/opt/install/"
      LSF_LICENSE="/opt/install/lsf_std_entitlement.dat"
      LSF_SERVER_HOSTS="management1 management2"
      LSF_LOCAL_RESOURCES="[resource azurehost]"
      LSF_LIM_PORT="7869"
      LSF_GET_CONF=lim

      Run the ./lsfinstall -s -f server.config command to install the LSF server host.

      After installation, make sure that the LSF_TOP/conf/lsf.conf file contains the azurehost resource.
      LSF_GET_CONF=lim
      LSF_CONFDIR=/opt/lsf/conf
      LSF_LIM_PORT=7869
      LSF_SERVER_HOSTS="management.myserver.com"
      LSF_VERSION=10.1
      LSF_LOCAL_RESOURCES="[resource azurehost]"
      LSF_TOP=/opt/lsf/
      LSF_LOGDIR=/opt/lsf/log
      LSF_LOG_MASK=LOG_WARNING
      LSF_ENABLE_EGO=N
      LSB_ENABLE_HPC_ALLOCATION=Y
      LSF_EGO_DAEMON_CONTROL=N
      
      LSF_LIM_PORT
      The port number must be the same as the one that is defined on the LSF management host.
      LSF_GET_CONF
      Update the LSF configuration to synchronize the cluster configuration with the management host.
      LSF_LOCAL_RESOURCES
      The new resource name azurehost is used by LSF to identify Azure instances. Use the bhosts -a command to see instances that are used by LSF.
    6. Optional: If required, update the /etc/hosts file to add the management host name to the /etc/hosts file on the Azure instance, or configure the DNS/NIS client on the instance when they are used.

      You must use the short host name when specifying the management host. The host names of the new instances that the LSF resource connector creates on Azure follows the "host-a-b-c-d" format, where "a-b-c-d" corresponds to the instance's IPv4 address (a.b.c.d). Add these entries into your custom DNS/NIS server or /etc/hosts file.

    7. Manually start the LSF daemons on the instance and make sure that the instance can join the LSF cluster as a dynamic host.
      Note: Disable LSF_LOCAL_RESOURCES="[resource azurehost]" in the lsf.conf before your test, as hosts with LSF resource connector flagged resources ("azurehost") are omitted by LSF if it is not dynamically created by the LSF resource connector. You must enable this parameter before creating the image.

      If the instance cannot join the LSF cluster as a dynamic host, check the VPN, firewall, or security group settings. Check whether the management host can ping the instance (and whether the instance can ping the management host) using its private IP address. If the management host can ping the instance using its IP address but not using the host name, configure the host name resolution properly.

    8. Shut down the LSF daemons and log out from the instance.
    9. Log in to the Azure instance.
    10. Deprovision and shut down the instance.
      sudo waagent -deprovision+user -force && halt
    11. Capture the image.

      Run the following Azure commands:

      az vm deallocate -g "resource_group_name" -n "instance_name"
      az vm generalize -g "resource_group_name" -n "instance_name"
      az image create -n "image_name" -g "resource_group_name" --os-type Linux --source "instance_name"

      The name of the image (imageId) is required in the azureprov_templates.json file for LSF to decide when borrowing happens, and when instances are launched from which image.

      For more details on capturing an image, refer to the Microsoft Azure documentation: https://docs.microsoft.com/en-us/azure/virtual-machines/linux/capture-image

    12. Add the image to the Azure resource group that you created for LSF.

      After capturing the image, the Azure instance cannot start and use this image again. If you need re-create the image, create a new instance using the image that you created, or create a new LSF cloud image.

  8. Install the LSF resource connector for Azure.
    1. Log in to the LSF management host as root.
    2. Install Java, Version 1.8, or later.
    3. Source the LSF environment.
      • For csh or tsch: source LSF_TOP/conf/cshrc.lsf
      • For sh, ksh, or bash: . LSF_TOP/conf/profile.lsf
    4. Copy the LSF resource connector package to the LSF management host and extract the package.
    5. From the extracted package directory, copy the contents of the LSF_VERSION/resource_connector/azure directory to the LSF_TOP/LSF_VERSION/resource_connector directory
    6. Create the azure subdirectory in the LSF_TOP/conf/resource_connector directory.
    7. From the extracted package directory, copy the contents of the LSF_VERSION/resource_connector/azure/conf directory to the LSF_TOP/conf/resource_connector/azure directory.
    8. Edit the LSF_TOP/conf/resource_connector/hostProviders.json file and add an entry for the Azure provider.
    9. Replace the ebrokerd executable file in the LSF_SERVERDIR directory with the ebrokerd executable in the extracted package.
    10. Run the badmin mbdrestart command to apply the changes