Monitoring Host

The host sensor is automatically deployed and installed after you install the Instana agent.

Supported OS

  • Linux
  • Windows
  • Mac OS/OS X
  • Solaris on Sparc
  • AIX

Configuration

For detailed information, see our agent configuration documentation.

Metrics collection

To view the metrics, select Infrastructure in the sidebar of the Instana User interface, click a specific monitored host, and then you can see a host dashboard with all the collected metrics and monitored processes.

Configuration data

  • Operating System name and version
  • CPU model and count
  • GPU model and count
  • Memory
  • Max Open Files
  • Hostname
  • Fully Qualified Domain Name
  • Machine ID
  • Boot ID
  • Startup time
  • Installed packages
  • System ID

Note:

System ID can be used for correlation with asset management systems. Instana agent collects the System ID by default for Linux OS. For other supported operating systems (Windows, macOS, Solaris, AIX), you need to explicitly enable the feature by using the agent configuration.yaml file:

com.instana.plugin.host:
  collectSystemId: true

Performance metrics

CPU usage

Overall CPU usage as a percentage.

  • In an AIX LPAR environment, set useMpstat to true to collect more accurate CPU usage.
com.instana.plugin.host:
  useMpstat: true

Datapoint: Filesystem

Granularity: 1 second

Memory usage

  • On Linux, the used value is computed in percentage by using the formula (total - actualFree) / total. The sensor uses the actualFree value, the real constrained memory which includes free and cached memory, instead of just free which is usually a pretty low value (used for caching/buffering).

Datapoint: Filesystem

Granularity: 1 second

  • On AIX, the used value is computed in percentage by using the formula (computational + non-computational) / real total. The non-computational is considered as part of used memory, potentially resulting in a relatively high used value. However, a high used value doesn't necessarily indicate a need for more memory.{: note} The determination of memory over-commitment is based on the comparison between computational memory and the real memory in the system. Therefore, the percentage of computational is more informative for estimating memory usage on AIX.

Datapoint: AIX perfstat_memory_total interface

Granularity: 1 second

CPU load

The average number of processes being or waiting to be executed over past selected period of time.

Datapoint: Filesystem

Granularity: 5 seconds

CPU usage

CPU usage values as a percentage; user, system, wait, nice, and steal. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Context switches

The total number of context switches. This is supported only on Linux hosts. The value is displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

CPU load

CPU load. The value is displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Individual CPU usage

Individual CPU usage values as a percentage; user, system, wait, nice, and steal. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

Individual GPU usage

Individual GPU usage values.

Datapoint Collected from Granularity Unit
Gpu Usage nvidia-smi 1 second %
Temperature nvidia-smi 1 second °C
Encoder nvidia-smi 1 second %
Decoder nvidia-smi 1 second %
Memory Used nvidia-smi 1 second %
Memory Total nvidia-smi 1 second bytes
Transmitted throughput nvidia-smi 1 second bytes/s
Received throughput nvidia-smi 1 second bytes/s

Supported Nvidia Graphic Cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported OS: Linux

Prerequisites: Installed latest official Nvidia drivers.

Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

GPU Memory/Process

The following list of processes utilizes GPU.

Datapoint Collected from Granularity
Process Name nvidia-smi 1 second
PID nvidia-smi 1 second
GPU nvidia-smi 1 second
Memory nvidia-smi 1 second

Supported Nvidia Graphic Cards:

Brand Model
Tesla S1070, S2050, C1060, C2050/70, M2050/70/90, X2070/90, K10, K20, K20X, K40, K80, M40, P40, P100, V100
Quadro 4000, 5000, 6000, 7000, M2070-Q, K-series, M-series, P-series, RTX-series
GeForce varying levels of support, with fewer metrics available than on the Tesla and Quadro products

Supported OS: Linux

Prerequisites: Installed latest official Nvidia drivers.

Starting Instana Agent Docker container with GPU support is documented here: Enable GPU monitoring through Instana Agent container.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Note:

Data collection of GPU metrics is carefully designed for minimal impact by splitting polling and querying into two processes using nvidia-smi command-line utility. The background process is started in a loop mode and is kept in memory. It significantly improves the metrics collection performance and prevents any potential overhead. The sensor queries GPU metrics based on the configured poll rate (every second by default). The solution enables the sensor to gather accurate and up to date metrics every second for multiple GPUs without the overhead.

Memory

  • Linux:
    • memory used and swap used values are displayed in percentage.
    • swap total, swap free, buffers, cached, and available are valued as a byte. The values are displayed on a graph over a selected time period.

Datapoint: Filesystem

Granularity: 1 second

  • AIX:
    • memory used, swap used and virtual used values are displayed in percentage.
    • swap total, swap free, virtual total, virtual free are values as a byte.
    • computational and non-computational values are displayed both as a percentage and in byte format in two separate charts.
    • page-in per second and page-out per second are displayed as the number of page-in and page-out events per second. You can visualize all these values on a graph over a user-selected time period.

Datapoint: AIX perfstat_memory_total interface

Granularity: 1 second

Open files

Open files usage when available on the operating system; current vs max. The values are displayed on a graph over a selected time period.

Note:

Solaris OS has limited support:

  • Global zone - only the current metric is supported
  • Non-global zone - none of the metrics are supported

Datapoint: Filesystem

Granularity: 1 second

Filesystems

Filesystems per device.

Datapoint Collected from Granularity
Device Filesystem 60 seconds
Mount Filesystem 60 seconds
Options Filesystem 60 seconds
Type Filesystem 60 seconds
Capacity Filesystem 60 seconds
Total Utilization* Filesystem 60 seconds
Read Utilization* Filesystem 60 seconds
Write Utilization* Filesystem 60 seconds
Used Filesystem 1 second
Leaked* Filesystem 1 second
Inode usage Filesystem 1 second
Reads/s, Bytes Read/s** Filesystem 1 second
Writes/s, Bytes Writes/s** Filesystem 1 second

* The total, read, and write utilization datapoint metrics display the disk I/O utilization as a percentage. This functionality is compatible only with Linux.

* Leaked (refers to deleted files that are in use and equates to capacity - used - free. On Linux you can find these files with lsof | grep deleted).

** The Total Utilization, Read Utilization, and Write Utilization datapoints are not supported for Network File Systems.

Instana will by default only monitor local filesystems. It is possible to explicitly list the filesystems that shall be monitored or excluded in the configuration.yaml file. The name for the config setting is the device name, which can be obtained from the first column of mtab file or df command output.

Temporary filesystems need to be specified in the following format: tmpfs:/mount/point. For example, list of filesystems to be monitored:

com.instana.plugin.host:
  filesystems:
    - '/dev/sda1'
    - 'tmpfs:/sys/fs/cgroup'
    - 'server:/usr/local/pub'

or to be included / excluded:

com.instana.plugin.host:
  filesystems:
    include:
      - '/dev/xvdd'
      - 'tmpfs:/tmp'
      - 'server:/usr/local/pub'
    exclude:
      - '/dev/xvda2'
Network File Systems (NFS)

To monitor all Network File Systems (NFS), use the nfs_all: true configuration parameter:

com.instana.plugin.host:
  nfs_all: true

Network interfaces

Network traffic and errors per an interface.

Datapoint Collected from Granularity
Interface Filesystem 60 seconds
Mac Filesystem 60 seconds
IPs Filesystem 60 seconds
RX Bytes Filesystem 1 second
RX Errors Filesystem 1 second
TX Bytes Filesystem 1 second
TX Errors Filesystem 1 second

TCP activity

TCP activity values are displayed on a graph over a selected time period.

Datapoint Collected from Granularity
Establised Filesystem 1 second
Open/s Filesystem 1 second
In Segments/s Filesystem 1 second
Out Segments/s Filesystem 1 second
Established Resets Filesystem 1 second
Out Resets Filesystem 1 second
Fail Filesystem 1 second
Error Filesystem 1 second
Retransmission Filesystem 1 second

Instana doesn't support the TCP activity metric for Sun Solaris hosts.

Process top list

The top process list is updated every 30 seconds, and it contains only the processes with significant system usage. For example, the processes with more than 10% CPU usage over the last 30 seconds or processes with more than 512 MB memory usage (RSS) are displayed in the process top list.

To create a combined list of processes from the top 10 CPU and memory usage lists, set combineTopProcesses to true. The processes are included in the combined list even if their CPU usage is less than 10% or memory usage is less than 512 MB. If the same process is listed in the top 10 CPU and top 10 memory usage lists, it is listed only once in the combined list, which can include up to 20 entries.

com.instana.plugin.host:
  combineTopProcesses: true

Linux top semantics are used. 100% CPU refers to full use of a single CPU core, and you can search a history of snapshots from the previous month. The normalized CPU is calculated by dividing the CPU by the number of logical processors.

Datapoint Collected from Granularity
PID Filesystem 30 seconds
Process Name Filesystem 30 seconds
CPU Filesystem 30 seconds
CPU (normalized) Calculated 30 seconds
Memory Filesystem 30 seconds

Installed Packages List

When the collectInstalledSoftware is set to true in the configuration.yaml file, installed packages on an operating system can be extracted once a day.

The following Linux distributions are currently supported:

  • Debian-based (dpkg)
  • Red Hat-based (rpm and yum)
com.instana.plugin.host:
  collectInstalledSoftware: true # [true, false]

Health signatures

For each sensor, there is a curated knowledgebase of health signatures that are evaluated continuously against the incoming metrics and are used to raise issues or incidents depending on user impact.

Built-in events trigger issues or incidents based on failing health signatures on entities, and custom events trigger issues or incidents based on the thresholds of an individual metric of any given entity.

For information about the built-in events for the Host sensor, see the Built-in events reference.

Error report events (only AIX operating system)

On the AIX system, the errpt command generates an error report from entries in an error log. The errors in the error report are then captured as events and sent to Instana. The sensor captures permanent and temporary error types, and hardware and software error classes. You need to explicitly enable the feature by using the agent configuration.yaml file:

com.instana.plugin.host:
  aixEventsPollRate: 900 # In seconds

Troubleshooting

eBPF Not Supported

Monitoring issue type: ebpf_not_supported

The Process Abnormal Termination functionality detects when processes running on a Linux-based Operating System terminate unexpectedly due to crashes or getting killed by outside signals.

This functionality is built on top of the extended Berkley Packet Filter, which seems to be unavailable on this host.

To take advantage of Instana's eBPF-based features you need a 4.7+ Linux kernel with debugfs mounted. Refer to the Process Abnormal Termination documentation for more information on the supported Operating Systems.

SELinux policy blocking eBPF

If you have SELinux installed on your host, you usually need to create a policy to allow the agent to leverage eBPF. SELinux may prevent unconfined services like the host agent from issuing the bpf_* syscall that the eBPF sensor uses to instrument the Linux kernel. To verify that this is happening, one must look in the log entries of the Audit system, which is stored by default in the /var/log/audit/audit.log.

The following is an example is from a Red Hat Linux machine:

$ cat /var/log/audit/audit.log | grep ebpf
type=AVC msg=audit(1598891569.452:193): avc:  denied  { map_create } for  pid=1612 comm="ebpf-preflight-" 
scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:unconfined_service_t:s0 
tclass=bpf permissive=0
type=SYSCALL msg=audit(1598891569.452:193): arch=c000003e syscall=321 success=no exit=-13 
a0=0 a1=7ffc0e1f5020 a2=78 a3=fefefefefefefeff items=0 ppid=1502 pid=1612 auid=4294967295 
uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ebpf-preflight-" 
exe="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin" 
subj=system_u:system_r:unconfined_service_t:s0 key=(null)
type=PROCTITLE msg=audit(1598891569.452:193):
proctitle="/opt/instana/agent/data/repo/com/instana/ebpf-preflight/0.1.6/ebpf-preflight-0.1.6.bin"

Note that audit log files are usually rotated, so we have to run this command not long after starting the host agent.

In the log file, we see that the map_create syscall was denied. To allow the eBPF sensor to make this syscall we need to create an SELinux policy. For this we need the program audit2allow. On Red Hat systems this can be installed as follows:

yum install policycoreutils-python

With audit2allow, we can then create raw policy files based on the log entries:

grep ebpf /var/log/audit/audit.log | audit2allow -M instana_ebpf

The command above will create the following files:

ls -Al | grep instana_ebpf
-rw-r--r--. 1 root                    root                      886 31. Aug 18:31 instana_ebpf.pp
-rw-r--r--. 1 root                    root                      239 31. Aug 18:31 instana_ebpf.te

The raw policy file, called instana_ebpf.te, now contains an instruction to allow the denied syscall:

$ cat instana_ebpf.temodule instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf map_create;
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf map_create;

This policy will allow any app of type unconfined (very generic) to make the map_create syscall.

Additionally, the eBPF sensor needs a few more syscalls. We have to edit the instana_ebpf.te file so it looks like this:

$ cat instana_ebpf.te module instana_ebpf 1.0;require {
	type unconfined_service_t;
	class bpf { map_create map_read map_write prog_load prog_run };
}#============= unconfined_service_t ==============#!!!! This avc is allowed in the current policy
allow unconfined_service_t self:bpf { map_create map_read map_write prog_load prog_run };

This file then must be re-written to a binary format as the instana_ebpf.mod file:

$ checkmodule -M -m -o instana_ebpf.mod instana_ebpf.te
checkmodule:  loading policy configuration from instana_ebpf.te
checkmodule:  policy configuration loaded
checkmodule:  writing binary representation (version 19) to instana_ebpf.mod

The instana_ebpf.mod file must be repackaged as a loadable module:

semodule_package -o instana_ebpf.pp -m instana_ebpf.mod

And finally we can apply the policy package:

semodule -i instana_ebpf.pp

Any unconfined process, such as the host agent, can now make those syscalls.