Purging of poll data

Poll data is purged using database partitioning, The raw and historical poll data tables in the NCPOLLDATA database are partitioned based on time. Partitions are added and detached at regular intervals.

Pruning of polling data is handled using partitioning. Each of the following NCPOLLDTATA database tables is broken up into partitions.
  • Raw poll data table, polldata
  • Historical poll data tables:
    • pdEwmaForDay
    • pdEwmaForWeek
    • pdEwmaForMonth
    • pdEwmaForYear
Each partition has a defined time range. Once this time range has passed, the Polling engine starts writing to the next partition.
The partition configuration for each table is as follows:
Table 1. Partition configuration for poll data tables

Table

Number of partitions

Length of each partition

pollData

8

10 minutes

pdEwmaForDay

26

1 hour

pdEwmaForWeek

16

12 hours

pdEwmaForMonth

33

1 day

pdEwmaForYear

14

30 days

The goal is to always be writing new data to the penultimate partition of each of the tables. Once the time range for a partition is complete, a new partition is added. If the table has more than the defined number of partitions for that table, then the oldest partition in the table is detached, and that data is removed from storage.

In the case of the pollData table where the Polling engine is writing raw poll data to the table, the process works as shown in the following diagram. The 10-minute long partitions in the diagram are shown shaded to indicate the presence of data. The unshaded partitions do not contain data:
Figure 1. Pruning of raw poll data
Sequence of events for the writing raw poll data to the pollData table
 1  Polling engine, ncp_poller, writes data to the pollData table
The Polling engine, ncp_poller, writes data to the penultimate partition in the table.
 2  New partition is added
Once the time range for the partition that is being written to is complete, a new empty partition is added to the "front" of the table. This ensures that there is always a spare partition at the front of the table.
 3  The oldest partition is detached from the table
With the addition of the new partition, the system detects that the table has more than the defined number of partitions for that table, which in the case of the pollData table is 8. At that point, the oldest partition in the table is detached, and that raw data is removed from storage.
In the case of the historical data tables, the Apache Storm process aggregates raw poll data at regular intervals into the relevant historical data table. For example, for the pdEwmaForDay table, an EWMA average of raw poll data from the pollData table is calculated by Storm every 15 minutes and is written to the current partition in the pdEwmaForDay table. This example is shown in the following diagram:
Figure 2. Pruning of historical poll data
Sequence of events for the writing aggregated poll data to the pdEwmaForDay aggregated poll data table
 1  Apache Storm process aggregates raw poll data
Every 15 minutes Apache Storm calculates the average of the last 15 minutes of data in the raw poll data table, pollData.
 2  Storm writes data to the pdEwmaForDay table
Storm writes data to the penultimate partition in the pdEwmaForDay table.
 3  New partition is added
Once the time range for the partition that is being written to is complete, a new empty partition is added to the "front" of the table. This ensures that there is always a spare partition at the front of the table.
 4  The oldest partition is detached from the table
With the addition of the new partition, the system detects that the table has more than the defined number of partitions for that table, which in the case of the pdEwmaForDay table is 26. At that point, the oldest partition in the table is detached, and that raw data is removed from storage.
Note: If the Apache Storm process stops running, then no more historical data can be processed until Storm starts up again. In this case, no more partitions will be detached from the historical poll data tables until Storm starts running again. This ensures that no historical data is lost. Poll data is collected in batches in the raw poll data table, pollData, and each batch of raw poll data is marked with a flag to indicate whether that batch has been processed by Storm. Once Storm starts up again, it will start processing raw data in the pollData table starting with those batches that have not yet been processed. This ensures that if Storm comes up within the lifetime of data in the pollData table (up to 80 minutes) all raw poll data will be successfully processed into historical data.

Network Manager monitors the historical poll data database tables and sends alerts if it detects that data in the tables is outside of age limits or that the amount of data in the table violates size limits. Data age violations are an indication that the Apache Storm process might not be running. Table size violations are an indication that the poll data storage rate is too high. In either case, you will need do some troubleshooting work to get to the root of the problem.