The Self-Healing Layer

SelfHealing Layer2.jpg

The Self-Healing Layer defines  a set of tools designed to push problem analysis and resolution as far into the network as possible, enabling devices and local networks of devices to tend to themselves and each other. This objective is in keeping with the XPDR architecture's philosophy of using the computing power available in intelligent devices to work more autonomously and rely less on expensive central infrastructure. As already noted, this reduces network chatter and infrastructure costs in server and storage hardware while improving reliability and user experience.

The Self-Healing Layer consists of a server component and a device component.

The server component consists of:

  •  Rules Composer - This can be a compiler or an interactive editor that is used to author rules that define self-healing behavior in the device and also defines certain server behaviors that affect the network at large or devices that have gone completely offline.
  • Rules Engine - A high-performance expert system that executes network-scale rules as defined by the rules composer.
  • Simulation/Test Environment - A debugging and testing environment that allows network managers to simulate a device and real-world conditions (e.g., network congestion, lost connectivity, excessive interrupts or CPU usage, etc.) to allow rules to be tested before they are deployed to production devices.
  • Optional Physical Device Test Environment - Extends the simulation/test environment by allowing real devices to be attached and an agent added to simulate conditions on an actual device, thus providing an additional element of realism to the testing environment.

The device component is an agent that consists of:

  • Rules Engine - A tailored version of the same essential rules engine as on the server component but paired down for the scale and requirements of the smaller scope in which it operates.
  • Bidirectional Local Monitoring Probe - Provides internal status information upon request by approved local devices and polls local devices on its own list of "buddy" devices for status information. The information is consumed by the rules engine to both to aid other devices in their own self-management but also to gain insight into the local network overall so as to further guide its actions.
  • A health status indicator, maintained by the rules engine, records the current overall health state of the device on a scale from 1 to 100. The Control Layer will have rules defined as to when this status is reported to the back office and what additional information is communicated when it does.
  • Unit Under Test Agent - An optional component that works with the physical device test environment of the server component to provide a maximally realistic test environment for testing authored rules.

There are many ways these components can work together to ensure a healthy network and happy users but here are a few examples:

  • The health status indicator will typically be parsed into symbolic status ("Red Light", "Yellow Light", "Green Light"). The Control Layer may be instructed to report:
    • A Green Light status once every eight hours along with certain routine operating information (traffic and error statistics, for instance);
    • A Yellow Light status as soon as it occurs and every hour until it returns to Green;
    • A Red Light status as soon as it occurs and every ten minutes until it returns to a healthier level.
  • Devices in a local network monitor each other's traffic statistics and optionally, intentionally throttle their own traffic generation in deference to other devices that have been deemed as higher priority when necessary to keep from overwhelming an internet connection;
  • A server rule may correlate an increase in Red Light status indicators with devices having received a recent firmware upgrade or configuration file change and decide to roll back those changes causing those devices to load back the previous versions and then notifying a help desk of the decision for further research;
  • A gateway device can monitor a connected device and if its health status indicator is abnormally low and its traffic generation is unusually high, the gateway may decide to throttle data it receives from the device as it forwards to the internet or it may disconnect the device completely other than for XPDR-related communications to and from the device.
  • A server rule may notice an increase in Yellow or Red Light status indicators in a certain part of the network and notify a network help desk to resolve a possible network problem before a major disruption occurs
  • A device's memory utilization may be trending upward for several days in a row and it is dropping increasingly large numbers of outbound packets. It's own rules engine may decide to schedule a midnight full device reset at which time it also generates an alert coming from the device that includes memory dumps, log files or other diagnostic information for engineering review;

The potential set of examples is endless and will own grow as the capabilities of devices and management levers within them become more sophisticated and need more intelligence to keep device and network operations balanced.

With rules defined in the back office, reliably delivered to devices via the Control and Fabric Layer protocols, and executed reliably by the rules engine on the device, a new level of device reliability and delightful user experience can be achieved and maintained.

The Analytics Layer....