Help and Documentation

Incidents

9min

Overview

Rootly helps streamline your incident response procedure through easy-to-use and powerful automations during each stage of the incident life cycle.

  1. Incident Detection: Rootly integrates with various observability applications such as Datadog, Grafana, Sentry, etc. to alert teams when any abnormalities or potential issues arise.
  2. Paging and Notification: Upon potential issue detection, Rootly notifies the relevant stakeholders through various communication channels such as Slack, email, or SMS.
  3. Incident Triage: Upon being alerted, the incident is triaged to assess its severity and impact on the organization's operations. Rootly provides a centralized interface to empower team members to efficiently collaborate and gather information about the potential incident.
  4. Incident Response: Rootly facilitates incident response efforts by automating manual tasks, which helps remove the cognitive load during system outages.
  5. Collaboration and Communication: Throughout the incident resolution process, Rootly serves as a hub for collaboration and communication among team members. It enables real-time communication, file sharing, and status updates, ensuring everyone stays informed and aligned on the incident response efforts.
  6. Resolution and Post-Incident Analysis: Once the incident is resolved, Rootly facilitates post-incident analysis to document root causes, lessons learned, and areas for improvement.
  7. Incident Analytics: Rootly captures all relevant incident information and provides insightful metrics to help teams interpret their incident data.

Incident Lifecycle

Rootly manages incidents through the following stages. Each stage is represented as the incident status.

Triage

The triage status is used for potential issues that have not been confirmed as an incident. Placing an incident in the triage status allows teams to limit the blast radius of any notifications and keep the initial investigation to a small group of responders.

Incidents can be declared into the triage status by selecting the Mark as In Triage checkbox.

The data value for the triage status is in_triage.

When an incident is in the triage status, the {{ incident.in_triage_at }} timestamp will be automatically recorded.

This timestamp will NOT be logged if the incident was never triaged.

Document image


Started

Once an incident is in the started status, it signifies that the incident has been confirmed as a real incident.

To declare an incident directly into the started status, simply leave the Mark as In Triage checkbox unchecked.

The data value for the started status is started.

When an incident is in the started status, the {{ incident.started_at }} timestamp will be automatically recorded.

Document image


Mitigated

An incident moves to the mitigated status once its impact has been contained. However, this does NOT mean the incident has been officially fixed.

An incident can be progressed to the mitigated status by using the /rootly mitigate command or via the Mitigate button.

The data value for the mitigated status is mitigated.

When an incident is in the mitigated status, the {{ incident.mitigated_at }} timestamp will be automatically recorded.

This timestamp will automatically be set to the same value as the {{ incident.resolved_at }} timestamp if the mitigated status is skipped.

Document image


Resolved

An incident is considered resolved once it's the issue causing the incident has been fixed.

An incident can be progressed to the resolved status by using the /rootly resolve command or via the Resolve button.

The data value for the resolved status is resolved.

When an incident is in the resolved status, the {{ incident.resolved_at }} timestamp will be automatically recorded.

Document image


Cancelled

CANCELED - The incident can be canceled if it's determined that it's not an actual incident (false positive) or if it was a duplicate of another incident. The data value for this status is canceled.

An incident is considered cancelled once it's been determined as a false-positive or it's been identified as a duplicate of an existing incident.

An incident can be progressed to the cancelled status by using the /rootly cancel command or via the Cancel Incident button.

The Cancel Incident button is only available when the incident is in the triage status.

The data value for the cancelled status is cancelled.

When an incident is in the cancelled status, the {{ incident.cancelled_at }} timestamp will be automatically recorded.

Document image


Incident Properties

Every incident created on Rootly can be characterized with a set of data properties. These properties can either be built-in or custom.

Incident properties play a key role during incident management as they can

  • help categorize each incident (e.g. type = security, customer-facing, backend, etc.)
  • be used as run conditions for automations (e.g. create incident retrospective when status = resolved, notify leadership if severity = SEV0)
  • be used to gain insightful incident analytics (e.g. plot graph breaking down incidents by their impacted service)

You can read more about these properties on the dedicated page here.

Support

If you need help or more information about this page, please contact [email protected] or start a chat by navigating to Help > Chat with Us.