Information on what to do during a major incident. See our [severity level descriptions](/before/severity_levels.md) for what constitutes a major incident.
Always document your activities. Keep a detailed worklog of your actions in DoIT and communicate verbosely in our internal Chat or other channels (email, etc.).
If this is a security incident, you should follow the [Security Incident Response](/during/security_incident_response.md) process.
## Don't Panic!
1. Join the incident call and chat (see links above).
* Anyone is free to join the call or chat to observe and follow along with the incident.
* If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting.
1. Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand.
* If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once becomes overwhelming, so we try to maintain a hierarchical structure to the call if possible.
Not all issues begin with a formal call. Some issues are self-explanatory and automatically generated via our monitoring platforms, a customer logging on to our portal, etc. In these scenarios [DoIT](http://doit.sphs.ro) is the definitive source. If that is not sufficient ask your TL and Sysadmin.
1. Announce on the call, in DoIT and in our internal Chat that you are the team leader, who you have designated as sysadmin (usually the backup TL), and scribe/juniors if any.
* Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the TL on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you.
1. Identify investigation & repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list),
* **Bad Deployment:** Roll it back.
* **Web Application Stuck/Crashed:** Do a rolling restart.
* **Event Flood:** Validate automatic throttling is sufficient, adjust manually if not.
* **Data Center Outage:** Validate automation has removed bad data center. Force it to do so if not.
* **Degraded Service Behavior without load:** Gather forensic data (heap dumps, etc), and consider doing a rolling restart.
1. Listen for prompts from your Sysadmin regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly.
* Announcing publicly is at your discretion as TL. If you are unsure, then announce publicly ("If in doubt, tweet it out").
1. Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now.
1. Provide regular status updates in our internal Chat (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @<channel-name>.
1. You should add notes to the proper channels when significant actions are taken, or findings are determined. You don't need to wait for the TL to direct this - use your own judgment.
* You should also add `TODO` notes to the proper channel that indicate follow-ups slated for later.
You are there to support the team leader in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions.
1. You will typically be required to update the status page and to send Tweets or other communications from our various accounts at certain times during the call.