Performance Degradation
Date of Incident: 2025-09-03
Time of Incident (UTC): 11:06 - 12:07
Service(s) Affected: All APIs
Impact Duration: 61 minutes
Summary
For 61 minutes on the morning of September 3rd, 2025, all 1Password APIs in the US/Global environment had degraded performance or returned an error for approximately 20% of requests. 92% of the impact was mitigated within 13 minutes at 11:19 by automation scaling up infrastructure. By 12:06 a manual restart of the remaining infrastructure completed mitigation. A permanent fix was implemented and deployed to prevent the issue from reoccurring.
Impact on Customers
- APIs: High latency, or a 500 Internal Server Error.
- Number of Affected Customers: 20% of all requests returned errors for 13 minutes, 1% thereafter.
- Geographic Regions Affected (if applicable): 1Password USA/Global
What Happened?
Timeline of Events (UTC):
- 11:05: A customer started a stream of an unusually high volume of requests to an API with sub-optimal performance.
- 11:06: Some servers started consuming abnormally high memory, causing slow response times and high error rates.
- 11:19: Automation scaled up infrastructure to service additional load
- 11:30: Increased errors trigger escalation, on-call engineer begins investigation
- 11:51: Engineers declare an incident and alert response teams
- 12:02: Response team begins restarting affected servers.
- 12:07: All servers completed restarts, and error rates returned to normal levels
Root Cause Analysis: A poorly performing cache operation was triggered repeatedly in a short period of time across multiple servers, leading directly to greatly delayed responses.
How Was It Resolved?
- Mitigation Steps: Automatic instance scaling restored over 98% of operational capacity after 13 minutes. Full capacity was restored through manual intervention
- Resolution Steps: We refactored the poorly performing query.
- Verification of Resolution: We tested the affected API to confirm refactoring of query produced the desired performance improvement. We deployed the fix and monitored it for 24 hours to assert the issue was resolved.
What We Are Doing to Prevent Future Incidents
- We are auditing services for sub-optimal query performance.
Next Steps and Communication
- No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Sincerely,
The 1Password Team