Incident Postmortem - Degraded performance when accessing 1Password
Date of Incident: 2025-09-26
Time of Incident: 4:20pm UTC - 5:39pm UTC
Service(s) Affected: SSO, Web Sign In, Sign Up, Web Interface, CLI
Impact Duration: ~60 minutes
Summary
On September 26, 2025 at 4:20 UTC 1Password’s web interface and APIs experienced degraded performance for all customers in the US region. This was not a result of a security incident and customer data was not affected.
Impact on Customers
During the duration of the incident:
- Web interface, Administration: Customers experienced delays when accessing the 1Password web interface.
- Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled experienced delays, and in some cases failures to login.
- Command Line Interface (CLI): CLI users faced increased latency and timeouts when attempting to access our web APIs.
- Browser Extension: Users requiring web interface authentication experienced delays or failures.
- Number of Affected Customers (approximate): ~30%
- Geographic Regions Affected: 1password.com (US/Global)
What Happened?
At 4:20PM UTC and 5 PM UTC There were traffic bursts which caused extra load on one of our caches. This cache was under-provisioned to handle that spike of activity, which resulted in it exhausting available CPU. This caused cascading errors/latency which manifested in slow and failed requests.
Timeline of Events (UTC):
- 2025-09-26 4:20pm: Spike in customer traffic began
- 2025-09-26 4:29pm: Automated monitoring detects increased errors and latency
- 2025-09-26 4:35pm: The team activates our incident protocol and begins investigation
- 2025-09-26 4:58pm: The team decides to restart application servers
- 2025-09-26 5:00pm: The servers have been restarted, service is still degraded, as a second traffic burst begins
- 2025-09-26 5:18pm: Service starts to improve
- 2025-09-26 5:25pm: The team detects increased load for the second time
- 2025-09-26 5:33pm: The team restarts application servers again
- 2025-09-26 5:39pm: Service is back to normal, team continues to investigate
- 2025-09-26 7:26pm: Team has found the issue, and proceeds to upgrade cache instance size
- 2025-09-26 7:49pm: Cache upgrade completed successfully
- 2025-09-26 7:50pm: Team continues to monitor, performance has returned to nominal levels
- 2025-09-26 8:24pm: Incident is marked as resolved
Root Cause Analysis:
A code library installed in July introduced latency issues for cache connections. Authentication operations weren't properly rate-limited, allowing large traffic influxes. During peak traffic periods, the cache infrastructure was operating near maximum CPU capacity. The incident occurred when a burst of authentication traffic pushed the cache CPU utilization to 100%. The increased latency and CPU usage together directly caused the incident.
Contributing Factors:
- Latency increase due to cache library version upgrade
- Inadequate rate limiting allowed traffic bursts to go unchecked
- Cache instance size is under-provisioned
How Was It Resolved?
- Mitigation Steps: Restarting application servers temporarily mitigated the latency and errors, but the problems returned when traffic spiked again.
- Resolution Steps: Increasing the instance size for the cache resolved the issue.
- Verification of Resolution: The incident team tested the upgrade in a staging deployment before executing it in production. They then monitored metrics to confirm the system returned to normal levels.
What We Are Doing to Prevent Future Incidents
- Improve capacity planning for cache: We will ensure our internal infrastructure is properly sized to handle current traffic volumes and accommodate future growth. We'll implement regular resource evaluations to maintain adequate capacity as our traffic increases. We will also implement proactive alerting systems that notify our teams when resource utilization approaches critical thresholds.
- Update library to a more performant version: We will upgrade our caching library to the latest stable version to eliminate the current latency issues.
- Improve rate limiting for operations that triggered the traffic burst: Enhancing our rate limiting system will significantly improve our ability to handle future traffic bursts.
- Timeline for Implementation: Observability improvements have already been implemented, and we will complete the remaining work by the end of Q1, 2026.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Sincerely,
The 1Password Team