Incident Postmortem - Cloud Services Degraded
Date of Incident: 2025-11-18
Time of Incident (UTC): 5:03pm UTC - 6:05pm UTC
Service(s) Affected: SSO, sign in, sign up, CLI, web interface, access to vault content and other items, admin console, MFA
Impact Duration: ~60 mins
Summary
On November 18, 2025, at 5:03 PM UTC, 1Password experienced degraded and temporarily unavailable cloud services for customers in the US region. The issue was caused by database resource exhaustion, causing operations to fail and connections to be rejected. This was not a security incident and no customer data was impacted. The issue was resolved by resizing the database to restore normal performance and ensure additional capacity for future growth.
Impact on Customers
- Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled experienced delays, and in some cases failures to log in.
- Browser Extension: Users who needed to authenticate via the web interface were unable to unlock their vaults.
- Web Interface, Administration: Customers were unable to log in, sign-ups failed, syncing between devices was not functioning, access to vault and other items were unavailable and the admin console was not reachable.
- API Access: CLI users and API requests received timeout errors and slow responses.
- Number of Affected Customers (approximate): All customers utilizing cloud interfaces and APIs in the affected region for the duration of the incident.
- Geographic Regions Affected (if applicable): US/Global.
What Happened?
Timeline of Events (UTC):
- 2025-11-18 4:59pm: Automated monitoring detects increased errors
- 2025-11-18 5:03pm: Team began investigating
- 2025-11-18 5:09pm: Servers restarted, service is still degraded
- 2025-11-18 5:23pm: Public status page updated to Investigating and services Degraded
- 2025-11-18 5:25pm: Servers scaled down to reduce database load
- 2025-11-18 5:41pm: Database instance size upgrade started
- 2025-11-18 5:44pm: Potentially problematic cron job disabled
- 2025-11-18 5:56pm: Services slowly started to scale up
- 2025-11-18 5:57pm: Services started to come back as the database instance resize completes
- 2025-11-18 6:05pm: Incident marked as Identified
- 2025-11-18 6:05pm: Team continues to monitor, performance has returned to normal levels
- 2025-11-18 6:23pm: Incident marked as Monitoring and services Operational
- 2025-11-18 7:16pm: Incident marked as resolved
Root Cause Analysis: The refactor of a feature increased the impact of a poorly performing query that had previously gone undetected. The result was the exponential increase in resource consumption for the main database. Once resources were fully exhausted, the service rejected connections and all requests failed.
Contributing Factors (if any):
- Non-performant queries
- Database under-provisioned
How Was It Resolved?
Mitigation Steps:
- Background services were halted to reduce load on the database.
- Application servers were scaled down to further reduce load.
Resolution Steps: Increasing the database instance size resolved the issue.
Verification of Resolution: Monitoring metrics were closely observed to ensure error rates returned to normal and database performance had stabilized.
What We Are Doing to Prevent Future Incidents
- Improving monitoring: We are updating our monitoring systems to better detect database issues like this before impacting customers.
- Improve database performance: We are refactoring the responsible query to improve performance and reduce load, and tuning the background service to prevent resource contention.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Sincerely,
The 1Password Team