1Password services experiencing latency

Incident Report for 1Password

Postmortem

Date of Incident: 2025-05-21

Time of Incident (UTC) 16:06:40 - 16:48:10

Service(s) Affected: USA/Global 1Password.com website, Sign in, Sign up, Admin console, SSO (Single Sign On), Command Line Interface (CLI)).

Impact Duration: 41 minutes

Summary

On May 21st, 1Password's web interface, APIs, browser extension, and CLI tools experienced significant latency and errors. These problems stemmed from a code change that triggered a spike in server requests, leading to increased memory usage and system load. As a result, customers were unable to access their vaults or sign in via SSO.

This was not a result of a security incident and customer data was not affected.

Impact on Customers

During the duration of the incident:

  • Web interface, Administration: Customers experienced significant delays when accessing the 1Password web interface. Administrators could not access or use any administration tools.
  • Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled could not sign in and received an "An unexpected error occurred" message. Customers may also have been required to re-authenticate to access 1Password once the issue was mitigated.
  • Command Line Interface (CLI): CLI users faced increased latency and timeouts when attempting to access our web APIs.
  • Browser Extension: Users requiring web interface authentication were unable to unlock their vaults.
  • Number of Affected Users (approximate): All users accessing the service in the US/Global (1password.com) region were affected
  • Geographic Regions Affected (if applicable): 1password.com (US/Global)

What Happened?

We deployed code changes that increased the number of queries to our Redis clusters. The increase in queries caused a spike in memory usage which in turn caused latency and errors across all endpoints.

  • Timeline of Events (UTC):

    • 2025-05-21 15:52 UTC: Deployment started
    • 2025-05-21 15:57 UTC: Deployment complete
    • 2025-05-21 16:00 UTC: Automated monitoring detects increased errors and latency
    • 2025-05-21 16:01 UTC: Automation pages the incident response team
    • 2025-05-21 16:06 UTC: The team activates our incident protocol and begins investigation
    • 2025-05-21 16:21 UTC: The team initiates a rollback to a previous version
    • 2025-05-21 16:23 Code change causing the issue identified
    • 2025-05-21 16:48 UTC: Incident mitigated—rollback completed and we see a significant improvement in error rates and latency. The team continues to monitor the system.
    • 2025-05-21 17:23:11 UTC: Incident resolved
  • Root Cause Analysis:

    We released a code change that caused a significant increase in data writes to our session store cluster.

    All operations, even those with a pre-established session depend on the session store for authenticating requests.

    The resulting resource contention led to increased latency and timeouts.

    The unplanned high volume of writes to this specific datastore also caused a portion sessions to be prematurely evicted, requiring customers to re-authenticate earlier than anticipated.

How Was It Resolved?

Our monitoring systems detected the issue and alerted the response team immediately after the release. The team quickly identified the problem and initiated a rollback.

  • Resolution Steps: The team identified the problematic code change and reverted to a previous version. As the rollback deployed, server functionality returned to normal.
  • Verification of Resolution: Our monitoring systems were closely observed for 2 hours after the rollback to ensure latency and errors were fully resolved.

What We Are Doing to Prevent Future Incidents

  • Our team will implement longer testing periods in lower-traffic environments to improve monitoring and issue detection for similarly high-risk changes.
  • Our team is working to improve our deployment process to enhance our incremental deployments, which will allow us to detect system issues earlier and contain fallout.

Next Steps and Communication

  • Some customers may need to re-authenticate in order to access 1Password

We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.

Sincerely,

The 1Password Team

Posted May 22, 2025 - 16:51 EDT

Resolved

This incident has been resolved. We will publish a postmortem here as soon as one is available.
Posted May 21, 2025 - 13:27 EDT

Monitoring

The changes have been rolled out, and systems are recovering. Latency and error rates have returned to normal levels. Engineers are continuing to monitor. Users may be required to reauthenticate.
Posted May 21, 2025 - 12:48 EDT

Update

Our engineering team continues to work on mitigating this incident. Please note that users may experience latency, timeouts or error messages when logging in to 1Password, loading vaults, using the CLI and APIs.
Posted May 21, 2025 - 12:41 EDT

Update

Our engineers have identified the issue and are deploying a change to mitigate it.
Posted May 21, 2025 - 12:30 EDT

Identified

Our engineers have identified the issue and are deploying a change to mitigate it.
Posted May 21, 2025 - 12:27 EDT

Investigating

1Password services for the US/Global region are failing to respond to requests. Our teams are investigating the issue.
Posted May 21, 2025 - 12:23 EDT
This incident affected: USA/Global (1Password.com website, Sign in, Sign up, Admin console, SSO (Single Sign On), Command Line Interface (CLI)).