Degraded performance when accessing 1Password

Incident Report for 1Password

Postmortem

Incident Postmortem - Degraded performance when accessing 1Password

Date of Incident: 2025-09-26
Time of Incident: 4:20pm UTC - 5:39pm UTC
Service(s) Affected: SSO, Web Sign In, Sign Up, Web Interface, CLI
Impact Duration: ~60 minutes

Summary

On September 26, 2025 at 4:20 UTC 1Password’s web interface and APIs experienced degraded performance for all customers in the US region. This was not a result of a security incident and customer data was not affected.

Impact on Customers

During the duration of the incident:

  • Web interface, Administration: Customers experienced delays when accessing the 1Password web interface.
  • Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled experienced delays, and in some cases failures to login.
  • Command Line Interface (CLI): CLI users faced increased latency and timeouts when attempting to access our web APIs.
  • Browser Extension: Users requiring web interface authentication experienced delays or failures.
  • Number of Affected Customers (approximate): ~30%
  • Geographic Regions Affected: 1password.com (US/Global)

What Happened?

At 4:20PM UTC and 5 PM UTC There were traffic bursts which caused extra load on one of our caches. This cache was under-provisioned to handle that spike of activity, which resulted in it exhausting available CPU. This caused cascading errors/latency which manifested in slow and failed requests.

  • Timeline of Events (UTC):

    • 2025-09-26 4:20pm: Spike in customer traffic began
    • 2025-09-26 4:29pm: Automated monitoring detects increased errors and latency
    • 2025-09-26 4:35pm: The team activates our incident protocol and begins investigation
    • 2025-09-26 4:58pm: The team decides to restart application servers
    • 2025-09-26 5:00pm: The servers have been restarted, service is still degraded, as a second traffic burst begins
    • 2025-09-26 5:18pm: Service starts to improve
    • 2025-09-26 5:25pm: The team detects increased load for the second time
    • 2025-09-26 5:33pm: The team restarts application servers again
    • 2025-09-26 5:39pm: Service is back to normal, team continues to investigate
    • 2025-09-26 7:26pm: Team has found the issue, and proceeds to upgrade cache instance size
    • 2025-09-26 7:49pm: Cache upgrade completed successfully
    • 2025-09-26 7:50pm: Team continues to monitor, performance has returned to nominal levels
    • 2025-09-26 8:24pm: Incident is marked as resolved
  • Root Cause Analysis:

    A code library installed in July introduced latency issues for cache connections. Authentication operations weren't properly rate-limited, allowing large traffic influxes. During peak traffic periods, the cache infrastructure was operating near maximum CPU capacity. The incident occurred when a burst of authentication traffic pushed the cache CPU utilization to 100%. The increased latency and CPU usage together directly caused the incident.

  • Contributing Factors:

    • Latency increase due to cache library version upgrade
    • Inadequate rate limiting allowed traffic bursts to go unchecked
    • Cache instance size is under-provisioned

How Was It Resolved?

  • Mitigation Steps: Restarting application servers temporarily mitigated the latency and errors, but the problems returned when traffic spiked again.
  • Resolution Steps: Increasing the instance size for the cache resolved the issue.
  • Verification of Resolution: The incident team tested the upgrade in a staging deployment before executing it in production. They then monitored metrics to confirm the system returned to normal levels.

What We Are Doing to Prevent Future Incidents

  • Improve capacity planning for cache: We will ensure our internal infrastructure is properly sized to handle current traffic volumes and accommodate future growth. We'll implement regular resource evaluations to maintain adequate capacity as our traffic increases. We will also implement proactive alerting systems that notify our teams when resource utilization approaches critical thresholds.
  • Update library to a more performant version: We will upgrade our caching library to the latest stable version to eliminate the current latency issues.
  • Improve rate limiting for operations that triggered the traffic burst: Enhancing our rate limiting system will significantly improve our ability to handle future traffic bursts.
  • Timeline for Implementation: Observability improvements have already been implemented, and we will complete the remaining work by the end of Q1, 2026.

Next Steps and Communication

No action is required from our customers at this time.

We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.

Sincerely,

The 1Password Team

Posted Oct 03, 2025 - 15:28 EDT

Resolved

This incident has been resolved. We will publish a post-mortem as soon as we complete it.
Posted Sep 26, 2025 - 16:24 EDT

Monitoring

The engineering team deployed a mitigation, which has addressed the issue. We will continue to monitor performance.
Posted Sep 26, 2025 - 15:53 EDT

Update

The engineering team is rolling out a mitigation in our production environment.
Posted Sep 26, 2025 - 15:28 EDT

Update

The engineering team is continuing to test a mitigation in our test environment before deploying it to production. Customer impact is being actively managed and instances of slowdowns should continue to decrease.
Posted Sep 26, 2025 - 15:13 EDT

Identified

The engineering team has identified the issue and we are continuing to test a mitigation in our test environment before rolling it out to production. Customer impact is being actively managed and slowdowns should be increasingly rare.
Posted Sep 26, 2025 - 14:57 EDT

Update

Engineering teams are testing a mitigation in our non-production environments to verify before rolling it out to our production environments. Customer impact is being actively managed and slowdowns should be rare if they occur at all.
Posted Sep 26, 2025 - 14:44 EDT

Update

Engineering teams have identified a potential cause and are actively managing customer impact. Customers may see some slowdowns, but due to the active management those slowdowns are unlikely.
Posted Sep 26, 2025 - 14:32 EDT

Update

The engineering team is actively managing degraded performance and continuing attempts to identify root cause. Customers may still experience some slowdowns when accessing 1Password online.
Posted Sep 26, 2025 - 14:11 EDT

Update

The engineering team is actively managing degraded performance and simultaneously attempting to identify root cause. Customers may still experience some slowdowns when accessing 1Password online.
Posted Sep 26, 2025 - 13:57 EDT

Update

The engineering team is continuing to investigate and actively managing service degradation. Customers may still experience periods of slowdowns when accessing the 1Password service.
Posted Sep 26, 2025 - 13:41 EDT

Update

Overall service is still degraded, but experiencing periods of improvement.
Posted Sep 26, 2025 - 13:25 EDT

Update

The engineering team is restarting key systems in an attempt to alleviate slowdowns.
Posted Sep 26, 2025 - 13:06 EDT

Investigating

We are actively investigating an issue where customers may be experiencing degraded performance and slowdowns when accessing 1Password.
Posted Sep 26, 2025 - 12:51 EDT
This incident affected: USA/Global (Sign in, Sign up, Syncing items between your devices, Billing, Admin console, SSO (Single Sign On), Multi-factor Authentication (MFA), Command Line Interface (CLI), 1Password Connect).