Intermittent 1Password request failures

Incident Report for 1Password

Postmortem

Incident Postmortem - Intermittent 500 error failures for US customers.

Date of Incident: 2025-06-16
Time of Incident (UTC): 14:39 UTC - 17:47 UTC
Service(s) Affected: Web Interface, Sign in, Sign up, Admin console, Item Sync, SSO (Single Sign On), Command Line Interface (CLI)) for US based users.

Impact Duration: 68 minutes

Summary

At 14:39 UTC, users in the US region experienced intermittent errors while accessing 1Password. The issue stemmed from resource constraints within our infrastructure, specifically affecting the networking services. This was resolved by scaling up the affected services.

Impact on Customers

During the duration of the incident:

  • Web interface, Admin Console: Customers were able to log in but saw intermittent 500 errors, including “Failed to get Integrations” on the web Interface.
  • SSO (Single Sign On), Command Line Interface (CLI), Item Sync: There was degraded performance for authentication and API requests.
  • Sign in, Sign up: There were intermittent failures on sign in and sign up for some customers during the incident.
  • Number of Affected Customers (approximate): All users accessing the service in the US region were affected.
  • Geographic Regions Affected (if applicable): US

What Happened?

The incident began when our internal services started returning errors after deploying the latest version of the 1Password service. As part of the initial investigation, we restarted a supporting network service within our infrastructure, which resulted in an initial recovery of the affected service.

  • Timeline of Events (UTC):

    • 2025-06-16 14:39 UTC: Incident Start - Automation detects servers are returning errors
    • 2025-06-16 14:40 UTC: Initial investigation begins
    • 2025-06-16 15:21 UTC: Networking updates are rolled out
    • 2025-06-16 15:23 UTC: Initial service recovery observed
    • 2025-06-16 15:38 UTC: Root cause identified: Networking applications ran out of allocated resources.
    • 2025-06-16 15:42 UTC: Additional capacity added to networking applications
    • 2025-06-17 17:47 UTC: The spike in server errors stopped, and internal monitoring showed that system health had returned to normal.
    • 2025-06-17 17:53 UTC: Incident resolved
  • Root Cause Analysis: An internal service that directs network traffic became resource constrained which caused degraded performance of the service. We first stabilized the system by adding more capacity and have since deployed a permanent fix by increasing system resources to prevent a recurrence.

How Was It Resolved?

  • Mitigation Steps: As an immediate mitigation, the number of replicas for the deployment was scaled up.
  • Resolution Steps: A more permanent fix was later applied by increasing the allocated resources for the networking applications.
  • Verification of Resolution: Around 15:25 UTC, we observed that the spike in 500 errors from the server had completely stopped. The team continued monitoring the errors and confirmed at 17:53 pm EST that allocated resource consumption had been stable for a while.

What We Are Doing to Prevent Future Incidents

  • Scale existing resources: We have effectively scaled resources and resource limits to address additional load and will implement monitoring to ensure we do not hit critical limits
  • Review and expand existing monitors: We will review our critical service monitors to improve alerting and catch future incidents earlier, before they have customer impact.

Next Steps and Communication

  • No action is needed from customers

We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.

Sincerely,

The 1Password Team

Posted Jun 24, 2025 - 09:01 EDT

Resolved

This incident has been resolved.
Posted Jun 16, 2025 - 13:59 EDT

Monitoring

This issue has been mitigated and we are seeing error levels return back to normal. We will continue to monitor to confirm that the issue has been resolved.
Posted Jun 16, 2025 - 12:06 EDT

Identified

We have identified an issue causing intermittent failure requests for US based customers and our engineering team is actively investigating.
Posted Jun 16, 2025 - 11:46 EDT