Post-mortem: Search Performance Incident — January 16, 2025

Diego Carvallo & Nikola Loncar January 20, 2025

On Thursday January 16, 2025, Pickle experienced a significant performance degradation affecting our core candidate search functionality. Users encountered slow or failed searches for approximately two hours during the incident window. The timing—Friday morning in the US and a US holiday on Monday—limited customer impact, but we take any degradation to our core product seriously.

We're sharing this post-mortem to be transparent about what happened, how we responded, and what we're doing to prevent similar incidents.

Incident Timeline

All times in UTC

January 15 (Wednesday)

Evening: Slow search performance detected internally. Investigation begins.

January 16 (Thursday)

~11:00: Frontend changes deployed introducing TanStack Query for improved query handling and caching. Shortly after, an unexplained spike in database CPU usage observed (primarily query/user process driven). Cause attributed to inefficient search queries.
12:24: Query optimisation and database index creation deployed to profiles and profile_fields tables.
12:24–14:19: Immediate production testing following deployment. Database CPU usage spikes significantly, with the majority attributed to IOWAIT. Disk IOPS surge, predominantly READ operations. The prolonged spike from index creation on large tables was unexpected.
~13:00: Supabase project manually reset to recover database performance. CPU behaviour shifts from IOWAIT to user-query driven usage—still elevated but more manageable.
13:57: Users notified that we had detected and were investigating the issue.
14:15: Frontend TanStack Query changes rolled back as a precaution to rule out correlation. Performance is re-verified and query speed imporvement is noticed.
14:31: Users notified that fixes were deployed and we were monitoring.
14:30 onwards: Internal testing confirms the query optimisation improved performance. However, some user-facing searches remain slow.

January 17–19 (Friday–Sunday)

Investigation continues. Root cause identified: permission logic in the search query was checking access at the profile_field level, resulting in expensive operations. Combined with Vercel's 15-second function timeout, queries exceeding this limit were being dropped by the frontend—appearing as failures to users even when the API was completing them.

January 20 (Monday — US Holiday)

Morning (GMT): Permission logic refactored. Two new columns added to the profiles table to handle permission checks at the profile level (derived from profile_identifiers). This allows early filtering of inaccessible profiles before expensive joins occur. The fix was released.
Secondary bug fix deployed for AI message generation, caused by authentication changes in the TanStack Query migration.
Search performance significantly improved across all user accounts.

What Happened

We were addressing reports of slow candidate search performance. Our fix involved two changes deployed together:

Query optimisation — Restructuring the search query for better performance
Index creation — Adding indexes to the profiles and profile_fields tables

The index creation on teh tables caused significant database load. This was compounded by our immediate production testing after deployment, which added query load on top of the indexing operations.

After resetting the Supabase project, the database recovered, but searches were still slow for customers. The underlying issue: our search query was checking permissions at the profile_field level—meaning for each of the potentially millions of profile fields, we were evaluating access. This was computationally expensive and caused queries to exceed Vercel's 15-second serverless function timeout.

Internal (Pickle) users have less permission checks, which masked this issue during our validation. The query appeared fast for us while remaining slow for actual users.

Secondary Issue: TanStack Query Migration

As part of the same performance improvement effort, we migrated our frontend to TanStack Query. This moved query handling to the frontend and changed authentication patterns. While this improves caching and query management, it introduced a bug affecting AI message generation that was since fixed.

Impact

Duration: Approximately 2 hours of significant degradation (12:24–14:30 UTC), with residual slowness until the Monday fix.
Affected feature: Core candidate search — searches were slow or appeared to fail.
User impact: Limited due to timing. The incident occurred Friday afternoon US time, and Monday was a US holiday (Martin Luther King Jr. Day), giving us time to fully resolve the issue before the US work week resumed.

Root Causes

Index creation on large tables: Creating indexes on profiles and profile_field tables caused sustained database load we didn't anticipate.
Immediate production load during indexing: Testing in production immediately after deployment added query pressure while the database was already under stress from index creation.
Permission checks at wrong granularity: Checking access at the profile_field level instead of profile level resulted in expensive queries that exceeded serverless timeouts.
Internal testing blind spot: Bypassing permission checks internally meant we couldn't reproduce the user experience, delaying diagnosis.
Serverless timeout: Vercel's 15-second function timeout caused queries to fail silently on the frontend, even when the API was processing them.

Resolution

Immediate: Supabase project reset to recover database from index creation load.
Permanent: Refactored permission logic to check access at the profile level. Created two new columns on the profiles table derived from profile_identifiers, allowing inaccessible profiles to be filtered out before expensive joins.
Secondary: Fixed TanStack Query authentication bug affecting AI message generation.

What We're Changing

Immediate Changes (Completed)

Query-specific alerting: Added monitoring for the search endpoint to detect performance degradation faster.
Development database sizing: Increased development database backups from 20% to 50% of production data volume for more realistic testing.

Process Changes (In Progress)

Migration timing: Index creation and schema migrations will be scheduled during low-traffic windows, not deployed ad-hoc.
Internal permission testing: Evaluating changes to how internal team permissions work—moving from bypassing checks entirely to having configurable internal permissions that mirror production behaviour.

Conclusion

This incident highlighted gaps in how we test permission-heavy queries and deploy database migrations. The combination of index creation load, immediate production testing, and a permission architecture that didn't scale caused a degraded experience for our users.

We're grateful the timing limited customer impact, but we're treating this as a serious learning opportunity. The changes we're implementing—better monitoring, more representative test environments, and thoughtful migration scheduling—will help us catch similar issues before they affect users.

We appreciate your patience and trust in Pickle. If you have any questions about this incident, please reach out to us directly.