Context
Helix (anonymized) is a Series B consumer fintech handling cross-border transfers in the sub-five-thousand-euro bracket. The founding engineering team had shipped the bulk of their v2 stack over the preceding nine months, leaning heavily on LLM-assisted scaffolding for the authentication surface, the payment reconciliation pipeline, and the internal ledger. By the time we were brought in, they were six weeks from a public-launch campaign with PR already scheduled and a compliance officer asking pointed questions about the authorization model.
Their internal test suite was green. It had always been green. The CTO's concern - which is the exact concern every founder calling us describes in a slightly different dialect - was that the suite was green for the wrong reasons. LLM-generated tests tend to mirror the shape of the code under test rather than the shape of the requirements, and that particular failure mode doesn't show up in coverage metrics.
The specific trigger for the engagement was a staging-only incident where a support engineer noticed that a test account could read transaction history for a neighboring user. That's a Critical-severity IDOR in production terms. It had been present for roughly four months. No automated test had flagged it, because no test had been written with an adversarial model of the permission check in mind.
They called us on a Thursday. We scoped on Monday, started on Wednesday, and the first zero-day finding landed on the Friday of week one.
Scope
We agreed on three surfaces, in priority order:
- Authentication and session layer. Login, MFA enrollment and verification, password reset, magic-link flows, session revocation, device binding.
- Payment reconciliation pipeline. Webhook ingestion from the two primary PSPs, idempotency handling, retry semantics, the nightly reconciliation job, and the drift-detection logic.
- Ledger state machine. The core double-entry ledger, transfer lifecycle transitions (
PENDING-CLEARED-SETTLED/REVERSED), and the locking strategy under concurrent writes.
Out of scope: mobile client code, marketing site, and the recommendation service. Fifty-two endpoints in total were in-scope across REST and the internal gRPC service mesh.
Method
The full nine-dimension protocol ran end to end: functional, auth and authorization, data integrity, concurrency and race, performance, observability, supply-chain, error-path, and adversarial. For a fintech with this surface area, the bulk of our week-by-week budget went into auth, concurrency, and adversarial dimensions.
Tooling
- Playwright for full browser-level auth-flow reproduction, including MFA replay attacks, session-fixation fuzzing, and reset-token race conditions.
- semgrep with a custom rule pack: check-ordering rules for authorization middleware, rules flagging
findByIdcalls executed before role assertions, and secrets-in-logs patterns specific to their logging wrapper. - A custom chaos harness we wrote against their staging environment. It fired concurrent
POST /transferrequests under induced latency on the PSP webhook path, then diffed the resulting ledger state against the expected end state.
Concrete scenarios
Scenario A - IDOR on guessed UUIDs. We enumerated /users/:id and /users/:id/transactions while authenticated as a low-privilege account. UUIDs are not secret, but the authorization check was supposed to gate access anyway. It didn't, and the reason was a check-ordering bug discussed below.
Scenario B - Concurrent transfer race. The harness fired fifty concurrent POST /transfer requests from the same source account, each drawing an amount equal to 90% of the balance, while simultaneously replaying a PSP transfer.succeeded webhook from the previous transaction. The ledger was supposed to reject all but one. It cleared several.
Scenario C - Reset-token lifetime verification. We requested a password reset, waited 60 minutes, then attempted to consume the token. The documented TTL was 15 minutes. The token accepted at 60 minutes. And at 24 hours. And at 71 hours.
Findings
The authorization middleware for /users/:id and several sibling endpoints was structured so that the user object was fetched by ID first, then the caller's role was checked against the fetched object. Under several code paths - specifically where the fetched object was returned to a log statement before the permission check threw - the timing was enough for an attacker to observe existence of accounts and, in two endpoints, to read fields that were serialized before the 403 was emitted. This is a textbook LLM generation pattern: the scaffolder wrote a middleware that looked like a permission check but executed in the wrong order relative to the object fetch.
The ledger write path used optimistic concurrency with a version column, but the version read and the balance check happened outside the write transaction. The test suite had never exercised this because it ran sequentially and the tests asserted on final state, not on the window between read and write. Under 50-way concurrency with the PSP webhook replayed, we reproduced a double-clearing of the same transfer in 7 of 20 runs. In production terms that's duplicate payouts.
The TTL constant had been introduced by an LLM-generated helper and silently drifted across two refactors. Neither the tests nor the docs caught it because the tests only asserted that expired tokens were rejected, not that valid tokens expired at the documented boundary. This is the most common LLM-drift pattern we see: boilerplate that ships the right shape with the wrong constant.
The PSP webhook idempotency cache was keyed on the PSP's event ID only, not on the target user. A crafted replay with a stolen event ID could credit the wrong account. Low exploit probability in practice, but the fix is a one-line tuple change and we insisted.
The error branch in the reconciliation job stringified the full PSP payload, which in error cases from one of the two PSPs included the last six digits of the card PAN. Not enough to be a PCI incident on its own, but enough to fail a quarterly review.
Outcome
Headline number: P1 defect count dropped 91% across the audited surfaces between the pre-engagement snapshot and the sign-off snapshot eight weeks later. Three pre-launch zero-days were closed before public traffic ever hit them.
- 14 findings in total: 3 Critical, 4 High, 7 Medium.
- All Criticals closed within 9 days of disclosure.
- Hand-off regression suite: 220 Playwright scenarios + 35 semgrep rules, checked into their repo and running in CI.
- Mean time to reproduce a reported bug in the six weeks after hand-off: 4 minutes (self-reported by their team).
Takeaway
The most dangerous class of defect in AI-assisted fintech is not the exotic one. It's the plausible one. Middleware that looks like a permission check but executes out of order; constants that look correct but have drifted across refactors; tests that assert on the shape of the response rather than the security property being claimed. A senior adversarial review catches these because it starts from "what would an attacker try" rather than "does this code compile and return 200". If you are shipping financial primitives, assume your LLM-generated test suite is green for reasons that have nothing to do with correctness.
Want the same?
Book a 20-min call. You explain the surface. I explain which tier fits.
Book a 20-min call arrow_forward