Context
Toniq (anonymized) is a clinician-facing healthtech platform: a web application that lets primary-care providers manage a panel of patients, record encounters, and share summaries with other clinicians on the care team. Patient data - chart notes, lab results, prescriptions - lives under their PHI boundary. Their enterprise customers will not sign a contract without a SOC2 Type I report, and their external auditor was booked for a kick-off call eight weeks out from the start of our engagement.
The product had been built by a small team over fourteen months with significant LLM assistance - not end-to-end code generation, but scaffolding for controllers, middleware, validation schemas, and much of the audit-logging infrastructure. The code was clean-looking and the test suite was comprehensive. But the CTO, who had been through SOC2 once before at a previous company, had a specific concern: "I know what the auditor is going to ask. I don't know whether our code will pass those questions. I need a second senior pair of eyes that thinks like an auditor, before the auditor does."
That framing defines the engagement. We were not looking for any and every defect. We were looking specifically for the classes of defect a SOC2 Type I assessor flags as Critical: PHI leakage, audit-log gaps, RBAC escalation, and any gap in the control narrative the client had written up.
The eight weeks were structured as two sprints: four weeks of deep audit, then four weeks of remediation-plus-verification, ending with a dry-run walkthrough the week before the assessor's kick-off.
Scope
Three compliance-adjacent surfaces:
- PHI boundaries. Every ingress and egress point for patient data: API endpoints, webhook responses, error responses, log statements, background jobs, and export flows.
- Audit logging. Every authentication event, authorization decision, data-access event, and administrative action. Completeness, immutability, time-accuracy, and correlation.
- RBAC. Role definitions, middleware composition, scope boundaries between clinician / admin / billing / patient-facing roles, and the interaction with the multi-tenant partition model.
Method
Healthtech pulls especially hard on auth-and-authorization, data-integrity, observability, and adversarial dimensions of our nine-dimension protocol. We added a threat-modeling phase at the front of this engagement because SOC2 assessors expect to see a documented threat model, and Toniq didn't have one.
Tooling
- semgrep with a custom PHI rule pack. Rules flagged any string interpolation of patient-identifying fields into log statements, error responses, or HTTP headers, plus rules for unsanitized error returns from the controller layer.
- Burp Suite Pro for manual authenticated exploration across role boundaries. We crafted test personas per role and replayed them against every endpoint in the scope list.
- Threat modeling sessions using a light STRIDE variant, one session per surface (PHI, audit, RBAC). Outputs were a diagrammed threat catalog and a mapping from each threat to the specific control (or control gap) in the codebase.
Concrete scenarios
Scenario A - Forced-error PHI leak. We intentionally triggered the error path on twenty endpoints that handled patient data: malformed IDs, invalid payloads, authorization-denied requests. For each, we inspected the response body, the response headers, the stack trace surfaced in development mode, and the corresponding log line. We were looking for patient identifiers, not just full records.
Scenario B - Failed auth audit coverage. We walked through every failure mode of the auth stack - wrong password, expired token, revoked session, missing MFA step, rate-limited requests - and verified that each produced an audit entry with the right event type, the right principal, and the right correlation ID. A gap here is a SOC2-critical "monitoring failure" finding.
Scenario C - Role-boundary probing. With a clinician-role token, we attempted to call admin-scoped endpoints, billing-scoped endpoints, and endpoints for other tenants. We followed every 403 backwards through the middleware stack to confirm the decision was made on the server, not just in the UI.
Findings
The default error middleware had been generated as a handy debugging aid: in non-production modes it returned the stack trace to the client. The production build disabled that feature - but only at the outer wrapper. The inner data layer, when it threw on a missing record, included the lookup key (the patient ID) in the exception message, which was then serialized into the client-facing error payload even in production. We found PHI-class identifiers in the response body on nine endpoints.
Successful logins were logged correctly. Failed logins, expired tokens, and rate-limited requests were not - the audit-event emitter was only wired into the success branch of the auth middleware. For SOC2 monitoring controls this is a Critical gap: the auditor cannot verify that brute-force attempts are detectable if the failures never reach the audit trail. We also found that the audit log used the application server's local clock rather than a monotonically-increasing source, which is a secondary finding in the same category.
Two admin-scoped endpoints composed their authorization middleware in the wrong order. The clinician-role middleware ran first, which short-circuited into "allowed" because the clinician role passed its own check; the admin-role check that was supposed to run next was never reached. Any clinician could invoke two administrative endpoints - one of which was the tenant-configuration modifier. This is the exact class of finding that turns a SOC2 Type I into a Type I-with-qualifications, and it's the exact class the LLM scaffolder produces when it generates middleware chains from a textual description.
Most endpoints inferred the tenant partition from the JWT claim. Two endpoints, both background-job triggers, accepted a tenant_id field in the request body and did not cross-check it against the caller's token. Authenticated users from tenant A could enqueue jobs scoped to tenant B. Low exploitation ceiling because the jobs were internal, but a clear SOC2 finding.
The log-shipping config retained access logs for 30 days. The client's own SOC2 control narrative claimed 12 months. This is a documentation-versus-implementation drift of the kind assessors spot in the first hour of fieldwork.
Outcome
Headline: the external SOC2 Type I assessor produced a report with zero Critical findings and a clean monitoring-control narrative. Secondary metrics:
- All three Criticals closed within 11 days of disclosure. Verified by re-running the semgrep rule pack and the Burp test personas.
- Threat model delivered as a written document with 34 threats catalogued, mapped to controls, and reviewed with the external assessor before fieldwork.
- Audit-log coverage: 100% of the 27 required event types post-remediation, up from 18/27 at the start.
- Hand-off regression: a semgrep rule pack and 60-scenario Burp project saved into the client's CI pipeline.
Takeaway
Compliance-sensitive software written with LLM assistance has a specific failure mode that assessors know how to find: the shape of the controls is present (there is a middleware, there is a logger, there is a role model) but the composition is subtly wrong - a check runs in the wrong order, a logger is wired to only one branch, an error path leaks data the normal path doesn't. None of this shows up in the test suite because the test suite was generated from the same mental model as the code. A senior adversarial review one month before the external assessor arrives is the cheapest insurance a healthtech founder can buy. It is enormously cheaper than a Type I-with-qualifications and a second round of fieldwork.
Want the same?
Book a 20-min call. You explain the surface. I explain which tier fits.
Book a 20-min call arrow_forward