AI Book Companion Platform with Enterprise Security and Observability

Mar 20, 2026

For one of the largest privately held companies in the US, I engineered the platform infrastructure for a conversational AI book companion — a tool that helps readers learn and apply leadership principles through Socratic dialogue. My work spanned authentication, infrastructure-as-code, observability, reliability, and agent execution, delivering a secure, observable, production-grade platform across three AWS environments.

The Challenge

The organization published a leadership book and wanted to build a public-facing AI companion that helps readers move from understanding the content to applying it in their own work. The system needed to support three conversation modes — learning a principle, preparing for an upcoming situation, and reflecting on a past experience — each producing persistent “cards” capturing the user’s insights over time.

I was responsible for making this AI application production-ready:

  • Multi-provider authentication: Public users needed frictionless sign-up via SSO (Google, LinkedIn), corporate identity, magic link emails, and guest access — all unified through a single identity layer
  • Multi-environment deployment: A full dev → nonprod → prod pipeline on AWS EKS, managed through infrastructure-as-code with GitOps
  • LLM observability: Every conversation needed to be traced end-to-end — prompt inputs, model outputs, latency, and token costs — without sending data to external services
  • Reliability at scale: CloudNative PostgreSQL with connection management, Temporal workflow orchestration, and load testing to validate behavior under concurrent users
  • Content safety: AWS Bedrock Guardrails for input/output filtering on all AI-generated responses

System Architecture

Authentication & Identity

I designed and implemented the full authentication stack using AWS Cognito as the identity provider. The system supports multiple authentication paths through a unified flow:

OAuth/SSO Providers:

  • Google and LinkedIn social login for public users
  • Corporate SSO integration for internal users
  • Each provider configured with Cognito user pool identity providers, handling attribute mapping and token exchange

Magic Link Authentication: I built a complete magic link system for passwordless onboarding — a full-stack feature spanning backend service, database model, API endpoint, email delivery via AWS SES, and a frontend verification page. Users enter their email, receive a one-time link, and are authenticated without creating a password. This reduced sign-up friction significantly for first-time users coming from the book.

Guest Access: For users who want to explore without committing to an account, I implemented a guest user flow with rate limiting to prevent abuse. Guests can later upgrade to full accounts, preserving their conversation history and cards.

Security Hardening:

  • Zero Trust Network Access (ZPA) rules controlling access to internal services
  • Network filtering rules for traffic control between environments

Infrastructure as Code

I managed the infrastructure-as-code using Pulumi for all AWS resource provisioning:

Pulumi Resource Management:

  • AWS Cognito user pools, identity providers, and app clients
  • AWS EKS cluster configuration across three environments (dev, nonprod, prod)
  • AWS Secrets Manager entries for application secrets
  • S3 buckets with versioning enabled and public access removed
  • EBS snapshot policies for volume backup management

Kubernetes & GitOps:

  • Kustomize overlays for environment-specific manifests
  • Flux for GitOps-based continuous deployment from the main branch
  • External Secrets Operator bridging Kubernetes secrets with AWS Secrets Manager
  • Multi-environment Kustomization files for nonprod and prod clusters

CI/CD Pipeline:

  • GitHub Actions workflows with OIDC role assumption — no static AWS credentials stored in CI
  • Automated build, test, and deploy pipelines for both backend and frontend
  • Container image builds pushed to AWS ECR

Observability

I deployed and configured the observability stack to give full visibility into the AI application’s behavior:

Langfuse (LLM Tracing):

  • Deployed Langfuse in-cluster on EKS as a StatefulSet, backed by PostgreSQL and ClickHouse — no external data egress for conversation traces
  • Every LLM call traced with prompt inputs, model outputs, latency, token counts, and cost estimates
  • Session-level views linking multi-turn conversations without exposing intermediate API requests

Reliability & Performance

Database:

  • CloudNative PostgreSQL (CNPG) with tuned health probe timeouts to prevent false-positive pod restarts
  • Connection management optimized to avoid idle-in-transaction leaks
  • Database password-based authentication configured across environments

Load Testing: I built a load testing framework with simulated users that exercise realistic conversation scenarios — not synthetic HTTP requests. The framework includes configurable scenarios, an API client matching the production authentication flow, and conversation prompts that trigger multi-turn agent interactions. This validated the system’s behavior under concurrent load and identified bottlenecks in database connection pooling.

Agent Concurrency

The AI companion uses an agent framework with pluggable tools for retrieving book content, managing user cards, and accessing conversation history. I implemented concurrent tool execution, enabling the agent to run multiple tools in parallel rather than sequentially. This reduced response latency for complex queries requiring multiple data sources, directly improving the user experience during conversations.

Key Design Decisions

  1. Cognito over alternatives — AWS-native identity management scales with the existing infrastructure. Cognito’s built-in support for multiple OAuth providers, user pools, and hosted UI reduced the custom auth code needed. Integration with other AWS services (SES for email, Secrets Manager for credentials) avoided glue code between providers.

  2. Magic links for frictionless onboarding — Book readers arriving at the companion for the first time needed the lowest possible barrier to entry. Magic links eliminate password creation, password reset flows, and the cognitive load of yet another credential. AWS SES handles delivery reliability.

  3. Pulumi over Terraform — Pulumi’s YAML configuration approach provided cleaner multi-environment management. Each environment (dev, nonprod, prod) has its own stack configuration, and Pulumi’s native AWS provider integration simplified Cognito resource provisioning where Terraform’s AWS provider had gaps.

  4. In-cluster Langfuse — Hosting Langfuse within EKS rather than using the managed SaaS eliminated external data egress for conversation traces. Given the sensitive nature of user leadership reflections, keeping all trace data within the AWS VPC was a security requirement, not a preference.

  5. Simulated-user load testing — Traditional HTTP load testing wouldn’t capture the stateful, multi-turn nature of AI conversations. The load testing framework simulates complete user sessions — authentication, conversation initiation, multi-turn dialogue, and card generation — providing realistic performance profiles.

  6. Concurrent tool execution — The agent’s tools (book content retrieval, card management, conversation history) are I/O-bound operations that benefit from parallelism. Running them concurrently rather than sequentially reduced agent response times for complex queries requiring multiple data sources.

Results & Impact

  • Deployed production platform across 3 AWS environments (dev, nonprod, prod) with GitOps-based continuous deployment
  • Implemented multi-provider authentication (Google, LinkedIn, corporate SSO, magic links, guest access) serving public users through unified Cognito identity layer
  • LLM observability pipeline with in-cluster Langfuse — every conversation traced end-to-end with prompt, response, latency, and cost data
  • Infrastructure-as-code with Pulumi managing all AWS resources across environments
  • Achieved zero-downtime deployments through Kubernetes rolling updates and Flux GitOps reconciliation
  • Hardened security posture: S3 bucket versioning, removed public access, ZPA rules, network filtering, critical SSO library upgrades
  • Validated platform reliability through simulated-user load testing exercising complete conversation flows
  • Improved agent response latency through concurrent tool execution for multi-source queries
  • Resolved database connection leaks and workflow scheduling conflicts ensuring stable operation under sustained load

Technologies

AWS Cognito, AWS EKS, AWS Bedrock, AWS SES, AWS Secrets Manager, Pulumi, Kubernetes, Kustomize, Flux, FastAPI, React, TypeScript, Python, Langfuse, ClickHouse, OpenTelemetry, Temporal, PostgreSQL (CNPG), Docker, GitHub Actions, Sentry, PostHog