# Developer Portal — High-Level Design

> Companion to [plan.md](./plan.md). This document describes the system architecture: components, data, contracts, deployment topology, and how the portal connects to deployed RAG app instances.

---

## 1. System Context

```
                  +---------------------------------------------+
                  |             Internal Developers              |
                  +---------------------------------------------+
                                         |
                                         | (browser, HTTPS)
                                         v
                  +---------------------------------------------+
                  |       Developer Portal (this system)         |
                  |   portal.yourdomain.com                      |
                  +---------------------------------------------+
                       |             |                |
            (configures)|     (deploys)|       (observes)|
                       v             v                v
        +----------------+   +----------------+   +-------------------+
        |  Firestore     |   |  Cloud Build   |   |  Cloud Logging /  |
        |  (portal DB)   |   |  + Artifact    |   |  Error Reporting /|
        |                |   |  Registry +    |   |  Monitoring +     |
        |                |   |  Cloud Run     |   |  Langfuse         |
        +----------------+   +----------------+   +-------------------+
                                     |                     ^
                                     | (deploys)           | (traces, logs)
                                     v                     |
                  +---------------------------------------------+
                  |       N x RAG App Instances (Cloud Run)      |
                  |  technician.apps.yourdomain.com              |
                  |  sales.apps.yourdomain.com                   |
                  |  warehouse.apps.yourdomain.com               |
                  |  ...                                          |
                  +---------------------------------------------+
                                         |
                                         v
                       +-------------------------------+
                       |  Shared GCP backplane:        |
                       |    Vertex AI / Firestore /    |
                       |    Cloud Storage / Speech     |
                       +-------------------------------+
```

---

## 2. Components

### 2.1 Portal Frontend (`portal-frontend`)

- **Stack:** React 19 + TypeScript + Vite + TailwindCSS 4 + shadcn/ui (mirrors RAG app frontend).
- **Hosted on:** Cloud Run (static-served via FastAPI, or Firebase Hosting).
- **Auth:** Firebase Auth client SDK, scoped to `portal-developers` tenant.
- **Major pages:**
  - `/apps` — list view: all deployed apps, health pill, last deploy, sight toggle.
  - `/apps/new` — create form (name, slug, persona, LLM, RAG, branding, env).
  - `/apps/:id` — detail: overview tab.
  - `/apps/:id/config` — edit config (persona, LLM, RAG knobs, sight toggle).
  - `/apps/:id/deploy` — revision history, rollback, traffic split.
  - `/apps/:id/observe` — sub-tabs: Health, Logs, Errors, AI Traces, Cost.
  - `/personas` — persona library CRUD.
  - `/settings` — portal-wide (developer accounts, GCP project links, Langfuse keys).

### 2.2 Portal Backend (`portal-backend`)

- **Stack:** FastAPI + Pydantic (mirrors RAG app).
- **Responsibilities:**
  - CRUD on `apps`, `personas`, `config_snapshots`, `deployment_events`.
  - Trigger Cloud Build via REST.
  - Proxy / aggregate observability queries (Logging, Error Reporting, Monitoring, Langfuse).
  - Manage Secret Manager entries per app.
  - Health-check deployed apps (periodic + on-demand).
- **Auth middleware:** Firebase JWT verification + RBAC (developer / admin).
- **GCP service account:** `portal-sa@<project>.iam.gserviceaccount.com` with these roles:
  - `roles/cloudbuild.builds.editor`
  - `roles/run.admin`
  - `roles/iam.serviceAccountUser` (to deploy as the runtime SA)
  - `roles/secretmanager.admin`
  - `roles/logging.viewer`
  - `roles/errorreporting.viewer`
  - `roles/monitoring.viewer`
  - `roles/datastore.user` (portal Firestore)

### 2.3 Portal Database (Firestore)

Separate Firestore database (`portal-state`) inside the portal GCP project. Schema in Section 4.

### 2.4 RAG App Template (existing repo `vertexai-rag/`)

Modifications needed:

- **`backend/core/config.py`** — add `APP_ID`, `PORTAL_API_URL`, `PORTAL_API_TOKEN`, `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `SIGHT_ENABLED_DEFAULT`.
- **`backend/core/app_config.py` (new)** — fetches per-app config from portal on boot, caches, exposes `get_app_config()`.
- **`backend/middleware/sight_middleware.py` (new)** — checks `app_config.sight_enabled`; returns 503 if off.
- **`backend/service/rag_service.py`** — reads `system_prompt`, `first_message`, `document_locked_message`, `llm_config` from app config instead of hard-coded constants.
- **`backend/service/openai_text_client.py`** — wraps calls with Langfuse `@observe` carrying `app_id`, `user_id`, `conversation_id`.
- **`backend/routes/admin.py`** — adds `POST /admin/reload-config` for portal-triggered hot reload (auth: portal SA bearer token).
- **`cloudbuild.yaml` (new at repo root)** — parameterized build/deploy pipeline.

### 2.5 Build & Deploy Pipeline (Cloud Build)

```
portal-backend
   |
   | POST /v1/projects/.../triggers/<trigger-id>:run
   |   substitutions: { _APP_ID, _TEMPLATE_VERSION, _CONFIG_SNAPSHOT_ID }
   v
Cloud Build
   |
   | step 1: git checkout vertexai-rag @ ${_TEMPLATE_VERSION}
   | step 2: curl ${PORTAL_API}/internal/snapshots/${_CONFIG_SNAPSHOT_ID} -> /workspace/app-config.json
   | step 3: docker build -t ${REGION}-docker.pkg.dev/.../rag:${_APP_ID}-${SHORT_SHA}
   | step 4: docker push
   | step 5: gcloud run deploy rag-${_APP_ID} \
   |             --image ... \
   |             --set-env-vars APP_ID=${_APP_ID},PORTAL_API_URL=...,LANGFUSE_PUBLIC_KEY=... \
   |             --set-secrets OPENAI_API_KEY=projects/.../secrets/openai-${_APP_ID}:latest,... \
   |             --service-account rag-runtime-sa@... \
   |             --region ${REGION} \
   |             --no-allow-unauthenticated (or --allow, depending on app type)
   | step 6: gcloud beta run domain-mappings create --service rag-${_APP_ID} \
   |             --domain ${_APP_ID}.apps.yourdomain.com
   | step 7: callback POST ${PORTAL_API}/internal/builds/${BUILD_ID}/complete
   v
Cloud Run service: rag-${_APP_ID}
```

### 2.6 Observability Stack

| Signal | Source | Portal access |
|--------|--------|---------------|
| **Logs** | Cloud Logging (auto-collected from Cloud Run stdout) | Logging API filter `resource.labels.service_name="rag-<app_id>"` |
| **Errors / crashes** | Cloud Error Reporting (auto from stderr stacks) | Error Reporting API filter by service name |
| **Metrics** (latency, RPS, instance count) | Cloud Monitoring | Monitoring API; portal renders charts |
| **Uptime** | Cloud Monitoring uptime checks (portal creates one per app on deploy) | Monitoring API |
| **LLM traces** | Langfuse (self-hosted or cloud) | Langfuse REST API, filter by `tags=["app_id:<id>"]` |
| **Cost** | Cloud Billing export → BigQuery | BigQuery query per `service` label |

All tagged with `app_id` so the portal can filter cleanly.

---

## 3. Connectivity to Deployed Apps

### 3.1 At deploy time (push)

1. Developer clicks "Deploy" in portal.
2. Portal writes `config_snapshots/<snapshot_id>` (immutable copy of current app config).
3. Portal triggers Cloud Build with `_CONFIG_SNAPSHOT_ID`.
4. Cloud Build → Artifact Registry → Cloud Run → domain mapping.
5. Cloud Build webhook calls portal `/internal/builds/<build_id>/complete` with revision name.
6. Portal updates `apps/<app_id>.current_revision` and `deployment_events`.

### 3.2 At runtime (pull)

1. Cloud Run instance boots with `APP_ID`, `PORTAL_API_URL`, `PORTAL_API_TOKEN` env vars.
2. App calls `GET ${PORTAL_API_URL}/internal/apps/${APP_ID}/config` (bearer auth).
3. Portal returns the **snapshot** that the current revision was built with (not the live config — config is versioned).
4. App caches in-memory; serves traffic.
5. If portal is down, app keeps serving with last cached config (resilience).

### 3.3 For sight toggle / hot config changes

Two strategies, depending on urgency:

- **Hot path (sight toggle off, prompt tweak):** Portal calls `POST https://rag-<app_id>.../admin/reload-config` (auth: portal SA OIDC token). App re-fetches from portal and updates in-memory cache. No restart.
- **Cold path (LLM model change, new persona):** Full redeploy via Cloud Build → new revision → traffic shift.

### 3.4 For observability (pull)

Portal backend reads from GCP APIs and Langfuse on behalf of the developer. Deployed apps don't talk to the portal for observability — they just emit logs/traces/metrics with the right labels.

---

## 4. Data Model (Firestore)

```
portal-state/                       (Firestore database)
│
├── personas/{persona_id}
│     name: "Field Technician"
│     system_prompt: "You are a field technician assistant..."
│     first_message: "Hi! What equipment do you need help with today?"
│     document_locked_message: "This document isn't in your assigned knowledge base."
│     voice_settings: { tts_voice: "...", stt_lang: "en-US" }
│     default_llm: { model: "gpt-5.4", temperature: 0.2, reasoning_effort: "low" }
│     created_at, updated_at, created_by
│
├── apps/{app_id}
│     name: "Technician Assistant"
│     slug: "technician"                 (-> subdomain technician.apps.yourdomain.com)
│     persona_id: "field-technician"
│     persona_overrides: { ... }         (per-app tweaks to the persona)
│     llm_config: {
│         model: "gpt-5.4",
│         temperature: 0.2,
│         reasoning_effort: "low",
│         max_tokens: 4096
│     }
│     rag_config: {
│         top_k: 8,
│         min_score: 0.6,
│         chunk_size: 800,
│         chunk_overlap: 150,
│         embedding_model: "text-embedding-005",
│         knowledge_base_ids: [...],
│         use_reranker: false
│     }
│     branding: { logo_url, primary_color, app_title, favicon_url }
│     sight_enabled: true
│     identity_tenant_id: "technician-abc123"   (Firebase tenant for end users)
│     template_version: "v1.4.2"
│     current_revision: "rag-technician-00042-abc"
│     status: "live" | "deploying" | "failed" | "archived"
│     custom_domain: null | "tech.customer.com"
│     created_at, updated_at, created_by
│
├── config_snapshots/{snapshot_id}
│     app_id: "technician"
│     snapshot_at: timestamp
│     config: { ...complete copy of app config at this point in time... }
│     # Immutable. Cloud Run revision points to this snapshot.
│
├── deployment_events/{event_id}
│     app_id: "technician"
│     event_type: "deploy_start" | "deploy_success" | "deploy_fail" | "rollback" | "config_update"
│     actor_email: "varun@xsparks.ai"
│     build_id: "..."
│     revision: "rag-technician-00042-abc"
│     snapshot_id: "..."
│     traffic_split: { "00042-abc": 100 }
│     error_message: null
│     timestamp
│
├── developers/{uid}
│     email, name, role: "developer" | "admin", invited_by, created_at
│
└── secrets_meta/{app_id}/keys/{key_name}
      # Just metadata pointing to Secret Manager; actual values live in GSM
      name: "OPENAI_API_KEY"
      gsm_resource: "projects/.../secrets/openai-technician"
      last_rotated_at
      rotation_policy: "manual" | "30d" | "90d"
```

---

## 5. API Contracts (Portal Backend)

### Developer-facing (browser, Firebase JWT)

```
GET    /api/apps                       List all apps (with health summary)
POST   /api/apps                       Create app (writes config + triggers first deploy)
GET    /api/apps/{app_id}              Get app detail
PATCH  /api/apps/{app_id}              Update config (creates new snapshot, does NOT deploy)
DELETE /api/apps/{app_id}              Archive app
POST   /api/apps/{app_id}/deploy       Trigger new deployment (uses current config)
POST   /api/apps/{app_id}/rollback     Set traffic 100% to previous revision
POST   /api/apps/{app_id}/sight        Toggle sight (hot path, no rebuild)
GET    /api/apps/{app_id}/revisions    List Cloud Run revisions
GET    /api/apps/{app_id}/health       Live health check
GET    /api/apps/{app_id}/logs?...     Cloud Logging query
GET    /api/apps/{app_id}/errors       Error Reporting query
GET    /api/apps/{app_id}/metrics?...  Cloud Monitoring query
GET    /api/apps/{app_id}/traces?...   Langfuse query (proxied)
GET    /api/apps/{app_id}/cost?...     BigQuery billing query

GET    /api/personas                   List personas
POST   /api/personas                   Create
PATCH  /api/personas/{id}              Update
DELETE /api/personas/{id}              Soft-delete (only if no app uses it)
```

### Internal (deployed apps + Cloud Build, OIDC / SA token)

```
GET    /internal/apps/{app_id}/config            App fetches its config on boot
GET    /internal/snapshots/{snapshot_id}         Cloud Build fetches snapshot to bake
POST   /internal/builds/{build_id}/complete      Cloud Build webhook on finish
POST   /internal/health/{app_id}                 Optional: app pushes heartbeat
```

---

## 6. Security

| Surface | Threat | Control |
|---------|--------|---------|
| Portal frontend | Unauthorized developer access | Firebase Auth + dedicated tenant + invite-only; IP allowlist optional |
| Portal API (dev) | CSRF, token theft | Firebase JWT in `Authorization` header, short TTL, no cookies for API calls |
| Portal API (internal) | App impersonation | OIDC tokens issued by Google to the app's runtime SA; portal verifies audience + SA email |
| Cloud Build | Secret leak in logs | Use Secret Manager `--set-secrets`, never `--set-env-vars` for secrets; redact build logs |
| Deployed RAG apps | Cross-app data access | Per-app Firebase Identity Platform tenant + `app_id` partitioning in shared Firestore |
| Sight toggle | Bypass | Middleware checks on every request before any business logic |
| Config snapshots | Tampering | Snapshots immutable (write-once); deployment audit log signed |
| Secret rotation | Stale secrets in old revisions | Rotation triggers redeploy of current revision pointing to `:latest` |

---

## 7. Deployment Topology

```
GCP Project: vertexai-rag-portal-prod
├── Cloud Run: portal-backend          (1 service, regional)
├── Cloud Run: portal-frontend         (1 service, regional)
├── Firestore: portal-state            (Native mode)
├── Secret Manager                     (per-app secrets, namespaced)
├── Artifact Registry: rag-images      (multi-region)
├── Cloud Build triggers               (one parameterized trigger for RAG template)
├── Cloud DNS: yourdomain.com zone
│     └── *.apps.yourdomain.com A -> Cloud Run load balancer
├── IAM:
│     ├── portal-sa@                   (portal backend runtime)
│     ├── rag-build-sa@                (Cloud Build worker)
│     └── rag-runtime-sa@              (every deployed RAG app runs as this)
└── Cloud Logging / Error Reporting / Monitoring (auto)

GCP Project: vertexai-rag-prod  (or same project, optional split)
├── Cloud Run: rag-technician          (one service per app)
├── Cloud Run: rag-sales
├── Cloud Run: rag-warehouse
├── ... (N services)
├── Firestore: (default)               (shared, partitioned by app_id)
├── Cloud Storage: shared buckets per app_id prefix
├── Vertex AI                          (shared model endpoints)
└── Identity Platform                  (one tenant per app)
```

---

## 8. Non-functional Requirements

| NFR | Target |
|-----|--------|
| Portal page load | p95 < 2s for app list |
| Deploy (form submit → live URL) | p95 < 10 min, p99 < 15 min |
| Config hot reload (sight toggle) | < 5s end-to-end |
| Rollback | < 30s |
| Observability data freshness | logs/errors < 60s, metrics < 90s, traces < 60s |
| Portal availability | 99.5% (portal outage must NOT impact deployed apps) |
| Apps' availability | 99.9% (independent of portal) |
| Apps per region | up to 200 in v1; design for 1000 |

---

## 9. What Stays in the RAG App Repo vs Portal Repo

| Concern | RAG repo | Portal repo |
|---------|----------|-------------|
| Business logic (chat, RAG, voice) | ✅ | ❌ |
| `cloudbuild.yaml` (parameterized) | ✅ | ❌ |
| Dockerfile | ✅ | ❌ |
| Persona prompt templates | ❌ (read from portal at runtime) | ✅ |
| App config schema | shared package or duplicated Pydantic models | ✅ source of truth |
| Deploy orchestration | ❌ | ✅ |
| Observability dashboards | ❌ | ✅ |
| Developer auth | ❌ | ✅ |

---

## 10. Migration Path from Today

1. **Today:** one Cloud Run service `fsg-backend` + `fsg-frontend`, hard-coded technician persona.
2. **Step 1:** Extract persona constants (system prompt, first message, locked-doc message) into a single config struct read from `core/config.py`. No portal yet — config still in env.
3. **Step 2:** Build portal v1 (Phase 1 of plan.md). Existing technician deployment becomes app `technician` in the portal, deployed via the same pipeline.
4. **Step 3:** Add Langfuse + Sight middleware to RAG app. Backfill technician config snapshot. Roll forward.
5. **Step 4:** Onboard a second persona (e.g. sales) entirely through the portal. Validate end-to-end.
6. **Step 5:** Decommission the manual deploy pipeline; portal becomes the only way to deploy.
