Ticket Source
Lambda generates TKT-XXXXXXXX immediately — ClickHouse is the system of record
Appian Case Number
Separate CASE-XXXXX created by Appian process model — linked via ticket_id
Temporal vs Orkes
Temporal = Signal push (instant). Orkes = DO_WHILE poll (20s lag). Each chosen for its strength
Worker Requirement
Temporal: no worker needed (wait-only workflows). Orkes v2: no worker either — HTTP system tasks call Lambda directly
Pipeline Burst Control
SQS intake queue + EventBridge 30-min drain — month-end batch of 400 docs clears overnight at 20/batch, no throttling, no sidecar
Five HITL Patterns
Each action type has a different orchestration pattern matched to its operational requirements — including auto-HITL for pipeline step failures
📄
Send Duplicate eStatement
action: send_duplicate
Temporal SQS Appian ✓ Built
UI
Customer requests duplicate statement copy. Banker selects delivery address and document IDs.
λ
status=pending. Also publishes SQS message (14-day guaranteed delivery buffer for Appian).
Ticket ID generated here — not in Appian
T
Pure wait-state. Zero compute cost. Durable up to 7 days. No worker process needed.
workflow_id = ticket_id for direct signalling
A
Creates CASE-XXXXX in Appian case management. Assigns User Task to BackOfficeTeam group. Sends email notification.
Needs Appian Designer config (~30 min)
A
Sees ticket details, customer query, delivery address. Enters decision note. Clicks Approve or Reject.
ClickHouse updated instantly. Signal sent to Temporal Cloud — workflow wakes in milliseconds. No polling.
vs Orkes DO_WHILE: 20s polling gap → Signal: <100ms
λ
S3 presigned URLs for doc_ids. Bedrock Nova Lite cover letter. ClickHouse → status: completed.
Polling /hitl/status/{id} every 5s reflects the decision. End-to-end from Approve click: <3s.
SLA
7 days
Engine
Temporal Cloud
Worker
Not needed
Fulfilment
S3 + Bedrock in Lambda
Signal latency
<100ms
🔄
Reopen Case
action: reopen_case
Temporal SQS Appian ✓ Built
UI
Customer disputes that a closed case was resolved incorrectly, or new evidence has emerged.
λ
askmybank_reopen_case workflow. SQS message published for Appian pickup. ClickHouse: status=pending.
T
Dispute cases may take longer — senior manager review required. 7-day SLA window.
A
Same SAIL approval form as eStatement. Manager sees original query, doc references, customer history note.
Temporal wakes instantly. Lambda run_fulfilment() updates ClickHouse. No S3/Bedrock — status update only.
Approved = "reopened" | Rejected = "reopen_denied"
ClickHouse is the source of truth. Downstream CMS can poll or subscribe to status changes.
SLA
7 days
Engine
Temporal Cloud
Worker
Not needed
Fulfilment
ClickHouse status only
Outcomes
reopened / reopen_denied / timeout
⚖️
Document Ingestion Pipeline
v2 — SQS queue + EventBridge drain + HTTP system tasks
Orkes SQS EventBridge ✓ No Worker ✓ Built
Q
S3 events, POST /pipeline/ingest, Appian triggers — all funnel to one durable queue. Month-end bursts sit here harmlessly (7-day retention, no data loss).
Burst control without a sidecar — throttle upfront at the queue
EB
Pops up to 20 messages per run (configurable PIPELINE_BATCH_SIZE). Manual trigger: POST /pipeline/drain to skip the schedule during testing.
Same EventBridge pattern as Lambda warming — zero extra infra
O
INLINE Graal JS detects doc type. HTTP tasks call Lambda /pipeline/step/* directly. Orkes awaits HTTP 200 before advancing — sequencing guaranteed.
v2: detect (INLINE JS) → textract (600s) → chunk → embed → s3vec + clickhouse (parallel)
λ
step_s3vec checks metadata.json for existing doc_id. step_clickhouse checks SELECT COUNT. Safe to re-trigger after any failure — no duplicate vectors.
doc_id = SHA-256(s3_key) — deterministic, collision-free
Vectors live in askmybank-vectors S3. Metadata in ClickHouse. Document immediately searchable in chat. Orkes UI shows full visual DAG execution history for compliance.
Failure path: step exception → auto-HITL ticket with re-trigger curl command (see Flow 5)
Engine
Orkes HTTP tasks
Worker
Not needed (v2)
Burst control
SQS + 30-min drain
Idempotency
metadata.json + ClickHouse COUNT
Textract timeout
600s per step
🛠️
Escalate to IT / Content Gap
action: escalate_to_it
Temporal Lambda ✓ Built
UI
AI couldn't answer the query — missing content in the knowledge base. Banker flags it for the content/IT team.
λ
banking_docs.content_gaps table. Returns GAP-XXXXXXXX. Always succeeds — no workflow needed for simple logging.
Simple escalations end here — no approval loop
T
For compliance-sensitive gaps, askmybank_escalate_to_it workflow starts. 48-hour IT SLA (shorter than banker flows). Auto-escalates to Head of Content on breach.
IT team uses review.html or Appian to mark resolved. Signal wakes Temporal. ClickHouse updated.
Content gap feeds the re-ingestion pipeline (Orkes document_pipeline). Next time this query runs, AI answers correctly.
SLA
48 hours
Engine
Temporal (optional)
Worker
Not needed
Outcomes
gap_resolved / gap_deferred / sla_breached
🔥
Pipeline Step Failure → Auto-HITL
_pipeline_failure_hitl() — ops never miss a failed doc
Lambda Orkes ✓ Built
O
Any of 6 steps: detect → textract → chunk → embed → s3vec → clickhouse. Orkes awaits HTTP response before advancing.
!
Textract timeout, Bedrock throttle, ClickHouse down, malformed PDF — any exception is caught by the try/except wrapper on every step endpoint.
Lambda returns HTTP 500 → Orkes marks task FAILED
λ
Calls hitl.log_content_gap() with step name, doc_id, s3_key, and the error string. ClickHouse content_gaps table gets a row immediately.
Same HITL logging path as escalate_to_it — reuses existing infrastructure
📋
user_feedback field contains a ready-to-paste curl command — POST /pipeline/ingest with the s3_key. Ops fixes the root cause, runs the curl, re-queues in 30 seconds.
curl -X POST .../pipeline/ingest -d '{"s3_key":"raw-docs/..."}'
Idempotency guards in step_s3vec and step_clickhouse detect if earlier steps already succeeded and skip cleanly. Only failed steps re-execute — no duplicate vectors.
No manual cleanup needed — re-trigger and walk away
Detection
Automatic — every step wrapped
Ops action
Fix root cause + paste curl
Re-run safety
Idempotency guard per step
HITL noise on success
Zero — straight-through only
Temporal vs Orkes — Why Each Was Chosen
Not a preference — each platform was selected for what it genuinely does best
Dimension Temporal Signal-driven HITL Orkes DAG Pipeline
Decision model Push — banker clicks Approve → Signal → instant wake Pull — DO_WHILE task polls ClickHouse every 20s
Response latency <100ms from click to workflow continuation Up to 20s polling interval before detection
Worker requirement None for wait-only workflows — Lambda triggers + signals Required — worker process polls task queue
Best use case Human approval loops, SLA-bound decisions, HITL gates Multi-step processing pipelines, parallel fan-out, enrichment
Visual UI Temporal Cloud timeline (event-by-event audit trail) Orkes visual DAG — every task node visible and retryable
Compliance audit Full event history — who signalled, when, payload Full task execution log — input/output per node, retries
Scale model Suitable for regional / country-level deployments Built for global enterprise scale, multi-region distribution
Used for send_duplicate, reopen_case, escalate_to_it legal_export (document ingestion pipeline)
Data Store Roles
Each store has a single, clear responsibility — no overlap

🟠 ClickHouse OLAP + HITL Source of Truth

banking_docs.backoffice_requests — every HITL ticket. Status, resolver, timestamps, workflow_run_id.
banking_docs.content_gaps — AI quality feedback. Each unanswerable query logged for IT/content team.
banking_docs.documents — ingested document metadata, s3_path for fulfilment S3 presign.
Why ClickHouse, not DynamoDB? ClickHouse is OLAP — columnar, fast aggregates. HITL dashboard queries ("all pending tickets") are analytical reads, not key-value lookups. DynamoDB would require GSIs and cost more for this pattern.

🟡 SQS Durable Delivery Buffer

14-day message retention — if Appian is down, messages queue up and are delivered when it recovers.
Dead-letter queue — after 3 failed deliveries, messages go to DLQ for investigation.
MessageAttributes — action type and ticket_id on every message, so Appian can route to the correct process model without parsing the body.
Why SQS, not DynamoDB? SQS is a queue — purpose-built for guaranteed delivery, visibility timeout, and consumer polling. DynamoDB is a database you'd have to poll manually and build queue mechanics on top of.
Build Status
What runs today vs what needs configuration in Appian Designer

✅ Running in AWS Lambda

  • POST /hitl — create ticketRoutes to Temporal or Orkes based on action type. Publishes SQS message.
  • GET /hitl/status/{id}Live ClickHouse read — chat.html polls every 5s.
  • GET /hitl/pendingAll pending tickets for back-office UI. Protected by X-Backoffice-Key.
  • POST /hitl/{id}/decideUpdates ClickHouse + sends Temporal Signal + runs fulfilment (S3 + Bedrock).
  • POST /hitl/gapContent gap logging. Instant — no workflow.

✅ Temporal Workflows (Temporal Cloud)

  • askmybank_estatement_duplicateWait-state + Signal. S3 + Bedrock fulfilment in Lambda. 7-day SLA.
  • askmybank_reopen_caseWait-state + Signal. ClickHouse status update only. 7-day SLA.
  • askmybank_escalate_to_itWait-state + Signal. 48-hour IT SLA. Auto-escalates on breach.
  • No worker process neededAll three workflows are pure wait-state — Lambda starts and signals them directly.

✅ Orkes Pipeline v2 (No Worker)

  • askmybank_document_pipeline v2INLINE Graal JS detect → HTTP tasks → Textract → chunk → embed → S3 vectors + ClickHouse. Visual DAG in Orkes UI.
  • No worker daemon neededOrkes HTTP system tasks call Lambda /pipeline/step/* directly. Worker replaced by Lambda endpoints.
  • POST /pipeline/ingest → SQS queueQueues document for drain. PIPELINE_QUEUE_URL env var must be set after SAM deploy.
  • POST /pipeline/drainManual drain trigger — pops PIPELINE_BATCH_SIZE docs from queue and starts Orkes workflows immediately.
  • GET /pipeline/queue/statusQueue depth, DLQ depth, ETA hours — live visibility into backlog.
  • Auto-HITL on step failure_pipeline_failure_hitl() fires on any exception — ops gets ClickHouse ticket with re-trigger curl command.

⚙️ Appian — Needs Designer Config

  • 📄
    Connected SystemHTTP + X-Backoffice-Key header. Spec in appian/connected_system_spec.md.
  • 📄
    3 Integration objectsGetPendingTickets, GetTicketStatus, SubmitDecision. Config in appian/expressions/.
  • 📄
    HITLApprovalForm SAILFull banker approval form. Paste from appian/interfaces/approval_form.sail.
  • 📄
    HITLPendingDashboard SAILAuto-refreshing grid. Paste from appian/interfaces/pending_tickets_dashboard.sail.
  • 📄
    Process ModelTimer → poll → User Task → decide. Node-by-node in appian/process_model_design.md.
  • 📄
    BackOfficeTeam groupAdd all managers who approve HITL tickets. ~30 min total config.

✅ Back-office (without Appian)

  • review.htmlManual approve/reject UI at /backoffice/review.html. Requires X-Backoffice-Key. Works today — no Appian needed.
  • chat.html HITL trayLive ticket status panel showing all session tickets and their current state.

✅ Infrastructure (AWS SAM) — Pending Deploy

  • SQS HITLQueue + DLQ14-day retention, 5-min visibility, 20s long-poll, maxReceiveCount=3. Live in AWS.
  • SQS PipelineQueue + PipelineDLQ7-day retention, 10-min visibility, maxReceiveCount=3. Defined in template.yaml — needs sam deploy.
  • EventBridge PipelineDrainRuleFires every 30 min → Lambda {"source":"pipeline_drain"}. Defined in template.yaml — needs sam deploy.
  • S3 event notificationbanking-docs-poc-qahftr → s3:ObjectCreated:* → prefix raw-docs/ → PipelineQueue. Manual AWS console step after deploy.
  • Lambda IAMsqs:ReceiveMessage / DeleteMessage / GetQueueAttributes on PipelineQueue + DLQ added to template.yaml.
  • Lambda timeout870s (14.5 min) — accommodates Textract async poll loop. Updated in Globals.
  • API Gateway CORSX-Backoffice-Key and dpanwar-vigyan.github.io added to CorsConfiguration.
  • ClickHouse tablesbackoffice_requests and content_gaps schemas deployed.