Ingestion Pipeline · SecureBank AI

Pipeline DAG — Live in Orkes UI

v2: Workerless architecture — Orkes calls Lambda /pipeline/step/* directly via HTTP system tasks. No polling daemon required. Node 1 runs as an Orkes INLINE Graal JS task (zero infra). The FORK_JOIN runs embedding and metadata extraction in parallel.

🚀

POST /pipeline/ingest

Lambda receives s3_key, doc_type, customer_id → starts Orkes workflow. Returns workflow_id + Orkes UI URL immediately.

Lambda

▼

🔍

1 · detect_doc_type

Classifies document from s3_key path pattern (estatement / dispute / complaint / maintenance). Assigns doc_id. Runs as Graal JS inside Orkes engine — no Lambda hop, no cost.

Orkes INLINE (JS)

▼

📄

2 · textract_extract

Orkes POSTs to Lambda /pipeline/step/textract → calls AWS Textract start_document_text_detection. Polls until job completes. Returns raw text + page count. Handles scanned / image PDFs via real OCR.

HTTP → /step/textract

▼

✂️

3 · chunk_text

Splits raw text into 400-token semantic passages with 60-token overlap. Preserves paragraph boundaries. Outputs chunk array with positional metadata.

HTTP → /step/chunk

▼

FORK_JOIN — parallel branches

🧠

4a · generate_embeddings

Bedrock Titan Embed v2 (256-dim). Batches all chunks. Returns embedding array.

Bedrock

▼

📦

5a · update_s3_vectors

Appends to vectors.npy + metadata.json in S3. Atomic update — idempotent on retry.

📋

4b · parse_metadata

Nova Lite LLM extracts structured fields: customer, account, case ref, dates, amounts, branch.

Bedrock

▼

🗄️

5b · upsert_clickhouse

INSERT/UPDATE banking_docs.documents (ReplacingMergeTree). Doc queryable for RAG immediately.

ClickHouse

JOIN — both branches must complete

▼

👁️

6 · content_review_hitl (WAIT task)

Orkes WAIT — pauses for optional spot-check of extracted content. Auto-continues after timeout. Human can approve / flag for correction via Orkes UI or back-office.

Orkes WAIT

▼

✅

7 · pipeline_complete

Logs doc_id, vector_count, pipeline_status to ClickHouse audit log. Document is now live in RAG — chat queries return it instantly.

Done

1,000 Sample Documents Ready to Ingest

All generated with realistic UK banking content — real names, account numbers, branch codes, case references, transaction amounts. Stored in S3 by type and date.

📊

eStatements

400 PDFs · 2024–2026

ContentMonthly account statements with transactions, balance, interest

Account typesCurrent, Savings, Business, Premier, ISA

S3 path—

s3://banking-docs-poc-qahftr/estatements/2026/01/STMT00008.pdf

curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "estatements/2026/01/STMT00008.pdf", "doc_type": "eStatement", "customer_id": "CUST00042"}'

⚔️

Dispute Cases

250 PDFs · 2024–2026

ContentUnauthorised transactions, merchant disputes, ATM errors, card fraud

Status mixOpen, Resolved, Closed-Won, Referred to Ombudsman

S3 path—

s3://banking-docs-poc-qahftr/disputes/2026/01/DSP00015.pdf

curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "disputes/2026/01/DSP00015.pdf", "doc_type": "Dispute", "customer_id": "CUST00019"}'

📢

Formal Complaints

200 PDFs · 2024–2026

ContentPoor service, online banking issues, fee disputes, product mis-selling

PriorityLow / Medium / High / Critical with compensation amounts

S3 path—

s3://banking-docs-poc-qahftr/complaints/2026/01/CMP00020.pdf

curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "complaints/2026/01/CMP00020.pdf", "doc_type": "Complaint", "customer_id": "CUST00099"}'

🔧

Account Maintenance

150 PDFs · 2024–2026

ContentAddress changes, overdraft amendments, beneficiary additions, SO changes

Types8 maintenance request types across 10 UK branches

S3 path—

s3://banking-docs-poc-qahftr/maintenance/2026/01/MNT00010.pdf

curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "maintenance/2026/01/MNT00010.pdf", "doc_type": "AccountMaintenance", "customer_id": "CUST00010"}'

Demo Script — Step by Step

Run this exactly to show the pipeline live. Takes ~60 seconds end-to-end.

No worker daemon needed — v2 is workerless ✨

Pipeline v2 uses Orkes HTTP system tasks — Orkes calls Lambda /pipeline/step/* endpoints directly. No polling daemon, no EC2, no open terminal. Just trigger the workflow and Orkes does the rest.

# v1 (SIMPLE tasks) — required daemon:
python -m workflows.orkes.worker  ← NOT NEEDED in v2

# v2 (HTTP system tasks) — just trigger and watch:
curl -s -X POST .../pipeline/ingest  # that's it

✓ Orkes calls Lambda directly — detect (INLINE JS) → textract → chunk → [embed ‖ metadata] → JOIN → complete

Trigger ingestion of a new complaint PDF

POST to the Lambda endpoint. Lambda starts the Orkes workflow and returns a UI URL immediately — the pipeline runs asynchronously.

curl -s -X POST \
  https://r6v15i892m.execute-api.us-east-1.amazonaws.com/pipeline/ingest \
  -H 'Content-Type: application/json' \
  -d '{
    "s3_key":        "complaints/2026/01/CMP00020.pdf",
    "doc_type":      "Complaint",
    "customer_id":   "CUST00099",
    "source_system": "manual"
  }' | python3 -m json.tool

✓ Response includes orkes_ui_url — open it in browser immediately

Open the Orkes UI — watch the DAG execute live

The response JSON contains orkes_ui_url pointing directly to the running execution. Open it to see:

Expected response:
{
  "workflow_id":   "<uuid>",
  "workflow_name": "askmybank_document_pipeline",
  "s3_key":        "complaints/2026/01/CMP00020.pdf",
  "doc_type":      "Complaint",
  "orkes_ui_url":  "https://developer.orkescloud.com/execution/<id>",
  "status":        "started"
}

✓ In the Orkes UI you see each task node turn green one by one — detect → textract → chunk → [embed ‖ metadata] → JOIN → complete

Show the parallel FORK_JOIN in Orkes (the visual showpiece)

Once textract and chunk complete, the DAG splits into two branches running simultaneously. This is what Orkes does better than most orchestrators — visual parallel execution with guaranteed join.

✓ Branch A: generate_embeddings → update_s3_vectors (Bedrock Titan → S3)

✓ Branch B: parse_metadata → upsert_clickhouse (Nova Lite LLM → ClickHouse)

Query the ingested document in chat.html

Once pipeline_complete turns green (~60s), the document is live in RAG. Open chat.html and ask about the complaint you just ingested.

# Try these queries after ingesting CMP00020.pdf:
"Show me formal complaints raised in January 2026"
"What complaint was logged for customer CUST00099?"
"Are there any critical priority complaints this year?"

✓ AI retrieves the exact document — demonstrating the full pipeline→RAG→chat loop

Bonus: ingest a dispute to show OCR (Textract takes longer)

Dispute PDFs are slightly more complex — Textract's async job takes a few extra seconds. This is a natural talking point: "for scanned documents Textract does real OCR — that's why this task has a 1-hour timeout."

curl -s -X POST \
  https://r6v15i892m.execute-api.us-east-1.amazonaws.com/pipeline/ingest \
  -H 'Content-Type: application/json' \
  -d '{
    "s3_key":      "disputes/2026/01/DSP00015.pdf",
    "doc_type":    "Dispute",
    "customer_id": "CUST00019"
  }' | python3 -m json.tool

PDF → RAG in One API Call