1,000
Sample PDFs in S3
4
Document Types
8
Pipeline Stages
2
Parallel Branches
256-d
Titan Embeddings
<60s
Avg Ingest Time
Pipeline DAG — Live in Orkes UI
v2: Workerless architecture — Orkes calls Lambda /pipeline/step/* directly via HTTP system tasks. No polling daemon required. Node 1 runs as an Orkes INLINE Graal JS task (zero infra). The FORK_JOIN runs embedding and metadata extraction in parallel.
🚀
POST /pipeline/ingest
Lambda receives s3_key, doc_type, customer_id → starts Orkes workflow. Returns workflow_id + Orkes UI URL immediately.
Lambda
🔍
1 · detect_doc_type
Classifies document from s3_key path pattern (estatement / dispute / complaint / maintenance). Assigns doc_id. Runs as Graal JS inside Orkes engine — no Lambda hop, no cost.
Orkes INLINE (JS)
📄
2 · textract_extract
Orkes POSTs to Lambda /pipeline/step/textract → calls AWS Textract start_document_text_detection. Polls until job completes. Returns raw text + page count. Handles scanned / image PDFs via real OCR.
HTTP → /step/textract
✂️
3 · chunk_text
Splits raw text into 400-token semantic passages with 60-token overlap. Preserves paragraph boundaries. Outputs chunk array with positional metadata.
HTTP → /step/chunk
FORK_JOIN — parallel branches
🧠
4a · generate_embeddings
Bedrock Titan Embed v2 (256-dim). Batches all chunks. Returns embedding array.
Bedrock
📦
5a · update_s3_vectors
Appends to vectors.npy + metadata.json in S3. Atomic update — idempotent on retry.
S3
📋
4b · parse_metadata
Nova Lite LLM extracts structured fields: customer, account, case ref, dates, amounts, branch.
Bedrock
🗄️
5b · upsert_clickhouse
INSERT/UPDATE banking_docs.documents (ReplacingMergeTree). Doc queryable for RAG immediately.
ClickHouse
JOIN — both branches must complete
👁️
6 · content_review_hitl (WAIT task)
Orkes WAIT — pauses for optional spot-check of extracted content. Auto-continues after timeout. Human can approve / flag for correction via Orkes UI or back-office.
Orkes WAIT
7 · pipeline_complete
Logs doc_id, vector_count, pipeline_status to ClickHouse audit log. Document is now live in RAG — chat queries return it instantly.
Done
1,000 Sample Documents Ready to Ingest
All generated with realistic UK banking content — real names, account numbers, branch codes, case references, transaction amounts. Stored in S3 by type and date.
📊
eStatements
400 PDFs · 2024–2026
ContentMonthly account statements with transactions, balance, interest
Account typesCurrent, Savings, Business, Premier, ISA
S3 path
s3://banking-docs-poc-qahftr/estatements/2026/01/STMT00008.pdf
curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "estatements/2026/01/STMT00008.pdf", "doc_type": "eStatement", "customer_id": "CUST00042"}'
⚔️
Dispute Cases
250 PDFs · 2024–2026
ContentUnauthorised transactions, merchant disputes, ATM errors, card fraud
Status mixOpen, Resolved, Closed-Won, Referred to Ombudsman
S3 path
s3://banking-docs-poc-qahftr/disputes/2026/01/DSP00015.pdf
curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "disputes/2026/01/DSP00015.pdf", "doc_type": "Dispute", "customer_id": "CUST00019"}'
📢
Formal Complaints
200 PDFs · 2024–2026
ContentPoor service, online banking issues, fee disputes, product mis-selling
PriorityLow / Medium / High / Critical with compensation amounts
S3 path
s3://banking-docs-poc-qahftr/complaints/2026/01/CMP00020.pdf
curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "complaints/2026/01/CMP00020.pdf", "doc_type": "Complaint", "customer_id": "CUST00099"}'
🔧
Account Maintenance
150 PDFs · 2024–2026
ContentAddress changes, overdraft amendments, beneficiary additions, SO changes
Types8 maintenance request types across 10 UK branches
S3 path
s3://banking-docs-poc-qahftr/maintenance/2026/01/MNT00010.pdf
curl -X POST ".../pipeline/ingest" \ -d '{"s3_key": "maintenance/2026/01/MNT00010.pdf", "doc_type": "AccountMaintenance", "customer_id": "CUST00010"}'
Demo Script — Step by Step
Run this exactly to show the pipeline live. Takes ~60 seconds end-to-end.
1
No worker daemon needed — v2 is workerless ✨
Pipeline v2 uses Orkes HTTP system tasks — Orkes calls Lambda /pipeline/step/* endpoints directly. No polling daemon, no EC2, no open terminal. Just trigger the workflow and Orkes does the rest.
# v1 (SIMPLE tasks) — required daemon: python -m workflows.orkes.worker ← NOT NEEDED in v2 # v2 (HTTP system tasks) — just trigger and watch: curl -s -X POST .../pipeline/ingest # that's it
✓ Orkes calls Lambda directly — detect (INLINE JS) → textract → chunk → [embed ‖ metadata] → JOIN → complete
2
Trigger ingestion of a new complaint PDF
POST to the Lambda endpoint. Lambda starts the Orkes workflow and returns a UI URL immediately — the pipeline runs asynchronously.
curl -s -X POST \ https://r6v15i892m.execute-api.us-east-1.amazonaws.com/pipeline/ingest \ -H 'Content-Type: application/json' \ -d '{ "s3_key": "complaints/2026/01/CMP00020.pdf", "doc_type": "Complaint", "customer_id": "CUST00099", "source_system": "manual" }' | python3 -m json.tool
✓ Response includes orkes_ui_url — open it in browser immediately
3
Open the Orkes UI — watch the DAG execute live
The response JSON contains orkes_ui_url pointing directly to the running execution. Open it to see:
Expected response: { "workflow_id": "<uuid>", "workflow_name": "askmybank_document_pipeline", "s3_key": "complaints/2026/01/CMP00020.pdf", "doc_type": "Complaint", "orkes_ui_url": "https://developer.orkescloud.com/execution/<id>", "status": "started" }
✓ In the Orkes UI you see each task node turn green one by one — detect → textract → chunk → [embed ‖ metadata] → JOIN → complete
4
Show the parallel FORK_JOIN in Orkes (the visual showpiece)
Once textract and chunk complete, the DAG splits into two branches running simultaneously. This is what Orkes does better than most orchestrators — visual parallel execution with guaranteed join.
✓ Branch A: generate_embeddings → update_s3_vectors (Bedrock Titan → S3)
✓ Branch B: parse_metadata → upsert_clickhouse (Nova Lite LLM → ClickHouse)
5
Query the ingested document in chat.html
Once pipeline_complete turns green (~60s), the document is live in RAG. Open chat.html and ask about the complaint you just ingested.
# Try these queries after ingesting CMP00020.pdf: "Show me formal complaints raised in January 2026" "What complaint was logged for customer CUST00099?" "Are there any critical priority complaints this year?"
✓ AI retrieves the exact document — demonstrating the full pipeline→RAG→chat loop
6
Bonus: ingest a dispute to show OCR (Textract takes longer)
Dispute PDFs are slightly more complex — Textract's async job takes a few extra seconds. This is a natural talking point: "for scanned documents Textract does real OCR — that's why this task has a 1-hour timeout."
curl -s -X POST \ https://r6v15i892m.execute-api.us-east-1.amazonaws.com/pipeline/ingest \ -H 'Content-Type: application/json' \ -d '{ "s3_key": "disputes/2026/01/DSP00015.pdf", "doc_type": "Dispute", "customer_id": "CUST00019" }' | python3 -m json.tool