Solution Architecture

🏦 Banking AI Document Assistant

From IBM FileNet documents to natural language answers — powered by RAG + AWS Bedrock · USD currency · Live on Streamlit Cloud

Document Sources  ·  IBM FileNet P8 (Production)  ·  AWS S3 (POC)
📄 eStatements · 400 docs
⚖️ Dispute Cases · 250 docs
📢 Complaints · 200 docs
🔧 Account Maintenance · 150 docs
🔍 Digital PDF → pdfplumber
🖨️ Scanned PDF → AWS Textract OCR
1,000
Total Documents
~4,000
Text Chunks
256-d
Embedding Dims
<30s
New Doc → Queryable
~$0.10
Ingestion Cost
~$1/mo
Running Cost
Real-Time Ingestion Pipeline — Single Run, Both Stores Kept in Sync
☁️ AWS S3
PDF uploaded with full metadata tags (32 fields)
🐍 ingest.py
Downloads PDF, extracts text, chunks 500 tokens, embeds via Titan
dual write
📦 ChromaDB
Vector chunks upserted · semantic content search · ~3,948 chunks · 256-dim embeddings
🗄️ ClickHouse Cloud · ReplacingMergeTree
Full metadata row upserted · aggregation engine · deduplicates automatically on re-run
Before: generate CSV → manually import to ClickHouse (one-off, goes stale)  |  Now: python rag/ingest.py → ChromaDB + ClickHouse updated together, every time, for every new document
Two query tracks
✅ Live Now
Track 1 · Pre-POC — Aggregation via ClickHouse NL→SQL
1
PDF metadata extracted during upload
Customer ID, RM, Branch, Case Summary, Dates → 32 columns
2
Stored in ClickHouse Cloud
Columnar DB — optimised for billion-row analytics
3
Nova Lite converts NL → SQL
Schema-aware prompt → valid ClickHouse query
4
Nova Lite formats results
Raw rows → professional markdown answer with insights
🟢 ClickHouse Cloud
🤖 Nova Lite NL→SQL
Full dataset scan
No TOP_K limit
✓ Aggregation questions → ClickHouse
• "How many complaints raised each year?"
• "Which RM handled the most disputes?"
• "Total compensation paid per branch"
• "Breakdown of dispute types referred to Ombudsman"
query
router
🚀 Full POC
Track 2 · Full RAG — Content Search via ChromaDB
1
PDF text extracted & chunked
500-token chunks, 80-token overlap
2
Embedded via Bedrock Titan
256-dim vectors stored in ChromaDB (~3,948 chunks)
3
Semantic vector search
Top-K chunks retrieved + metadata filters applied
4
Nova Lite generates answer
Grounded in retrieved chunks, cites doc IDs
🔵 AWS Bedrock
🤖 Nova Lite LLM
📦 ChromaDB
🔗 LangChain
✓ Content questions → ChromaDB RAG
• "Summarise Mathew Little's complaint"
• "Why was dispute DSP00047 lost?"
• "What did the customer say in this case?"
• "Show high priority complaints from Leeds"
Smart Query Router — Automatic Intent Detection
Every banker question is automatically classified by keyword intent before hitting any backend. No manual switching — the system picks the right engine every time.
how many · count · total · breakdown · per year · by branch · which RM · average · trend
↳ Any of these keywords → routed to ClickHouse NL→SQL (full dataset scan, no limit)
↳ All other questions → routed to ChromaDB semantic vector search (summarise, explain, what did, show case)
⚡ Why Two Engines?
ClickHouse — scans ALL 1,000+ docs, perfect for counts & trends.
ChromaDB — semantic search for what was said inside a specific case.
Neither alone is sufficient — together they cover every question type.
Banker-Facing Chat Interface — Streamlit Community Cloud (Hosted · Password Protected)
💬 Natural Language Chat
Bankers type questions in plain English. No SQL, no file browsing, no training required. Responses include source document citations with direct links back to PDFs in S3 / FileNet. All monetary amounts displayed in USD ($).
Intent detection
Auto metadata filters
Source citations
PDF deep links
🔐 Password gate
☁️ Streamlit Community Cloud
Free cloud hosting. Deploys directly from GitHub. Secrets managed via Streamlit Cloud Secrets Manager — no credentials in code.
Free tier hosting
GitHub auto-deploy
st.secrets → os.environ
💰 Running Cost
~$0.001 per question
~$1/month @ 100 q/day
S3 storage: $0 free tier
Streamlit hosting: $0
Production Migration — One Swap Per Component
📁 S3 → IBM FileNet P8
One loader swap. FileNet REST API / CMIS replaces boto3 S3 calls. All chunking and embedding code unchanged.
🤖 Nova Lite → Claude Haiku
One line config change. Better reasoning for complex banking queries and regulated industry compliance.
📦 ChromaDB → pgvector
Scale to multi-user production on PostgreSQL RDS. Same LangChain interface — no chain code changes.
🏦 IBM watsonx (Optional)
Full on-premise IBM stack if data residency or enterprise licensing requires it.
Full Tech Stack
☁️ AWS S3
🤖 AWS Bedrock
✨ Amazon Nova Lite
🔢 Titan Embeddings v2
📦 ChromaDB
🗄️ ClickHouse Cloud
🔄 ReplacingMergeTree
🔗 LangChain
🐍 Python 3.9
🌐 Streamlit Community Cloud
🔐 Password Gate
📄 pdfplumber
🖨️ AWS Textract (OCR)
💱 USD Currency
🏦 IBM FileNet P8 (Prod)
🧠 IBM watsonx (Optional)
🔐 IAM Role-Based Access
📋 Audit Logging
⚡ FastAPI (Planned)