Building a Financial Knowledge Graph with Vector Search
How we built a 3D knowledge graph that connects 20,000+ analyst reports, SEC filings, and institutional positions using Pinecone vector search, Voyage Finance-2 embeddings, and Three.js visualization. A technical deep-dive into the architecture.
Why Build a Knowledge Graph for Finance?
Financial data is inherently relational. A single SEC 10-K filing connects to dozens of entities: the company that filed it, the auditor that reviewed it, the executives who signed it, the risk factors it discloses, the competitors it mentions, the analyst who covers the stock, and the institutional investors who hold positions.
Traditional financial data platforms treat these as separate tables in a relational database. You can query a company's filings, or look up an analyst's ratings, or pull a fund's 13F positions -- but the connections between them are invisible.
A knowledge graph makes those connections explicit. And when you combine it with vector search, something powerful happens: you can discover relationships that no query was designed to find.
At HedgeFundTrade.ai, we built a financial knowledge graph that fuses three data layers -- document intelligence, market data, and institutional positioning -- into a searchable, explorable, 3D-rendered graph. This post is a technical deep-dive into how we did it.
Architecture Overview
The system has four core components:
- Data Pipeline: Ingestion, parsing, and embedding of financial documents
- Vector Store: Pinecone index with Voyage Finance-2 embeddings for semantic search
- Graph Database: Relationship extraction and entity linking
- 3D Visualization: Three.js + react-force-graph-3d for interactive exploration
Here is the high-level data flow:
SEC EDGAR API → Filing Parser → Chunker → Voyage Finance-2 Embedder
↓ ↓
Entity Extractor → Graph Store Pinecone Vector Index
↓ ↓
└─────────── Knowledge Graph API ───────────┘
↓
Three.js 3D Visualization
Let us walk through each layer.
Layer 1: Document Intelligence Pipeline
Ingestion
The pipeline starts with the SEC EDGAR API, which provides access to every filing submitted to the SEC. We use sec-api.io for reliable, structured access to filing metadata and full text.
When a new filing appears, the system:
- Fetches the filing metadata (CIK, company name, form type, filing date)
- Downloads the full text (HTML, XBRL, or plain text depending on the form)
- Parses it into clean, structured sections
- Stores the raw document in S3 for archival
from app.ingestion.sec_client import SECClient
async def ingest_filing(cik: str, form_type: str):
client = SECClient()
filing = await client.get_latest_filing(cik, form_type)
# Parse into sections
sections = parse_filing_sections(filing.content)
# Store raw document
await s3_client.upload(
bucket="hft-filings",
key=f"{cik}/{form_type}/{filing.accession_number}.html",
body=filing.raw_content,
)
return sections
Chunking
Chunking strategy is critical for search quality. We use a hierarchical approach:
Level 1 -- Section Chunking: The filing is split at natural section boundaries (Item 1, Item 1A, Item 7, etc.). Each section becomes a parent chunk.
Level 2 -- Paragraph Chunking: Within each section, we split at paragraph boundaries, targeting 300-500 tokens per chunk with 50-token overlap. Overlap ensures that context is preserved across chunk boundaries.
Level 3 -- Metadata Enrichment: Each chunk is tagged with:
- Source company (CIK, ticker, name)
- Form type (10-K, 10-Q, 8-K, 13F)
- Filing date
- Section identifier (Item 1A: Risk Factors, Item 7: MD&A, etc.)
- Chunk position within the section
This metadata is stored alongside the vector embedding in Pinecone, enabling filtered searches ("show me only Risk Factor sections from fintech companies filed after January 2026").
from app.processing.chunker import create_chunks
def chunk_filing_section(section_text: str, metadata: dict) -> list[dict]:
chunks = create_chunks(
text=section_text,
max_tokens=400,
overlap_tokens=50,
respect_paragraphs=True,
)
return [
{
"id": f"{metadata['accession_number']}-{metadata['section']}-{i}",
"text": chunk.text,
"metadata": {
**metadata,
"chunk_index": i,
"token_count": chunk.token_count,
},
}
for i, chunk in enumerate(chunks)
]
Embedding with Voyage Finance-2
This is where domain-specific models make a measurable difference.
Voyage Finance-2 is a 1024-dimensional embedding model trained specifically on financial text. Compared to general-purpose models, it better captures the semantic relationships between financial concepts. "Rising delinquency rates" and "deteriorating credit quality" are close in Voyage Finance-2's embedding space, even though they share no words in common.
We evaluated several embedding models before selecting Voyage Finance-2:
| Model | Dimensions | Financial Recall@10 | Latency (p95) |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 0.78 | 42ms |
| OpenAI text-embedding-3-small | 1536 | 0.72 | 28ms |
| Cohere embed-english-v3 | 1024 | 0.75 | 35ms |
| **Voyage Finance-2** | **1024** | **0.86** | **31ms** |
Voyage Finance-2 achieved the highest recall on our financial text benchmark while keeping dimensionality at 1024 (important for Pinecone storage costs) and latency under 35ms.
from app.processing.embedder import VoyageEmbedder
embedder = VoyageEmbedder(model="voyage-finance-2")
async def embed_chunks(chunks: list[dict]) -> list[dict]:
texts = [chunk["text"] for chunk in chunks]
embeddings = await embedder.embed_batch(texts)
return [
{
"id": chunk["id"],
"values": embedding,
"metadata": chunk["metadata"],
}
for chunk, embedding in zip(chunks, embeddings)
]
Layer 2: Pinecone Vector Index
All embeddings are stored in a Pinecone serverless index. The index configuration:
- Name:
sec-filings - Dimensions: 1024
- Metric: Cosine similarity
- Pod type: Serverless (s1)
Upsert Pipeline
After embedding, chunks are upserted to Pinecone in batches of 100:
from pinecone import Pinecone
pc = Pinecone()
index = pc.Index("sec-filings")
async def upsert_to_pinecone(embedded_chunks: list[dict]):
# Batch upsert for efficiency
for i in range(0, len(embedded_chunks), 100):
batch = embedded_chunks[i:i + 100]
index.upsert(vectors=batch)
Semantic Search
When a user queries the system, the query text is embedded with the same Voyage Finance-2 model and matched against the index:
async def semantic_search(
query: str,
top_k: int = 10,
filters: dict | None = None,
) -> list[dict]:
# Embed the query
query_embedding = await embedder.embed(query)
# Search Pinecone with optional metadata filters
results = index.query(
vector=query_embedding,
top_k=top_k,
filter=filters,
include_metadata=True,
)
return [
{
"text": match.metadata.get("text", ""),
"score": match.score,
"company": match.metadata.get("company_name"),
"form_type": match.metadata.get("form_type"),
"filing_date": match.metadata.get("filing_date"),
"section": match.metadata.get("section"),
}
for match in results.matches
]
A query like "What are the biggest cybersecurity risks facing fintech companies?" returns the most semantically relevant paragraphs from across thousands of filings -- not just keyword matches, but conceptually related passages about data breaches, encryption requirements, regulatory compliance, and third-party vendor risk.
Filtered Search
Pinecone's metadata filtering enables scoped searches without sacrificing performance:
# Search only 10-K Risk Factors from 2025-2026
results = await semantic_search(
query="supply chain disruption risks",
filters={
"form_type": {"$eq": "10-K"},
"section": {"$eq": "Item 1A"},
"filing_date": {"$gte": "2025-01-01"},
},
)
This is critical for financial analysis, where context matters. An analyst researching current risks does not want results from 2019 filings drowning out 2026 disclosures.
Layer 3: Knowledge Graph Construction
Vector search finds similar text. A knowledge graph finds connected entities. The two complement each other.
Entity Extraction
From each filing, we extract structured entities:
- Companies: ticker, CIK, sector, industry
- People: executives, directors, auditors
- Financial Metrics: revenue, EPS, guidance figures
- Risk Factors: categorized by type (regulatory, competitive, credit, operational)
- Relationships: competitor mentions, supplier/customer relationships, analyst coverage
Entity extraction uses a combination of rule-based parsing (for structured fields like financial tables) and LLM-powered extraction (for unstructured narrative sections).
Relationship Types
The graph connects entities through typed edges:
| Relationship | Example |
|---|---|
| `FILED_BY` | 10-K filing -> Company |
| `MENTIONS_RISK` | Filing -> Risk Factor |
| `COVERS` | Analyst -> Company |
| `HOLDS_POSITION` | Fund (13F) -> Company |
| `COMPETES_WITH` | Company -> Company |
| `INSIDER_TRANSACTION` | Executive -> Company |
| `HAS_PRICE_TARGET` | Analyst -> Company + Target |
| `SECTOR_MEMBER` | Company -> Sector |
Building the Graph
Each entity becomes a node. Each relationship becomes an edge. The result is a densely connected graph where you can traverse from any starting point to discover related entities:
Interactive Brokers (IBKR)
├── FILED_BY: 10-K (2025-02-28)
│ ├── MENTIONS_RISK: "Interest rate environment"
│ ├── MENTIONS_RISK: "Cybersecurity threats"
│ └── MENTIONS_RISK: "Regulatory capital requirements"
├── COVERS: Analyst A (Goldman Sachs) → PT: $85
├── COVERS: Analyst B (JP Morgan) → PT: $78
├── HOLDS_POSITION: Citadel (13F Q4 2025) → 2.1M shares
├── HOLDS_POSITION: Renaissance (13F Q4 2025) → 890K shares
├── COMPETES_WITH: Charles Schwab (SCHW)
├── COMPETES_WITH: TD Ameritrade (merged)
└── SECTOR_MEMBER: Capital Markets Infrastructure
Starting from IBKR, a user can explore: What risks does the company disclose? Who covers the stock, and are they accurate? What institutions hold it? Who are the competitors, and do they share similar risk factors?
Layer 4: 3D Visualization with Three.js
The graph comes alive through 3D visualization using react-force-graph-3d, which is built on Three.js.
Why 3D?
Financial knowledge graphs are high-dimensional. A 2D force-directed layout quickly becomes a tangled mess when you have thousands of nodes and tens of thousands of edges. 3D adds a critical degree of freedom -- clusters naturally separate along the z-axis, making it easier to identify sector groupings, analyst networks, and institutional positioning patterns.
Implementation
The visualization runs entirely client-side. The graph data is fetched from the API as a JSON structure of nodes and edges:
interface GraphNode {
id: string;
label: string;
type: 'company' | 'analyst' | 'fund' | 'filing' | 'risk_factor' | 'sector';
size: number; // Based on market cap, position size, or filing count
color: string; // Determined by node type
metadata: Record<string, unknown>;
}
interface GraphEdge {
source: string;
target: string;
type: string;
weight: number; // Relationship strength
metadata: Record<string, unknown>;
}
interface GraphData {
nodes: GraphNode[];
edges: GraphEdge[];
}
The graph component is dynamically imported to keep it off the server-side bundle and out of the landing page's initial load:
import dynamic from 'next/dynamic';
const KnowledgeGraph = dynamic(
() => import('@/components/dashboard/knowledge-graph'),
{ ssr: false }
);
Interaction Design
Users interact with the graph through:
- Click on a node to expand its connections
- Hover to see entity details (ticker, analyst name, filing date)
- Search to highlight paths between two entities
- Filter by entity type, sector, or date range
- Zoom to explore clusters at different levels of detail
When a user clicks on a company node, the graph expands to show all connected entities -- the analysts covering it, the funds holding it, the filings associated with it, and the risk factors disclosed. Each of those entities can be further expanded, creating an exploratory workflow that mirrors how financial research actually works: following threads of connection.
Connecting the Layers: Search to Graph to Insight
The real power emerges when vector search and the knowledge graph work together.
Scenario: An analyst asks, "Which fintech companies are disclosing increasing credit risk in their 2026 filings?"
- Vector search returns the top 20 most relevant filing chunks across all fintech 10-K filings
- Entity extraction identifies the companies mentioned: Affirm, Upstart, LendingClub, SoFi, Block
- Graph traversal enriches each result with:
- Current analyst ratings and price targets - Institutional position changes (13F data) - Related risk factors from prior filings (temporal comparison) - Competitor filings with similar language
The result is not just a list of text matches. It is a structured intelligence report connecting filing language to market data to institutional behavior. This is the difference between search and research.
Performance Characteristics
In production, the system achieves:
| Operation | Latency (p95) |
|---|---|
| Embedding a query (Voyage Finance-2) | 31ms |
| Pinecone similarity search (top 20) | 45ms |
| Graph traversal (2-hop) | 12ms |
| End-to-end search with graph enrichment | < 200ms |
| 3D graph render (1,000 nodes) | 16ms per frame (60fps) |
| 3D graph render (10,000 nodes) | 33ms per frame (30fps) |
The entire pipeline -- from natural language query to enriched, graph-connected results -- runs in under 200 milliseconds. This is fast enough for interactive exploration.
Scaling Considerations
As the corpus grows (more filings, more analyst reports, more 13F data), several scaling vectors matter:
- Pinecone: Serverless architecture scales automatically with index size. Currently indexing 500K+ chunks with no performance degradation.
- Graph Store: Relationship counts grow quadratically with entity count. Pruning strategies (removing low-weight edges, archiving old filings) keep the active graph manageable.
- Embedding Costs: Voyage Finance-2 pricing is per-token. We batch embed during ingestion and cache aggressively -- queries hit the cache before calling the embedding API.
- 3D Rendering: Client-side rendering means the server does not bear visualization costs. WebGL handles up to ~10,000 nodes smoothly on modern hardware.
What We Learned
Building this system surfaced several non-obvious lessons:
1. Chunking strategy matters more than embedding model selection. A mediocre embedding model with excellent chunking outperforms a state-of-the-art model with naive chunking. Financial documents have natural section boundaries -- respecting them dramatically improves retrieval quality.
2. Metadata is as important as embeddings. Filtering by form type, date range, and section eliminates noise before the similarity search even runs. Without filters, a search for "revenue guidance" returns results from every filing ever made. With filters, it returns exactly what the analyst needs.
3. 3D visualization is not just eye candy. Analysts consistently discover unexpected connections through graph exploration that they would never have found through keyword search. A fund that increased its IBKR position in the same quarter that IBKR disclosed record trading volumes -- that connection is invisible in a flat database but obvious in a graph.
4. Finance-specific embeddings justify their cost. The 8-percentage-point improvement in Recall@10 from Voyage Finance-2 over general-purpose models translates directly to analyst satisfaction. When search results are more relevant, users trust the system and explore more.
Try It Yourself
The knowledge graph is live on HedgeFundTrade.ai. Search across 20,000+ analyst reports, explore connections between companies, analysts, and institutional positions in 3D, and track who actually gets their predictions right.
For more context on how we use this technology for SEC filing analysis and the real-world data it surfaces -- like the fintech sector rotation of 2026 -- explore our other research posts.
Sign up for free to start exploring the graph.