Building a Financial Knowledge Graph with Vector Search

Why Build a Knowledge Graph for Finance?

Financial data is inherently relational. A single SEC 10-K filing connects to dozens of entities: the company that filed it, the auditor that reviewed it, the executives who signed it, the risk factors it discloses, the competitors it mentions, the analyst who covers the stock, and the institutional investors who hold positions.

Traditional financial data platforms treat these as separate tables in a relational database. You can query a company's filings, or look up an analyst's ratings, or pull a fund's 13F positions -- but the connections between them are invisible.

A knowledge graph makes those connections explicit. And when you combine it with vector search, something powerful happens: you can discover relationships that no query was designed to find.

At HedgeFundTrade.ai, we built a financial knowledge graph that fuses three data layers -- document intelligence, market data, and institutional positioning -- into a searchable, explorable, 3D-rendered graph. This post is a technical deep-dive into how we did it.

Architecture Overview

The system has four core components:

Data Pipeline: Ingestion, parsing, and embedding of financial documents
Vector Store: Pinecone index with Voyage Finance-2 embeddings for semantic search
Graph Database: Relationship extraction and entity linking
3D Visualization: Three.js + react-force-graph-3d for interactive exploration

Here is the high-level data flow:

SEC EDGAR API → Filing Parser → Chunker → Voyage Finance-2 Embedder
    ↓                                           ↓
Entity Extractor → Graph Store          Pinecone Vector Index
    ↓                                           ↓
    └─────────── Knowledge Graph API ───────────┘
                        ↓
              Three.js 3D Visualization

Let us walk through each layer.

Layer 1: Document Intelligence Pipeline

Ingestion

The pipeline starts with the SEC EDGAR API, which provides access to every filing submitted to the SEC. We use sec-api.io for reliable, structured access to filing metadata and full text.

When a new filing appears, the system:

Fetches the filing metadata (CIK, company name, form type, filing date)
Downloads the full text (HTML, XBRL, or plain text depending on the form)
Parses it into clean, structured sections
Stores the raw document in S3 for archival

from app.ingestion.sec_client import SECClient

async def ingest_filing(cik: str, form_type: str):
    client = SECClient()
    filing = await client.get_latest_filing(cik, form_type)

    # Parse into sections
    sections = parse_filing_sections(filing.content)

    # Store raw document
    await s3_client.upload(
        bucket="hft-filings",
        key=f"{cik}/{form_type}/{filing.accession_number}.html",
        body=filing.raw_content,
    )

    return sections

Chunking

Chunking strategy is critical for search quality. We use a hierarchical approach:

Level 1 -- Section Chunking: The filing is split at natural section boundaries (Item 1, Item 1A, Item 7, etc.). Each section becomes a parent chunk.

Level 2 -- Paragraph Chunking: Within each section, we split at paragraph boundaries, targeting 300-500 tokens per chunk with 50-token overlap. Overlap ensures that context is preserved across chunk boundaries.

Level 3 -- Metadata Enrichment: Each chunk is tagged with:

Source company (CIK, ticker, name)
Form type (10-K, 10-Q, 8-K, 13F)
Filing date
Section identifier (Item 1A: Risk Factors, Item 7: MD&A, etc.)
Chunk position within the section

This metadata is stored alongside the vector embedding in Pinecone, enabling filtered searches ("show me only Risk Factor sections from fintech companies filed after January 2026").

from app.processing.chunker import create_chunks

def chunk_filing_section(section_text: str, metadata: dict) -> list[dict]:
    chunks = create_chunks(
        text=section_text,
        max_tokens=400,
        overlap_tokens=50,
        respect_paragraphs=True,
    )

    return [
        {
            "id": f"{metadata['accession_number']}-{metadata['section']}-{i}",
            "text": chunk.text,
            "metadata": {
                **metadata,
                "chunk_index": i,
                "token_count": chunk.token_count,
            },
        }
        for i, chunk in enumerate(chunks)
    ]

Embedding with Voyage Finance-2

This is where domain-specific models make a measurable difference.

Voyage Finance-2 is a 1024-dimensional embedding model trained specifically on financial text. Compared to general-purpose models, it better captures the semantic relationships between financial concepts. "Rising delinquency rates" and "deteriorating credit quality" are close in Voyage Finance-2's embedding space, even though they share no words in common.

We evaluated several embedding models before selecting Voyage Finance-2:

Model	Dimensions	Financial Recall@10	Latency (p95)
OpenAI text-embedding-3-large	3072	0.78	42ms
OpenAI text-embedding-3-small	1536	0.72	28ms
Cohere embed-english-v3	1024	0.75	35ms
Voyage Finance-2	1024	0.86	31ms

Voyage Finance-2 achieved the highest recall on our financial text benchmark while keeping dimensionality at 1024 (important for Pinecone storage costs) and latency under 35ms.

from app.processing.embedder import VoyageEmbedder

embedder = VoyageEmbedder(model="voyage-finance-2")

async def embed_chunks(chunks: list[dict]) -> list[dict]:
    texts = [chunk["text"] for chunk in chunks]
    embeddings = await embedder.embed_batch(texts)

    return [
        {
            "id": chunk["id"],
            "values": embedding,
            "metadata": chunk["metadata"],
        }
        for chunk, embedding in zip(chunks, embeddings)
    ]

Layer 2: Pinecone Vector Index

All embeddings are stored in a Pinecone serverless index. The index configuration:

Name: sec-filings
Dimensions: 1024
Metric: Cosine similarity
Pod type: Serverless (s1)

Upsert Pipeline

After embedding, chunks are upserted to Pinecone in batches of 100:

from pinecone import Pinecone

pc = Pinecone()
index = pc.Index("sec-filings")

async def upsert_to_pinecone(embedded_chunks: list[dict]):
    # Batch upsert for efficiency
    for i in range(0, len(embedded_chunks), 100):
        batch = embedded_chunks[i:i + 100]
        index.upsert(vectors=batch)

Semantic Search

When a user queries the system, the query text is embedded with the same Voyage Finance-2 model and matched against the index:

async def semantic_search(
    query: str,
    top_k: int = 10,
    filters: dict | None = None,
) -> list[dict]:
    # Embed the query
    query_embedding = await embedder.embed(query)

    # Search Pinecone with optional metadata filters
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        filter=filters,
        include_metadata=True,
    )

    return [
        {
            "text": match.metadata.get("text", ""),
            "score": match.score,
            "company": match.metadata.get("company_name"),
            "form_type": match.metadata.get("form_type"),
            "filing_date": match.metadata.get("filing_date"),
            "section": match.metadata.get("section"),
        }
        for match in results.matches
    ]

A query like "What are the biggest cybersecurity risks facing fintech companies?" returns the most semantically relevant paragraphs from across thousands of filings -- not just keyword matches, but conceptually related passages about data breaches, encryption requirements, regulatory compliance, and third-party vendor risk.

Filtered Search

Pinecone's metadata filtering enables scoped searches without sacrificing performance:

# Search only 10-K Risk Factors from 2025-2026
results = await semantic_search(
    query="supply chain disruption risks",
    filters={
        "form_type": {"$eq": "10-K"},
        "section": {"$eq": "Item 1A"},
        "filing_date": {"$gte": "2025-01-01"},
    },
)

This is critical for financial analysis, where context matters. An analyst researching current risks does not want results from 2019 filings drowning out 2026 disclosures.

Layer 3: Knowledge Graph Construction

Vector search finds similar text. A knowledge graph finds connected entities. The two complement each other.

Entity Extraction

From each filing, we extract structured entities:

Companies: ticker, CIK, sector, industry
People: executives, directors, auditors
Financial Metrics: revenue, EPS, guidance figures
Risk Factors: categorized by type (regulatory, competitive, credit, operational)
Relationships: competitor mentions, supplier/customer relationships, analyst coverage

Entity extraction uses a combination of rule-based parsing (for structured fields like financial tables) and LLM-powered extraction (for unstructured narrative sections).

Relationship Types

The graph connects entities through typed edges:

Relationship	Example
`FILED_BY`	10-K filing -> Company
`MENTIONS_RISK`	Filing -> Risk Factor
`COVERS`	Analyst -> Company
`HOLDS_POSITION`	Fund (13F) -> Company
`COMPETES_WITH`	Company -> Company
`INSIDER_TRANSACTION`	Executive -> Company
`HAS_PRICE_TARGET`	Analyst -> Company + Target
`SECTOR_MEMBER`	Company -> Sector

Building the Graph

Each entity becomes a node. Each relationship becomes an edge. The result is a densely connected graph where you can traverse from any starting point to discover related entities:

Interactive Brokers (IBKR)
  ├── FILED_BY: 10-K (2025-02-28)
  │     ├── MENTIONS_RISK: "Interest rate environment"
  │     ├── MENTIONS_RISK: "Cybersecurity threats"
  │     └── MENTIONS_RISK: "Regulatory capital requirements"
  ├── COVERS: Analyst A (Goldman Sachs) → PT: $85
  ├── COVERS: Analyst B (JP Morgan) → PT: $78
  ├── HOLDS_POSITION: Citadel (13F Q4 2025) → 2.1M shares
  ├── HOLDS_POSITION: Renaissance (13F Q4 2025) → 890K shares
  ├── COMPETES_WITH: Charles Schwab (SCHW)
  ├── COMPETES_WITH: TD Ameritrade (merged)
  └── SECTOR_MEMBER: Capital Markets Infrastructure

Starting from IBKR, a user can explore: What risks does the company disclose? Who covers the stock, and are they accurate? What institutions hold it? Who are the competitors, and do they share similar risk factors?

Layer 4: 3D Visualization with Three.js

The graph comes alive through 3D visualization using react-force-graph-3d, which is built on Three.js.

Why 3D?

Financial knowledge graphs are high-dimensional. A 2D force-directed layout quickly becomes a tangled mess when you have thousands of nodes and tens of thousands of edges. 3D adds a critical degree of freedom -- clusters naturally separate along the z-axis, making it easier to identify sector groupings, analyst networks, and institutional positioning patterns.

Implementation

The visualization runs entirely client-side. The graph data is fetched from the API as a JSON structure of nodes and edges:

interface GraphNode {
  id: string;
  label: string;
  type: 'company' | 'analyst' | 'fund' | 'filing' | 'risk_factor' | 'sector';
  size: number;       // Based on market cap, position size, or filing count
  color: string;      // Determined by node type
  metadata: Record<string, unknown>;
}

interface GraphEdge {
  source: string;
  target: string;
  type: string;
  weight: number;     // Relationship strength
  metadata: Record<string, unknown>;
}

interface GraphData {
  nodes: GraphNode[];
  edges: GraphEdge[];
}

The graph component is dynamically imported to keep it off the server-side bundle and out of the landing page's initial load:

import dynamic from 'next/dynamic';

const KnowledgeGraph = dynamic(
  () => import('@/components/dashboard/knowledge-graph'),
  { ssr: false }
);

Interaction Design

Users interact with the graph through:

Click on a node to expand its connections
Hover to see entity details (ticker, analyst name, filing date)
Search to highlight paths between two entities
Filter by entity type, sector, or date range
Zoom to explore clusters at different levels of detail

When a user clicks on a company node, the graph expands to show all connected entities -- the analysts covering it, the funds holding it, the filings associated with it, and the risk factors disclosed. Each of those entities can be further expanded, creating an exploratory workflow that mirrors how financial research actually works: following threads of connection.

Connecting the Layers: Search to Graph to Insight

The real power emerges when vector search and the knowledge graph work together.

Scenario: An analyst asks, "Which fintech companies are disclosing increasing credit risk in their 2026 filings?"

Vector search returns the top 20 most relevant filing chunks across all fintech 10-K filings
Entity extraction identifies the companies mentioned: Affirm, Upstart, LendingClub, SoFi, Block
Graph traversal enriches each result with:

- Current analyst ratings and price targets - Institutional position changes (13F data) - Related risk factors from prior filings (temporal comparison) - Competitor filings with similar language

The result is not just a list of text matches. It is a structured intelligence report connecting filing language to market data to institutional behavior. This is the difference between search and research.

Performance Characteristics

In production, the system achieves:

Operation	Latency (p95)
Embedding a query (Voyage Finance-2)	31ms
Pinecone similarity search (top 20)	45ms
Graph traversal (2-hop)	12ms
End-to-end search with graph enrichment	< 200ms
3D graph render (1,000 nodes)	16ms per frame (60fps)
3D graph render (10,000 nodes)	33ms per frame (30fps)

The entire pipeline -- from natural language query to enriched, graph-connected results -- runs in under 200 milliseconds. This is fast enough for interactive exploration.

Scaling Considerations

As the corpus grows (more filings, more analyst reports, more 13F data), several scaling vectors matter:

Pinecone: Serverless architecture scales automatically with index size. Currently indexing 500K+ chunks with no performance degradation.
Graph Store: Relationship counts grow quadratically with entity count. Pruning strategies (removing low-weight edges, archiving old filings) keep the active graph manageable.
Embedding Costs: Voyage Finance-2 pricing is per-token. We batch embed during ingestion and cache aggressively -- queries hit the cache before calling the embedding API.
3D Rendering: Client-side rendering means the server does not bear visualization costs. WebGL handles up to ~10,000 nodes smoothly on modern hardware.

What We Learned

Building this system surfaced several non-obvious lessons:

1. Chunking strategy matters more than embedding model selection. A mediocre embedding model with excellent chunking outperforms a state-of-the-art model with naive chunking. Financial documents have natural section boundaries -- respecting them dramatically improves retrieval quality.

2. Metadata is as important as embeddings. Filtering by form type, date range, and section eliminates noise before the similarity search even runs. Without filters, a search for "revenue guidance" returns results from every filing ever made. With filters, it returns exactly what the analyst needs.

3. 3D visualization is not just eye candy. Analysts consistently discover unexpected connections through graph exploration that they would never have found through keyword search. A fund that increased its IBKR position in the same quarter that IBKR disclosed record trading volumes -- that connection is invisible in a flat database but obvious in a graph.

4. Finance-specific embeddings justify their cost. The 8-percentage-point improvement in Recall@10 from Voyage Finance-2 over general-purpose models translates directly to analyst satisfaction. When search results are more relevant, users trust the system and explore more.

Try It Yourself

The knowledge graph is live on HedgeFundTrade.ai. Search across 20,000+ analyst reports, explore connections between companies, analysts, and institutional positions in 3D, and track who actually gets their predictions right.

For more context on how we use this technology for SEC filing analysis and the real-world data it surfaces -- like the fintech sector rotation of 2026 -- explore our other research posts.