Unifying 12 Years of Knowledge: From OneNote to Local Markdown to Oracle AI Database

Part 4 of 4

⚡ TL;DR

Unified 12 years of scattered notes - 5,178 OneNote pages plus daily markdown journals - into a single AI-searchable knowledge base powered by Oracle AI Database 26ai. The result: a unified knowledge base where I can ask “What did I write about AI in 2021?” and get instant, accurate answers—regardless of where the note originated.

After building multimodal search with CLIP and vision-aware RAG, I had a powerful system. But most of my knowledge was trapped in two silos:

Microsoft OneNote: 12 years of meeting notes, project docs, and random ideas (5,000+ pages)
Local Markdown Editor: My current daily journal and PKM system (growing daily)

Both tools are excellent for capturing thoughts. Neither is great for finding them six months later.

This is the story of how I unified these knowledge silos into Oracle AI Database 26ai, and why the database becoming the knowledge hub changes everything.

🎯 What You'll Learn

How to migrate massive OneNote repositories via Microsoft Graph API
Building a real-time file watcher for local markdown files with Chokidar
Memory management for vector embedding generation at scale
Why Oracle AI Database 26ai is the perfect knowledge unification layer

The Knowledge Silo Problem

My Reality Before Integration

Knowledge Scattered Across:

OneNote — 5000+ Pages
Local Markdown Editor — Daily journals, linked references
PDFs, Word docs, PowerPoints…

The frustrating part: I knew I’d written about a topic before. Finding it meant searching three different apps, hoping I’d remember which one I used.

Worse, none of these tools understood context. Searching for “AI” in OneNote returns every word with “AI” in it. My objective of creating the knowledge management app in the first place was to be able to intelligently manage my knowledge base, but I had a wealth of information I wanted to include, not just my new notes.

The Vision: One Query, All Knowledge

What if I could ask:

“What were the key projects I was involved in last year?”

And get results from:

A markdown journal entry from September
A OneNote meeting note from February
A PowerPoint presentation I built

All ranked by semantic relevance, not keyword matching. All in one search.

That’s what I built.

Part 1: The OneNote Migration (5,178 Documents)

The Challenge: 12 Years of Notes

Microsoft OneNote stores notes in a proprietary format. The traditional approach—exporting to PDF—loses metadata, breaks formatting, and requires manual intervention for each notebook.

I needed:

Direct API access to OneNote’s cloud storage
Metadata preservation (creation dates, modification times)
Bulk processing capability (thousands of pages)
Rate limiting handling for Microsoft Graph API

Choosing the Integration Path

I evaluated three approaches:

Approach	Pros	Cons
PDF Export	Already working	Manual, loses metadata
MCP Server	Clean abstraction	Additional process to manage
Direct Graph API	Full control	More code, but simpler deployment

Winner: Direct Microsoft Graph API integration. The MCP pattern inspired the architecture, but for this use case, a direct service was simpler to deploy and maintain.

The Authentication Dance

Microsoft’s auth flow is very developer-friendly:

// backend/services/onenoteDirect.js
async function authenticateWithDeviceCode() {
  // Request device code from Microsoft
  const response = await axios.post(
    'https://login.microsoftonline.com/consumers/oauth2/v2.0/devicecode',
    new URLSearchParams({
      client_id: CLIENT_ID,
      scope: 'Notes.Read User.Read offline_access'
    })
  );

  console.log(`
    ╔════════════════════════════════════════╗
    ║     OneNote Authentication Required     ║
    ╠════════════════════════════════════════╣
    ║  1. Go to: microsoft.com/devicelogin   ║
    ║  2. Enter code: ${response.data.user_code}              ║
    ║  3. Sign in with your Microsoft account ║
    ╚════════════════════════════════════════╝
  `);

  // Poll for completion
  return pollForToken(response.data.device_code);
}

No Azure app registration. No complex OAuth redirect flows. Just a code, a browser, and done.

Handling Microsoft’s Rate Limits

The Graph API has strict rate limits. My first attempt hit 429 errors within 30 seconds—requests 1-46 succeeded, then request 47 got throttled with “429 Too Many Requests.”

Solution: Exponential backoff with Retry-After header support.

async function fetchWithRetry(url, options, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await axios.get(url, options);
      return response;
    } catch (error) {
      if (error.response?.status === 429) {
        // Microsoft tells us exactly how long to wait
        const retryAfter = error.response.headers['retry-after'] ||
                          Math.pow(2, attempt) * 1000;

        console.log(`Rate limited. Waiting ${retryAfter}ms...`);
        await sleep(retryAfter);
        continue;
      }
      throw error;
    }
  }
}

The Recursive Section Group Problem

OneNote has a nested structure: Notebooks → Section Groups → Sections → Pages. Section groups can contain other section groups, creating arbitrary depth.

My first implementation only fetched top-level sections. Missing: hundreds of pages buried in nested groups.

// Recursive section group fetching
async function fetchAllSections(notebookId, parentGroupId = null) {
  const sections = [];

  // Get sections at this level
  const directSections = await fetchSections(notebookId, parentGroupId);
  sections.push(...directSections);

  // Get section groups at this level
  const sectionGroups = await fetchSectionGroups(notebookId, parentGroupId);

  // Recursively fetch from each section group
  for (const group of sectionGroups) {
    const nestedSections = await fetchAllSections(notebookId, group.id);
    sections.push(...nestedSections);

    // Rate limiting delay between groups
    await sleep(3000);
  }

  return sections;
}

Critical insight: The 3-second delay between section groups wasn’t optional—it was the difference between completing the migration and getting bounced from the API.

Processing Pipeline: HTML → Text → Vectors

OneNote pages come as HTML. The embeddings I’m using in Oracle AI Database 26ai look for text chunks with CLIP embeddings.

async function processOneNotePage(page, connection) {
  // 1. Fetch HTML content
  const htmlContent = await fetchPageContent(page.id);

  // 2. Convert HTML to clean text (preserve structure)
  const textContent = htmlToText(htmlContent, {
    wordwrap: false,
    preserveNewlines: true,
    tables: true  // Keep table structure
  });

  // 3. Store document with metadata
  const documentId = await insertDocument(connection, {
    documentName: page.title,
    documentType: 'NOTE',
    contentHtml: htmlContent,
    textContent: textContent,
    createdDate: page.createdDateTime,  // Preserve original date!
    modifiedDate: page.lastModifiedDateTime,
    sourceSystem: 'onenote',
    sourceId: page.id
  });

  // 4. Chunk and generate embeddings (in-database!)
  await generateChunksWithEmbeddings(connection, documentId, textContent);
}

The key insight: storing createdDateTime from OneNote means my 2012 notes show up with their original dates, not the import date. When I search “What was I working on in 2015?”, I get accurate results.

The Results: 5,178 Documents

5,178

Documents Migrated

Migration Statistics:

OneNote: 5,178 pages
Processing time: ~4 hours
Vector embeddings: 13,279 chunks
Storage: 2.4 GB in Oracle AI Database 26ai
Searchable: Immediately

First query after migration:

“What were my top 3 projects in 2023?”

Response: Three relevant summaries, properly dated, with source attribution. Magic.

Part 2: Local Markdown Real-Time Sync

Why Local Markdown?

After OneNote, I switched to a local markdown-based PKM (Personal Knowledge Management) approach that stores everything on my laptop. Key differences:

Local-first: Markdown files on disk
Bidirectional links: [[Page]] references create connections
Daily journals: Automatic date-based organization
Open format: Plain text, no lock-in

But a local markdown editor is a capture tool. For search and AI-powered retrieval, I wanted everything in Oracle AI Database 26ai. Could I take my notes in markdown and have them automatically upload into the 26ai database and be available in my knowledge management system? YES!

The Architecture: File Watcher → Parser → Database

Data Flow:

Markdown Directory — /journals/2025_12_18.md, /pages/Project Ideas.md
Chokidar File Watcher — Monitors for changes
Debounce (3 minutes) — Batches rapid edits
Sequential Processing Queue — One file at a time
Parse Markdown + Extract Tags/Links — Structure extraction
Generate CLIP Embeddings (in-database) — Vector creation
Oracle AI Database 26ai — Final storage

Challenge 1: Memory Management

My first implementation processed files in parallel. It crashed almost immediately:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed -
JavaScript heap out of memory

Root cause: Each file requires embedding generation. Parallel processing exhausted the 4GB default Node.js heap.

Solution: Sequential processing with explicit memory management.

// backend/services/markdownWatcher.js
class MarkdownWatcher {
  constructor() {
    this.processingQueue = [];
    this.isProcessing = false;
  }

  async processQueue() {
    if (this.isProcessing || this.processingQueue.length === 0) return;

    this.isProcessing = true;

    while (this.processingQueue.length > 0) {
      const filePath = this.processingQueue.shift();

      try {
        await this.processFile(filePath);

        // Give garbage collector time to clean up
        await sleep(1000);

      } catch (error) {
        console.error(`Failed to process ${filePath}:`, error);
      }
    }

    this.isProcessing = false;
  }
}

Critical server start command:

node --max-old-space-size=8192 server-prod.js

The 8GB heap isn’t optional—it’s the difference between stable operation and random crashes.

Challenge 2: Timezone-Safe Date Parsing

Journal files are named 2025_12_18.md. Parsing this as a date seems trivial:

// WRONG - Creates UTC midnight, displays as previous day in EST
const date = new Date("2025-12-18");
// Shows: December 17, 2025 (wrong!)

The fix: Use the Date constructor with explicit local components.

// backend/services/markdownParser.js
function parseJournalDate(filename) {
  // filename: "2025_12_18.md"
  const match = filename.match(/(\d{4})_(\d{2})_(\d{2})\.md$/);
  if (!match) return null;

  const [, year, month, day] = match;

  // Create date in LOCAL timezone
  return new Date(
    parseInt(year),
    parseInt(month) - 1,  // Months are 0-indexed
    parseInt(day)
  );
}

This pattern applies everywhere you parse date strings into Date objects.

Challenge 3: The Infinite Loop Bug

Small files (< 1000 characters) caused an infinite loop in the chunking logic:

// BROKEN: Infinite loop when text.length < chunkSize
while (start < text.length) {
  const end = Math.min(start + chunkSize, text.length);
  chunks.push(text.slice(start, end));
  start = end - overlap;  // Never advances when end == text.length
}

The fix: Explicit termination condition.

// FIXED: Handles small files correctly
while (start < text.length) {
  const end = Math.min(start + chunkSize, text.length);
  chunks.push(text.slice(start, end));

  // Break when we've processed the entire text
  if (end >= text.length) break;

  start = end - overlap;
}

Extracting Markdown Graph Structure

Many markdown PKM tools support [[wiki links]] and #tags. I extract these during parsing:

function parseMarkdownContent(content, filename) {
  // Extract wiki links: [[Page Name]]
  const wikiLinks = [...content.matchAll(/\[\[([^\]]+)\]\]/g)]
    .map(match => match[1]);

  // Extract tags: #tag or #[[multi word tag]]
  const tags = [...content.matchAll(/#(?:\[\[([^\]]+)\]\]|(\w+))/g)]
    .map(match => match[1] || match[2]);

  return {
    title: filename.replace('.md', ''),
    content: content,
    wikiLinks: [...new Set(wikiLinks)],
    tags: [...new Set(tags)],
    isJournal: filename.match(/^\d{4}_\d{2}_\d{2}\.md$/) !== null
  };
}

This metadata is stored in the database, enabling future features like “show all notes linking to this concept.”

The 3-Minute Debounce

Many markdown editors auto-save constantly. Without debouncing, every keystroke triggers a sync:

// Too aggressive - causes constant reprocessing
watcher.on('change', async (path) => {
  await processFile(path);  // Runs on EVERY save
});

Solution: 3-minute debounce that resets on each change.

const DEBOUNCE_MS = 180000; // 3 minutes

watcher.on('change', (path) => {
  // Reset the timer on each change
  if (this.debounceTimer) {
    clearTimeout(this.debounceTimer);
  }

  this.pendingChanges.add(path);

  this.debounceTimer = setTimeout(() => {
    this.processPendingChanges();
  }, DEBOUNCE_MS);
});

Why 3 minutes? It’s long enough to batch multiple edits during active writing, but short enough that I don’t forget I made changes.

💡

The debounce timer significantly impacts user experience. Too short (30s) and you’re constantly processing partial thoughts. Too long (10min) and users wonder why their notes aren’t syncing. 3 minutes is the sweet spot for my workflow.

The Unified Result

One Query, Multiple Sources

After both integrations, here’s what happens when I ask:

“What patterns have I noticed in customer architecture discussions?”

-- Behind the scenes: Oracle AI Database 26ai
SELECT
  d.document_name,
  d.source_system,
  d.created_date,
  dc.chunk_text,
  VECTOR_DISTANCE(
    dc.chunk_vector_clip,
    VECTOR_EMBEDDING(CLIP_TEXT USING :query as INPUT),
    COSINE
  ) as similarity
FROM document_chunks dc
JOIN documents d ON dc.document_id = d.document_id
WHERE d.source_system IN ('onenote', 'markdown', 'pdf', 'powerpoint')
ORDER BY similarity
FETCH FIRST 10 ROWS ONLY;

Results come from:

A 2023 OneNote meeting note
Yesterday’s markdown journal entry
A 2024 PowerPoint presentation
A PDF whitepaper I uploaded last month

All ranked by semantic relevance. All with source attribution. All in < 200ms.

The Numbers

5,925

Total Documents

13,279

Vector Chunks

<200ms

Search Latency

Source	Documents	Span
OneNote	5,178	2012-2025
Markdown	47	2025-present
PDFs	412	Various
PowerPoints	156	Various
Images	132	Various

Why Oracle AI Database 26ai Is Optimal for This

I could have built this with a dedicated vector database (Pinecone, Weaviate, etc.). Here’s why Oracle was the better choice:

Single Source of Truth: Documents, embeddings, metadata—all in one place
ACID Guarantees: No sync issues between vector store and metadata
In-Database Embeddings: CLIP runs in SQL, not external APIs
SQL + Vectors: Complex queries combining filters and similarity
Existing Infrastructure: Already using Oracle for everything else

The database isn’t just storing my knowledge—it’s the integration layer that makes diverse sources act as one.

Lessons Learned

Insight	Takeaway
Rate Limits Are Real	Microsoft Graph API will absolutely throttle you. Build retry logic from day one, not after you’ve been bounced.
Memory Management Matters	Vector embedding generation is memory-intensive. Sequential processing with explicit GC time prevents crashes.
Preserve Original Dates	Storing `createdDateTime` from source systems enables time-based queries. Don’t use import timestamps.
Debounce Aggressively	File watchers that fire on every save create chaos. 3-minute debounce is the sweet spot for PKM tools.
Test Timezones	Date string parsing defaults to UTC. Always use `new Date(year, month-1, day)` for local dates.

What’s Next

The unified knowledge base enables features that were impossible before:

GraphRAG: Entity extraction across all sources, building a knowledge graph of people, projects, and concepts (already implemented!)
Daily Summaries: Auto-generated executive briefs from markdown journals and recent documents
Cross-Source References: “Show me everything I’ve written about [topic] since [date]”
Temporal Queries: “What changed in my thinking about [concept] between 2020 and 2024?”

The foundation is complete. Now the fun begins.

Have you unified knowledge silos? What sources did you integrate? What challenges did you face? Drop a comment or reach out

About the Author

Brian Hengen is a Vice President at Oracle, leading technical sales engineering teams. The views and opinions expressed in this blog are his own and do not necessarily reflect those of Oracle.

Brian Hengen

VP at Oracle • AI Researcher • Builder

Building and testing specialized language models on an RTX 5090. Exploring what happens when smaller, focused AI beats larger general-purpose models.

GitHub LinkedIn X/Twitter

The Knowledge Silo Problem#

My Reality Before Integration#

The Vision: One Query, All Knowledge#

Part 1: The OneNote Migration (5,178 Documents)#

The Challenge: 12 Years of Notes#

Choosing the Integration Path#

The Authentication Dance#

Handling Microsoft’s Rate Limits#

The Recursive Section Group Problem#

Processing Pipeline: HTML → Text → Vectors#

The Results: 5,178 Documents#

Part 2: Local Markdown Real-Time Sync#

Why Local Markdown?#

The Architecture: File Watcher → Parser → Database#

Challenge 1: Memory Management#

Challenge 2: Timezone-Safe Date Parsing#

Challenge 3: The Infinite Loop Bug#

Extracting Markdown Graph Structure#

The 3-Minute Debounce#

The Unified Result#

One Query, Multiple Sources#

The Numbers#

Why Oracle AI Database 26ai Is Optimal for This#

Lessons Learned#

What’s Next#

About the Author#

Subscribe to New Posts