
| Difficulty: Intermediate | Category: Workflow |
Build a Personal Knowledge Base with AI in 30 Minutes
Knowledge workers create an average of 1.7GB of new data per year, yet 85% report difficulty finding information they’ve previously saved. Your brilliant insights from last month’s research? Buried in a folder you can’t remember naming. The solution isn’t another note-taking app—it’s an AI-powered knowledge base that understands context and retrieves exactly what you need through natural language.
In this tutorial, you’ll build a semantic search system over your personal notes using Python, local embeddings, and a vector database. No cloud APIs required.
Prerequisites
- Python 3.10+ installed with pip
- 10GB free disk space for models and vector database
- Basic Python knowledge (reading files, running scripts)
- Your notes in markdown format (or any text files—we’ll show you how to convert)
Step-by-Step Guide
Step 1: Set Up Your Environment and Install Dependencies
Create a project directory and install the core libraries:
mkdir ai-knowledge-base
cd ai-knowledge-base
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install sentence-transformers==2.5.1 chromadb==0.4.22 langchain==0.1.9 pypdf2==3.0.1
Gotcha: ChromaDB versions before 0.4.x have breaking API changes. Stick to 0.4.22 for this tutorial.
These libraries do the heavy lifting:
sentence-transformerscreates semantic embeddings locallychromadbstores and searches those embeddingslangchainhelps chunk documents intelligentlypypdf2converts PDFs to text
Step 2: Collect and Prepare Your Knowledge Sources
Create a documents folder and add your content:
mkdir documents
Supported formats: .txt, .md, .pdf. If you use Notion, Evernote, or Apple Notes, export to markdown first.
Pro tip: For Notion, use File → Export → Markdown & CSV with “Include subpages” enabled. This maintains your folder structure.
Here’s a Python script to validate your documents (validate_docs.py):
import os
from pathlib import Path
def scan_documents(folder_path):
valid_extensions = {'.txt', '.md', '.pdf'}
files = []
for root, dirs, filenames in os.walk(folder_path):
for filename in filenames:
if Path(filename).suffix.lower() in valid_extensions:
full_path = os.path.join(root, filename)
size_kb = os.path.getsize(full_path) / 1024
files.append((full_path, size_kb))
print(f"Found {len(files)} documents:")
for path, size in sorted(files):
print(f" {path} ({size:.1f} KB)")
return files
if __name__ == "__main__":
scan_documents("documents")
Run it: python validate_docs.py
Step 3: Create Document Embeddings
Embeddings transform text into numerical vectors that capture semantic meaning. Similar concepts cluster together in vector space.
Create build_index.py:
from sentence_transformers import SentenceTransformer
import chromadb
from pathlib import Path
import PyPDF2
def load_document(file_path):
"""Load content from txt, md, or pdf files"""
path = Path(file_path)
if path.suffix.lower() == '.pdf':
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
return ' '.join([page.extract_text() for page in reader.pages])
else:
with open(file_path, 'r', encoding='utf-8') as f:
return f.read()
def chunk_text(text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks"""
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Initialize the embedding model (384-dimensional vectors)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="knowledge_base",
metadata={"hnsw:space": "cosine"}
)
# Process all documents
documents_folder = Path("documents")
doc_id = 0
for doc_path in documents_folder.rglob("*"):
if doc_path.suffix.lower() in ['.txt', '.md', '.pdf']:
print(f"Processing: {doc_path}")
content = load_document(doc_path)
chunks = chunk_text(content)
for chunk in chunks:
embedding = model.encode(chunk).tolist()
collection.add(
embeddings=[embedding],
documents=[chunk],
ids=[f"doc_{doc_id}"],
metadatas=[{"source": str(doc_path)}]
)
doc_id += 1
print(f"\nIndexed {doc_id} chunks from your documents")
Gotcha: The model all-MiniLM-L6-v2 downloads ~80MB on first run. It’s cached locally, so subsequent runs are instant.
Run the indexer: python build_index.py
This creates a chroma_db folder containing your vector database. On my 2GB test corpus (400 markdown files), indexing took 3 minutes on an M2 MacBook.
Step 4: Build the Query Interface
Now create search_kb.py for semantic search:
from sentence_transformers import SentenceTransformer
import chromadb
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="knowledge_base")
def search_knowledge_base(query, n_results=5):
"""Search your knowledge base with natural language"""
query_embedding = model.encode(query).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
print(f"\n🔍 Results for: '{query}'\n")
for i, (doc, metadata) in enumerate(zip(
results['documents'][0],
results['metadatas'][0]
), 1):
print(f"#{i} — Source: {metadata['source']}")
print(f"{doc[:300]}...\n")
if __name__ == "__main__":
# Interactive search loop
while True:
query = input("\nAsk your knowledge base (or 'quit'): ")
if query.lower() == 'quit':
break
search_knowledge_base(query)
Test it: python search_kb.py
Try queries like:
- “What were my main insights about transformer architectures?”
- “Meeting notes from Q4 2025 budget discussion”
- “Python code for API rate limiting”
Pro tip: Semantic search understands synonyms and concepts. Searching “ML deployment strategies” will surface notes about “production machine learning” even if they never use the word “deployment.”
Step 5: Add an AI-Powered Q&A Layer
Raw search results are useful, but combining them with a local LLM creates a conversational interface. Install Ollama (https://ollama.ai) and pull a model:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama2:7b
Create chat_kb.py:
from sentence_transformers import SentenceTransformer
import chromadb
import subprocess
import json
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="knowledge_base")
def query_with_context(question):
# Get relevant context from knowledge base
query_embedding = model.encode(question).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
context = "\n\n".join(results['documents'][0])
# Build prompt
prompt = f"""You are a helpful assistant answering questions based on the user's personal knowledge base.
Context from knowledge base:
{context}
Question: {question}
Answer based only on the provided context. If the context doesn't contain relevant information, say so."""
# Query Ollama
result = subprocess.run(
['ollama', 'run', 'llama2:7b', prompt],
capture_output=True,
text=True
)
return result.stdout
if __name__ == "__main__":
while True:
question = input("\nAsk a question (or 'quit'): ")
if question.lower() == 'quit':
break
print("\n🤖 Answer:")
print(query_with_context(question))
Gotcha: Llama2:7b requires ~4GB RAM. For constrained systems, use ollama pull phi (1.3GB) instead.
Step 6: Automate Knowledge Base Updates
Create a cron job or scheduled task to re-index nightly:
#!/bin/bash
# update_kb.sh
cd /path/to/ai-knowledge-base
source venv/bin/activate
python build_index.py
Make executable: chmod +x update_kb.sh
Add to crontab (crontab -e):
0 2 * * * /path/to/ai-knowledge-base/update_kb.sh
This runs at 2 AM daily. ChromaDB’s incremental updates mean only new/modified documents get re-indexed.
Practical Example: Complete Workflow
Let’s index a real research folder and query it:
# 1. Add documents
cp ~/research-papers/*.pdf documents/
cp ~/meeting-notes/*.md documents/
# 2. Build index
python build_index.py
# Output: Indexed 247 chunks from your documents
# 3. Search
python search_kb.py
# Query: "transformer attention mechanism alternatives"
# Returns: Notes from "efficient-transformers-survey.pdf"
# and "linear-attention-experiments.md"
# 4. Conversational Q&A
python chat_kb.py
# Question: "What are the tradeoffs between linear attention and standard attention?"
# AI Answer: "Based on your notes from efficient-transformers-survey.pdf,
# linear attention reduces complexity from O(n²) to O(n) but sacrifices
# some representation power..."
The entire workflow—from scattered PDFs to an intelligent Q&A system—takes under 30 minutes.
Key Takeaways
- Embeddings enable semantic search without exact keyword matches—the model understands “budget planning” relates to “fiscal strategy”
- Local-first AI (SentenceTransformers + Ollama) means zero cloud costs and complete privacy—your knowledge never leaves your machine
- Chunking strategy matters: 500-word chunks with 50-word overlap balance context preservation with retrieval precision
- Incremental updates make maintaining your knowledge base effortless—just drop new files in
documents/and re-run the indexer
What’s Next
Once your knowledge base is running, explore adding web scraping to automatically ingest bookmarked articles, or connect it to a Slack bot for team-wide knowledge sharing.
Key Takeaway: You can create a searchable, AI-powered knowledge base using Obsidian, Python embeddings, and local LLMs that transforms scattered notes into an intelligent second brain accessible through natural language queries.
New AI tutorials published daily on AtlasSignal. Follow @AtlasSignalDesk for more.
📧 Get Daily AI & Macro Intelligence
Stay ahead of market-moving news, emerging tech, and global shifts.