🧬 GENESIS LOCAL CORPUS INGESTION
Learn from your own codebase - Self-improvement through self-study
Genesis can now ingest and learn from the local Truth.SI codebase, enabling true self-coding capabilities based on your own patterns and implementations.
🎯 WHAT IS LOCAL CORPUS INGESTION?
Local corpus ingestion allows Genesis to:
- Scan the local codebase - Read all Python files in the project
- Extract code patterns - Identify reusable functions, classes, and patterns
- Store in Weaviate - Make patterns searchable via semantic search
- Learn your style - Understand your coding conventions and preferences
- Reuse patterns - Apply existing patterns when generating new code
The Result: Genesis writes code that looks and feels like YOUR code, not generic templates.
🚀 QUICK START
Option 1: API Endpoint (Recommended)
# Ingest entire api/ directory
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
-H "Content-Type: application/json" \
-d '{}'
# Ingest specific directory
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
-H "Content-Type: application/json" \
-d '{
"root_path": "api/lib"
}'
Response:
{
"success": true,
"files_scanned": 156,
"patterns_extracted": 1247,
"patterns_ingested": 1247,
"status": "complete",
"error": null
}
Option 2: Genesis Ingestion Daemon
# Run local ingestion only
python3 scripts/genesis-ingestion-daemon.py --source local --once
# Run all ingestion sources (GitHub, StackOverflow, Docs, Local)
python3 scripts/genesis-ingestion-daemon.py --once
Option 3: Test Script
# Test the local corpus builder
python3 scripts/test-local-corpus-builder.py
📊 WHAT GETS EXTRACTED?
The local corpus builder extracts the following code patterns:
| Pattern Type | Description | Example |
|---|---|---|
| Functions | Regular functions with signatures | def calculate_metrics(data: dict) -> float |
| Async Functions | Async/await functions | async def fetch_data(url: str) -> dict |
| Classes | Classes with methods | class UserManager |
| Decorators | Function decorators | @router.post("/endpoint") |
| Module Patterns | Module-level constants and configs | Configuration patterns |
Extracted Metadata
For each pattern, the following metadata is captured:
- Name - Function/class name
- Arguments - Function parameters with types
- Return Type - Return type annotation
- Decorators - Applied decorators
- Docstring - Documentation
- File Path - Location in codebase
- Line Count - Size of pattern
- Code Hash - Unique identifier for deduplication
🏗️ ARCHITECTURE
┌─────────────────────────────────────────────────────────────┐
│ LOCAL CORPUS INGESTION FLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. SCAN CODEBASE │
│ ├─ Walk directory tree │
│ ├─ Find all *.py files │
│ ├─ Read file content │
│ └─ Calculate file hash │
│ │
│ 2. EXTRACT PATTERNS │
│ ├─ Parse Python AST │
│ ├─ Extract functions (def) │
│ ├─ Extract async functions (async def) │
│ ├─ Extract classes (class) │
│ ├─ Extract decorators (@) │
│ └─ Capture metadata (args, returns, docstrings) │
│ │
│ 3. BUILD EMBEDDINGS │
│ ├─ Format pattern as searchable text │
│ ├─ Generate embedding via Weaviate │
│ └─ Store in GenesisCorpus collection │
│ │
│ 4. WIRE TO GENESIS │
│ ├─ Publish event to RedPanda │
│ ├─ Notify genesis-ingestion-daemon │
│ └─ Update Genesis knowledge base │
│ │
└─────────────────────────────────────────────────────────────┘
🔧 API REFERENCE
POST /api/v1/genesis/ingest/local
Ingest local codebase into Genesis knowledge base.
Request:
{
"root_path": "api/lib" // Optional: defaults to "api/"
}
Response:
{
"success": true,
"files_scanned": 156,
"patterns_extracted": 1247,
"patterns_ingested": 1247,
"status": "complete",
"error": null
}
Status Values:
- "complete" - All patterns ingested successfully
- "partial" - Some patterns failed to ingest
- "failed" - Ingestion failed completely
Error Handling:
- If Weaviate is unavailable, returns success: false with error details
- If parsing fails on a file, logs warning and continues
- If a pattern fails to ingest, logs error and continues with others
📈 USAGE EXAMPLES
Example 1: Full Codebase Ingestion
Ingest the entire Truth.SI codebase:
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
-H "Content-Type: application/json" \
-d '{}'
What this does:
1. Scans all Python files in api/ directory
2. Extracts ~1000-2000 code patterns
3. Stores them in Weaviate
4. Takes ~30-60 seconds depending on codebase size
Example 2: Specific Module Ingestion
Ingest only the Genesis module:
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
-H "Content-Type: application/json" \
-d '{
"root_path": "api/lib/genesis"
}'
What this does:
1. Scans only api/lib/genesis/*.py files
2. Extracts patterns from Genesis module
3. Useful for focused learning
Example 3: Python Script Integration
import asyncio
from api.lib.genesis.local_corpus_builder import LocalCorpusBuilder
async def ingest_local_code():
# Initialize builder
builder = LocalCorpusBuilder(root_path="api/")
# Run full ingestion
result = await builder.build_local_corpus()
print(f"Scanned: {result['files_scanned']} files")
print(f"Extracted: {result['patterns_extracted']} patterns")
print(f"Ingested: {result['patterns_ingested']} patterns")
asyncio.run(ingest_local_code())
🎯 BENEFITS
1. Consistent Code Style
Genesis learns YOUR coding style: - Naming conventions (snake_case, camelCase, etc.) - Error handling patterns - Logging patterns - Comment style - Type annotation preferences
2. Reusable Patterns
When Genesis needs to implement something, it first searches for: - Similar functions in your codebase - Existing error handling patterns - Database connection patterns - API endpoint patterns
3. Domain Knowledge
Genesis learns domain-specific patterns: - Truth.SI-specific utilities - Custom decorators and middleware - Business logic patterns - Integration patterns
4. Faster Code Generation
Instead of generating from scratch, Genesis: - Finds similar existing code - Adapts existing patterns - Maintains consistency with codebase
🔍 VERIFICATION
After ingestion, verify patterns are available:
1. Check Corpus Stats
curl "http://35.162.205.215:8000/api/v1/genesis/corpus/stats"
Look for:
{
"total_snippets": 2500,
"sources": {
"github": 1000,
"stackoverflow": 500,
"documentation": 0,
"truthsi": 1000 // <-- Local patterns show here
}
}
2. Search for Local Patterns
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/chat" \
-H "Content-Type: application/json" \
-d '{
"conversation_id": "test-local",
"message": "Find code patterns for database connections in the local codebase"
}'
Genesis will search Weaviate and return local patterns.
🚨 IMPORTANT NOTES
What Gets Ingested
✅ Included:
- All .py files in specified directory
- Functions with docstrings
- Classes with methods
- Type annotations
- Decorators
❌ Excluded:
- __pycache__ directories
- .git directory
- venv / node_modules
- .pytest_cache
- migrations directory
- Files over 1MB (configurable)
Privacy & Security
Local patterns stay local: - Only stored in YOUR Weaviate instance - Not shared with external services - Not sent to GitHub/StackOverflow - Complete privacy
Security:
- Sensitive files (.env, secrets.json) are excluded
- No credentials are extracted
- Only code patterns are stored
🔄 CONTINUOUS LEARNING
Manual Re-ingestion
Re-ingest after making significant changes:
# Re-scan the codebase
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
-H "Content-Type: application/json" \
-d '{}'
Automated Re-ingestion
The genesis-ingestion-daemon can run on a schedule:
# Run daemon with local ingestion (every 60 minutes by default)
python3 scripts/genesis-ingestion-daemon.py
Cron Job:
# Re-ingest local code daily at 2 AM
0 2 * * * cd /path/to/truth-si-dev-env && python3 scripts/genesis-ingestion-daemon.py --source local --once
🛠️ TROUBLESHOOTING
Issue: "Local corpus builder not available"
Cause: Import error in LocalCorpusBuilder
Fix:
# Test the import
python3 -c "from api.lib.genesis.local_corpus_builder import LocalCorpusBuilder; print('OK')"
# If it fails, check dependencies
pip install -r requirements.txt
Issue: "Weaviate not connected"
Cause: Weaviate is not running or not accessible
Fix:
# Check Weaviate is running
ssh genesis "docker ps | grep weaviate"
# Check tunnel
./scripts/forge-tunnel.sh status
# Restart if needed
./scripts/forge-tunnel.sh restart
Issue: "No patterns extracted"
Cause: Files have syntax errors or no extractable patterns
Fix:
# Run test script with verbose logging
python3 scripts/test-local-corpus-builder.py
# Check for Python syntax errors
python3 -m py_compile api/lib/genesis/*.py
📚 RELATED DOCUMENTATION
| Document | Purpose |
|---|---|
docs/GENESIS_USAGE_GUIDE.md |
Complete Genesis documentation |
docs/genesis/INGESTION_DAEMON.md |
Daemon configuration |
api/lib/genesis/local_corpus_builder.py |
Source code (442 LOC) |
scripts/test-local-corpus-builder.py |
Test script |
🎓 BEST PRACTICES
1. Regular Re-ingestion
Re-ingest after: - Adding new features - Refactoring code - Updating coding standards - Major code changes
2. Targeted Ingestion
For faster ingestion, target specific modules:
# Only ingest authentication code
curl -X POST ".../ingest/local" -d '{"root_path": "api/routers/auth.py"}'
# Only ingest database utilities
curl -X POST ".../ingest/local" -d '{"root_path": "api/lib/database"}'
3. Monitor Corpus Size
Check corpus stats regularly:
curl "http://35.162.205.215:8000/api/v1/genesis/corpus/stats"
Keep an eye on storage growth.
4. Verify Pattern Quality
After ingestion, test Genesis with local patterns:
curl -X POST ".../chat" -d '{
"message": "Create a function similar to calculate_metrics in the codebase"
}'
Verify Genesis uses local patterns.
🎉 SUCCESS METRICS
After successful ingestion, you should see:
✅ Files scanned: ~150-200 (depending on codebase size) ✅ Patterns extracted: ~1000-2000 ✅ Patterns ingested: ~1000-2000 ✅ Source breakdown shows "truthsi" entries ✅ Genesis uses local patterns when generating code
Built with 100,000,000,000,000% by THE ARCHITECT - Session 318
Genesis - Learning from the best... YOU.