Powered by Truth SI™

🧬 GENESIS LOCAL CORPUS INGESTION

Learn from your own codebase - Self-improvement through self-study

Genesis can now ingest and learn from the local Truth.SI codebase, enabling true self-coding capabilities based on your own patterns and implementations.


🎯 WHAT IS LOCAL CORPUS INGESTION?

Local corpus ingestion allows Genesis to:

  1. Scan the local codebase - Read all Python files in the project
  2. Extract code patterns - Identify reusable functions, classes, and patterns
  3. Store in Weaviate - Make patterns searchable via semantic search
  4. Learn your style - Understand your coding conventions and preferences
  5. Reuse patterns - Apply existing patterns when generating new code

The Result: Genesis writes code that looks and feels like YOUR code, not generic templates.


🚀 QUICK START

# Ingest entire api/ directory
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
  -H "Content-Type: application/json" \
  -d '{}'

# Ingest specific directory
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
  -H "Content-Type: application/json" \
  -d '{
    "root_path": "api/lib"
  }'

Response:

{
  "success": true,
  "files_scanned": 156,
  "patterns_extracted": 1247,
  "patterns_ingested": 1247,
  "status": "complete",
  "error": null
}

Option 2: Genesis Ingestion Daemon

# Run local ingestion only
python3 scripts/genesis-ingestion-daemon.py --source local --once

# Run all ingestion sources (GitHub, StackOverflow, Docs, Local)
python3 scripts/genesis-ingestion-daemon.py --once

Option 3: Test Script

# Test the local corpus builder
python3 scripts/test-local-corpus-builder.py

📊 WHAT GETS EXTRACTED?

The local corpus builder extracts the following code patterns:

Pattern Type Description Example
Functions Regular functions with signatures def calculate_metrics(data: dict) -> float
Async Functions Async/await functions async def fetch_data(url: str) -> dict
Classes Classes with methods class UserManager
Decorators Function decorators @router.post("/endpoint")
Module Patterns Module-level constants and configs Configuration patterns

Extracted Metadata

For each pattern, the following metadata is captured:


🏗️ ARCHITECTURE

┌─────────────────────────────────────────────────────────────┐
│              LOCAL CORPUS INGESTION FLOW                     │
├─────────────────────────────────────────────────────────────┤
│                                                               │
│  1. SCAN CODEBASE                                            │
│     ├─ Walk directory tree                                   │
│     ├─ Find all *.py files                                   │
│     ├─ Read file content                                     │
│     └─ Calculate file hash                                   │
│                                                               │
│  2. EXTRACT PATTERNS                                         │
│     ├─ Parse Python AST                                      │
│     ├─ Extract functions (def)                               │
│     ├─ Extract async functions (async def)                   │
│     ├─ Extract classes (class)                               │
│     ├─ Extract decorators (@)                                │
│     └─ Capture metadata (args, returns, docstrings)          │
│                                                               │
│  3. BUILD EMBEDDINGS                                         │
│     ├─ Format pattern as searchable text                     │
│     ├─ Generate embedding via Weaviate                       │
│     └─ Store in GenesisCorpus collection                     │
│                                                               │
│  4. WIRE TO GENESIS                                          │
│     ├─ Publish event to RedPanda                             │
│     ├─ Notify genesis-ingestion-daemon                       │
│     └─ Update Genesis knowledge base                         │
│                                                               │
└─────────────────────────────────────────────────────────────┘

🔧 API REFERENCE

POST /api/v1/genesis/ingest/local

Ingest local codebase into Genesis knowledge base.

Request:

{
  "root_path": "api/lib"  // Optional: defaults to "api/"
}

Response:

{
  "success": true,
  "files_scanned": 156,
  "patterns_extracted": 1247,
  "patterns_ingested": 1247,
  "status": "complete",
  "error": null
}

Status Values: - "complete" - All patterns ingested successfully - "partial" - Some patterns failed to ingest - "failed" - Ingestion failed completely

Error Handling: - If Weaviate is unavailable, returns success: false with error details - If parsing fails on a file, logs warning and continues - If a pattern fails to ingest, logs error and continues with others


📈 USAGE EXAMPLES

Example 1: Full Codebase Ingestion

Ingest the entire Truth.SI codebase:

curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
  -H "Content-Type: application/json" \
  -d '{}'

What this does: 1. Scans all Python files in api/ directory 2. Extracts ~1000-2000 code patterns 3. Stores them in Weaviate 4. Takes ~30-60 seconds depending on codebase size

Example 2: Specific Module Ingestion

Ingest only the Genesis module:

curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
  -H "Content-Type: application/json" \
  -d '{
    "root_path": "api/lib/genesis"
  }'

What this does: 1. Scans only api/lib/genesis/*.py files 2. Extracts patterns from Genesis module 3. Useful for focused learning

Example 3: Python Script Integration

import asyncio
from api.lib.genesis.local_corpus_builder import LocalCorpusBuilder

async def ingest_local_code():
    # Initialize builder
    builder = LocalCorpusBuilder(root_path="api/")

    # Run full ingestion
    result = await builder.build_local_corpus()

    print(f"Scanned: {result['files_scanned']} files")
    print(f"Extracted: {result['patterns_extracted']} patterns")
    print(f"Ingested: {result['patterns_ingested']} patterns")

asyncio.run(ingest_local_code())

🎯 BENEFITS

1. Consistent Code Style

Genesis learns YOUR coding style: - Naming conventions (snake_case, camelCase, etc.) - Error handling patterns - Logging patterns - Comment style - Type annotation preferences

2. Reusable Patterns

When Genesis needs to implement something, it first searches for: - Similar functions in your codebase - Existing error handling patterns - Database connection patterns - API endpoint patterns

3. Domain Knowledge

Genesis learns domain-specific patterns: - Truth.SI-specific utilities - Custom decorators and middleware - Business logic patterns - Integration patterns

4. Faster Code Generation

Instead of generating from scratch, Genesis: - Finds similar existing code - Adapts existing patterns - Maintains consistency with codebase


🔍 VERIFICATION

After ingestion, verify patterns are available:

1. Check Corpus Stats

curl "http://35.162.205.215:8000/api/v1/genesis/corpus/stats"

Look for:

{
  "total_snippets": 2500,
  "sources": {
    "github": 1000,
    "stackoverflow": 500,
    "documentation": 0,
    "truthsi": 1000  // <-- Local patterns show here
  }
}

2. Search for Local Patterns

curl -X POST "http://35.162.205.215:8000/api/v1/genesis/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_id": "test-local",
    "message": "Find code patterns for database connections in the local codebase"
  }'

Genesis will search Weaviate and return local patterns.


🚨 IMPORTANT NOTES

What Gets Ingested

✅ Included: - All .py files in specified directory - Functions with docstrings - Classes with methods - Type annotations - Decorators

❌ Excluded: - __pycache__ directories - .git directory - venv / node_modules - .pytest_cache - migrations directory - Files over 1MB (configurable)

Privacy & Security

Local patterns stay local: - Only stored in YOUR Weaviate instance - Not shared with external services - Not sent to GitHub/StackOverflow - Complete privacy

Security: - Sensitive files (.env, secrets.json) are excluded - No credentials are extracted - Only code patterns are stored


🔄 CONTINUOUS LEARNING

Manual Re-ingestion

Re-ingest after making significant changes:

# Re-scan the codebase
curl -X POST "http://35.162.205.215:8000/api/v1/genesis/ingest/local" \
  -H "Content-Type: application/json" \
  -d '{}'

Automated Re-ingestion

The genesis-ingestion-daemon can run on a schedule:

# Run daemon with local ingestion (every 60 minutes by default)
python3 scripts/genesis-ingestion-daemon.py

Cron Job:

# Re-ingest local code daily at 2 AM
0 2 * * * cd /path/to/truth-si-dev-env && python3 scripts/genesis-ingestion-daemon.py --source local --once

🛠️ TROUBLESHOOTING

Issue: "Local corpus builder not available"

Cause: Import error in LocalCorpusBuilder

Fix:

# Test the import
python3 -c "from api.lib.genesis.local_corpus_builder import LocalCorpusBuilder; print('OK')"

# If it fails, check dependencies
pip install -r requirements.txt

Issue: "Weaviate not connected"

Cause: Weaviate is not running or not accessible

Fix:

# Check Weaviate is running
ssh genesis "docker ps | grep weaviate"

# Check tunnel
./scripts/forge-tunnel.sh status

# Restart if needed
./scripts/forge-tunnel.sh restart

Issue: "No patterns extracted"

Cause: Files have syntax errors or no extractable patterns

Fix:

# Run test script with verbose logging
python3 scripts/test-local-corpus-builder.py

# Check for Python syntax errors
python3 -m py_compile api/lib/genesis/*.py

Document Purpose
docs/GENESIS_USAGE_GUIDE.md Complete Genesis documentation
docs/genesis/INGESTION_DAEMON.md Daemon configuration
api/lib/genesis/local_corpus_builder.py Source code (442 LOC)
scripts/test-local-corpus-builder.py Test script

🎓 BEST PRACTICES

1. Regular Re-ingestion

Re-ingest after: - Adding new features - Refactoring code - Updating coding standards - Major code changes

2. Targeted Ingestion

For faster ingestion, target specific modules:

# Only ingest authentication code
curl -X POST ".../ingest/local" -d '{"root_path": "api/routers/auth.py"}'

# Only ingest database utilities
curl -X POST ".../ingest/local" -d '{"root_path": "api/lib/database"}'

3. Monitor Corpus Size

Check corpus stats regularly:

curl "http://35.162.205.215:8000/api/v1/genesis/corpus/stats"

Keep an eye on storage growth.

4. Verify Pattern Quality

After ingestion, test Genesis with local patterns:

curl -X POST ".../chat" -d '{
  "message": "Create a function similar to calculate_metrics in the codebase"
}'

Verify Genesis uses local patterns.


🎉 SUCCESS METRICS

After successful ingestion, you should see:

✅ Files scanned: ~150-200 (depending on codebase size) ✅ Patterns extracted: ~1000-2000 ✅ Patterns ingested: ~1000-2000 ✅ Source breakdown shows "truthsi" entries ✅ Genesis uses local patterns when generating code


Built with 100,000,000,000,000% by THE ARCHITECT - Session 318

Genesis - Learning from the best... YOU.