Powered by Truth SIβ„’

🧬 Genesis Ingestion Daemon

Part of Genesis Protocol Phase 1: Data Collection

Overview

The Genesis Ingestion Daemon continuously collects world-class code examples from across the internet to train the Genesis self-coding system. It ingests code from:

  1. GitHub: Top Python/FastAPI repositories
  2. StackOverflow: Best Python answers with high scores
  3. Documentation: Official Python, FastAPI, PyTorch, TensorFlow docs
  4. Truth.SI: Our own codebase patterns (future enhancement)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 GENESIS INGESTION DAEMON                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   GitHub     β”‚  β”‚ StackOverflowβ”‚  β”‚    Docs      β”‚     β”‚
β”‚  β”‚  Ingester    β”‚  β”‚   Ingester   β”‚  β”‚  Ingester    β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚         β”‚                 β”‚                  β”‚              β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                           β”‚                                 β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚                  β”‚ GenesisStorage  β”‚                        β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                           β”‚                                 β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚         β”‚                 β”‚                 β”‚              β”‚
β”‚    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”       β”‚
β”‚    β”‚ Weaviate β”‚   β”‚    Neo4j    β”‚   β”‚  RedPanda   β”‚       β”‚
β”‚    β”‚(Vectors) β”‚   β”‚ (Knowledge) β”‚   β”‚  (Events)   β”‚       β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features

Data Sources

GitHub - Fetches top repos by stars for each topic - Topics: python, fastapi, machine-learning, pytorch, tensorflow, etc. - Extracts Python files from each repo - Deduplicates based on content hash

StackOverflow - Fetches accepted answers with high scores - Tags: python, fastapi, asyncio, pandas, pytorch, etc. - Extracts code blocks from answer bodies - Filters for Python code only

Documentation - Scrapes code examples from official docs - Sources: Python, FastAPI, Pydantic, PyTorch, TensorFlow - Extracts code blocks from HTML

Storage Backends

Weaviate (Semantic Search) - Stores code snippets as embeddings - Enables semantic code search - Collection: GenesisCode

Neo4j (Knowledge Graph) - Stores code relationships - Links code to authors, repos, topics - Enables graph-based queries

RedPanda (Event Stream) - Publishes ingestion events in real-time - Topic: genesis.ingestion - Enables downstream processing

Resilience

Monitoring

Usage

Run Continuously (Daemon Mode)

python3 scripts/genesis-ingestion-daemon.py

Runs every 60 minutes, collecting fresh code continuously.

Run Once (One-Shot Mode)

python3 scripts/genesis-ingestion-daemon.py --once

Runs a single ingestion cycle then exits.

Ingest Specific Source

# GitHub only
python3 scripts/genesis-ingestion-daemon.py --once --source github

# StackOverflow only
python3 scripts/genesis-ingestion-daemon.py --once --source stackoverflow

# Documentation only
python3 scripts/genesis-ingestion-daemon.py --once --source docs

Deploy as systemd Service (FORGE)

# Copy service file
sudo cp systemd/genesis-ingestion-daemon.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable genesis-ingestion-daemon
sudo systemctl start genesis-ingestion-daemon

# Check status
sudo systemctl status genesis-ingestion-daemon

# View logs
sudo journalctl -u genesis-ingestion-daemon -f

Configuration

All configuration is via environment variables in .env:

# GitHub API token (required for GitHub ingestion)
GITHUB_TOKEN=gho_XXXXXXXXXXXXXXX

# Database connections (via tunnel)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=truthsiadmin2024

WEAVIATE_URL=http://localhost:8080

REDIS_URL=redis://localhost:6379

# Ingestion parameters (optional overrides)
GENESIS_GITHUB_REPOS_PER_BATCH=50
GENESIS_STACKOVERFLOW_ANSWERS_PER_BATCH=100
GENESIS_CHECK_INTERVAL_MINUTES=60
GENESIS_RATE_LIMIT_DELAY_SECONDS=1.0

Weaviate Collection Schema

The daemon expects a GenesisCode collection in Weaviate. Create it with:

import weaviate

client = weaviate.connect_to_local()

client.collections.create(
    name="GenesisCode",
    properties=[
        weaviate.classes.config.Property(
            name="source",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="content",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="language",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="title",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="url",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="author",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="tags",
            data_type=weaviate.classes.config.DataType.TEXT_ARRAY,
        ),
        weaviate.classes.config.Property(
            name="content_hash",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="created_at",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
    ],
    vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_transformers(),
)

Or run the provided script:

python3 scripts/setup-genesis-weaviate-collection.py

Performance

Expected throughput (per cycle): - GitHub: ~500 code files (10 repos Γ— 10 files Γ— 5 topics) - StackOverflow: ~1,000 code snippets (100 answers Γ— 10 tags) - Documentation: ~100 code examples (20 examples Γ— 5 doc sites) - Total: ~1,600 snippets per hour

Over 24 hours: - ~38,400 code snippets - ~1.15M snippets per month - Enough to train a world-class code model

Troubleshooting

No GitHub data collected: - Check GITHUB_TOKEN is set in .env - Verify token has repo read permissions - Check rate limit: curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/rate_limit

Weaviate connection errors: - Ensure tunnel is running: ./scripts/forge-tunnel.sh status - Test connection: curl http://localhost:8080/v1/meta - Check Weaviate is running: ssh genesis "docker ps | grep weaviate"

Neo4j connection errors: - Check tunnel: ./scripts/forge-tunnel.sh status - Verify credentials in .env - Test: ssh genesis "docker exec neo4j cypher-shell -u neo4j -p truthsiadmin2024 'RETURN 1'"

High memory usage: - Reduce batch sizes in configuration - Limit concurrent operations - Increase check interval

Next Steps (Genesis Protocol)

After ingestion:

  1. Phase 2: Training - Fine-tune Qwen on collected code
  2. Phase 3: Agency - Build agentic framework with tool use
  3. Phase 4: Interface - Create chat UI
  4. Phase 5: Integration - Wire into 9-layer architecture

Files

scripts/
  genesis-ingestion-daemon.py        # Main daemon (481 LOC)
  setup-genesis-weaviate-collection.py  # Weaviate setup (future)

systemd/
  genesis-ingestion-daemon.service   # Systemd service file

docs/genesis/
  INGESTION_DAEMON.md               # This document

logs/
  genesis-ingestion-daemon.log      # Daemon logs

Created by THE ARCHITECT - Session 311 Part of GENESIS PROTOCOL - Phase 1 Complete