🧬 Genesis Ingestion Daemon

Part of Genesis Protocol Phase 1: Data Collection

Overview

The Genesis Ingestion Daemon continuously collects world-class code examples from across the internet to train the Genesis self-coding system. It ingests code from:

GitHub: Top Python/FastAPI repositories
StackOverflow: Best Python answers with high scores
Documentation: Official Python, FastAPI, PyTorch, TensorFlow docs
Truth.SI: Our own codebase patterns (future enhancement)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                 GENESIS INGESTION DAEMON                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │   GitHub     │  │ StackOverflow│  │    Docs      │     │
│  │  Ingester    │  │   Ingester   │  │  Ingester    │     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
│         │                 │                  │              │
│         └─────────────────┴──────────────────┘              │
│                           │                                 │
│                  ┌────────▼────────┐                        │
│                  │ GenesisStorage  │                        │
│                  └────────┬────────┘                        │
│                           │                                 │
│         ┌─────────────────┼─────────────────┐              │
│         │                 │                 │              │
│    ┌────▼─────┐   ┌──────▼──────┐   ┌──────▼──────┐       │
│    │ Weaviate │   │    Neo4j    │   │  RedPanda   │       │
│    │(Vectors) │   │ (Knowledge) │   │  (Events)   │       │
│    └──────────┘   └─────────────┘   └─────────────┘       │
└─────────────────────────────────────────────────────────────┘

Features

Data Sources

GitHub - Fetches top repos by stars for each topic - Topics: python, fastapi, machine-learning, pytorch, tensorflow, etc. - Extracts Python files from each repo - Deduplicates based on content hash

StackOverflow - Fetches accepted answers with high scores - Tags: python, fastapi, asyncio, pandas, pytorch, etc. - Extracts code blocks from answer bodies - Filters for Python code only

Documentation - Scrapes code examples from official docs - Sources: Python, FastAPI, Pydantic, PyTorch, TensorFlow - Extracts code blocks from HTML

Storage Backends

Weaviate (Semantic Search) - Stores code snippets as embeddings - Enables semantic code search - Collection: GenesisCode

Neo4j (Knowledge Graph) - Stores code relationships - Links code to authors, repos, topics - Enables graph-based queries

RedPanda (Event Stream) - Publishes ingestion events in real-time - Topic: genesis.ingestion - Enables downstream processing

Resilience

Rate Limiting: Respects API limits (1s delay between calls)
Circuit Breaker: Auto-disables failing sources
Graceful Degradation: Continues if one backend fails
Deduplication: SHA-256 hash prevents duplicates
Content Limits: Max 100KB per snippet

Monitoring

Prometheus Metrics: Port 9126
github_repos_fetched: Number of repos fetched
github_files_fetched: Number of files extracted
stackoverflow_snippets_fetched: Number of SO answers
docs_snippets_fetched: Number of doc examples
weaviate_stored: Snippets stored in Weaviate
neo4j_stored: Snippets stored in Neo4j
redpanda_published: Events published
ingestion_cycles_total: Total ingestion cycles
snippets_ingested_total: Total snippets collected

Usage

Run Continuously (Daemon Mode)

python3 scripts/genesis-ingestion-daemon.py

Runs every 60 minutes, collecting fresh code continuously.

Run Once (One-Shot Mode)

python3 scripts/genesis-ingestion-daemon.py --once

Runs a single ingestion cycle then exits.

Ingest Specific Source

# GitHub only
python3 scripts/genesis-ingestion-daemon.py --once --source github

# StackOverflow only
python3 scripts/genesis-ingestion-daemon.py --once --source stackoverflow

# Documentation only
python3 scripts/genesis-ingestion-daemon.py --once --source docs

Deploy as systemd Service (FORGE)

# Copy service file
sudo cp systemd/genesis-ingestion-daemon.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable genesis-ingestion-daemon
sudo systemctl start genesis-ingestion-daemon

# Check status
sudo systemctl status genesis-ingestion-daemon

# View logs
sudo journalctl -u genesis-ingestion-daemon -f

Configuration

All configuration is via environment variables in .env:

# GitHub API token (required for GitHub ingestion)
GITHUB_TOKEN=gho_XXXXXXXXXXXXXXX

# Database connections (via tunnel)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=truthsiadmin2024

WEAVIATE_URL=http://localhost:8080

REDIS_URL=redis://localhost:6379

# Ingestion parameters (optional overrides)
GENESIS_GITHUB_REPOS_PER_BATCH=50
GENESIS_STACKOVERFLOW_ANSWERS_PER_BATCH=100
GENESIS_CHECK_INTERVAL_MINUTES=60
GENESIS_RATE_LIMIT_DELAY_SECONDS=1.0

Weaviate Collection Schema

The daemon expects a GenesisCode collection in Weaviate. Create it with:

import weaviate

client = weaviate.connect_to_local()

client.collections.create(
    name="GenesisCode",
    properties=[
        weaviate.classes.config.Property(
            name="source",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="content",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="language",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="title",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="url",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="author",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="tags",
            data_type=weaviate.classes.config.DataType.TEXT_ARRAY,
        ),
        weaviate.classes.config.Property(
            name="content_hash",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
        weaviate.classes.config.Property(
            name="created_at",
            data_type=weaviate.classes.config.DataType.TEXT,
        ),
    ],
    vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_transformers(),
)

Or run the provided script:

python3 scripts/setup-genesis-weaviate-collection.py

Performance

Expected throughput (per cycle): - GitHub: ~500 code files (10 repos × 10 files × 5 topics) - StackOverflow: ~1,000 code snippets (100 answers × 10 tags) - Documentation: ~100 code examples (20 examples × 5 doc sites) - Total: ~1,600 snippets per hour

Over 24 hours: - ~38,400 code snippets - ~1.15M snippets per month - Enough to train a world-class code model

Troubleshooting

No GitHub data collected: - Check GITHUB_TOKEN is set in .env - Verify token has repo read permissions - Check rate limit: curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/rate_limit

Weaviate connection errors: - Ensure tunnel is running: ./scripts/forge-tunnel.sh status - Test connection: curl http://localhost:8080/v1/meta - Check Weaviate is running: ssh genesis "docker ps | grep weaviate"

Neo4j connection errors: - Check tunnel: ./scripts/forge-tunnel.sh status - Verify credentials in .env - Test: ssh genesis "docker exec neo4j cypher-shell -u neo4j -p truthsiadmin2024 'RETURN 1'"

High memory usage: - Reduce batch sizes in configuration - Limit concurrent operations - Increase check interval

Next Steps (Genesis Protocol)

After ingestion:

Phase 2: Training - Fine-tune Qwen on collected code
Phase 3: Agency - Build agentic framework with tool use
Phase 4: Interface - Create chat UI
Phase 5: Integration - Wire into 9-layer architecture

Files

scripts/
  genesis-ingestion-daemon.py        # Main daemon (481 LOC)
  setup-genesis-weaviate-collection.py  # Weaviate setup (future)

systemd/
  genesis-ingestion-daemon.service   # Systemd service file

docs/genesis/
  INGESTION_DAEMON.md               # This document

logs/
  genesis-ingestion-daemon.log      # Daemon logs

Created by THE ARCHITECT - Session 311 Part of GENESIS PROTOCOL - Phase 1 Complete