𧬠Genesis Ingestion Daemon
Part of Genesis Protocol Phase 1: Data Collection
Overview
The Genesis Ingestion Daemon continuously collects world-class code examples from across the internet to train the Genesis self-coding system. It ingests code from:
- GitHub: Top Python/FastAPI repositories
- StackOverflow: Best Python answers with high scores
- Documentation: Official Python, FastAPI, PyTorch, TensorFlow docs
- Truth.SI: Our own codebase patterns (future enhancement)
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GENESIS INGESTION DAEMON β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β GitHub β β StackOverflowβ β Docs β β
β β Ingester β β Ingester β β Ingester β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β βββββββββββββββββββ΄βββββββββββββββββββ β
β β β
β ββββββββββΌβββββββββ β
β β GenesisStorage β β
β ββββββββββ¬βββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β β β β β
β ββββββΌββββββ ββββββββΌβββββββ ββββββββΌβββββββ β
β β Weaviate β β Neo4j β β RedPanda β β
β β(Vectors) β β (Knowledge) β β (Events) β β
β ββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Features
Data Sources
GitHub - Fetches top repos by stars for each topic - Topics: python, fastapi, machine-learning, pytorch, tensorflow, etc. - Extracts Python files from each repo - Deduplicates based on content hash
StackOverflow - Fetches accepted answers with high scores - Tags: python, fastapi, asyncio, pandas, pytorch, etc. - Extracts code blocks from answer bodies - Filters for Python code only
Documentation - Scrapes code examples from official docs - Sources: Python, FastAPI, Pydantic, PyTorch, TensorFlow - Extracts code blocks from HTML
Storage Backends
Weaviate (Semantic Search)
- Stores code snippets as embeddings
- Enables semantic code search
- Collection: GenesisCode
Neo4j (Knowledge Graph) - Stores code relationships - Links code to authors, repos, topics - Enables graph-based queries
RedPanda (Event Stream)
- Publishes ingestion events in real-time
- Topic: genesis.ingestion
- Enables downstream processing
Resilience
- Rate Limiting: Respects API limits (1s delay between calls)
- Circuit Breaker: Auto-disables failing sources
- Graceful Degradation: Continues if one backend fails
- Deduplication: SHA-256 hash prevents duplicates
- Content Limits: Max 100KB per snippet
Monitoring
- Prometheus Metrics: Port 9126
github_repos_fetched: Number of repos fetchedgithub_files_fetched: Number of files extractedstackoverflow_snippets_fetched: Number of SO answersdocs_snippets_fetched: Number of doc examplesweaviate_stored: Snippets stored in Weaviateneo4j_stored: Snippets stored in Neo4jredpanda_published: Events publishedingestion_cycles_total: Total ingestion cyclessnippets_ingested_total: Total snippets collected
Usage
Run Continuously (Daemon Mode)
python3 scripts/genesis-ingestion-daemon.py
Runs every 60 minutes, collecting fresh code continuously.
Run Once (One-Shot Mode)
python3 scripts/genesis-ingestion-daemon.py --once
Runs a single ingestion cycle then exits.
Ingest Specific Source
# GitHub only
python3 scripts/genesis-ingestion-daemon.py --once --source github
# StackOverflow only
python3 scripts/genesis-ingestion-daemon.py --once --source stackoverflow
# Documentation only
python3 scripts/genesis-ingestion-daemon.py --once --source docs
Deploy as systemd Service (FORGE)
# Copy service file
sudo cp systemd/genesis-ingestion-daemon.service /etc/systemd/system/
# Reload systemd
sudo systemctl daemon-reload
# Enable and start
sudo systemctl enable genesis-ingestion-daemon
sudo systemctl start genesis-ingestion-daemon
# Check status
sudo systemctl status genesis-ingestion-daemon
# View logs
sudo journalctl -u genesis-ingestion-daemon -f
Configuration
All configuration is via environment variables in .env:
# GitHub API token (required for GitHub ingestion)
GITHUB_TOKEN=gho_XXXXXXXXXXXXXXX
# Database connections (via tunnel)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=truthsiadmin2024
WEAVIATE_URL=http://localhost:8080
REDIS_URL=redis://localhost:6379
# Ingestion parameters (optional overrides)
GENESIS_GITHUB_REPOS_PER_BATCH=50
GENESIS_STACKOVERFLOW_ANSWERS_PER_BATCH=100
GENESIS_CHECK_INTERVAL_MINUTES=60
GENESIS_RATE_LIMIT_DELAY_SECONDS=1.0
Weaviate Collection Schema
The daemon expects a GenesisCode collection in Weaviate. Create it with:
import weaviate
client = weaviate.connect_to_local()
client.collections.create(
name="GenesisCode",
properties=[
weaviate.classes.config.Property(
name="source",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="content",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="language",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="title",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="url",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="author",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="tags",
data_type=weaviate.classes.config.DataType.TEXT_ARRAY,
),
weaviate.classes.config.Property(
name="content_hash",
data_type=weaviate.classes.config.DataType.TEXT,
),
weaviate.classes.config.Property(
name="created_at",
data_type=weaviate.classes.config.DataType.TEXT,
),
],
vectorizer_config=weaviate.classes.config.Configure.Vectorizer.text2vec_transformers(),
)
Or run the provided script:
python3 scripts/setup-genesis-weaviate-collection.py
Performance
Expected throughput (per cycle): - GitHub: ~500 code files (10 repos Γ 10 files Γ 5 topics) - StackOverflow: ~1,000 code snippets (100 answers Γ 10 tags) - Documentation: ~100 code examples (20 examples Γ 5 doc sites) - Total: ~1,600 snippets per hour
Over 24 hours: - ~38,400 code snippets - ~1.15M snippets per month - Enough to train a world-class code model
Troubleshooting
No GitHub data collected:
- Check GITHUB_TOKEN is set in .env
- Verify token has repo read permissions
- Check rate limit: curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/rate_limit
Weaviate connection errors:
- Ensure tunnel is running: ./scripts/forge-tunnel.sh status
- Test connection: curl http://localhost:8080/v1/meta
- Check Weaviate is running: ssh genesis "docker ps | grep weaviate"
Neo4j connection errors:
- Check tunnel: ./scripts/forge-tunnel.sh status
- Verify credentials in .env
- Test: ssh genesis "docker exec neo4j cypher-shell -u neo4j -p truthsiadmin2024 'RETURN 1'"
High memory usage: - Reduce batch sizes in configuration - Limit concurrent operations - Increase check interval
Next Steps (Genesis Protocol)
After ingestion:
- Phase 2: Training - Fine-tune Qwen on collected code
- Phase 3: Agency - Build agentic framework with tool use
- Phase 4: Interface - Create chat UI
- Phase 5: Integration - Wire into 9-layer architecture
Files
scripts/
genesis-ingestion-daemon.py # Main daemon (481 LOC)
setup-genesis-weaviate-collection.py # Weaviate setup (future)
systemd/
genesis-ingestion-daemon.service # Systemd service file
docs/genesis/
INGESTION_DAEMON.md # This document
logs/
genesis-ingestion-daemon.log # Daemon logs
Created by THE ARCHITECT - Session 311 Part of GENESIS PROTOCOL - Phase 1 Complete