data-access
This document explains how to access transcript data from the k3s-deployed YouTube Transcript project.
Architecture
The YouTube Transcript project runs in the k3s cluster:
- Namespace:
youtube-transcript - Worker Pod:
youtube-transcript-worker- GPU-powered transcription using Whisper - Database: SQLite with Prisma ORM
- Replication: Litestream → MinIO (S3-compatible storage)
Prerequisites
- Access to the
~/Projects/uptownhr/agentsdirectory - Litestream installed (
~/bin/litestream) - MinIO credentials (minio/minio123)
Database Location
The live database runs inside the worker pod:
/app/prisma/youtube-transcripts.dbLitestream continuously replicates to MinIO:
s3://youtube-transcripts/db/Restoring the Database Locally
1. Navigate to the project directory
cd ~/Projects/uptownhr/agents/packages/youtube-transcript2. Use the litestream restore config
The litestream-restore.yml config is already set up:
dbs: - path: /tmp/youtube-transcripts.db replicas: - type: s3 bucket: youtube-transcripts path: db endpoint: http://minio.local force-path-style: true3. Run the restore
AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=minio123 \ ~/bin/litestream restore -config litestream-restore.yml /tmp/youtube-transcripts.dbThis downloads the latest database snapshot to /tmp/youtube-transcripts.db.
Database Schema
The SQLite database uses Prisma with two main tables:
videos table
| Column | Type | Description |
|---|---|---|
| id | TEXT | Primary key (cuid) |
| youtubeId | TEXT | YouTube video ID |
| title | TEXT | Video title |
| duration | INTEGER | Duration in seconds |
| channelName | TEXT | YouTube channel name |
| status | TEXT | Processing status |
| createdAt | DATETIME | Record creation time |
| updatedAt | DATETIME | Last update time |
transcripts table
| Column | Type | Description |
|---|---|---|
| id | TEXT | Primary key (cuid) |
| videoId | TEXT | Foreign key to videos |
| rawText | TEXT | Raw Whisper transcription |
| correctedText | TEXT | AI-corrected transcript |
| takeaways | TEXT | AI-generated summary/analysis |
| createdAt | DATETIME | Record creation time |
| updatedAt | DATETIME | Last update time |
Querying the Database
Using Python (recommended):
import sqlite3
conn = sqlite3.connect('/tmp/youtube-transcripts.db')cursor = conn.cursor()
# Count videos with transcriptscursor.execute(''' SELECT COUNT(*) FROM videos v JOIN transcripts t ON v.id = t.videoId WHERE t.correctedText IS NOT NULL''')print(f"Videos with transcripts: {cursor.fetchone()[0]}")
# Sample querycursor.execute(''' SELECT v.title, v.channelName, LENGTH(t.correctedText) as length FROM videos v JOIN transcripts t ON v.id = t.videoId ORDER BY t.createdAt DESC LIMIT 5''')for row in cursor.fetchall(): print(f"{row[0]} ({row[1]}) - {row[2]} chars")
conn.close()Data Statistics
As of December 2025:
- 658 total videos tracked
- 646 transcripts with full corrected text
- 646 takeaways with AI-generated analysis
- Average transcript length: ~15,000 characters
K8s Direct Access (Alternative)
If you need to access the live database directly in the cluster:
cd ~/Projects/uptownhr/agentsKUBECONFIG=.kube/config kubectl -n youtube-transcript exec -it deploy/youtube-transcript-worker -- /bin/sh
# Inside the podsqlite3 /app/prisma/youtube-transcripts.db ".tables"Note: Prefer litestream restore for local work to avoid impacting the running service.