This document explains how to access transcript data from the k3s-deployed YouTube Transcript project.

Architecture

The YouTube Transcript project runs in the k3s cluster:

  • Namespace: youtube-transcript
  • Worker Pod: youtube-transcript-worker - GPU-powered transcription using Whisper
  • Database: SQLite with Prisma ORM
  • Replication: Litestream → MinIO (S3-compatible storage)

Prerequisites

  • Access to the ~/Projects/uptownhr/agents directory
  • Litestream installed (~/bin/litestream)
  • MinIO credentials (minio/minio123)

Database Location

The live database runs inside the worker pod:

/app/prisma/youtube-transcripts.db

Litestream continuously replicates to MinIO:

s3://youtube-transcripts/db/

Restoring the Database Locally

1. Navigate to the project directory

Terminal window
cd ~/Projects/uptownhr/agents/packages/youtube-transcript

2. Use the litestream restore config

The litestream-restore.yml config is already set up:

dbs:
- path: /tmp/youtube-transcripts.db
replicas:
- type: s3
bucket: youtube-transcripts
path: db
endpoint: http://minio.local
force-path-style: true

3. Run the restore

Terminal window
AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=minio123 \
~/bin/litestream restore -config litestream-restore.yml /tmp/youtube-transcripts.db

This downloads the latest database snapshot to /tmp/youtube-transcripts.db.

Database Schema

The SQLite database uses Prisma with two main tables:

videos table

ColumnTypeDescription
idTEXTPrimary key (cuid)
youtubeIdTEXTYouTube video ID
titleTEXTVideo title
durationINTEGERDuration in seconds
channelNameTEXTYouTube channel name
statusTEXTProcessing status
createdAtDATETIMERecord creation time
updatedAtDATETIMELast update time

transcripts table

ColumnTypeDescription
idTEXTPrimary key (cuid)
videoIdTEXTForeign key to videos
rawTextTEXTRaw Whisper transcription
correctedTextTEXTAI-corrected transcript
takeawaysTEXTAI-generated summary/analysis
createdAtDATETIMERecord creation time
updatedAtDATETIMELast update time

Querying the Database

Using Python (recommended):

import sqlite3
conn = sqlite3.connect('/tmp/youtube-transcripts.db')
cursor = conn.cursor()
# Count videos with transcripts
cursor.execute('''
SELECT COUNT(*) FROM videos v
JOIN transcripts t ON v.id = t.videoId
WHERE t.correctedText IS NOT NULL
''')
print(f"Videos with transcripts: {cursor.fetchone()[0]}")
# Sample query
cursor.execute('''
SELECT v.title, v.channelName, LENGTH(t.correctedText) as length
FROM videos v
JOIN transcripts t ON v.id = t.videoId
ORDER BY t.createdAt DESC
LIMIT 5
''')
for row in cursor.fetchall():
print(f"{row[0]} ({row[1]}) - {row[2]} chars")
conn.close()

Data Statistics

As of December 2025:

  • 658 total videos tracked
  • 646 transcripts with full corrected text
  • 646 takeaways with AI-generated analysis
  • Average transcript length: ~15,000 characters

K8s Direct Access (Alternative)

If you need to access the live database directly in the cluster:

Terminal window
cd ~/Projects/uptownhr/agents
KUBECONFIG=.kube/config kubectl -n youtube-transcript exec -it deploy/youtube-transcript-worker -- /bin/sh
# Inside the pod
sqlite3 /app/prisma/youtube-transcripts.db ".tables"

Note: Prefer litestream restore for local work to avoid impacting the running service.