data-access

This document explains how to access transcript data from the k3s-deployed YouTube Transcript project.

Architecture

The YouTube Transcript project runs in the k3s cluster:

Namespace: youtube-transcript
Worker Pod: youtube-transcript-worker - GPU-powered transcription using Whisper
Database: SQLite with Prisma ORM
Replication: Litestream → MinIO (S3-compatible storage)

Prerequisites

Access to the ~/Projects/uptownhr/agents directory
Litestream installed (~/bin/litestream)
MinIO credentials (minio/minio123)

Database Location

The live database runs inside the worker pod:

/app/prisma/youtube-transcripts.db

Litestream continuously replicates to MinIO:

s3://youtube-transcripts/db/

Restoring the Database Locally

1. Navigate to the project directory

cd ~/Projects/uptownhr/agents/packages/youtube-transcript

2. Use the litestream restore config

The litestream-restore.yml config is already set up:

dbs:
  - path: /tmp/youtube-transcripts.db
    replicas:
      - type: s3
        bucket: youtube-transcripts
        path: db
        endpoint: http://minio.local
        force-path-style: true

3. Run the restore

AWS_ACCESS_KEY_ID=minio AWS_SECRET_ACCESS_KEY=minio123 \
  ~/bin/litestream restore -config litestream-restore.yml /tmp/youtube-transcripts.db

This downloads the latest database snapshot to /tmp/youtube-transcripts.db.

Database Schema

The SQLite database uses Prisma with two main tables:

`videos` table

Column	Type	Description
id	TEXT	Primary key (cuid)
youtubeId	TEXT	YouTube video ID
title	TEXT	Video title
duration	INTEGER	Duration in seconds
channelName	TEXT	YouTube channel name
status	TEXT	Processing status
createdAt	DATETIME	Record creation time
updatedAt	DATETIME	Last update time

`transcripts` table

Column	Type	Description
id	TEXT	Primary key (cuid)
videoId	TEXT	Foreign key to videos
rawText	TEXT	Raw Whisper transcription
correctedText	TEXT	AI-corrected transcript
takeaways	TEXT	AI-generated summary/analysis
createdAt	DATETIME	Record creation time
updatedAt	DATETIME	Last update time

Querying the Database

Using Python (recommended):

import sqlite3

conn = sqlite3.connect('/tmp/youtube-transcripts.db')
cursor = conn.cursor()

# Count videos with transcripts
cursor.execute('''
    SELECT COUNT(*) FROM videos v
    JOIN transcripts t ON v.id = t.videoId
    WHERE t.correctedText IS NOT NULL
''')
print(f"Videos with transcripts: {cursor.fetchone()[0]}")

# Sample query
cursor.execute('''
    SELECT v.title, v.channelName, LENGTH(t.correctedText) as length
    FROM videos v
    JOIN transcripts t ON v.id = t.videoId
    ORDER BY t.createdAt DESC
    LIMIT 5
''')
for row in cursor.fetchall():
    print(f"{row[0]} ({row[1]}) - {row[2]} chars")

conn.close()

Data Statistics

As of December 2025:

658 total videos tracked
646 transcripts with full corrected text
646 takeaways with AI-generated analysis
Average transcript length: ~15,000 characters

K8s Direct Access (Alternative)

If you need to access the live database directly in the cluster:

cd ~/Projects/uptownhr/agents
KUBECONFIG=.kube/config kubectl -n youtube-transcript exec -it deploy/youtube-transcript-worker -- /bin/sh

# Inside the pod
sqlite3 /app/prisma/youtube-transcripts.db ".tables"

Note: Prefer litestream restore for local work to avoid impacting the running service.