Building BrassTranscripts: What We Learned [Behind-the-Scenes Q&A]

We launched BrassTranscripts in August 2024 with a clear focus: professional AI transcription with transparent pricing. Pay-per-use pricing, automatic speaker identification included, and no subscription required. Here's what we learned building it.

Why We Started
Technical Decisions
Multi-language Support
Building AI Prompts
What Surprised Us
Mistakes We Made
What's Next

Why We Started

Q: Why build another transcription service? There are already dozens.

We were frustrated users first. Every transcription service we tried had the same problems: hidden fees that doubled the advertised price, subscription requirements that charged monthly even when we didn't use the service, and extra charges for basic features like speaker identification.

We wanted pay-per-use pricing where $0.15/minute actually means $0.15/minute with everything included. No surprise charges for speaker identification, no monthly minimums, no enterprise upsells.

Q: What makes BrassTranscripts different?

Pay-per-use pricing with no subscriptions. You only pay for what you transcribe - $0.15/minute for files over 15 minutes, $2.25 flat rate for shorter files. No monthly fees, no credits that expire, no minimum commitments.

Everything included: Speaker identification, all output formats (TXT, SRT, VTT, JSON), timestamps, and multi-language support are built into the base price. No add-on fees.

30-word preview before payment: See transcript quality before paying. If the preview doesn't meet your needs, walk away - no charge.

Technical Decisions

Q: Why did you choose WhisperX large-v3?

WhisperX large-v3 is OpenAI's latest Whisper model (1.55 billion parameters) combined with advanced speaker diarization. It's one of the most capable open-source transcription models available.

We chose it because:

Open-source transparency: The model is publicly documented
Speaker diarization: Automatic speaker identification (most services charge extra for this)
Multi-language support: 99+ languages with automatic detection
Active development: Regular updates and improvements

Q: What about processing speed? How fast is it really?

1-3 minutes per hour of audio (verified from production system logs). A typical 1-hour meeting transcribes in about 2 minutes, 30-minute podcast in 1 minute.

We set a 10-minute processing timeout as a safety limit. In practice, most files finish within 5 minutes regardless of complexity.

Q: Why limit files to 2 hours and 250MB?

Practical limits based on actual usage patterns:

2-hour maximum duration: Most meetings, podcasts, and interviews fit within 2 hours. Longer recordings can be split into segments.
250MB maximum file size: Accommodates high-quality audio without requiring users to compress excessively.
5-minute minimum duration: Ensures meaningful transcription content (very short clips often lack context).

These limits cover the vast majority of real-world transcription needs while maintaining consistent processing quality.

Q: Why support 11 file formats instead of just MP3?

Because users' files come in whatever format their recording device creates:

iPhone users: Get M4A from Voice Memos app
Professional recordings: Often WAV or FLAC
Zoom/Teams users: Receive MP4 video files
Podcast creators: Use various formats (MP3, M4A, OGG)

Supporting 11 formats (9 audio + 2 video) means users upload their files directly without format conversion headaches.

Multi-language Support

Q: You support 99+ languages - how does that actually work?

WhisperX large-v3 has built-in support for 99+ languages with automatic language detection. When users upload a file, the system automatically detects the spoken language and transcribes accordingly - no manual language selection needed.

Q: What are the most commonly transcribed languages?

English is most common, followed by Spanish, French, German, and Mandarin Chinese. We also see regular usage of Portuguese, Japanese, Italian, Dutch, and Arabic.

Q: Does automatic detection ever get it wrong?

Rarely, but it happens. The most common issues:

Code-switching: Meetings where speakers switch between languages mid-conversation
Heavy accents: Strong regional accents can sometimes trigger incorrect language detection
Background noise: Poor audio quality affects language detection accuracy

When detection fails, the transcript quality drops noticeably - usually the first sign something went wrong.

Q: Can you transcribe mixed-language conversations?

Not reliably. WhisperX detects the dominant language in the file and transcribes everything in that language. If a meeting is 80% English and 20% Spanish, it will detect English and attempt to transcribe the Spanish portions phonetically in English.

For truly bilingual conversations, we recommend splitting the audio by language or accepting that one language will dominate the output.

Q: What about regional dialects and accents?

WhisperX handles most regional variations well - British English, Australian English, Latin American Spanish, European Portuguese all process correctly. The model was trained on diverse speech data including various accents and dialects.

Strong regional accents may see slightly lower transcription quality, but the system generally adapts well to standard pronunciation variations within each language.

Building AI Prompts

Q: You have 66 AI prompts now. Why so many?

Because transcription is just the first step. Users need to do something with the transcript:

Executives need meeting summaries and action items
Marketers need blog posts and social media content
Legal professionals need deposition analysis
Researchers need thematic coding

We built prompts for every major use case we encountered. Not theoretical prompts - ones based on actual user needs.

Q: What's the most popular prompt?

Executive Summary Generator and Meeting Action Items Extractor consistently top the list. Professionals transcribe meetings, then immediately need actionable summaries for their teams.

Q: Do users actually use the prompts?

Yes. Blog posts with embedded prompts perform significantly better than posts without prompts. Users copy the prompts and use them with ChatGPT, Claude, or Gemini.

The prompt guide page has become one of our highest-traffic pages.

Q: How do you design effective AI prompts?

We focus on clear, structured instructions that work consistently across different AI tools (ChatGPT, Claude, Gemini). Each prompt includes:

Specific task description: What the AI should produce
Input requirements: What the user needs to provide
Output format: How results should be structured
Customization guidance: How to adapt for specific needs

Example:

"Extract action items, owners, and deadlines from meeting transcripts" (clear task)
"Paste your transcript below" (clear input)
"Format as numbered list with owner names bolded" (clear output)

What Surprised Us

Q: What surprised you most about running a transcription service?

How much people value the preview feature. We thought users would prioritize price or speed. They do care about those - but the 30-word preview (try before you pay) became our biggest differentiator.

Users consistently mention being able to verify transcript quality before paying as the reason they chose us over competitors.

Q: What feature took the longest to get right?

Speaker diarization (identifying who said what). Early on, we just labeled speakers as "Speaker 0," "Speaker 1," etc. Users wanted real names.

We realized we couldn't automatically assign names (we don't know who's in your meeting). So we built AI prompts to help users identify speakers from context and replace labels themselves. That solved it.

Q: What did users ask for that you didn't expect?

Processing time estimates. Users kept asking "how long will my 2-hour file take?" We built an AI prompt that calculates exact timelines based on our 1-3 min/hour processing speed.

Turns out, deadline planning matters more than we thought.

Q: Any features you built that nobody uses?

Not yet - but we're cautious about feature creep. Every feature we've built addresses a specific user problem we verified through support questions or blog comments.

Mistakes We Made

Q: What's the biggest mistake you made?

Not documenting system specifications from day one. We had service limits, formats, and pricing scattered across different files and documentation. This created inconsistencies between our website, blog posts, and support docs.

Now we maintain a centralized specifications file with all technical details, and we reference it directly in all content.

Q: What would you do differently?

Build automated testing for documentation earlier. We manually checked blog posts for technical accuracy, which was time-consuming and error-prone. We should have built validation scripts from the start to verify all technical claims against our centralized specifications.

Q: Any technical decisions you regret?

Not implementing comprehensive error handling earlier. Our initial upload system had basic error messages like "Upload failed." Users couldn't tell if the problem was file size, format, or network issues.

We now provide specific error messages ("File exceeds 250MB limit" vs "Unsupported format" vs "Network timeout"), which reduced support questions significantly.

What's Next

Q: What features are you working on next?

We're focused on improvements to existing features rather than adding complexity:

Better error messages: Clearer explanations when uploads fail
Format conversion guidance: Helping users convert unsupported formats
Improved speaker diarization: Better handling of 10+ speaker scenarios

Q: Will you add subscriptions?

No. Pay-per-use pricing is our core value. The moment we add subscriptions, we create incentives to charge for features that should be included.

Q: What about an API?

It's on our roadmap. Current focus is perfecting the web upload experience. API access would help developers integrate transcription into their workflows, but we want to ensure it maintains the same pay-per-use pricing model with no subscription requirements.

Q: Will you expand beyond transcription?

Possibly. We're exploring related services like:

Translation: Transcribe in one language, translate to another
Summarization: Built-in AI summarization (beyond prompts)
Custom workflows: Automated processing pipelines

All expansions would follow the same pay-per-use pricing model with everything included in the base price.

Conclusion: Lessons for Other Builders

Q: What advice would you give to someone building a similar service?

Three core lessons:

1. Simple pricing beats complex pricing

Pay-per-use with everything included ($0.15/minute) converts better than tiered subscriptions with add-on fees. Users want to know the exact cost upfront - no surprises, no hidden charges, no calculating which tier they need.

Price transparency builds confidence faster than any marketing claim.

2. Solve ONE problem exceptionally well

We do transcription with pay-per-use pricing. That's it. No AI video generation, no enterprise collaboration suite, no blockchain integration. Just transcription done right.

Pick your one thing. Perfect it. Resist feature creep.

3. Build single sources of truth

Maintain one authoritative file with all service specifications - file limits, formats, pricing, processing times. Every piece of content references this file. No guessing, no inconsistencies, no outdated documentation.

Centralize your specifications and reference them everywhere.

Q: Final thoughts?

Building BrassTranscripts taught us that users value clarity and reliability. They want to know what they're paying, what they're getting, and what to expect.

The preview feature (see quality before paying) became our biggest differentiator - not because of marketing, but because users can verify results before committing.

Want to try it? Upload your audio at BrassTranscripts.com. See the 30-word preview, verify quality, pay only if satisfied. That's how transcription should work.

About This Post: This Q&A reflects our genuine journey building BrassTranscripts from August 2024 to November 2025. All technical specifications verified from production system data. All features and improvements based on actual development experience.