File size: 3,613 Bytes
0acf986
3499c7d
46194ba
3499c7d
 
2f74fe7
e21f086
dbf0990
46194ba
0acf986
0dcac3a
0acf986
 
3499c7d
 
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
 
 
 
 
aad85c9
3499c7d
 
 
 
aad85c9
3499c7d
aad85c9
3499c7d
 
 
aad85c9
3499c7d
 
 
 
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
 
 
aad85c9
3499c7d
aad85c9
 
 
 
 
46194ba
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: Modal Transcriber MCP
emoji: ⚑
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: 5.33.1
app_file: app.py
pinned: true
license: mit
python_version: 3.10.12
---

# πŸŽ™οΈ Modal Transcriber MCP

A powerful audio transcription system integrating Gradio UI, FastMCP Tools, and Modal cloud computing with intelligent speaker identification.

## ✨ Key Features

- **🎡 Multi-platform Audio Download**: Support for Apple Podcasts, XiaoYuZhou, and other podcast platforms
- **πŸš€ High-performance Transcription**: Based on OpenAI Whisper with multiple model support (turbo, large-v3, etc.)
- **🎀 Intelligent Speaker Identification**: Using pyannote.audio for speaker separation and embedding clustering
- **⚑ Distributed Processing**: Support for large file concurrent chunk processing, significantly improving processing speed
- **πŸ”§ FastMCP Tools**: Complete MCP (Model Context Protocol) tool integration
- **☁️ Modal Deployment**: Support for both local and cloud deployment modes

## 🎯 Core Advantages

### 🧠 Intelligent Audio Segmentation
- **Silence Detection Segmentation**: Automatically identify silent segments in audio for intelligent chunking
- **Fallback Mechanism**: Long audio automatically degrades to time-based segmentation, ensuring processing efficiency
- **Concurrent Processing**: Multiple chunks processed simultaneously, dramatically improving transcription speed

### 🎀 Advanced Speaker Identification
- **Embedding Clustering**: Using deep learning embeddings for speaker consistency identification
- **Cross-chunk Unification**: Solving speaker label inconsistency issues in distributed processing
- **Quality Filtering**: Automatically filter low-quality segments to improve output accuracy

### πŸ”§ Developer Friendly
- **MCP Protocol Support**: Complete tool invocation interface
- **REST API**: Standardized API interface
- **Gradio UI**: Intuitive web interface
- **Test Coverage**: 29 unit tests and integration tests

## πŸš€ Quick Start

### Local Setup

1. **Clone Repository**
```bash
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
cd ModalTranscriberMCP
```

2. **Install Dependencies**
```bash
pip install -r requirements.txt
```

3. **Configure Hugging Face Token** (Optional, for speaker identification)
```bash
# Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env
```

4. **Start Application**
```bash
python app.py
```

### Usage Instructions

1. **Upload audio file** or **Input podcast URL**
2. **Select transcription options**:
   - Model size: turbo (recommended) / large-v3
   - Output format: SRT / TXT
   - Enable speaker identification
3. **Start transcription**, the system will automatically process and generate results

## πŸ› οΈ Technical Architecture

- **Frontend**: Gradio 4.44.0
- **Backend**: FastAPI + FastMCP
- **Transcription Engine**: OpenAI Whisper
- **Speaker Identification**: pyannote.audio
- **Cloud Computing**: Modal.com
- **Audio Processing**: FFmpeg

## πŸ“Š Performance Metrics

- **Processing Speed**: Support for 30x real-time transcription speed
- **Concurrency**: Up to 10 chunks processed simultaneously
- **Accuracy**: Chinese accuracy >95%
- **Supported Formats**: MP3, WAV, M4A, FLAC, etc.

## 🀝 Contributing

Issues and Pull Requests are welcome!

## πŸ“œ License

MIT License

## πŸ”— Related Links

- **Project Documentation**: See `docs/` directory in the repository
- **Test Coverage**: 29 test cases ensuring functional stability
- **Modal Deployment**: Support for cloud high-performance processing

---
*Last updated: 2025-06-11*