# audio-summarize An audio summarizer that glues together ffmpeg, whisper.cpp and BART. ## Dependencies - Python 3 (tested: 3.12) - ffmpeg - git - make - c/c++ compiler (on Ubuntu, installing `build-essential` does the trick) ## Setup Create a virtual environment for python and activate it: ```bash python3 -m venv .venv source .venv/bin/activate ``` Run setup.sh ```bash ./setup.sh ``` ## Run 1. You need a whisper.cpp compatible model file (-> https://huggingface.co/ggerganov/whisper.cpp) 2. In your terminal, make shure you have your python venv activated 3. Run audio-summarize.py ### Usage ``` ./audio-summarize.py -m filepath -i filepath -o filepath [--summin n] [--summax n] [--segmax n] options: -h, --help show this help message and exit --summin n The minimum lenght of a segment summary [10, min: 5] --summax n The maximum lenght of a segment summary [90, min: 5] --segmax n The maximum number of tokens per segment [375, 5 - 500] -m filepath The path to a whisper.cpp-compatible model file -i filepath The path to the media file -o filepath Where to save the output text to ``` Example: ```bash ./audio-summarize.py -m ./tmp/whisper_ggml-small.en-q5_1.bin -i ./tmp/test.webm -o ./tmp/output.txt ``` ## How does it work? To summarize a media file, the program executes the following steps: 1. Convert the media file with [ffmpeg](https://www.ffmpeg.org/) to a mono 16kHz 16bit-PCM wav file 2. Transcribe that wav file using [whisper.cpp](https://github.com/ggerganov/whisper.cpp) 3. Clean up the transcript (newlines, whitespaces at the beginning and end) 4. Semantically split up the transcript into segments using [semantic-text-splitter](https://github.com/benbrandt/text-splitter) and the tokenizer for BART 5. Summarize each segment using BART ([`facebook/bart-large-cnn`](https://huggingface.co/facebook/bart-large-cnn)) 6. Write the results to a text file