Speech to Text in Python using Google Cloud API.
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Looking for python code that can take a video mp4 file of 30 minutes and run it against google speech to text api (https://cloud.google.com/speech-to-text/) and return back transcribed text.

For extra $15 it should cleanly format the transcribed text by speakers ( https://cloud.google.com/speech-to-text/docs/multiple-voices) and return back timestamps. This should use the 'Speaker diarization' feature.

For example:

Speaker 1 - 00:00:05 - "Hi would like to welcome everyone this webinar and introduce my guest Tim'

Speaker 2 - 00:00:10 - "Thanks John, glad to be here"

Note - Google Cloud provide free sign-up and trial service so you should be able to build / test the solution at no cost.

Looking forward to the solution, please comment your code as I hope to learn from it. Ask me any questions, first time I am trying Bountify.

awarded to ocanal

Crowdsource coding tasks.

1 Solution

Winning solution
Tipped

Here is my shot,

You have to install moviepy to handle convert mp4 to audio file.

$pip install moviepy

And you can call the script with file.mp4 argument

python speech.py test.mp4

UPDATE-1: I updated a little bit with the new cloud speech beta version. changed to long_running_recognize, because you expect a long video to convert.

Here you can download a test.mp4 file.

UPDATE-2: I completely updated my solution with Speaker diarization feature. Here is the sample output.txt of the test.mp4 (obama interview)

UPDATE-3: As you mentioned that if audio record is long google requires cloud-storage url for that audio file.

So operation is changed.

1- extract audio from video file
2- upload that audio file to google-cloud-storage
3- convert it to text.

So here is updated code, that handles all progress,

don't forget to:

install google-cloud-storage-library

create a storage bucket and change the line bucket_name = "bountify-test-bucket" with your bucket name.

UPDATE-4: There is a bug on parsing words of speech, so the last part of speech was missing on output.txt. I updated the code again.

import sys
import io
import os
import datetime

from moviepy.editor import *
from google.cloud import speech_v1p1beta1 as speech
from google.cloud.speech_v1p1beta1 import enums
from google.cloud.speech_v1p1beta1 import types
from google.cloud import storage

#global variable, please change them with your values
bucket_name = "bountify-test-bucket"
output_file = "output.txt"

def convert_video_to_mp3(video_file_path):
    audio_file_path = "generated.mp3"
    video = VideoFileClip(video_file_path)
    audio = video.audio
    audio.write_audiofile(audio_file_path)
    return audio_file_path

def upload_file_to_cloud_storage(audio_file):
    bucket_url = "gs://{}/{}".format(bucket_name, audio_file)
    file_name = os.path.join(os.path.dirname(__file__), audio_file)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(audio_file)
    print(u"{} is uploading to storage...".format(bucket_url))
    blob.upload_from_filename(file_name)
    print("File is uploaded.")
    os.remove(file_name)
    return bucket_url

def convert_speech_to_text(bucket_file_url, output_file = "output.txt"):
    client = speech.SpeechClient()

    audio = {"uri": bucket_file_url}
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.MP3,
        sample_rate_hertz=16000,
        enable_speaker_diarization=True,
        diarization_speaker_count=2,
        model="phone_call",
        enable_word_time_offsets=True,
        use_enhanced=True,
        language_code="en-US")

    operation = client.long_running_recognize(config, audio)

    print(u"Waiting for speech-to-text operation to complete...")
    response = operation.result()

    with open(output_file, "w") as text_file:
        for result in response.results:
            alternative = result.alternatives[0]
            current_speaker_tag=-1
            transcript = ""
            time = 0
            for word in alternative.words:
                if word.speaker_tag != current_speaker_tag:
                    if (transcript != ""):
                        print(u"Speaker {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript), file=text_file)
                    transcript = ""
                    current_speaker_tag = word.speaker_tag
                    time = word.start_time.seconds

                transcript = transcript + " " + word.word
        if transcript != "":
            print(u"Speaker {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript), file=text_file)
        print(u"Speech to text operation is completed, output file is created: {}".format(output_file))

#mainprogram starts here
video_file_path = sys.argv[1]
audio_file = convert_video_to_mp3(video_file_path)
audio_file_bucket_url = upload_file_to_cloud_storage(audio_file)
convert_speech_to_text(audio_file_bucket_url, output_file)

Result:

Speaker 1 - 0:00:00 -  you talked the president like Trump now several times over the course this transition what have you tried to impress on him about the job for
Speaker 2 - 0:00:06 -  conversations have been cordial he has been open to suggestions off and the main thing that I've tried to
Speaker 1 - 0:00:14 -  transmit is that there's a difference between governing and campaigning so that what he has to appreciate is
Speaker 2 - 0:00:26 -  as soon as you walk into this office after you've been sworn in your now in charge of the largest organization on
Speaker 1 - 0:00:31 -  Earth you can't manage it
Speaker 2 - 0:00:35 -  the way you would manage a family business you have to have a strong team around you you have to have respect for
Speaker 1 - 0:00:42 -  institutions and the process to
Speaker 2 - 0:00:45 -  make good decisions because you are inherently reliant on other folks how is he impressed
Speaker 1 - 0:00:49 -  here you
Speaker 2 - 0:00:52 -  know he is somebody who I think is very
Speaker 1 - 0:00:57 -  engaging and gregarious like him you know I've
Speaker 2 - 0:01:01 -  enjoyed the conversations that we've had he is somebody who I think is not lacking in
Speaker 1 - 0:01:06 -  confidence which is I think
Speaker 2 - 0:01:09 -  some say that about you tell that's what I was saying it's probably a prerequisite for the
Speaker 1 - 0:01:13 -  job or at
Speaker 2 - 0:01:15 -  least you have to have enough craziness to think that you can do the
Speaker 1 - 0:01:19 -  job I think that he has not spent a lot of time
Speaker 2 - 0:01:30 -  sweating the details
Speaker 1 - 0:01:31 -  of you know all the policies that
Speaker 2 - 0:01:36 -  that were you well I think that can be both a strength and a weakness I think it depends on how he
Speaker 1 - 0:01:44 -  approaches it if he if it gives him fresh eyes then that can be valuable
That worked! Interested in the extra tasks for the tip?
monkeydust 8 days ago
Hey, I updated my solution.
ocanal 8 days ago
And I've to say that speaker diarization is a beta feature which is not working decently.
ocanal 8 days ago
Hi works well on your test.mp4 - getting this error on my file which creates a 25.1Mb mp3 file - is there some setting / code change to allow this to be processed or something I need to configure on google side?
monkeydust 6 days ago
grpc.channel.Rendezvous: <Rendezvous of RPC that terminated with: status = StatusCode.INVALIDARGUMENT details = "Request payload size exceeds the limit: 10485760 bytes." debugerrorstring = "{"created":"@1575031213.260000000","description":"Error received from peer ipv4:172.217.169.74:443","fi le":"src/core/lib/surface/call.cc","fileline":1055,"grpcmessage":"Request payload size exceeds the limit: 10485760 bytes.","grpc_st atus":3}"
monkeydust 6 days ago
Hey monkeydust, it's getting a little bit complicated, but I handled it. You have to upload the converted audio file to storage, so I also added the uploadfiletocloudstorage functions. hope it works :)
ocanal 6 days ago
Hi Ocanal - almost there! So operations 1+2 fine but 3 (returned text) truncates at less than the full video length I am trying to transcribe. I can see the full mp3 file in the google bucket so the upload works fine and the processing takes time but the output file is small. For example - try this http://www.obamadownloads.com/videos/un-mdg-speech.mp4. I get output of this to 11 minutes but not beyond - https://pastebin.com/9dtF42w5 Any idea? As said at start of project the goal is for a 30 minute video, didn't think it would get this complicate but appreciate your persistence and will leave a bit more on the tip side if we can get this working,
monkeydust 5 days ago
Hi monkeydust, that is my bad, I made a mistake so the last part of speech was not printed on output.txt. I fix the problem, and updated the code. Actually I only added an extra print convertspeechto_text function as you can see. I also tested it with the video you shared, it's working well :) Sorry for late reply because of different time-zone I think.
ocanal 4 days ago