Before you start

Make sure you have created and deployed a Speechly application. Take note of the App ID, you’ll need it shortly.

You’ll also need a React app. Use your existing app, or create a new one using:

npx create-react-app my-app

Install and import

1. Install the @speechly/react-client package:

npm install @speechly/react-client
# or
yarn add @speechly/react-client

2. Import SpeechProvider and wrap the app with it, passing the App ID of your Speechly application:

import { SpeechProvider } from '@speechly/react-client';

<React.StrictMode>
  <SpeechProvider
    appId="YOUR_APP_ID"
    logSegments
    debug
  >
    <App />
  </SpeechProvider>
</React.StrictMode>,

SpeechProvider properties:

appId: your Speechly application ID (required)
logSegments: logs the speech segment emitted by the client library (optional)
debug: prints the decoder status, status changes and possible errors (optional)

3. Start the development server:

npm run start
# or
yarn start

Navigate to http://localhost:3000 to see your app running. Next, let’s capture some audio!

Capturing microphone audio

There are two ways of capturing audio from the browser microphone:

Using a microphone button that toggles the microphone on and off
Using Voice activity detection (VAD) for a more hands-free experience

To use other audio sources, like a MediaStream or pre-recorded audio files, see the Use other audio sources section.

Using a microphone button

Import the useSpeechContext hook, create a button to initialize the microphone, another button for toggling the microphone and then display the transcript:

import { useSpeechContext } from '@speechly/react-client';

function App() {
  const { segment, listening, attachMicrophone, start, stop } = useSpeechContext();
  return (
    <div className="App">
      <button onClick={attachMicrophone}>Initialize microphone</button>
      <button onPointerDown={start} onPointerUp={stop}>
        {listening ? 'Listening…' : 'Push to talk'}
      </button>
      <p>
        {segment && segment.words.map(word => word.value).join(' ')}
      </p>
    </div>
  );
}

Using VAD

Enable vad in SpeechProvider:

<SpeechProvider
  appId="YOUR_APP_ID"
  logSegments
  debug
  vad={{ enabled: true }}
>

Import the useSpeechContext hook, create a button to initialize the microphone and then display the transcript:

import { useSpeechContext } from '@speechly/react-client';

function App() {
  const { segment, attachMicrophone } = useSpeechContext();
  return (
    <div className="App">
      <button onClick={attachMicrophone}>Initialize microphone</button>
      <p>
        {segment && segment.words.map(word => word.value).join(' ')}
      </p>
    </div>
  );
}

When VAD is enabled, it takes control of calling start and stop so no microphone button is needed. To adjust the VAD behavior, see VadOptions for available options.

Initializing the microphone

Regardless which way you chose, go ahead and press the Initialize microphone button and you should be prompted for microphone permissions:

browser microphone permission

Press Allow and say something like “hello world” to very that the integration works. If you have logSegments enabled, open the developer console to see the raw speech segments.

Managing app state

A common practice in React apps is to the useEffect hook to subscribe to changes for segment and useState for making changes to the app state.

Use segment.isFinal to check the segment state. When false, the segment might be updated several times. When true, the segment won’t be updated anymore and subsequent callbacks within the same audio context refer to the next segment.

List all transcripts and show the tentative transcript:

const { segment } = useSpeechContext();
const [tentativeTranscript, setTentativeTranscript] = useState("");
const [transcripts, setTranscripts] = useState([]);

useEffect(() => {
  if (segment) {
    // Handle speech segment and make tentative changes to app state
    const plainString = segment.words.map(word => word.value).join(' ');
    setTentativeTranscript(plainString);
    if (segment.isFinal) {
      // Handle speech segment and make permanent changes to app state
      setTentativeTranscript("");
      setTranscripts(current => [...current, plainString]);
    }
  }
}, [segment]);

return (
  <div className="App">
    {transcripts?.map((value) => 
      <p>{value}</p>
    )}
    {tentativeTranscript && <p><em>{tentativeTranscript}</em></p>}
  </div>
);

Timestamps

Each word in the segment.words array has startTimestamp and endTimestamp properties, indicating the start/end timestamp of the word from the start of the audio stream counted in milliseconds.

Example: displaying transcript timestamps

1. Extract the startTimestamp from the first word in the segment and store it in an object together with the transcript:

const [tentativeTranscript, setTentativeTranscript] = useState("");
const [transcripts, setTranscripts] = useState([]);

useEffect(() => {
  if (segment) {
    const plainString = segment.words.map(word => word.value).join(' ');
    setTentativeTranscript(plainString);
    if (segment.isFinal) {
      setTentativeTranscript("");
      setTranscripts(current => [...current, {
        timestamp: words[0].startTimeStamp,
        value: plainString, 
      }]);
    }
  }
}, [segment]);

2. Show and format the timestamp:

return (
  <div className="App">
    {transcripts?.map(({ timestamp, value }) =>
      <p>{formatDuration(timestamp)}: {value}</p>
    )}
    {tentativeTranscript && <em>{tentativeTranscript}</em>}
  </div>
);

In this example we’re using the format-duration package for formatting milliseconds to a standard duration string.

Intents and entities

Speechly can extract intents and entities from the users speech based on your training data. They are available in the segment data structure, using either segment.intent or segment.entities. Note that each segment can only have one intent, but possibly several entities.

See Configuring Your Application to learn more.

Example: warn about offensive language

1. Update the configuration, providing example utterances for what’s offensive and what’s not:

*offensive thats [bullsiht | bollocks](profanity)
*not_offensive thats [interesting | nice | great]

Remember to declare an entity named profanity of type string and to deploy the changes.

2. Create a component for showing the transcript and bleeping out entities with the type profanity:

const findEntity = (word, entities) =>
  entities.find(entity => word.index >= entity.startPosition && word.index < entity.endPosition)

const matchWordsWithEntities = (words, entities) =>
  words.flatMap(word => [{ word, entity: findEntity(word, entities) }])

export const MyComponent = ({ entities, words }) => {
  const wordsWithEntities = matchWordsWithEntities(words, entities);
  return (
    <div className="MyComponent">
      {wordsWithEntities?.map(({ word, entity }) =>
        <>
          {!entity && (
            <span>{word.value} </span>
          )}
          {entity && (
            <span>{entity.type === "profanity" ? "******" : entity.value} </span>
          )}
        </>
      )}
    </div>
  )
}

3. Update and render the allSegments array while showing an alert if an intent has the value offensive:

import { MyComponent } from './components/MyComponent';

const { segment } = useSpeechContext();
const [allSegments, setAllSegments] = useState([]);

const updateOrAddSegment = segment => {
  const newArray = Array.from(allSegments);
  const i = newArray.findIndex(item => item.contextId === segment.contextId);
  if (i > -1) newArray[i] = segment;
  else newArray.push(segment);
  setAllSegments(newArray);
};

useEffect(() => {
  if (segment) {
    updateOrAddSegment(segment);
    if (segment.isFinal) {
      if (segment.intent?.intent === "offensive") {
        window.alert("Hey, no cursing here ☝️");
      }
      updateOrAddSegment(segment);
    }
  }
}, [segment]);

return (
  <div className="App">
    {allSegments?.map(({ entities, words }) =>
      <MyComponent entities={entities} words={words} />
    )}
  </div>
);

To further tweak the experience, you might want to check out the Segmenting speech section.

Example: rendering multi-word entities correctly

The example above assumes that your entities are single word entities. If you expect entities with multiple words like “two weeks” or “next tuesday”, you need to handle the matching of words with entities.

Each word in segment.words has an index property, indicating it’s index within a segment. Similarly each entity in segment.entities has startPosition and endPosition properties, indicating the index of the first/last word that contains this entity.

You can modify the matchWordsWithEntities function to deal with this:

const isWorPartOfEntity = (entity, word) =>
  word && word.index >= entity.startPosition && word.index < entity.endPosition;

const matchWordsWithEntities = (words, entities) => {
  let wordsBuilder = [];
  words.forEach(word => {
    if (!word) return;
    const wordEntity = entities.find(entity => isWorPartOfEntity(entity, word))
    if (!wordEntity) {
      wordsBuilder[word.index] = { word };
    } else {
      const combinedWordValue = words
        .filter(word => isWorPartOfEntity(wordEntity, word))
        .map(word => word.value)
        .join(" ");
      wordsBuilder[wordEntity.startPosition] = {
        word: { ...word, value: combinedWordValue, index: wordEntity.startPosition },
        entity: wordEntity
      }
    }
  });
  return wordsBuilder.filter(item => item.word);
}

Segmenting speech

If you’d like for new speech segments to be emitted when the user pauses for a short while, you need to enable Silence triggered segmentation. You can enable it in Speechly Dashboard by going to Application → Settings, or by adding it to your config.yaml. If you have trouble figuring out a good value, start with 720 and play around with it.

Depending on what your app does and how your segment handling is written, this may or may not require changes to the code. For example, enabling this in the profanity example above will introduce a bug where the updated segment sometimes overrides the previous one!

To fix this, add another condition in the updateOrAddSegment function:

const updateOrAddSegment = segment => {
  const newArray = Array.from(allSegments);
  //highlight-next-line
  const i = newArray.findIndex(item => item.contextId === segment.contextId && item.id === segment.id);
  if (i > -1) newArray[i] = segment;
  else newArray.push(segment);
  setAllSegments(newArray);
};

The reason for this is that an audio context (contextId) can consist of several consecutive segments (id). Previously when Silence triggered segmentation was disabled, only one segment per context was emitted. Once we enabled it, we need to make sure that both contextId and id match in order to perform an update.

Using other audio sources

In addition to using the browser microphone, you can also attach a MediaStream or upload a pre-recorded audio file. For this purpose the React client library exposes a client interface, which provides low-level access to underlying Speechly BrowserClient.

When dealing with a MediaStream or a file, consider using VAD as it provides a better hands-free experience while eliminating timeout related issues.

Attaching a MediaStream

To use your webcam (or some other MediaStream) use the client.attach() method to attach it:

const { client, segment } = useSpeechContext();

const handleMediaStream = async (stream) => {
  await client.attach(stream);
};

return (
  <div className="App">
    <Webcam muted audio={true} onUserMedia={handleMediaStream} />
    // ...
  </div>
);

In this example we’re using the react-webcam package to create a webcam component. The component has a callback function for when it receives a MediaStream. We need to provide a handler that attaches the MediaStream to the client.

If you have VAD enabled and need to stop listening (for example if the user muted their microphone) you can use the client.adjustAudioProcessor() method and pass { vad: { enabled: false } to it. You can use the the same method to pass other VadOptions as well.

Uploading a pre-recorded audio file

To upload a pre-recorded audio file use the client.uploadAudioData() method:

const { client, segment } = useSpeechContext();

const sendAudioToSpeechly = async (files) => {
  if (files === null) return
  try {
    const buffer = await files[0].arrayBuffer();
    client.uploadAudioData(buffer)
  } catch (e) {
    console.error(e)
  }
}

return (
  <div className="App">
    <input
      type="file"
      accept="audio/*"
      onChange={e => sendAudioToSpeechly(e.target.files)}
      onClick={e => e.target.value = ""}
    />
    // ...
  </div>
)

Please note that the client.uploadAudioData() method: expects the audio data to be in binary format.

Using Speechly React Client

Before you start

Install and import

Capturing microphone audio

Using a microphone button

Using VAD

Initializing the microphone

Managing app state

Timestamps

Example: displaying transcript timestamps

Intents and entities

Example: warn about offensive language

Example: rendering multi-word entities correctly

Segmenting speech

Using other audio sources

Attaching a MediaStream

Uploading a pre-recorded audio file

See also