Talking Head (3D)

Demo Videos

Video	Description
<span style="display: block; min-width:400px"><img src="images/screenshot4.jpg" width="400"/></span>	I chat with Jenny and Harri. The close-up view allows you to evaluate the accuracy of lip-sync in both English and Finnish. Using GPT-3.5 and Microsoft text-to-speech.
<img src="images/screenshot5.jpg" width="400"/>	A short demo of how AI can control the avatar's movements. Using OpenAI's function calling and Google TTS with the TalkingHead's built-in viseme generation.
<img src="images/screenshot6.jpg" width="400"/>	Michael lip-syncs to two MP3 audio tracks using OpenAI's Whisper and TalkingHead's `speakAudio` method. He kicks things off with some casual talk, but then goes all out by trying to tackle an old Meat Loaf classic. 🤘 Keep rockin', Michael! 🎤😂
<img src="images/screenshot3.jpg" width="400"/>	Julia and I showcase some of the features of the TalkingHead class and the test app including the settings, some poses and animations.

All the demo videos are real-time screen captures from a Chrome browser running the TalkingHead test web app without any post-processing.

Use Case Examples

Video/App	Use Case
<span style="display: block; min-width:400px"><img src="images/olivia.jpg" width="400"/></span>	Video conferencing. A video conferencing solution with real-time transcription, contextual AI responses, and voice lip-sync. The app and demo, featuring Olivia, by namnm 👍
<img src="images/geminicompetition.jpg" width="400"/>	Recycling Advisor 3D. Snap a photo and get local recycling advice from a talking avatar. My entry for the Gemini API Developer Competition.
<img src="images/evertrail.jpg" width="400"/>	Live Twitch adventure. Evertrail is an infinite, real-time generated world where all of your choices shape the outcome. Video clip and the app by JPhilipp 👏👏
<img src="images/cliquevm.jpg" width="400"/>	Quantum physics using a blackboard. David introduces us to the CHSH game and explores the mystery of quantum entanglement. For more information about the research project, see CliqueVM.
<img src="images/interactiveportfolio.jpg" width="400"/>	Interactive Portfolio. Click the image to open the app, where you can interview the virtual persona of its developer, AkshatRastogi-1nC0re 👋

Introduction

Talking Head (3D) is a JavaScript class featuring a 3D avatar that can speak and lip-sync in real-time. The class supports Ready Player Me full-body 3D avatars (GLB), Mixamo animations (FBX), and subtitles. It also knows a set of emojis, which it can convert into facial expressions.

By default, the class uses Google Cloud TTS for text-to-speech and has a built-in lip-sync support for English, Finnish, and Lithuanian (beta). New lip-sync languages can be added by creating new lip-sync language modules. It is also possible to integrate the class with an external TTS service, such as Microsoft Azure Speech SDK or ElevenLabs WebSocket API.

The class uses ThreeJS / WebGL for 3D rendering.

Talking Head class

You can download the TalkingHead modules from releases (without dependencies). Alternatively, you can import all the needed modules from a CDN:

<script type="importmap">
{ "imports":
  {
    "three": "https://cdn.jsdelivr.net/npm/three@0.161.0/build/three.module.js/+esm",
    "three/addons/": "https://cdn.jsdelivr.net/npm/three@0.161.0/examples/jsm/",
    "talkinghead": "https://cdn.jsdelivr.net/gh/met4citizen/TalkingHead@1.2/modules/talkinghead.mjs"
  }
}
</script>

If you want to use the built-in Google TTS and lip-sync using Single Sign-On (SSO) functionality, give the class your TTS proxy endpoint and a function from which to obtain the JSON Web Token needed to use that proxy. Refer to Appendix B for one way to implement JWT SSO.

import { TalkingHead } from "talkinghead";

// Create the talking head avatar
const nodeAvatar = document.getElementById('avatar');
const head = new TalkingHead( nodeAvatar, {
  ttsEndpoint: "/gtts/",
  jwtGet: jwtGet,
  lipsyncModules: ["en", "fi"]
});

FOR HOBBYISTS: If you're just looking to experiment on your personal laptop without dealing with proxies, JSON Web Tokens, or Single Sign-On, take a look at the minimal code example. Simply download the file, add your Google TTS API key, and you'll have a basic web app template with a talking head.

The following table lists all the available options and their default values:

Option	Description
`jwsGet`	Function to get the JSON Web Token (JWT). See Appendix B for more information.
`ttsEndpoint`	Text-to-speech backend/endpoint/proxy implementing the Google Text-to-Speech API.
`ttsApikey`	If you don't want to use a proxy or JWT, you can use Google TTS endpoint directly and provide your API key here. NOTE: I recommend that you don't use this in production and never put your API key in any client-side code.
`ttsLang`	Google text-to-speech language. Default is `"fi-FI"`.
`ttsVoice`	Google text-to-speech voice. Default is `"fi-FI-Standard-A"`.
`ttsRate`	Google text-to-speech rate in the range [0.25, 4.0]. Default is `0.95`.
`ttsPitch`	Google text-to-speech pitch in the range [-20.0, 20.0]. Default is `0`.
`ttsVolume`	Google text-to-speech volume gain (in dB) in the range [-96.0, 16.0]. Default is `0`.
`ttsTrimStart`	Trim the viseme sequence start relative to the beginning of the audio (shift in milliseconds). Default is `0`.
`ttsTrimEnd`	Trim the viseme sequence end relative to the end of the audio (shift in milliseconds). Default is `300`.
`lipsyncModules`	Lip-sync modules to load dynamically at start-up. Limiting the number of language modules improves the loading time and memory usage. Default is `["en", "fi", "lt"]`. [≥`v1.2`]
`lipsyncLang`	Lip-sync language. Default is `"fi"`.
`pcmSampleRate`	PCM (signed 16bit little endian) sample rate used in `speakAudio` in Hz. Default is `22050`.
`modelRoot`	The root name of the armature. Default is `Armature`.
`modelPixelRatio`	Sets the device's pixel ratio. Default is `1`.
`modelFPS`	Frames per second. Note that actual frame rate will be a bit lower than the set value. Default is `30`.
`modelMovementFactor`	A factor in the range [0,1] limiting the avatar's upper body movement when standing. Default is `1`. [≥`v1.2`]
`cameraView`	Initial view. Supported views are `"full"`, `"mid"`, `"upper"` and `"head"`. Default is `"full"`.
`cameraDistance`	Camera distance offset for initial view in meters. Default is `0`.
`cameraX`	Camera position offset in X direction in meters. Default is `0`.
`cameraY`	Camera position offset in Y direction in meters. Default is `0`.
`cameraRotateX`	Camera rotation offset in X direction in radians. Default is `0`.
`cameraRotateY`	Camera rotation offset in Y direction in radians. Default is `0`.
`cameraRotateEnable`	If true, the user is allowed to rotate the 3D model. Default is `true`.
`cameraPanEnable`	If true, the user is allowed to pan the 3D model. Default is `false`.
`cameraZoomEnable`	If true, the user is allowed to zoom the 3D model. Default is `false`.
`lightAmbientColor`	Ambient light color. The value can be a hexadecimal color or CSS-style string. Default is `0xffffff`.
`lightAmbientIntensity`	Ambient light intensity. Default is `2`.
`lightDirectColor`	Direction light color. The value can be a hexadecimal color or CSS-style string. Default is `0x8888aa`.
`lightDirectIntensity`	Direction light intensity. Default is `30`.
`lightDirectPhi`	Direction light phi angle. Default is `0.1`.
`lightDirectTheta`	Direction light theta angle. Default is `2`.
`lightSpotColor`	Spot light color. The value can be a hexadecimal color or CSS-style string. Default is `0x3388ff`.
`lightSpotIntensity`	Spot light intensity. Default is `0`.
`lightSpotPhi`	Spot light phi angle. Default is `0.1`.
`lightSpotTheta`	Spot light theta angle. Default is `4`.
`lightSpotDispersion`	Spot light dispersion. Default is `1`.
`avatarMood`	The mood of the avatar. Supported moods: `"neutral"`, `"happy"`, `"angry"`, `"sad"`, `"fear"`, `"disgust"`, `"love"`, `"sleep"`. Default is `"neutral"`.
`avatarMute`	Mute the avatar. This can be helpful option if you want to output subtitles without audio and lip-sync. Default is `false`.
`markedOptions`	Options for Marked markdown parser. Default is `{ mangle:false, headerIds:false, breaks: true }`.
`statsNode`	Parent DOM element for the three.js stats display. If `null`, don't use. Default is `null`.
`statsStyle`	CSS style for the stats element. If `null`, use the three.js default style. Default is `null`.

Once the instance has been created, you can load and display your avatar. Refer to Appendix A for how to make your avatar:

// Load and show the avatar
try {
  await head.showAvatar( {
    url: './avatars/brunette.glb',
    body: 'F',
    avatarMood: 'neutral',
    ttsLang: "en-GB",
    ttsVoice: "en-GB-Standard-A",
    lipsyncLang: 'en'
  });
} catch (error) {
  console.log(error);
}

An example of how to make the avatar speak the text on input text when the button speak is clicked:

// Speak 'text' when the button 'speak' is clicked
const nodeSpeak = document.getElementById('speak');
nodeSpeak.addEventListener('click', function () {
  try {
    const text = document.getElementById('text').value;
    if ( text ) {
      head.speakText( text );
    }
  } catch (error) {
    console.log(error);
  }
});

The following table lists some of the key methods. See the source code for the rest:

Method	Description
`showAvatar(avatar, [onprogress=null])`	Load and show the specified avatar. The `avatar` object must include the `url` for GLB file. Optional properties are `body` for either male `M` or female `F` body form, `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood` and `avatarMute`.
`setView(view, [opt])`	Set view. Supported views are `"full"`, `"mid"`, `"upper"` and `"head"`. The `opt` object can be used to set `cameraDistance`, `cameraX`, `cameraY`, `cameraRotateX`, `cameraRotateY`.
`setLighting(opt)`	Change lighting settings. The `opt` object can be used to set `lightAmbientColor`, `lightAmbientIntensity`, `lightDirectColor`, `lightDirectIntensity`, `lightDirectPhi`, `lightDirectTheta`, `lightSpotColor`, `lightSpotIntensity`, `lightSpotPhi`, `lightSpotTheta`, `lightSpotDispersion`.
`speakText(text, [opt={}], [onsubtitles=null], [excludes=[]])`	Add the `text` string to the speech queue. The text can contain face emojis. Options `opt` can be used to set text-specific `lipsyncLang`, `ttsLang`, `ttsVoice`, `ttsRate`, `ttsPitch`, `ttsVolume`, `avatarMood`, `avatarMute`. Optional callback function `onsubtitles` is called whenever a new subtitle is to be written with the parameter of the added string. The optional `excludes` is an array of [start,end] indices to be excluded from audio but to be included in the subtitles.
`speakAudio(audio, [opt={}], [onsubtitles=null])`	Add a new `audio` object to the speech queue. In audio object, property `audio` is either `AudioBuffer` or an array of PCM 16bit LE audio chunks. Property `words` is an array of words, `wtimes` is an array of corresponding starting times in milliseconds, and `wdurations` an array of durations in milliseconds. If the Oculus viseme IDs are know, they can be given in optional `visemes`, `vtimes` and `vdurations` arrays. The object also supports optional timed callbacks using `markers` and `mtimes`. The `opt` object can be used to set text-specific `lipsyncLang`.
`speakEmoji(e)`	Add an emoji `e` to the speech queue.
`speakBreak(t)`	Add a break of `t` milliseconds to the speech queue.
`speakMarker(onmarker)`	Add a marker to the speech queue. The callback function `onmarker` is called when the queue processes the event.
`lookAt(x,y,t)`	Make the avatar's head turn to look at the screen position (`x`,`y`) for `t` milliseconds.
`lookAtCamera(t)`	Make the avatar's head turn to look at the camera for `t` milliseconds.
`setMood(mood)`	Set avatar mood.
`playBackgroundAudio(url)`	Play background audio such as ambient sounds/music in a loop.
`stopBackgroundAudio()`	Stop playing the background audio.
`setMixerGain(speech, background)`	The amount of gain for speech and background audio (see Web Audio API / GainNode for more information). Default value is `1`.
`playAnimation(url, [onprogress=null], [dur=10], [ndx=0], [scale=0.01])`	Play Mixamo animation file for `dur` seconds, but full rounds and at least once. If the FBX file includes several animations, the parameter `ndx` specifies the index. Since Mixamo rigs have a scale 100 and RPM a scale 1, the `scale` factor can be used to scale the positions.
`stopAnimation()`	Stop the current animation started by `playAnimation`.
`playPose(url, [onprogress=null], [dur=5], [ndx=0], [scale=0.01])`	Play the initial pose of a Mixamo animation file for `dur` seconds. If the FBX file includes several animations, the parameter `ndx` specifies the index. Since Mixamo rigs have a scale 100 and RPM a scale 1, the `scale` factor can be used to scale the positions.
`stopPose()`	Stop the current pose started by `playPose`.
`playGesture(name, [dur=3], [mirror=false], [ms=1000])`	Play a named hand gesture and/or animated emoji for `dur` seconds with the `ms` transition time. The available hand gestures are `handup`, `index`, `ok`, `thumbup`, `thumbdown`, `side`, `shrug`. By default, hand gestures are done with the left hand. If you want the right handed version, set `mirror` to true. You can also use `playGesture` to play emojis. See Appendix D for more details. [≥`v1.2`]
`stopGesture([ms=1000])`	Stop the gesture with `ms` transition time. [≥`v1.2`]
`start`	Start/re-start the Talking Head animation loop.
`stop`	Stop the Talking Head animation loop.