Is this just feature extraction + conditioning on a pretrained talking-head model? Curious what the minimal pipeline is (feature encoding, identity representation, realtime inference) and how people are doing this efficiently.
Any insights or similar open-source patterns?
I already am good at doing Voice AI bots, but struggling with the face.
https://navtalk.ai/