Summarized by Dodly:

Scale 2: AI Animation Without Skeletons or Control Nets

Benji’s AI Playground (Subscribed)

Audio Summary

Summary

The new Scale 2 animation framework from Tingua University and Z.AI allows for character motion transfer from a reference video directly into an output video, without needing skeletons, pose data, or ControlNet. This updated model uses novel techniques like rotary position embedding, or rope, and in-contact mass conditions to achieve more accurate motion transfer, even with multiple characters or animals. Scale 2 is available in Comfy UI with FP16, FP8, and FP8 MX precision models, and requires the light X2Vora model for efficient low sampling steps, as few as four. It integrates with SAM 3 video tracking for object segmentation, enabling "replacement mode" where specified characters are swapped out in an existing video. For instance, a singer on stage can be replaced by a different character using a reference image, with the model transferring motion and even mimicking camera movements. Scale 2 can also handle complex scenarios like multi-character boxing matches, accurately transferring poses and even overlapping body parts. It supports video extension beyond the standard 81 frames through a for-loop mechanism, allowing for longer generated videos. The model's fidelity and physics simulation have improved significantly compared to its predecessor, Scale 1, by purely relying on motion transfer and segmentation masks instead of external control methods. This advancement signifies a trend towards more integrated AI solutions in animation, reducing the need for complex ControlNet setups and offering a more streamlined workflow for video motion transfer and editing.

Play the full video