Shengshu Technology Completes A New Round of Financing, Focusing on Catching Up with Sora

On March 12th, Beijing Shengshu Technology Co., Ltd. (hereinafter referred to as ‘Shengshu Technology’) announced the completion of a new round of hundreds of millions yuan financing, led by Qiming Venture Partners, with participation from Delta Capital, Hongfu Hode, Zhipu AI, existing shareholders Baidu Ventures and ZY Capital. China Renaissance acted as the exclusive financial advisor for this round.

According to public information, Shengshu Technology was established in March 2023, with its main business focusing on the research and development of native multimodal large models such as images, 3D, and videos. It is reported that this round of financing will be mainly used for iterative research and development of basic multimodal large models, innovative application products, and market expansion.

At the beginning of this year, OpenAI released its text-to-video product Sora, which has attracted widespread attention due to its ultra-long generation time and high-quality videos. Zhou Zhifeng, a partner at Qiming Venture Partners, once predicted that with the further strengthening of Scaling Law in the field of video generation, multimodal technology will lead to a series of remarkable innovations. Focusing on the domestic multimodal large model track, companies such as Shengshu Technology and AIsphere have become strong competitors to domestically produced Sora.

In recent releases of multimodal models, both Sorad and Stable Diffusion 3 adopt the Diffusion Transformer architecture (DiT). That is, in the Diffusion Model, a Transformer is used instead of the commonly used U-Net. This integration combines the scalability of Transformers with the natural advantages of diffusion models in processing visual data, thereby demonstrating outstanding emergent capabilities in visual tasks.

Currently, the Diffusion Transformer architecture (DiT) is a verified and preliminarily industry-consensus technology route. Tang Jiayu, CEO of Shengshu Technology, stated that the industry’s technical routes have tended to converge. Previously, Wang Changhu, CEO of AIsphere, also mentioned that the emergence of Sora has validated that video generation large models based on Diffusion+Transformer can achieve better performance and has strengthened AIsphere’s future development direction.

Looking back at the origins, the DiT architecture was published by the Berkeley team in December 2022. However, as early as September 2022, founding members of Shengshu Technology proposed a network architecture U-ViT based on Transformer. The two works are completely consistent in terms of architectural ideas and experimental paths, both integrating Transformer with diffusion models.

In March 2023, Shengshu Technology open-sourced the multimodal diffusion large model UniDiffuser, which directly aligns with Stable Diffusion in terms of parameter size and training data scale. In addition to unidirectional text-to-image generation, UniDiffuser also supports more general tasks involving images and text, such as image-to-text generation, joint image-text generation, and image-text rewriting. At that time, the model architecture was based on the Diffusion Transformer framework (U-ViT).

Although there is a research foundation for the fusion architecture of diffusion models, Tang Jiayu frankly admits that there is still a certain gap compared to Sora at present. Companies like Shengshu Technology and AIsphere have also set goals to catch up with Sora.

Tang Jiayu said that after having experience in efficiently and cost-effectively training models on a large-scale GPU cluster, catching up with Sora will definitely be much easier than catching up with GPT-4. It is expected to achieve the current version of Sora’s performance this year.

SEE ALSO: Talking Tom Cat Is Developing A Voice Interactive Companion Robot Using Generative AI Technology

In the latest round of financing news from AIsphere, it was also mentioned that the new funding will be mainly used for the technical research and development of underlying video large models and team building. According to Wang Changhu, in the future, manpower and resources will be concentrated to surpass Sora’s current level within 3 to 6 months.

In terms of commercialization, based on its MaaS (Model as a Service) capability, Shengshu Technology provides model capabilities directly to B-end institutions in the form of APIs. Its main cooperative clients are concentrated in gaming companies and internet enterprises. On the other hand, it chooses to develop vertical application products and charges fees in subscription forms. Currently, it has launched the visual creative design platform PixWeaver and the 3D asset creation tool VoxCraft.

Compared to the relatively abundant data resources in the fields of graphics, images, and videos, the data quality related to 3D asset generation is relatively poor. In response to this point, Tang Jiayu stated that at present, Shengshu Technology internally chooses to use a solution of joint training with 2D data and 3D data to improve modeling effectiveness.

In the investors of this new financing round for Shengshu Technology, both Zhipu AI and Baidu have made layouts in large models. In response to this, Tang Jiayu stated that the model products of Zhipu AI and Baidu are biased towards language models, placing more emphasis on understanding and logical reasoning abilities, which complement and cooperate with Shengshu Technology’s multimodal capabilities.