Zhiheng Liu

I am a second-year Ph.D. student at the Department of Computer Science, The University of Hong Kong (HKU), advised by Prof. Ping Luo.

Before that, I obtained my master degree from University of Science and Technology of China (USTC), advised by Prof. Yang Cao.

My previous research primarily centered around generative models, with a focus on image, video, and 3D generation. Currently, my interests have shifted towards native multimodal models. I am also actively exploring long-context modeling, an area where I am currently an enthusiastic beginner.

I am always open to research discussions and collaborations; please feel free to contact me via email (zhihengl0528 AT connect.hku.hk).

Email  /  Google Scholar  /  Github

profile photo
News
  • [9. 2025] 3 papers accepted at NeurIPS 2025.
  • [3. 2025] 3 papers accepted at CVPR 2025, including one Highlight.
  • [12. 2024] We release DepthLab, a robust depth inpainting foundation model that can be applied to various downstream tasks to enhance performance.
  • [12. 2024] We release The Matrix, a foundation world model for generating infinite-length, hyper-realistic videos with real-time, frame-level control.
  • [7. 2024] LivePhoto accepted at ECCV 2024.
  • [5. 2024] CCM accepted at ICML 2024.
  • [4. 2024] We release InFusion for 3D inpainting via diffusion prior.
  • [3. 2024] DreamVideo accepted at CVPR 2024.
  • [1. 2024] DreamClean accepted at ICLR 2024.
  • [12. 2023] This page is online. Discussions and collaborations are welcome.
Selected Publications

(*: Equal contribution)

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
Zhiheng Liu, Weiming Ren*, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong 
CVPR, 2026
pdf/ page

This work introduces Tuna, a native Unified Multimodal Model (UMM) that builds a continuous visual representation space for end-to-end image and video processing. Key highlights include:

  • It achieves state-of-the-art results across a wide range of tasks, including image and video.
  • The representation encoder plays a crucial role; using stronger pretrained encoders consistently and significantly improves performance across all multimodal tasks.
  • Compared to previous models with decoupled encoders, Tuna's unified architecture avoids representation format mismatches, leading to superior results in both understanding and generation.
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
Zhiheng Liu, Xueqing Deng, Shoufa Chen, Angtian Wang, Qiushan Guo, Mingfei Han, Zeyue Xue, Mengzhao Chen, Ping Luo, Linjie Yang
NeurIPS, 2025
pdf/ page

WorldWeaver is a framework for long-horizon video generation that unifies RGB and perceptual conditions, leveraging depth-guided memory and segmented noise scheduling to enhance structural and temporal consistency.

DepthLab: From Partial to Complete
Zhiheng Liu*, Ka Leong Cheng*, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qicheng Chen, Ping Luo
arxiv, 2024
pdf/ page

We propose a robust depth inpainting foundation model that can be applied to various downstream tasks to enhance performance.

InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior
Zhiheng Liu*, Hao Ouyang*, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, Yang Cao
arxiv, 2024
pdf/ page

We present an image-conditioned depth inpainting model, which uses the diffusion prior to inpaint 3D Gaussians and has very good geometric and texture consistency.

MangaNinja: Line Art Colorization with Precise Reference Following
Zhiheng Liu*, Ka Leong Cheng*, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qicheng Chen, Ping Luo
CVPR, 2025 Highlight
pdf/ page

MangaNinja is a reference-based line art colorization method that enables precise matching and fine-grained interactive control.

LivePhoto: Real Image Animation with Text-guided Motion Control
Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao
ECCV, 2024
pdf/ page

We present LivePhoto, a real image animation method with text control. Different from previous works, LivePhoto truely listens to the text instructions and well preserves the object-ID.

Cones 2: Customizable Image Synthesis with Multiple Subjects
Zhiheng Liu*, Yifei Zhang*, Yujun Shen, Kecheng Zheng, Kai Zhu, Ruili Feng, Yu Liu, Deli Zhao, Jingren Zhou, Yang Cao
NeurIPS, 2023
pdf / page

Cones 2 uses a simple yet effective representation to register a subject. The storage space required for each subject is approximately 5 KB. Moreover, Cones 2 allows for the flexible composition of various subjects without any model tuning.

Cones: Concept Neurons in Diffusion Models for Customized Generation
Zhiheng Liu*, Ruili Feng*, Kai Zhu, Yifei Zhang, Kecheng Zheng, Yu Liu, Deli Zhao, Jingren Zhou, Yang Cao
ICML, 2023 Oral
pdf / page

We explore the subject-specific concept neurons in a pre-trained text-to-image diffusion model. Concatenating multiple clusters of concept neurons representing different persons, objects, and backgrounds can flexibly generate all related concepts in a single image.


Design and source code from Jon Barron's website