2 minute read

T2M-GPT Reproduction Posting

GPU => RTX 3090으로 진행했음 Github Code는 여기에 나와 있음.
Colab으로 Demo를 돌릴 수 있다고 나와있지만 너무 많은 시간이 걸리기 때문에 나는 서버를 구축해서 demo를 확인해보았다.

Environment

T2M-GPT는 V100으로 진행했다고 언급되어 있고 environment.yml을 확인해본 결과 cuda 10.1을 사용하는 것을 확인함.

  - python=3.8.11=h12debd9_0_cpython
  - pytorch=1.8.1=py3.8_cuda10.1_cudnn7.6.3_0
  - torchaudio=0.8.1=py38
  - torchvision=0.9.1=py38_cu101

따라서 Cuda 환경을 바꿀 필요성이 있음. 나같은 경우에는 requirements.txt를 만들어 해결 했다. python=3.8

torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
absl-py==0.13.0
backcall==0.2.0
cachetools==4.2.2
charset-normalizer==2.0.4
chumpy==0.70
cycler==0.10.0
decorator==5.0.9
google-auth==1.35.0
google-auth-oauthlib==0.4.5
grpcio==1.39.0
idna==3.2
imageio==2.9.0
ipdb==0.13.9
ipython==7.26.0
ipython-genutils==0.2.0
jedi==0.18.0
joblib==1.0.1
kiwisolver==1.3.1
markdown==3.3.4
matplotlib==3.4.3
matplotlib-inline==0.1.2
oauthlib==3.1.1
pandas==1.3.2
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
prompt-toolkit==3.0.20
protobuf==3.17.3
ptyprocess==0.7.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pygments==2.10.0
pyparsing==2.4.7
python-dateutil==2.8.2
pytz==2021.1
pyyaml==5.4.1
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
scikit-learn==0.24.2
scipy==1.7.1
sklearn==0.0
smplx==0.1.28
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
threadpoolctl==2.2.0
toml==0.10.2
tqdm==4.62.2
traitlets==5.0.5
urllib3==1.26.6
wcwidth==0.2.5
werkzeug==2.0.1
git+https://github.com/openai/CLIP.git
git+https://github.com/nghorbani/human_body_prior
gdown
moviepy

Setup

이제 홈페이지에서 말한대로 Pre-trained model, extractor등을 설치해준 후에 Render SMPL Mesh를 만들 수 있는 패키지를 설치해야한다.

sudo sh dataset/prepare/download_smpl.sh
conda install -c menpo osmesa
conda install h5py
conda install -c conda-forge shapely pyrender trimesh mapbox_earcut

생각보다 오래걸렸는데 차분히 기다려주면 완료 된다.

Demo

Demo Code는 Google Colab을 참고했고 내가 돌린 Demo Code가 궁금하면 아래 코드를 작성하면 된다.

clip_text = ["The person is running and jumping with his right hand up"]



import options.option_transformer as option_trans
args = option_trans.get_args_parser()

args.dataname = 't2m'
args.resume_pth = 'pretrained/VQVAE/net_last.pth'
args.resume_trans = 'pretrained/VQTransformer_corruption05/net_best_fid.pth'
args.down_t = 2
args.depth = 3
args.block_size = 51
import clip
import torch
import numpy as np

## load clip model and datasets
clip_model, clip_preprocess = clip.load("ViT-B/32", device=torch.device('cuda'), jit=False)  # Must set jit=False for training

import models.vqvae as vqvae
import models.t2m_trans as trans

clip.model.convert_weights(clip_model)  # Actually this line is unnecessary since clip by default already on float16


clip_model.eval()
for p in clip_model.parameters():
    p.requires_grad = False

net = vqvae.HumanVQVAE(args, ## use args to define different parameters in different quantizers
                       args.nb_code,
                       args.code_dim,
                       args.output_emb_width,
                       args.down_t,
                       args.stride_t,
                       args.width,
                       args.depth,
                       args.dilation_growth_rate)



trans_encoder = trans.Text2Motion_Transformer(num_vq=args.nb_code, 
                                embed_dim=1024, 
                                clip_dim=args.clip_dim, 
                                block_size=args.block_size, 
                                num_layers=9, 
                                n_head=16, 
                                drop_out_rate=args.drop_out_rate, 
                                fc_rate=args.ff_rate)


print ('loading checkpoint from {}'.format(args.resume_pth))
ckpt = torch.load(args.resume_pth, map_location='cpu')
net.load_state_dict(ckpt['net'], strict=True)
net.eval()
net.cuda()

print ('loading transformer checkpoint from {}'.format(args.resume_trans))
ckpt = torch.load(args.resume_trans, map_location='cpu')
trans_encoder.load_state_dict(ckpt['trans'], strict=True)
trans_encoder.eval()
trans_encoder.cuda()

mean = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy')).cuda()
std = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy')).cuda()

text = clip.tokenize(clip_text, truncate=True).cuda()
feat_clip_text = clip_model.encode_text(text).float()
index_motion = trans_encoder.sample(feat_clip_text[0:1], False)
pred_pose = net.forward_decoder(index_motion)
from utils.motion_process import recover_from_ric
pred_xyz = recover_from_ric((pred_pose*std+mean).float(), 22)
print(pred_xyz.shape)
xyz = pred_xyz.reshape(1, -1, 22, 3)

np.save('motion.npy', xyz.detach().cpu().numpy())

import visualization.plot_3d_global as plot_3d
pose_vis = plot_3d.draw_to_batch(xyz.detach().cpu().numpy(),clip_text, ['example5.gif'])

실험 1 action word가 주어지지 않은 상태에서 매우 세밀한 Pose를 줬을 때 결과

실험 2 논문에 나온 word를 적용한 결과

생각보다 결과가 잘나오긴 한다.

Categories:

Updated:

Leave a comment