딥러닝에서 for문은 최대한 피해야..

용묻이 2022. 7. 9. 15:39

[논문] Deep Kinematics Analysis for Monocular 3D Human Pose Estimation 논문의 큰 틀을 차용하면서, 세부적인 것들은 내 맘대로 바꿔보며 3D Pose Estimation 모델을 개발하고 학습해봤다.

근데 이게 영, 어지간히 오래 걸리는 게 아니다.

한 라인마다 time.time()을 찍어가며 테스트를 해봤는데, composition — decomposition 과정이 되게 오래 걸렸다.

sk_sidx = torch.tensor([0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1]).cuda()
sk_eidx = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ,13, 14]).cuda()

def compose_skeleton(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros(
                (dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), 
                dtype=torch.float32, device=device
        ) # (-1, T, J, dim)

        for i in range(sk_sidx.shape[0]):
                joints[:, :, sk_eidx[i]] = joints[:, :, sk_sidx[i]] + length[:, :, i] * dirs[:, :, i]    

    return joints # (-1, T, S, dim)

def decompose_skeleton(joints, device='cuda:0'):
    # joints: (-1, T, J, dim)
    # joints to skeletons
    skeletons = joints[:, :, sk_eidx, :] - joints[:, :, sk_sidx, :] # (-1, T, S, dim)
    length = torch.sqrt(torch.sum(skeletons ** 2, dim=-1, keepdim=True)) # (-1, T, S, 1)
    # normalize direction vectors
    direction = skeletons / length # (-1, T, S, dim)
    return length, direction

위 코드에서 compose_skeleton 함수가 for문을 사용하므로 상당히 느렸다. 아마 기억상 iter당 8초 언저리로 걸렸던 것 같다. 그런데 이걸 인덱싱으로 해결할 수는 없었다.

joints[:, :, sk_eidx] = joints[:, :, sk_sidx] + length * dirs

위와 같은 코드는 정상적으로 작동하지 않는다.

parent joint의 좌표값이 채워져야 정상적인 child joint의 값을 구할 수 있는데, 인덱싱(numpy의 cpu연산 포함)은 병렬적으로 연산이 진행되어 parent값이 채워지지 않은 상태에서 child에 연산을 하는 경우가 생기기 때문이다.

그러나 for문은 너무 오래 걸리고, 이걸 어떻게 해야 하나 많은 고민을 하고 솔루션을 고안했는데, 영감을 Graph Neural Network로부터 받았다.

그래프의 경우 인접 행렬이 존재하는데, 이 joint들로 root로부터 child까지의 연결된 모든 관절에 1을 준 행렬을 곱하면 텐서 병렬 연산이면서 원하는 결과를 얻을 수 있을 것이라 생각했다.

sk_sidx = torch.tensor([0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1]).cuda()
sk_eidx = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ,13, 14]).cuda()

adj = torch.tensor([
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
]).float().cuda()
adj=adj.view((1, 1) + adj.shape) # 텐서 matmul의 broadcasting을 위한 reshape

def compose_skeleton(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros((dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), dtype=torch.float32, device=device) # (-1, T, J, dim)

    joints[:, :, sk_eidx] = length * dirs

    return torch.matmul(adj, joints) # (-1, T, S, dim)

def decompose_skeleton(joints, device='cuda:0'):
    # joints: (-1, T, J, dim)
    # joints to skeletons
    skeletons = joints[:, :, sk_eidx, :] - joints[:, :, sk_sidx, :] # (-1, T, S, dim)
    length = torch.sqrt(torch.sum(skeletons ** 2, dim=-1, keepdim=True)) # (-1, T, S, 1)
    # normalize direction vectors
    direction = skeletons / length # (-1, T, S, dim)
    return length, direction

이로부터 iter당 연산 0.01초 단위로 끊을 수 있게 되었다.

아래와 같이 CPU에서 테스트를 진행하였다.

import torch
import time

sk_sidx = torch.tensor([0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1])
sk_eidx = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ,13, 14])

adj = torch.tensor([
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
]).float()
adj=adj.view((1, 1) + adj.shape) # 텐서 matmul의 broadcasting을 위한 reshape

def compose_skeleton1(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros(
        (dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), 
        dtype=torch.float32, device=device
    ) # (-1, T, J, dim)

    for i in range(sk_sidx.shape[0]):
        joints[:, :, sk_eidx[i]] = joints[:, :, sk_sidx[i]] + length[:, :, i] * dirs[:, :, i]    

    return joints # (-1, T, S, dim)

def compose_skeleton2(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros((dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), dtype=torch.float32, device=device) # (-1, T, J, dim)

    joints[:, :, sk_eidx] = length * dirs

    return torch.matmul(adj, joints) # (-1, T, S, dim)

def decompose_skeleton(joints, device='cuda:0'):
    # joints: (-1, T, J, dim)
    # joints to skeletons
    skeletons = joints[:, :, sk_eidx, :] - joints[:, :, sk_sidx, :] # (-1, T, S, dim)
    length = torch.sqrt(torch.sum(skeletons ** 2, dim=-1, keepdim=True)) # (-1, T, S, 1)
    # normalize direction vectors
    direction = skeletons / length # (-1, T, S, dim)
    return length, direction

num_joints = 15
batch_size = 128
T = 243
dimension = 3

for _ in range(10):
    joints = torch.randn((batch_size, T, num_joints, dimension))

    start_time = time.time()
    for _ in range(100):
        length, direction = decompose_skeleton(joints, device='cpu') # (128, 243, 14, 1), (128, 243, 14, 3)
        new_joints = compose_skeleton1(torch.mean(length, dim=1), direction, device='cpu')

    print(time.time() - start_time)

    start_time = time.time()
    for _ in range(100):
        length, direction = decompose_skeleton(joints, device='cpu') # (128, 243, 14, 1), (128, 243, 14, 3)
        new_joints = compose_skeleton2(torch.mean(length, dim=1), direction, device='cpu')

    print(time.time() - start_time)
    print('='*100)

2.438546657562256
2.9766674041748047
====================================================================================================
2.4565513134002686
3.0466833114624023
====================================================================================================
2.452549695968628
3.079690933227539
====================================================================================================
2.516564130783081
3.0976943969726562
====================================================================================================
2.499560594558716
3.1397578716278076
====================================================================================================
2.5777175426483154
3.1707115173339844
====================================================================================================
2.5475711822509766
3.13071870803833
====================================================================================================
2.5495800971984863
3.2202258110046387
====================================================================================================
2.5580780506134033
3.0646872520446777
====================================================================================================
2.5955910682678223
3.122096538543701
====================================================================================================

내가 고안한 솔루션이 20%정도 느리다.

그렇다면 GPU에선 어떨까?

import torch
import time

sk_sidx = torch.tensor([0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1]).cuda()
sk_eidx = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ,13, 14]).cuda()

adj = torch.tensor([
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
]).float().cuda()
adj=adj.view((1, 1) + adj.shape) # 텐서 matmul의 broadcasting을 위한 reshape

def compose_skeleton1(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros(
        (dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), 
        dtype=torch.float32, device=device
    ) # (-1, T, J, dim)

    for i in range(sk_sidx.shape[0]):
        joints[:, :, sk_eidx[i]] = joints[:, :, sk_sidx[i]] + length[:, :, i] * dirs[:, :, i]    

    return joints # (-1, T, S, dim)

def compose_skeleton2(lens, dirs, device='cuda:0'):
    # lens, dirs: (-1, S, 1),(-1, T, S, dim)
    assert len(lens.shape) == 3 and lens.shape[2] == 1, lens.shape
    assert len(dirs.shape) == 4
    assert lens.shape[1] == dirs.shape[2], 'same skeletons number'
    assert lens.shape[0] == dirs.shape[0], 'same batch size'
    length = lens.view(-1, 1, lens.shape[1], 1)
    joints = torch.zeros((dirs.shape[0], dirs.shape[1], dirs.shape[2] + 1, dirs.shape[3]), dtype=torch.float32, device=device) # (-1, T, J, dim)

    joints[:, :, sk_eidx] = length * dirs

    return torch.matmul(adj, joints) # (-1, T, S, dim)

def decompose_skeleton(joints, device='cuda:0'):
    # joints: (-1, T, J, dim)
    # joints to skeletons
    skeletons = joints[:, :, sk_eidx, :] - joints[:, :, sk_sidx, :] # (-1, T, S, dim)
    length = torch.sqrt(torch.sum(skeletons ** 2, dim=-1, keepdim=True)) # (-1, T, S, 1)
    # normalize direction vectors
    direction = skeletons / length # (-1, T, S, dim)
    return length, direction

num_joints = 15
batch_size = 128
T = 243
dimension = 3

for _ in range(10):
    joints = torch.randn((batch_size, T, num_joints, dimension)).cuda()

    start_time = time.time()
    for _ in range(100):
        length, direction = decompose_skeleton(joints) # (128, 243, 14, 1), (128, 243, 14, 3)
        new_joints = compose_skeleton1(torch.mean(length, dim=1), direction)

    print(time.time() - start_time)

    start_time = time.time()
    for _ in range(100):
        length, direction = decompose_skeleton(joints) # (128, 243, 14, 1), (128, 243, 14, 3)
        new_joints = compose_skeleton2(torch.mean(length, dim=1), direction)

    print(time.time() - start_time)
    print('='*100)

6.3679351806640625
1.2472772598266602
====================================================================================================
6.385876655578613
0.019004344940185547
====================================================================================================
6.404436111450195
0.01900482177734375
====================================================================================================
6.376429796218872
0.01900506019592285
====================================================================================================
6.347423315048218
0.019004344940185547
====================================================================================================
6.374429225921631
0.019004344940185547
====================================================================================================
6.414942741394043
0.02200460433959961
====================================================================================================
6.432886123657227
0.019003629684448242
====================================================================================================
6.40843653678894
0.019004344940185547
====================================================================================================
6.4684507846832275
0.01900458335876465
====================================================================================================

비교가 되지 않는다. gpu에서는 라이브러리의 텐서 연산이 무조건 최고다...

나는 epoch당 대략 3만 iters인 데이터셋으로 훈련했었다. 그렇다면 100iters당 6초의 연산차이는 어떤 결과를 불러올까?

$${6\times300\times10\over3600}=5$$

10 epochs만 돌려도 5시간의 차이가 난다.

이 테스트를 할 당시 같은 gpu에서 다른 모델을 학습중이었어서, 실제와는 정확하진 않을 수 있다.

하루에 걸쳐 프로젝트의 거의 모든 코드를 이곳저곳 손보면서 이런 for문을 없애는 작업 등으로 8sec/iter의 연산을 1.1sec/iter로 줄였었다.

한 라인마다 time.time() 찍어대는 노가다와 미련한 하드코딩은 쓸데없는 일이 아닌 것 같다.

비슷한 글들

D**(-1/2)는 어떤 방법으로 구해야 할까? np.ndarray와 np.matrix의 차이