MovileNet v2

hjr067 2024. 8. 19. 21:06

MobileNetV2: Inverted Residuals and Linear Bottlenecks

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of app

arxiv.org

Google에서 발표한 MobileNet V2를 제안한 논문 : MobileNetV2: Inverted Residuals and Linear Bottlenecks이다.

- MobileNet 모델을 개선한 모델

- MobileNet v2는 임베디드 device 또는 모바일 장치를 타켓으로 하는 단순한 구조의 경량화 네트워크를 설계

- Depthwise-Separable Convolution , width/resolution multiplyer 주로 사용 : 정확도와 모델 크기를 trade-off

- 단순히 Depthwise-Separable Convolution을 쌓은 구조의 MobileNet과는 달리 MobileNet v2에서는 Inverted Residual block이라는 구조를 이용해 네트워크를 구성한 차이점

mobilenet v2에는 2가지 종류의 block이 있다. 첫번째는 stride가 1인 residual block이고 두번째는 downsizing을 위한 stride가 2인 block이며 각각의 block은 3개의 layer를 가지고 있다.

두 block 모두 첫 번째 layer는 pointwise(1x1) convolution + ReLU6
- https://gaussian37.github.io/dl-concept-relu6/ : 렐루6를 사용하는 이유
두번째 layer는 depthwise convolution : 첫번째 block은 여기서 stride가 1로 적용되고 두번째 block은 stride가 2가 적용되어 downsizing
세번째 layer에서는 다시 pointwise(1x1) convolution이 적용된다. 단, 여기서는 activation function이 없다. 즉, non-linearity를 적용하지 않은 셈
stride가 2가 적용된 block에는 skip connection이 없습니다. stride 2를 적용하면 feature의 크기가 반으로 줄어들게 되므로 skip connection 또한 줄어든 크기에 맞게 맞춰져야 하는 문제가 있어서 skip connection은 적용하지 않은것으로 추정된다.

설계 전략

Linear Bottlenecks

1. Manifold of interst

고차원의 데이터가 저차원으로 압축되면서 특정 정보들이 저차원의 어떤 영역으로 매핑이 되게 되는데, 이것을 manifold 라고 이해하면 될 것 같다.

* manifold : 수학에서 차원이 다른 공간에 존재하는 어떤 구조를 의미한다. 예를 들어, 3차원 공간 안에 존재하는 2차원 평면 같은 것.

어떤 네트워크에 이미지가 입력된 모습을 생각해보자. 이때 n번째 layer에서는 입력이미지의 특징들이 저차원의 특정 영역에서 활성화된다. 이렇게 저차원에서 특징들이 mapping(활성화)되는 영역이 생기는 것을 manifold of interest를 구성한다고 한다.

ex. 1000x1000 픽셀의 이미지가 있지만, 실제로는 그 안에 포함된 의미 있는 정보는 저차원 공간(예: 얼굴, 물체의 경계 등)에 압축될 수 있다는 것이다.

오래전부터 manifold of interst는 저차원 subspace로 임베딩이 가능하다고 가정했다.

즉, 고차원의 정보는 저차원에 표현 가능하다고 가정한다는 것이다.

고차원 데이터인 너구리 이미지에서, 너구리의 특징에 해당하는 정보는 저차원의 일부분(subspace)에 맵핑되어 영역을 구성한다. 즉, 저차원에서도 너구리는 정보를 가지고 있다 !

2. Linear Transformation

렐루는 linear transformation이다. (음수영역 : 0 / 양수영역 : 자기 자신을 반환)

→ manifold of interest가 양수면 렐루는 linear transformation 연산과 동일하다고 볼 수 있으며 이때는 렐루를 통과해도 정보가 보존된다고 생각해볼 수 있음.

낮은 차원에 적은 수(2,3개)으 채널을 사용한 경우 복원했을 때 정보가 크게 손실된 것을 볼 수 있지만 15개 이상 많은 채널을 사용하는 경우 원래 정보의 대부분이 보존되는 것을 볼 수 있다.

ReLU 사용 시, 각 channel에선 필연적으로 정보의 손실 발생.

그러나 논문에서는 채널의 개수를 많이 사용할수록 정보 보존이 가능하다고 주장한다.

네트워크를 거치면서 저차원으로 매핑이 되는 연산이 계속 되는데, 이때 (인풋의 manifold가 인풋 space의 저차원 subspace에 있다는 가정 하에) ReLU는 양수의 값은 단순히 그대로 전파하므로 즉 , linear transformation이므로, manifold상의 정보를 그대로 유지한다.

즉, 저차원으로 매핑하는 bottleneck architecture를 만들 때, linear transformation 역할을 하는 linear bottlenect layer를 만들어서 차원은 줄이되 manifold 상의 중요한 정보들은 그대로 유지해보자는 컨셉이다!

저 가설을 바탕으로 실제 실험을 하였는데 bottleneck layer를 사용했을 때 ReLU를 사용하면 오히려 성능이 떨어지는것을 확인✔️

3. Linear Bottlenecks (ReLU 제거)

저차원에서 정보 손실을 막기 위해 활성화 함수를 사용하지 않고 linear 변환을 적용하는 전략

*bottleneck 구조란?

Bottleneck 구조는 신경망의 중간 layer 에서 차원을 줄여 정보의 흐름을 제약하는 방식이다. 이로 인해 network가 꼭 필요한 정보만을 압축하여 다음 layer로 전달하게 된다.

- 모바일넷 v2는 네트워크 효율성을 높이기 위해 Depthwise Separable Convolution과 같은 방법을 사용한다. 하지만 이렇게 차원을 줄이는 과정에서, 활성화 함수(ReLU, Softmax, Sigmoid 등)를 지나면서 중요한 정보(특히, 저차원 공간에서의 정보)가 손실될 수 있다.

- ReLU는 음수값을 0으로 만들어버리기에 중요한 정보가 날아갈 수 있다. 이를 막기 위해 모바일넷 v2에서는 Linear Bottlenecks를 도입

-> 입력에서 중요한 정보들인 manifold of interests는 Layer를 거쳐가며 저차원영역으로 전달될 수 있고, 이때 Layer가 linear transformation이면 정보가 보존될 것이라 가정 가능

-> 모바일넷 v2에선 저차원으로 mapping 하는 linear transformation을 만들 때 보틀넥 구조 활

종합하면 MobileNet v2는 linear transformation역할을 하는 linear bottleneck layer를 활용해서 차원은 줄이되 중요한 정보(manifold of interest)를 그대로 유지하여 네트워크 크기는 줄어들지만 정확도는 유지하는 전략을 취한다.

Inverted Residuals

wide → narrow → wide 형태가 되어 가운데 narrow 형태가 bottleneck 구조를 만들어준다.
처음에 들어오는 입력은 채널이 많은 wide한 형태이고 1x1 convolution을 이용하여 채널을 줄여 다음 layer에서 bottleneck을 만든다.
bottleneck에서는 3x3 convolution을 이용하여 convolution 연산을 하게 되고 다시 skip connection과 합쳐지기 위하여 원래의 사이즈로 복원하게 된다.

논문에서 제안된 inverted residual은 일반적인 residual block과 정반대로 움직인다.

처음 입력으로 그려진 점선 형태의 feature는 앞에서 다룬 linear bottleneck(ReLU를 거치지 않았다)
즉 narrow → wide → narrow 구조로 skip connection을 합치게 된다.
이렇게 시도한 이유는 narrow에 해당하는 저차원의 layer에는 필요한 정보만 압축되어서 저장되어 있다라는 가정을 가지고 있기 때문
따라서 필요한 정보는 narrow에 있기 때문에, skip connection으로 사용해도 필요한 정보를 더 깊은 layer까지 잘 전달할 것이라는 기대를 할 수 있다.
이렇게 하는 이유의 목적은 압축된 narrow layer를 skip connection으로 사용함으로써 메모리 사용량을 줄이기 위함임

pytorch code

from torch import nn
from torch import Tensor
from typing import Callable, Any, Optional, List


__all__ = ['MobileNetV2', 'mobilenet_v2']


model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
}


def _make_divisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_value:
    :return:
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v


class ConvBNReLU(nn.Sequential):
    def __init__(
        self,
        in_planes: int,
        out_planes: int,
        kernel_size: int = 3,
        stride: int = 1,
        groups: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        padding = (kernel_size - 1) // 2
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            norm_layer(out_planes),
            nn.ReLU6(inplace=True)
        )


class InvertedResidual(nn.Module):
    def __init__(
        self,
        inp: int,
        oup: int,
        stride: int,
        expand_ratio: int,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        super(InvertedResidual, self).__init__()
        self.stride = stride
        # stride는 반드시 1 또는 2이어야 하므로 조건을 걸어 둡니다.
        assert stride in [1, 2]

        if norm_layer is None:
            norm_layer = nn.BatchNorm2d

        # expansion factor를 이용하여 channel을 확장합니다.
        hidden_dim = int(round(inp * expand_ratio))
        # stride가 1인 경우에만 residual block을 사용합니다.
        # skip connection을 사용하는 경우 input과 output의 크기가 같아야 합니다.
        self.use_res_connect = (self.stride == 1) and (inp == oup)

        # Inverted Residual 연산
        layers: List[nn.Module] = []
        if expand_ratio != 1:
            # point-wise convolution
            layers.append(ConvBNReLU(inp, hidden_dim, kernel_size=1, norm_layer=norm_layer))
        layers.extend([
            # depth-wise convolution
            ConvBNReLU(hidden_dim, hidden_dim, stride=stride, groups=hidden_dim, norm_layer=norm_layer),
            # point-wise linear convolution
            nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
            norm_layer(oup),
        ])            
        self.conv = nn.Sequential(*layers)

    def forward(self, x: Tensor) -> Tensor:
        # use_res_connect인 경우만 connection을 연결합니다.
        # use_res_connect : stride가 1이고 input과 output의 채널 수가 같은 경우 True
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetV2(nn.Module):
    def __init__(
        self,
        num_classes: int = 1000,
        width_mult: float = 1.0,
        inverted_residual_setting: Optional[List[List[int]]] = None,
        round_nearest: int = 8,
        block: Optional[Callable[..., nn.Module]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        """
        MobileNet V2 main class
        Args:
            num_classes (int): Number of classes
            width_mult (float): Width multiplier - adjusts number of channels in each layer by this amount
            inverted_residual_setting: Network structure
            round_nearest (int): Round the number of channels in each layer to be a multiple of this number
            Set to 1 to turn off rounding
            block: Module specifying inverted residual building block for mobilenet
            norm_layer: Module specifying the normalization layer to use
        """
        super(MobileNetV2, self).__init__()

        if block is None:
            block = InvertedResidual

        if norm_layer is None:
            norm_layer = nn.BatchNorm2d

        input_channel = 32
        last_channel = 1280
        
        # t : expansion factor
        # c : output channel의 수
        # n : 반복 횟수
        # s : stride
        if inverted_residual_setting is None:
            inverted_residual_setting = [
                # t, c, n, s
                [1, 16, 1, 1],
                [6, 24, 2, 2],
                [6, 32, 3, 2],
                [6, 64, 4, 2],
                [6, 96, 3, 1],
                [6, 160, 3, 2],
                [6, 320, 1, 1],
            ]

        # only check the first element, assuming user knows t,c,n,s are required
        if len(inverted_residual_setting) == 0 or len(inverted_residual_setting[0]) != 4:
            raise ValueError("inverted_residual_setting should be non-empty "
                             "or a 4-element list, got {}".format(inverted_residual_setting))

        # building first layer
        input_channel = _make_divisible(input_channel * width_mult, round_nearest)
        self.last_channel = _make_divisible(last_channel * max(1.0, width_mult), round_nearest)
        features: List[nn.Module] = [ConvBNReLU(3, input_channel, stride=2, norm_layer=norm_layer)]
        
        
        # Inverted Residual Block을 생성합니다.
        # features에 feature들의 정보를 차례대로 저장합니다.
        for t, c, n, s in inverted_residual_setting:
            # width multiplier는 layer의 채널 수를 일정 비율로 줄이는 역할을 합니다.
            output_channel = _make_divisible(c * width_mult, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t, norm_layer=norm_layer))
                input_channel = output_channel
        
        # building last several layers
        features.append(ConvBNReLU(input_channel, self.last_channel, kernel_size=1, norm_layer=norm_layer))
        # make it nn.Sequential
        self.features = nn.Sequential(*features)

        # building classifier
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(self.last_channel, num_classes),
        )

        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # This exists since TorchScript doesn't support inheritance, so the superclass method
        # (this one) needs to have a name other than `forward` that can be accessed in a subclass
        x = self.features(x)
        # Cannot use "squeeze" as batch-size can be 1 => must use reshape with x.shape[0]
        x = nn.functional.adaptive_avg_pool2d(x, (1, 1)).reshape(x.shape[0], -1)
        x = self.classifier(x)
        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)


def mobilenet_v2(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> MobileNetV2:
    """
    Constructs a MobileNetV2 architecture from
    `"MobileNetV2: Inverted Residuals and Linear Bottlenecks" <https://arxiv.org/abs/1801.04381>`_.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    model = MobileNetV2(**kwargs)
    if pretrained:
        try:
            from torch.hub import load_state_dict_from_url
        except ImportError:
            from torch.utils.model_zoo import load_url as load_state_dict_from_url
        state_dict = load_state_dict_from_url(
            'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth', progress=True)
        model.load_state_dict(state_dict)
    return model

if __name__ == '__main__':
    net = mobilenet_v2(True)

Conclusion

MobileNet v2에서는 고차원에 있는 정보를 손실없이 저차원에 저장하는 아이디어를 바탕으로 Invered Residual BottleNeck 구조를 사용한 네트워크를 설계하여 효율적이면서 성능을 최대한 보존하는 결과를 보여주었다.

네트워크 자체가 Inverted Residual Block을 단순히 쌓아올려 구성했기 때문에 구현이 쉬우면서 필요에 따라 block수를 조절하며 네트워크 크기를 조절하며 사용할 수도 있다. 논문에서는 이러한 단순한 구조 덕분에 Tensorflow나 Caffe같은 Framework에서 최적화하기 쉬운 장점도 있다고 강조하고 있다.

결론적으로 MobileNet v1과 마찬가지로 어느정도 성능이 보장되면서 가벼운 네트워크로 모바일 환경을 타겟으로 할 때 유용한 네트워크라고 생각된다.

참고 문헌

https://velog.io/@woojinn8/LightWeight-Deep-Learning-7.-MobileNet-v2

https://gaussian37.github.io/dl-concept-mobilenet_v2/