FashionFit: Analysis of Mapping 3D Pose

FashionFit: Analysis of Mapping 3D Pose and Neural Body Fit for Custom Virtual Try-On

MOHAMMAD FARUKH HASHMI 1, (Member, IEEE), B. KIRAN KUMAR ASHISH2, AVINASH G. KESKAR3, NEERAJ DHANRAJ BOKDE 4, AND ZONG WOO GEEM 5, (Member, IEEE)

ABSTRACT

Visual compatibility and virtual feel are critical metrics for fashion analysis yet are missing in existing fashion designs and platforms.
An explicit model is much needed for implanting visual compatibility through fashion image inpainting and virtual try-on. With rapid advancements in the Computer Vision realm, the increase in creating customer experience which leads to the great potential of interest to retailers and customers.
The public datasets available are very much fit for generating outfits from Generative Adversarial Networks (GANs) but the custom outfits of the users themselves lead to low accuracy levels.

Visual compatibility와 virtual feel은 패션 분석에 중요한 지표이지만 기존 패션 디자인과 플랫폼에서는 누락되어 있다. fashion image inpainting을 통한 visual compatibility을 삽입과 virtual try-on에 명시적 모델이 많이 필요하다. 컴퓨터 비전 분야의 급속한 발전으로 고객 경험의 창출이 증가하여 소매업체와 고객이 관심을 가질 수 있는 큰 잠재력이 생깁니다. 사용 가능한 공개 데이터 세트는 GAN(Generative Adversarial Network)에서 의상을 생성하는 데 매우 적합하지만 사용자 자신의 사용자 지정 의상은 낮은 정확도 수준으로 이어진다.

This work is the first step in analyzing and experimenting with the fit of custom outfits and visualizing it to the users on them which creates the great customer experience.
The work analyses the need for providing visualization of custom outfits on users in the large corpora of AI in Fashion.
The authors propose a novel architecture which facilitates the combining outfits provided by the retailers and visualize it on the users themselves using Neural Body Fit.

이 작업은 custom outfits의 fit을 분석하고 실험한 뒤 그 위에 있는 사용자에게 시각화해 주는 첫 번째 단계로, 훌륭한 고객 경험을 만들어낸다. 이 연구는 패션의 AI 대기업 사용자에게 custom outfits의 시각화 제공의 필요성을 분석한다. 이 논문의 저자들은 소매업자가 제공하는 의상의 결합을 용이하게 하고 Neural Body Fit을 사용하여 사용자 자신에게 시각화하는 새로운 아키텍처를 제안한다.

This work creates a benchmark in disentangling the custom generation of cloth outfits using GANs and virtually trying it on the users to ensure a virtual-photorealistic appearance and results to create a great customer experience by using AI.
Extensive experiments show the high accuracy levels on custom outfits generated by GANs but not in customized levels.

GAN을 사용한 custom generation of cloth outfits을 분리하고 사용자에게 virtual-photorealistic appearance와 훌륭한 고객 경험을 만드는 결과를 보장하기 위한 virtually trying it on 한 benchmark를 만든다. 광범위한 실험은 GAN에서 생성된 custom outfits에 대한 높은 정확도 수준을 보여주지만 customized levels는 아니다.

This experiment creates new state-of-art results by plotting users pose for calculating the lengths of each body-part segment (hand, leg, and so forth), segmentation + NBF for accurate fitting of the cloth outfit. This paper is different from all other competitors in terms of approach for the virtual try-on for creating a new customer experience.

이 경험은 cloth outfit의 accurate fitting을 위한 각 body-part segment (hand, leg, and so forth), segmentation + NBF 의 길이를 계산하기 위해 사용자 pose를 plotting하여 새로운 최신 results를 만든다. 본 논문은 새로운 고객 경험을 창출하기 위한 virtual try-on에 대한 접근 방식 측면에서 다른 모든 것들과 다르다.

INDEX TERMS Neural body fit, generative adversarial networks (GANs), pose, customer experience, segmentation.

1. INTRODUCTION

Recent developments and breakthroughs in Computer Vision realm in Fashion space such as the implementation of Variational Autoencoders (VAEs) [1], Generative Adversarial Networks (GANs) [2], and its variants gave a path to a myriad to fashion synthesis using computer vision [3]-[5].
These applications include visual-try on, AR/VR neural body fit, language-guided fashion synthesis.

Variational Autoencoders(VAE)[1], GAN(Generative Adversarial Networks)[2] 및 그 변형과 같은 패션 분야 컴퓨터 비전의 최근 발전과 돌파구는 컴퓨터 비전[3]-[5]을 사용하여 패션 합성까지 무수히 많은 경로를 제공하였다. 이러한 응용은 visual-try on, AR/VR neural body fit, language-guided fashion synthesis 등이 포함된다.

With daily uploading of photos in social media [4], [5] throughout the world in every-day scenarios, the datasets [4]-[6] and the environment scenes understanding around has become much easier for the researchers to develop the visual recommendation systems based on given inputs by the user.

일상적인 시나리오에서 매일 소셜 미디어[4], [5]의 사진을 전 세계에 업로드함으로써, 연구원들에게 사용자의 주어진 inputs을 기반으로 한 visual recommendation systems을 개발하는데 있어서 데이터 세트[4]-[6]와 environment scenes이 훨씬 쉬워졌다.

Most of the existing methods rely on the GANs [2], [7] which gives high-quality, fine, and realistic images [4], [8] as shown in Figure 1.
Although GANs [2], [9] can generate high perceptual features that give realistic-feel [5], [10], the retailers are finding a narrow gap in gaining the customer experience due to the unavailability of custom testing on them of cloth outfits.

기존 방법의 대부분은 GAN [2], [7]에 의존하며, Figure 1과 같이 gives high-quality, fine, realistic images를 제공한다[4], [8]. GAN [2], [9]는 현실적인 느낌(realistic-feel)[5], [10]을 주는 high perceptual features을 생성할 수 있지만, 소매업체들은 cloth outfits에 대한 맞춤형 테스트를 이용할 수 없기 때문에 고객 경험을 얻는 데 있어 좁은 차이를 발견하고 있다.

GANs [2] generated fashion images [7] are mostly employed in today’s virtual try-ons and fashion industry. These are mainly used for advertising and display purpose.
The realistic images are propose in [11] are limited to selecting available garments for just normal shopping purpose.
Often film stars, sports personalities, in movies, and so forth appear on television with mesmerizing cloth outfits.
Most of them wouldn’t reveal the source or are custom designed.
Noa Garcia and Georgia Vogiatzis [12] proposed “Dress like a Star” method where outfits can be extracted from the video frames.

패션 이미지[7]를 생성하는 GAN[2]는 오늘날의 fashion industry 및 패션 업계에서 주로 사용되고 있다. 이것들은 주로 광고와 전시 목적으로 사용된다. [11]에서 제안된 realistic images는 단지 정상적인 쇼핑 목적으로 사용 가능한 의류를 선택하는 것으로 제한된다. 종종 film stars, sports personalities, 영화 등에서 매혹적인 옷을 입고 텔레비전에 출연한다. 그것들 대부분은 출처를 밝히지 않거나 맞춤 제작되었습니다. Noa Garcia와 Georgia Vogiatzis[12]는 비디오 프레임에서 의상을 추출할 수 있는 “Dress like a Star” 방법을 제안하였다.

The latest trend lies in the image generation which are realistic in vision.
Han et.al [13] proposed FiNet network which generates realistic images.
These are most often used in fashion industry to create new designs.
Honda Shion [14] proposed VITON-GAN for a virtual try-on with GANs.

The existing methods couldn’t achieve high-accuracy levels on custom visual try-on [3], [7], [12], [15], [16], where the user can upload himself/herself and try-on different cloth image outfits for getting the reality-feel on screen.
Many public datasets [6], [17]-[21] are available but the lag in gaining the power of complete access on users’ fingertips isn’t available.
Therefore, this work proposes a novel architecture where the user can try out different outfits on himself/herself with gives the utmost reality on visual screens.
This breakthrough can open many doors for the customers in getting the actual try-on feel as they usually do it in the shopping malls and bridges the gap between reality and reality-feel through Computer Vision.

In recent times, there has been extensive research in characterizing clothing using computer vision algorithms.
The high rate of people uploading photos with different styles in social media has been quite spiked in recent years.
The cloth outfits dataset is tremendous and is quite useful for the generation of new cloth garments and the perfect fit of new outfits using GANs [2] has been simple.
With the availability of a humongous dataset of wide categories of cloth outfits, this work mostly focusses on the algorithm building for getting a photorealistic of garment fit onto bodies, i.e., virtual try-on of customized cloth outfits.
This stylistic signature of virtual try-on mainly deals with GANs [2] computing of combining the human body with custom selected cloth garments [13].
The perfect body fit [22] had been a challenge till recent past due to unavailability of data which lags in creating a photorealistic image, now as DeepFashion2 [17] and many such datasets have shown up, its become easier for virtual try-on, as it is being training on very large scale dataset bunch [23].

This work involves the construction of an advance framework to aggregate all the learned features from the 3D-pose plots [24] and corresponding cloth attributes for matching the nodes for visual perception [9].
The learned features from the CNN network involve the clothes categories, 3D-poses [24], and masks to solve the clothing image retrieval in an end-to-end virtual try-on manner.

A. VISUAL CLOTH OUTFITS AND GEOGRAPHY

The clothing style differs in different geographical locations [21].
Numerous styles in outfits throw some sort of challenging task because the outfits are complex in some scenarios such as thick styles, loose styles for shirts and many more.
These latest styles are recent trends and have set a benchmark in the fashion industry.
The lack of availability of these kinds of complex outfit datasets drew a line in moving forward for all available scenarios and forced the researchers to fit their architecture with limited outfits.

B. IMAGE AND SEGMENT SYNTHESIS

Fashion is a huge ocean has an overflow of styles in it.
The users are very smart in having to like different styles for each segment.
For example, sleeveless, shorts, and many other styles [25].
Different segmented styles outfits throw a challenge accurately mapping it to the particular body segment.
This problem arises when the shorts maps to hands which are false positive.
In recent developments, the annotations available and outfits generated [7], [9], [14] in fashion outfits wiped this challenge as shown in Figure 2.

C. VISUAL RECOMMENDATION SYSTEM

Visual Compatibility Modelling plays a key role in visual search, visual recommendation systems, and visual information retrieval systems [8], [14], [15], [26]-[29].
The recent trends in this space especially the image synthesis where similar cloth outfits appear on the screen.
The simple NLP and CNN for classification groups similar objects into one query search.
The latest outfits released often appear on the screen by users mapping onto a similar class/object.
Furthermore, the existing models are all labeled and filtered by users to get different styles and outfits separately.
The comparison is homogenous in the existing filter search.
Different outfits and styles on a single catalog are done on different sorting and queries.
This leads to losing potentials from exploring much deeper.
Fashion experts recommend having all styles and outfits on a single plate to the users which are quick, smarter, and productive.
This creates an imagination of users with a realistic feeling of comparison with different catalogs.
The most recent usage of deep learning models is bidirectional LSTM and Siamese CNN [19], [30] for predicting the next item based on past and currently selected catalog by the user.

D. IMAGE SYNTHESIS

1) GENERATIVE ADVERSARIAL NETWORKS (GANs)
The latest advancements in creating customer experience are generating cloth outfits by GANs [2], [7], [13].
GANs [2] has opened a wide-angle in many domains, in fashion it proved to create a great path of creating a great customer experience.
Retailers found this very potential way of attracting funnel of creating a great customer experience.
Shion Honda [14] proposed a two-stage architecture of generating new clothes onto a person and transfer it to a different person.
This new clothing visualization created many interests for the users trying out with different arbitrary poses [31], [32] and GANs generated outfits.

2) CONDITIONAL GANs
Conditional GANs has generative models of data on discrete labels and images, creates a new way of mapping scenes, edges for a bidirectional mapping between unpaired images [14].
The existing model consists of a semantic layout of images [33] generated from different inputs from the users.

E. HUMAN POSE + SEGMENTATION FOR GANs GENERATED OUTFITS
Another approach or getting accuracy in custom try-on outfits is the human parsing and pose estimation [34], [35] which contains pixel-level annotation of the body [4].
The work proposed by [14] covers the region of interest on the body with specified key points where the cloth outfits are to be appended onto [11].
This approach made a wider path for GANs [7] to validate its generated outfits on the body which gave an essence of virtual try-on [14], [31], [32], [36].

F. DATASETS

This works leverages several datasets such as fashion datasets [25], DeepFashion2 [17] for custom cloth outfits.
MVC dataset [19] provides invariant clothing retrieval and attribute prediction.
It even provides cropped and dressed person images with varies views, multiple poses.
MPII Pose dataset [35], DensePose [9] are used for human parsing and pose estimation.
DeepFashion2 [17] contains a 391K training set with 13 classes, a validation set of 34K images, and 67K images of the testing set.
LIP dataset [37] and few random images taken from google were tested for prediction.
The annotations were largely done on the segmentation basis using the VIA annotation tool [38].
This paper experiments the combination and concatenation of both the architectures of Fashion and Pose for setting up a new benchmark in advancing the virtual try-on to create a new customer experience.

III. TECHNICAL APPROACH

This paper gives an essence of combining two-great esteems, i.e., Human Pose and Fashion’s Virtual try-on.
This experiment an initial step in the FashionAI by creating a fully custom virtual try-on at the fingertips of the users provided with the highest accuracies.
The problem raised in the existing models is the absence of fully custom features like choosing custom outfits separately and virtually trying it on their catalogs.
This created an inadequate customer experience in the fashion market.
This paper summarizes a novel approach in which provides fully custom features with custom virtual try-on by combining two architectures.

A. VISUAL RECOMMENDATION AND QUERY

This step is crucial for selecting custom garments by the user.
The similarity search based on Euclidean distance from selecting the outfits by the user gets the recommendation of similar garments available with the e-retailers.
The recommendation system queries [36], [39], [40] based on the past and current selection [3].
The image retrieval [3], [41]-[43] for the users is as follows.
Let the E-retailers cloth outfit gallery be given by a set G=y, it computes the similarities s between the image query x and object y and ranks them \(x=x^i\) and \(y=y^i\), where \(x^i \in R^{Cxl}\) and \(y^i \in R^{Cxl}\) are locally mapped features of cloth image retrievals and the e-retailer respectively [21], [44], [45].
Mapping retailers and cloth images retrievals as global variables: \[s_g = S_g(A(x),A(y)) \qquad \qquad \qquad \qquad \qquad \qquad (1)\]

where, A(.) is the aggregate function consisting of a pooling operator (either average or maximum) in the convolutional layers and S_g(.; .) is the similarity function.
Most of the times the similarity function is the Euclidean distance and sometimes cosine similarity can be used based on the datasets available.
Most of the visual recommendation systems [36], [39], [46], [47] are built upon Euclidean distance operators.
This function maps the input query to the clothing set available in the retailer’s database.

Noise features were observed due to similar matching of pixels in backgrounds, scenes, other objects in the background which can be avoided only by cropping the target object is a challenging task in moving forward.
To suppress this, [30] proposed a similarity correlation between the input query and the local pixels array as follows:


© 2020. All rights reserved.

따라쟁이