/ / 8 min read

Large-Scale Coarse-to-Fine Object Retrieval Ontology and Deep Local Multitask Learning (Part 4)

Offline Phase of the retrieval system.

In the previous post, we have discussed about offline phase and online phase in overview. In this post, we will discover the offline phase in details.

1. States in Offline Phase

This phase consisted of three substages:

  • Object Ontology Establishment Stage. This stage defines fashion ontology to control the training flow as well as the online retrieval flow which serves as a bridge between high-level concepts (objects and categories), midlevel concepts (attributes), and raw data.
  • Learning Stage. This stage exploits deep networks with transfer learning in dealing with the specific tasks including object part learning, category learning, and attribute learning.
  • Storing and Indexing Stage. This stage defines a way of storing data as well as making the index list to reduce retrieval or searching time.

From the offline phase, in this section, inherited from previous state-of-the-art methods, we will mention about object part extraction, transfer learning, and its role in the retrieval system as well as data storing. These modules are highly generalized to any object. Other issues including ontology, attribute learning, network architecture, and indexing strategy will be detailed in the following sections.

2. Loss Function

This function inherited the current state-of-the-art ResNet for classification, and cross entropy loss function is applied for multiclass classification in the category classification model and attribute classification model.

For attribute multitask classification models, the loss function is described as follows:

3. Technical Details

Object ontology which is designed manually based on professional experience and public dataset for the community. It is organized into the hierarchical semantic tree with three main levels: region level, category level, and attribute level. Regions, categories, and attributes are learned automatically based on the local MDNN. The DeepFashion dataset, the used functions will be described as follows:

  • (i)extract_predicates(dta): in a rich-annotated dataset, e.g., DeepFashion, a sample image can be annotated by many labels in different fine-grained levels. For each fine-grained level, the function is used to extract the unique possible labels of samples and then store these labels into a corresponding array. For example, in the DeepFashion dataset, Top, Bottom, and Body are unique labels belonging to one fine-grained level, and thus, they are stored into one array. Similarly, fabric, shape, part, style, and texture labels belong to one fine-grained level and are stored into one array.
  • (ii)build_ontology(predicates, prior): this matches the extracted level and its labels from each predicate array into the corresponding stage of the general ontology, i.e., prior. For example, Top, Bottom, and Body belong to one level which is matched with the region stage of the ontology. After the matching is finished, all other unused stages are eliminated from the general ontology to generate the adapted ontology, e.g., fashion ontology.
  • (iii)extract_state(onto): from the built ontology, all stages and their labels are searched and stored into arrays which will be used to reconstruct the data. For example, the region stage array contains three classes, and the category stage array contains 50 classes.
  • (iv)extract_nes_dta(dta, state, onto): based on the stage and the classes extracted from the “extract_state” function, the whole DeepFashion dataset will be split. Only samples having the labels belonging to the stage are stored as the training set of that stage in the ontology. For example, with the region classification model, only samples labelled Top, Body, or Bottom are used for training.
  • (v)classifyModel(architecture, state_dta): in the DeepFashion dataset, based on ontology, there are four classification models: region classification model and category classification model for the Top region, Body region, and Bottom region. These models are retrained from the ImageNet dataset using ResNet-10.
  • (vi)multitaskModel(group_state_dta, architecture, Matthrew_coef = True): for each group state in terms of the fine-grained attribute level, a multitask classification model is built, e.g., fabric attribute group classification model and style attribute group classification model. These models are retrained from the ImageNet dataset using NASNet v3. Besides, the attribute learning and the usage of MCC are mentioned for an imbalanced data solver.
  • (vii)indexing(state_sta): indexing files are created that will be used for run-time retrieval. The method is based on the nonexhaustive compressed-domain search with GPU.
  • (viii)build_storage(onto, states): storage structures are automatically created based on built-in object ontology and extracted states.
  • (ix)infor_extract(states, dta, onto, classifyModels, multitaskModel): for each sample in the database, all attribute learning models trained in “multitaskModel” function are run and then all possible attributes which are higher than thresholds are extracted.
  • (x)feat_extract(dta, onto, classificationModels, multitaskModel): for each sample in the database, the features of the pre-softmax layer in four models trained in “classifyModel” function are obtained.
  • (xi)structure(storage, feat_dta, info_dta, indexFiles): the database is automatically built based on extracted features, extracted information, index, and storage structure.

3.1. Object Part Extraction

For the aforementioned reasons, foreground objects should be extracted from background regions efficiently and accurately before entering the retrieval step. The target of object extraction is to filter the necessary specific subjects. This also improves the efficiency of the following modules as well as increases the overall system performance. There are many successful object detection methods. Among them, YOLO shows the state-of-the-art results. In our system, we inherited the successful software YOLO (version 3.0) to identify fashion items.

3.2. Transfer Learning

Transfer learning is one of the best methods to reduce training time, especially with complicated architectures such as ResNet or NASNet. The key issue is the initial parameters. In the first step of the training process, we have to generate these parameters with some unsupervised learning methods. However, the initial one will be far from the optimal one. In transfer learning, we will reuse the trained parameters on a large and diverse dataset (such as ImageNet dataset). By this way, our training process will be easier to meet convergence. Thus, it reduces the training time.

Transfer learning can be applied in different ways based on the size of the dataset and data similarity. There are four scenarios in total. First, if the data size is small while data similarity is high, we use the pretrained model as a feature extractor. Second, if the data size is small and data similarity is low, we freeze the top layers and train the remaining layers of the pretrained model. Third (ideal situation), if the data size is large and data similarity is high, we can retrain the model by using the weights initialized in the pretrained one. Fourth (worst situation), if the data size is large and data similarity is low, transfer learning cannot be applied, and we have to train our model from scratch. In our fashion example experiments, while DeepFashion is a large dataset and ImageNet (dataset used for transfer learning) is a high diversity one, we can use all of the initialized weights from the pretrained model.

According to our approach, transfer learning will be applied in region, category classification as well as attribute learning along with ResNet and NASNet architectures, respectively. It can also be used in global deep feature extraction to improve the overall retrieval performance.

3.3. Data Storing

Features extracted from the category classification task and attribute learning will be stored in a hierarchical semantic tree based on object ontology. All features belong to a leaf of object ontology and will be stored in one file. In case of the expansion of large-scale data, the mentioned files can be indexed and split with a corresponding mapping key for each image. The folders will be organized based on object ontology in which each name corresponds to each concept. To clarify, data storing for the proposed ontology is defined as follows (see Figure below for an example of data storing):

  • (i) All files are stored in a folder named “database,” which is denoted as the “Object” node.
  • (ii) Based on ontology, “Object” node contains 3 nodes at the “Region” semantic level. Thus, we have 3 smaller folders: “Top,” “Body,” and “Bottom.”
  • (iii) At the next stage of ontology, we have the “Category” semantic level. Thus, we have 50 folders representing all nodes of “Category.”
  • (iv) Finally, we have the “Attribute” semantic level standing for the leaf node state in ontology. At this state, all features belong to the same “Region” and “Category” and are stored in one file.

Figure 1. An example of the storing structure.

Next

In the next post, we will mention the online phase in details.

Like What You See?