Deep features based action description.
Multiple persons based abnormal event detection
According to a recent survey (Tripathi et al., 2018), which focused on four crowd attributes to be taken into account for crowd analysis: crowd counting, crowd motion detection, crowd tracking, and crowd behavior understanding, abnormal event detection approaches in crowded scenes evolved from shallow to deep. One of the challenging topics in the public domain is crowd analysis. And deep learning models, such as variations of CNN, LSTM, autoencoders, and RNN, are offering better solutions than handmade features for real-life difficulties in crowded environments like occlusion, cluttered backgrounds, etc. (Sindagi and Patel, 2018).
In the study (Vignesh et al., 2017), CNN properties are combined with the LSTM network's ability to learn the long-term dependence between frames in order to distinguish uncommon events in videos. The LSTM network in this instance consists of two layers: a 1 1024 LSTM layer and a 1 512 bi-directional layer. This layer architecture gets beyond the limitations of traditional recurrent neural networks, which include the rigid amount of input data and the inability to obtain the next input from the present state. Contrarily, a bi-directional LSTM combines the important traits of the past and future by operating both a forward and a backward LSTM in reverse time. Despite the fact that the method yields remarkable results for the ARENA dataset, the experimental work only addresses one kind of activity. Conv-LSTM was first discussed in (Medel and Savakis, 2016), who used it to model and predict video sequences. Conv-LSTM layers do not depend on memory, which slows down the training and testing phases. The likelihood of an abnormal event occurring in a real-world context is lower than that of a normal event. Despite being able to distinguish between normal and abnormal events, 3D ConvNet fails in the absence of abnormal events and in datasets that are poorly supervised. A spatiotemporal end-to-end approach for anomaly identification in movies or congested scenes was put out by (Chong and Tay, 2017). It functions effectively even when there are no unusual happenings in the video samples. The spatial feature extractor, which is based on CNN, and the temporal encoder-decoder work together to learn the temporal patterns of the input volume. (Shao et al., 2016) created a spatiotemporal CNN called Slicing SCC (S-CNN) to comprehend crowd situations by decomposing the 3D volume (x, y, t) input video into 2D spatial (x, y) and 3D temporal slices (x, y), (x, t), and (y, t), respectively, by utilizing the selectivity of spatial filters. According to the semantic feature maps, this form of slicing makes it easier to isolate the actors from the scene's backdrop clutter. In addition, a recent study by (Li and Chuah, 2018) defined the ReHAR framework, which combines optical flow and CNN-based image feature extraction with a global average pooling layer and is fed to a stack of two LSTM networks to predict the group activities of the Basketball dataset and the UCF sports action dataset. Real-time action recognition is made possible by the global average pooling layer and stack of two LSTM structures, which speeds up task completion.
Table 1 provides an overview of the advancement of generalized deep learning architectures for human action recognition. Which shows that different modalities, including RGB, depth, skeleton, optical flow, MSDI, and dynamic pictures, are combined by the researchers to improve the system's performance. Dynamic images (Jing et al., 2017) and optical flow (Hana et al., 2018) both give motion representation, but dynamic images offer a more compact representation of motion in a video sequence than optical flow from the perspective of time of computation and storage capacity. A group of researchers have used the viewinvariant and concise properties of skeletons for deep action recognition models and have seen outstanding results (Liu et al., 2018; Amir et al., 2016; Song et al., 2016; Ding et al., 2016). From the perspective of single person monitoring, the real-time applicability of such algorithms is still a significant concern. As shown in Table 2, there has been very little research done on the issue of recognizing single person abnormal human action using the advantages of deep architectures, which requires considerable attention given the demand for better healthcare services, smart homes, and AAL. The described work also uses data from wearable sensors. Wearing the sensors can occasionally be uncomfortable, and if the person forgets to wear them, the identification system will not be able to see their actions. However, it must be acknowledged that deep models have improved crowd analysis systems' effectiveness by addressing real-world issues that lower the complexity barrier for a superior analysis of crowd activity.
A limited amount of training data, limited computational resources, and other issues are problems that come with developing and establishing deeper architectures. With minimal training datasets, deep learning-based neural networks are unable to deliver a realistic performance. Therefore, to assess to tiny dataset samples for human action recognition, rigorous data augmentation approaches (Tran et al., 2017; Hanab et al., 2018) (i.e. cropping, rotating, and flipping input images) are used. In the study (Aquino et al., 2017), two augmentation strategies—"Only enhanced" and "Balanced augmented"—were introduced, along with how they affected CNN designs and pointed to improved CNN performance. By transferring the information acquired in one set of uses to another, transfer learning (Cook et al., 2013; Sargano et al., 2017) is another effective method for handling small data samples for a particular application without requiring extensive training time from scratch and a huge dataset.
The study (Hammerla and Plotz, 2015; Zhang et al., 2015; Alsheikh et al., 2016) is mostly sensor based and is only able to access a small number of data samples. It is aimed at enhancing the performance of smart homes and older persons. It is possible to enhance the perception of single person abnormal human recognition if transfer learning techniques are used on visual datasets for abnormal action recognition and LSTMs and RNNs are used to examine the temporal features involved.