For vision-based human action recognition systems that classify human actions as normal or abnormal in a variety of applications such as AAL, smart homes, crowd analysis, and so on, a compact representation of the action videos must be defined while taking into account the discrimination ability of the description as well as time and computational complexity. The blog series discussed three techniques for expressing actions for abnormal human identification applications: 2D AbHAR, 3D AbHAR, and deep recognition systems for abnormal activities.
During the survey, it was discovered that each method had its drawbacks. As a result, the method of action representation must always be application specific. For RGB images/video sequences, abnormal action detection and recognition may be divided into two approaches: single person based and multiple person based. It has been observed that silhouette and spatiotemporal based techniques have been fairly popular for single person based AbHAR. Khan and Sohn (2011) use R Transform on silhouettes and KDA to increase the descriptor's discriminating capacity. However, the system is tested using a hypothetical dataset. Furthermore, (Sacco et al., 2012) has a significant implementation cost and is incapable of doing long-term real-time analysis. In contrast, Riboni et al. (2016) use a real-time dataset. However, the LOTAR framework was designed for early identification of abnormal human behavior in real time. This technique, however, is dependent on several sensors other than the vision sensor-camera. The integration of sensors, as well as the simultaneous collection and analysis of data from several sensors, has become another real-time challenge. For abnormal crowd behavior detection with multiple people, spatiotemporal features (Roshtkhari and Levine, 2013; Roshtkhari and Levine, 2013) offer important aspects. Sparseness (Chathuramali et al., 2014) of derived features is also important in detecting abnormal crowd behavior such as stampede.
Skeleton and depth creation of a human is achievable for indoor activities where depth cameras may be installed. During the survey, it was observed that in a closed environment, single person-based abnormal activities are studied with smart homes, geriatric health care, and fall detection applications in mind. Full Procrustes distance (Rougier et al., 2011b), OCSVM (Yu et al., 2013), History Triple Factor (HTF) (Goudelis et al., 2015), and BoCSS (Ma et al., 2014) approaches detect the fall quickly while dealing with occlusion and view variations well. However, (Ma et al., 2014; Goudelis et al., 2015) have a high computing cost and do not provide real-time fall detection, while (Yu et al., 2013) are only tested for one type of fall dataset.
To understand long-term details, the developed single person deep abnormal behavior recognition methods (Hammerla et al., 2016; Arifoglu and Bouchachia, 2017; Park et al., 2018) used variants of RNN-Vanilla RNNs (VRNN), Long Short-Term RNNs (LSTM), Gated Recurrent Unit RNNs (GRU), and Residual-RNN architectures. However, due to the scarcity of sufficiently large abnormal datasets for deep networks powered by millions of data points, no vision-based single person deep abnormal behavior identification model has yet to be developed. There are plenty of datasets for AbHAR with multiple people, which is why many experiments have combined CNNs and LSTMs (Vignesh et al., 2017; Medel and Savakis, 2016; Li and Chuah, 2018; Hinami et al., 2017; Ravanbakhsh et al., 2016), or used Generative Adversarial Nets (Ravanbakhsh et al., 2017) to create the deep model. However, some of these frameworks rely on data that is not properly labeled, and some are not able to achieve both fast and accurate performance in real time.