Ishmam Khan ’25
Deep learning involves the use of machinery to simulate biological phenomena, especially human behavior. Researchers have developed two systems of programming that proved useful in mimicking movements: convolutional neural networks (CNNs), which are based on virtual imagery and spatial information, and recurrent neural networks (RNNs), which adapt long-short term memory (LSTM) to model long term contextual information of temporal sequences. When used independently of one another, these methods lack specificity. Researchers at Stony Brook University, led by Dr. Hong Qin, were able to optimize the CNN model by incorporating an attention mechanism and integrated LSTM features for a superior hybrid model.
Dr. Qin and his team began with a model CNN framework, where each frame within the network represents a coordinate in the joints of the skeleton. They then implemented an attention mechanism, a model that filtered specific spatiotemporal information which would enable action recognition. This model was fed into the CNN model to determine more discriminative, or particular, features that propose movement. The researchers then ran an ablation study, which determined whether the new parts of the model actually increased CNN sensitivity to discriminative features focused on actions revolving around motion such as clapping and handshaking as well as optimizing the temporal factors behind these motions. They compared this sensitivity to similar LSTM data and found that the CNN model showed a 20% improvement in providing an accurate framework for learning.
The researchers then linked the temporal features from the LSTM model and the CNN with attention mechanism variation to create a hybrid model. Using this hybrid in the same ablation study as previously stated, the researchers found that the hybrid model performed 1-2% more smoothly than the CNN model and 21%-22% higher than the LSTM model. The hybrid achieved the highest score in the ablation study, securing it as the most efficient model tested by this group.
This research posits a new and innovative method to machine learning and artificial intelligence. The researchers believe the hybrid features learning framework is an effective strategy for AI-human behavior simulation. A future course of study for this group would be to
combine different types of features and consider the information of bones in skeleton sequences simultaneously. In addition, exploration is recommended in how to extend human action to other scenes, such as predicting people’s emotions by coupling human action and facial expression.
 Z. Chen, et al., Hybrid features for skeleton-based action recognition based on network fusion. Computer Animation and Virtual Worlds 31, 1-11 (2020). doi: /10.1002/cav.1952.