Arabic Sign Language (ArSL) Recognition

Arabic Sign Language (ArSL) Recognition


School Affiliation



Date Due

Chapter 3: Literature Review

Arabic Sign Language recognition poses substantial challenges, including hand segmentation and the choice of visual descriptors. Moreover, ArSL faces additional challenges because of the similarities between some of its letters. Currently, several approaches have emerged to help in addressing the problems. Researchers have investigated the utility of both glove-based and glove-free approaches to the recognition of ArSL. While both approaches use machine learning algorithms, they should be distinguished because of the degree of machine learning associated. In this case, some models use a deep learning approach, while others use non-deep learning approach. The review of the literature aims at appraising studies pertinent to deep learning approach and non-deep learning approach.

3.1 Non-Deep Learning Approaches

·         Recognition based on Scale-Invariant Features Transform

Tharwat et al. [2] proposed a systemfor ArSL recognition based on Scale-Invariant Features Transform (SIFT). The algorithm is based on five critical steps.

  1. Colvolving the image using Gaussian filter of diverse widths to create the difference of Gaussian function pyramid
  2. The identification of the Gaussian pyramids’ extrema through the comparison of each point with its neighbors
  3. The elimination of the extrema key points suspected to be vulnerable to noise or those located on the edge
  4. Assigning orientation by creating a histogram from gradient orientations of the sample points
  5. Creation of a descriptor of the local image region

In turn, Linear Discriminant Analysis (LDA) was used for the reduction of the high dimensional vector. The researchers fed the resulting feature to classifiers, including K-Nearest Neighbor, Support Vector Machine. The experimental results showed that SVM outperformed KNN with an accuracy of 98.9%.  Deriche et al. [7] relied on an almost similar approach, although it used LDA alongside Gaussian Mixture Model. With two LMCs as the foundation, the researchers collected data from two native signers for 100 Arabic signs. The proposed framework achieved an accuracy of 92%, which is relatively higher than for similar sensor-based techniques.

·         Recognition based on Histogram Oriented Gradient (HOG) and SVM

            In their study, Alzohairi et al. [3], proposed a system for automatic recognition of ArSL using visual descriptors. The system aimed at building a myriad of visual descriptors for an accurate ArSL alphabet recognizer. The extracted visual descriptors were conveyed using One-versus-All SVM. The analysis of the results revealed that the HOG descriptor outperformed other descriptors. The proposed system had an accuracy of 90.55% in the recognition of ArSL.

·         Recognition based on Nearest Neighbor

            Several other studies have aimed at developing systems that rely on nearest neighbor for ArSL recognition. Tubaiz et al. [4] compiled a sensor-based dataset of 40 sentences based on an 80-word lexicon to test a glove-based ArSL recognition system. Following data labeling, the researchers applied low-complexity preprocessing and feature extraction techniques to capture the temporal dependency of the data. A modified K-Nearest Neighbor (MKNN) was used for the classification of the outputs. The system achieved a sentence recognition rate of approximately 98.9%. The approach was considered superior to vision-based approaches pertinent to classification rates.

            Similarly, Naoum et al. [6] used KNN algorithm in the development of an ArSL algoruthm. The proposed system involved several stages, including image questioning and capturing, image clipping and narrowing, image masking, generation of histograms, and comparison of histograms. The developed algorithm was tested for hits and glove detection, in which most of the characters attained a hit rate of approximately 90%. However, the algorithm has shortcomings because it does not incorporate a self-learning engine.

            Hemayad and Hassanien [16] proposed a system in which input images were converted to YCbCr space for the detection of hand and face based on skin profile. The dilation morphological operation was used on the converted images. Hand shape edge detection relied on the Prewitt operator, while PCA was used to reduce dimensionality and acquire the final feature. Based on 150 signs and gestures, the experiment revealed an accuracy of 97% on KNN.

·         Recognition based on Gray Level Co-occurrence Matrix

El Alfi and Atawy [20] used GLCM in the development of a proposed ArSL transition system. Although the method is non-deep learning approach, four phases were involved: processing phase, feature extraction phase, matching strategy phase, and Display Translation phase. Feature extraction involved a combination of GLCM and histogram features. The experimental results revealed that the system could recognize 19 Arabic alphabets with an accuracy of 73%. Compared to other results, this could mean that the approach has lower accuracy. However, the researchers do not explain the reason behind the low accuracy rates compared to other methods of feature extraction.

·         Others

            Several of the studies identified did not state the approach take explicitly for it to be classified as deep learning or non-deep learning. For example, Sadek et al. [9] describe an approach to the development of a smart glove for ArSL recognition. Although the simulated system was found to be cost-effective, the researchers do not report about its accuracy. El-Gayyar et al. [13] report the development of a mobile-based system. The system appears to be aligned with self-supervising system for deep learning because it involves multiple layers. The findings provide preliminary information about the usability of the system without adequate indication of its accuracy. Ibrahim et al. [4] also developed a system that can be considered to have relied on a non-deep learning approach. The system relied on the tracking of blobs using Euclidean distance because the data was not robust. Based on a dataset containing 30 isolated words, the study found the system to have a recognition rate of 97%.

3.1 Deep Learning Approach

Deep learning approaches rely on neural networks with a hidden layer that make them function independent of the signer in image recognition.

·         Recognition using Convolutional Neural Network

Kamruzzaman [10] developed a vision-based system applying CNN for the recognition of hand sign-based letters. The system applies deep learning approach because it would detect hand sign letters automatically and speak out the results with Arabic language. The feature extraction component of the system converts images into a 3D matrix for specified height, width, and depth. The pooling layer, which appears between the convolution layers, decreases the dimensionality and lessens computation with fewer parameters. It also aims at regulating the training time. However, classification is the second most essential component of the system. The FC (fully connected) layer accepts one-dimensional data. Overall, the system gave 90% accuracy in the recognition of Arabic hand sign-based letters, which illustrates its dependability.

            Likewise, Saleh and Issa [12] used CNN in their proposed system for transfer learning and fine-tuning aimed at improving the accuracy in recognizing 32 hand gestures. The proposed methodology involved the creation of models matching ResNet152 and VGG16 structures. In turn, the pre-trained model weights were loaded into the layers of each network. The system involved the addition of another layer (soft-max classification) after the FC layer. Unlike Kamruzzaman [10], the system was fed with 2D images from different ArSL data. The system had an accuracy of approximately 99%.

            Hayani et al. [18] also proposed a system for the recognition of ArSL based on CNN inspired from LeNet-5. The system comprises seven layers, with the initial four layers of the system were designed to extract deep features from the images and the final three classifies them. In evaluating the system, the researchers used a dataset containing 2030 images of numbers and 5839 images of letters of the ArSL. They varied the number of training an test tests to improve the accuracy of the proposed system. Based on the results, the system achieved an accuracy of approximately 90.02% using 80% of the images from the dataset. The recognition accuracy was found to be superior to KNN and SVM-based system.

            Latif et al. [24] also used CNN to develop a system for ArSL recognition. Conventional to CNN, the system had three convolutional layers, three pooling layers, and an FC layer. Besides, the system included Rectified Linear Unit (ReLu) to serve activation function; the leaky ReLU helped in overcoming the problem of dead neurons. A soft-max activation function involving a Dense Layer Activation function and Loss Function was connected after the FC layer to convert the outputs to a one-hot encoding vector. Initially, the system had an accuracy of 80.31%. However, a series of 15 experiments aimed at finalizing a suitable design. With a dataset of 50,000 images, the subsequent experiments led to an accuracy of 97.6%. Notably, this design seems to have given a higher accuracy that related studies, probably because of the additional layers aimed at optimizing the system.

·         Recognition using Hidden Markov Models (HMMs)

            Several other studies have used Hidden Markov Models (HMMs) in the development of systems for ArSL recognition. HMMs can enable the development of signer-independent systems that do not depend on the use of gloves or connected sensors. In their study, Youssif et al. [8] introduced an automatic ArSL recognition system using HMMs. The system was based on three phases of hand detection and tracking: skin detection, edge detection, and fingertips tracking. After pre-processing the images at a frame rate of 25Hz, the RGB captured them and converted them into HSV images.In turn, the color spaces were separated pertinent to hue, saturation, and intensity (HIS). The Canny algorithm was used for edge detection and minimization of error rate. For each frame, the detected skin regions were tested to detect whether the input contour was convex or not using connected component analysis. While the study found a detection rate lower than in many other studies (82.22%), the system used only eight features per frame. The use of fewer features is critical to the system in optimizing its ease of use.

            El Alfi and Atawi [20] presented several spatio-temporal feature extraction techniques with application to offline and online recognition of isolated ArSL. In the proposed approach, the researchers extracted video-based gestures via forward, backward, and bi-directional predictions. To ensure the representation of the video sequence by few coefficients and elimination of temporal dependencies, the researchers thresholded and accumulated the errors into one image, which was followed by spatial domain feature extractions. The proposed model differed from the others in that it included Zonal coding, Radon transformation, and low pass filtering, besides 2-D transformation. The experimental results led to performance accuracy ranging from 97% to 100% in the recognition of ArSL.

            Ahmed and Aly [23] used a combination of techniques, including HMM, Local Binary Patterns (LBP) and Principal Component Analysis (PCA). The approach to the proposed system included hands and head detection, labeling of connected components, and features extraction. The experiment relied on 23 isolated Arabic words performed by 3 different signers. The first experiment evaluated the recognition rate of the system in the absence of skin segmentation in which the recognition rate with PCA-only featured led to a recognition rate of 99.8% with 30 eigonvectors. The combination of LBP and PCA features with HMM led to the achievement of 99.9% recognition rate with 20 eigonvectors. LBP was applied in the description of the shape and texture of images, while PCA was used in the reduction of dimensionality. The second and third experiments did not have significant signer-dependent recognition rates. As such, the researchers concluded that the system using LBP-PCA had superior performance to systems using Discrete Cosine Transformation.

·         Recognition using Multilayer Perceptron

            MLP is a deep learning algorithm that contains several layers. In their work, Luqman and Mahmoud [14] compared SVM, KNN, and ML classifiers in the development of a ArSL system. The work aimed at assessing the different transformation techniques for the extraction and description of features base on the accumulation of signs frames into one image. Unlike other studies, the authors also evaluated different frequency domain transforms (Log-Gabor, Hartley, and Fourier transforms) for ArSL. Overall, the experiment found that Hartley transform was superior to other transforms in the recognition of ArSL with an accuracy of approximately 98.9% on SVM. Besides, the accuracy was further improved to 99.2% using MLP descriptors. Therefore, this shows that the system could be efficient in ArSL, especially when using MLP compared to other descriptors.

            Mohandes et al. [1] also used the LMC as a backbone for an ArSL system that would detect finger and hand position and motion without the requirement of sensors. The system had the classical phases including preprocessing, feature extraction, and classification stages. However, the researchers compared the utility of MLP with Nave Bayes Classifier. The comparison of MLP and NBC revealed that MLP was superior with an accuracy of 99% compared to 98% with NBC. Coupled with the findings from Luqman and Mahmoud [14], the results show that MLP can be used successfully with LMC in the development of ArSL systems.

            Likewise, Elons et al. [15] used MLP neural networks in the development of a digital sensor (leap motion). The sensor captures hand and finger movement using a 3D format. The researchers fed the spatial and temporal features to test the system’s accuracy in recognizing 50 different dynamic signs. The system attained an accuracy of 88% using two different signers. However, a shortcoming of the approach is that the sensor can only track manual features.

El-Bendary et al. [21] developed the ArSLAT (Arabic Sign Language Alphabet Translator) system using the same approach and compared its accuracy using MLP and minimum distance classifier (MDC). Conventional to deep learning, the system included five phases, preprocessing, frame detection, category detection, feature extraction, and classification phases. The system was made more flexible by using translation, scale, and rotation invariant as the extraction features. Unlike previous studies, the system performed better using MDC, with an accuracy of 91% compared to 83.7% using MLP. However, the study also revealed that MDC took longer to recognize ArSL compared to MLP, which could be a weakness of the system. Almasre and Al-Nuaim [25] used a different approach, although they used an LMC skeleton. The proposed system was developed for supervised machine learning using Kinect. The results from the study revealed that the system could recognize 22 of the 22 alphabets with 100% accuracy.

·         Recognition using Recurrent Neural Network

Maraqa and Abu-Zaiter [19] describe the use of recurrent Neural Networks in the development of an ArSl recognition system. The method included for phases: data collection and image acquisition, image processing, feature extraction, and gesture recognition. The process of image acquisition required the processing of image according to HIS. The extraction was conducted according to color layers, with expansion from the segmented color regions. The researchers compared different architectures for the system and found that RNN was the most promising, with an accuracy of 95%.

·         Recognition using Neuro-fuzzy Inference System

            Al-Jarrah and Halawani (17) used the Adaptive Neuro-fuzzy Inference System (ANFIS)in the proposed system for the recognition of 30 ArSL alphabets. The approach is considered deep learning because the ANFIS architecture has five layers: an put layer, two hidden layers, and an output layer. In this experiment, 60 different signers performed the 30 alphabets. The researchers filtered the input image using a median filter to enhance the image for segmentation and reduce noise. The features of interest included the direction of the gesture, the center of the area, and border information. The proposed system was robust to direction, position, and size, with an overall accuracy of 93.5%.

3.3 Review of Papers on Deep Learning

            The literature reveals different approaches to the development of ArSL systems using deep learning. However, one of the missing links in the papers entails the aspects of segmentation, with many of the studies not discussing the issue of segmentation. In Ahmed and Aly [23], the researchers used a deep learning approach but did not consider skin pigmentation. While the inclusion of LBP could segment fingers, the other approach may be inconvenient to the user. However, Hemayed and Hussainien [16] address the problem by including skin profiles, which eliminates the need for colored gloves. Indeed, most of the studies addressing the segmentation problem used non-deep learning approaches. The constraint that could be encountered in such systems using deep learning entails the need for depth sensing camera. The requirement could limit the usability of the system. Moreover, none of the studies reports the use of Lightweight and Efficient CNN models. Although the systems achieved high levels of accuracy, it appears that the lack of Lightweight and Efficient CNN models could have led to the inclusion of many features in the system, hence, minimizing the ease of use.

Discussion of the Chapter

            Segmentation emerges as one of the crucial steps that many ArSL systems use in the separation of the hand from the background. However, many approaches fail to adapt segmentation to all images because of the different levels of illumination, skin tones, shape variation, and background complexity. While several approaches identified tackled the problem of segmentation ([17], [19], [23]), they fall short because they rely on heavyweight models. While such systems could be applied in real world settings, they may be subject to performance declination. Other studies proposed sensor-based approaches that could capture both depth and intensity of images. However, these non-deep learning approaches require extensive use of sensors, which limits the usability of the systems. Indeed, these systems may prove unreliable because of the costs associated with the acquisition of sensors. Therefore, lightweight and efficient CNN models could eliminate these requirements, while also reducing the overreliance on the segmentation phase.

Summary of the Chapter

            The existing literature proposes a multiplicity of approaches for ArSL. In many cases, the previous studies have tried to address the problem of discriminative features associated with sign language. The chapter discussed the relevant literature pertinent to deep learning and non-deep learning approaches to ArSL. The first section outlines the common non-deep learning approaches to the development of ArSL systems. The second section identified several of the approaches to deep learning that have proven effective in the development of ArSL systems. Overall, the studies show that deep-learning approaches could lead to the development of superior systems. The discussion includes a review of the papers on deep learning approaches, as well as their shortcomings.


  • Mohandes, M., Aliyu, S., & Deriche, M. (2014, June). Arabic sign language recognition using the leap motion controller. In 2014 IEEE 23rd International Symposium on Industrial Electronics (ISIE) (pp. 960-965). IEEE.
  • Tharwat, A., Gaber, T., Hassanien, A. E., Shahin, M. K., & Refaat, B. (2015). Sift-based arabic sign language recognition system. In Afro-european conference for industrial advancement (pp. 359-370). Springer, Cham.
  • Alzohairi, R., Alghonaim, R., Alshehri, W., Aloqeely, S., Alzaidan, M., & Bchir, O. (2018). Image based Arabic sign language recognition system. International Journal of Advanced Computer Science and Applications (IJACSA)9(3).
  • Ibrahim, N. B., Selim, M. M., & Zayed, H. H. (2018). An automatic arabic sign language recognition system (ArSLRS). Journal of King Saud University-Computer and Information Sciences30(4), 470-477.
  • Tubaiz, N., Shanableh, T., & Assaleh, K. (2015). Glove-based continuous Arabic sign language recognition in user-dependent mode. IEEE Transactions on Human-Machine Systems45(4), 526-533.
  • Naoum, R., Owaied, H. H., & Joudeh, S. (2012). Development of a new arabic sign language recognition using k-nearest neighbor algorithm. Journal of Emerging Trends in Computing and Information Science, 3(8), 1173-1178.
  • Deriche, M., Aliyu, S. O., & Mohandes, M. (2019). An intelligent arabic sign language recognition system using a pair of LMCs with GMM based classification. IEEE Sensors Journal19(18), 8067-8078.
  • Youssif, A. A., Aboutabl, A. E., & Ali, H. H. (2011). Arabic sign language (arsl) recognition system using hmm. International Journal of Advanced Computer Science and Applications (IJACSA)2(11).
  • Sadek, M. I., Mikhael, M. N., & Mansour, H. A. (2017, March). A new approach for designing a smart glove for Arabic Sign Language Recognition system based on the statistical analysis of the Sign Language. In 2017 34th National Radio Science Conference (NRSC) (pp. 380-388). IEEE.
  • Kamruzzaman, M. M. (2020). Arabic sign language recognition and generating Arabic speech using convolutional neural network. Wireless Communications and Mobile Computing2020.
  • Aly, S., & Aly, W. (2020). DeepArSLR: A novel signer-independent deep learning framework for isolated arabic sign language gestures recognition. IEEE Access8, 83199-83212.
  • Saleh, Y., & Issa, G. (2020). Arabic Sign Language Recognition through Deep Neural Networks Fine-Tuning. International Journal of Online and Biomedical Engineering, 16(5), 71-83.
  • El-Gayyar, M. M., Ibrahim, A. S., & Wahed, M. E. (2016). Translation from Arabic speech to Arabic Sign Language based on cloud computing. Egyptian Informatics Journal17(3), 295-303.
  • Luqman, H., & Mahmoud, S. A. (2017). Transform-based Arabic sign language recognition. Procedia Computer Science117, 2-9.
  • Elons, A. S., Ahmed, M., Shedid, H., & Tolba, M. F. (2014, December). Arabic sign language recognition using leap motion sensor. In 2014 9th International Conference on Computer Engineering & Systems (ICCES) (pp. 368-373). IEEE.
  • Hemayed, E. E., & Hassanien, A. S. (2010, December). Edge-based recognizer for Arabic sign language alphabet (ArS2V-Arabic sign to voice). In 2010 International Computer Engineering Conference (ICENCO) (pp. 121-127). IEEE.
  • Al-Jarrah, O., & Halawani, A. (2001). Recognition of gestures in Arabic sign language using neuro-fuzzy systems. Artificial Intelligence133(1-2), 117-138.
  • Hayani, S., Benaddy, M., El Meslouhi, O., & Kardouchi, M. (2019, July). Arab Sign language recognition with convolutional neural networks. In 2019 International Conference of Computer Science and Renewable Energies (ICCSRE) (pp. 1-4). IEEE.
  • Maraqa, M., & Abu-Zaiter, R. (2008, August). Recognition of Arabic Sign Language (ArSL) using recurrent neural networks. In 2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT) (pp. 478-481). IEEE.
  • El Alfi, A. E. E., & Atawy, S. M. E. L. (2018). Intelligent Arabic sign language to Arabic text translation for easy deaf communication. International Journal of Computer Applications180.
  • El-Bendary, N., Zawbaa, H. M., Daoud, M. S., Hassanien, A. E., & Nakamatsu, K. (2010, October). Arslat: Arabic sign language alphabets translator. In 2010 international conference on computer information systems and industrial management applications (CISIM) (pp. 590-595). IEEE.
  • Shanableh, T., Assaleh, K., & Al-Rousan, M. (2007). Spatio-temporal feature-extraction techniques for isolated gesture recognition in Arabic sign language. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)37(3), 641-650.
  • Ahmed, A. A., & Aly, S. (2014, April). Appearance-based arabic sign language recognition using hidden markov models. In 2014 international conference on engineering and technology (ICET) (pp. 1-6). IEEE.
  • Latif, G., Mohammad, N., AlKhalaf, R., AlKhalaf, R., Alghazo, J., & Khan, M. (2020). An Automatic Arabic Sign Language Recognition System based on Deep CNN: An Assistive System for the Deaf and Hard of Hearing. International Journal of Computing and Digital Systems9(4), 715-724.
  • Almasre, M. A., & Al-Nuaim, H. (2016). A real-time letter recognition model for arabic sign language using kinect and leap motion controller v2. International Journal of Advanced Engineering, Management and Science2(5), 239469.
Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more