Arabic Sign Language Recognition using Lightweight EfficientNet

Sign language continues being the preferred communication method among both the hearing-impaired and the deaf. Due to advances in information technology, it has become necessary to develop systems which can enhance automatic translation between the spoken and sign language. Systems which translate between spoken and Arabic sign language have increasingly become popular recently. The literature is full of numerous proposed solutions for recognition of sign language. However, the Arabic Sign language, contrary to American Sign Language, had not received enough attention from previous research. This paper proposes a new, novel Arabic Language System and addresses the challenges of a) choosing the relevant visual features, b) relying on the segmentation phase, and c) using heavyweight CNN models. Specifically, this paper proposes to design and implement an Arabic Sign Language Recognition System based on Lightweight and Efficient CNN models. The proposed system will mainly be image-based and apply CNN for recognizing Arabic hand sign-based letters together with their translation into Arabic speech. Also, the proposed system will be instrumental for automatically detecting hand sign letters and speaking out the outcome using the Arabic language with a deep learning model. The system results in a recognition accuracy of 90%, thus implying that it is a highly reliable system. To improve the accuracy, more advanced hand gestures which recognize devices like Xbox Kinect or Leap Motion can be used. After recognition of the hand sign-based letters, the results will be input into the speech engine to yield the Arabic language’s audio as an output. Generally, EfficientNet model has been found to improve accuracy for sign language compared to existing approaches.

Chapter 1: Introduction
Communication is an essential skill for humans to be able to live in a society, to express themselves in a simple manner that ensures that they coexist harmoniously with others. In addition, communication also plays a critical role in fostering debates about social issues in order to identify viable solutions for preventing these problems from aggravating. Communication has also been identified as being helpful for improving one’s mental health and wellbeing [1]. The four types of communication are verbal, non-verbal, written, and visual communication. In communication, it is worth noting that hearing/speech disability persons are different from people without disability in communication because the use visual communication and nonverbal cues which involve the movement of certain body parts and facial expressions [2].
Sign language is the main communication technique for deaf or dumb persons, and is founded on a visual-motion code which has a codified system issued through certain standard positions and hand movements which relate to facial expressions [3]. Contrary to popular belief, many people have been found to have hearing disabilities. A World Health Organization, WHO, report showed that approximately 466 million people around the world are suffering from moderate to profound hearing/speech loss struggle, and 38 million of them are living in Eastern Mediterranean district [4]. In January 2017, AlSharq Al-Aawsat newspaper estimated the number of people with hearing disabilities in Saudi Arabia to be about 800,000 persons [5]. In June 2019, the Saudi Arabian ministry of health valued the number to be 229,541 based on a statistic from the general organization of statistics in Saudi Arabia [6]. They usually use gestures and movements, or as we call it “signs”, to convey meaning to others. Therefore, a sign language recognition system is perceived as an assistance tool that aids persons with hearing impairment in their communication with other members of the society. Specifically, it aims to translate the sign language gestures done by those impaired to make it easier for them to socialize with others. Using collections of signs, people developed many sign languages, and each is commonly used in certain regions. In sign language, the
main components of signs are handshape, location, movement, orientation and non-manual component.
Nevertheless, sign language is common among people with hearing and speaking disabilities and mainly affects their families and friends. Therefore, it creates a gap between those with hearing-impairment and others in the society. Hearing impairment also makes it more difficult or occasionally impossible to communicate with others, meet other people or interact with day-to-day life situations. The Arabic Sign Language (ArSL) reveal some discrepancies between different Arab countries, due to the presence of approximately 22 countries that speak Arabic as an official language. Despite the lack of uniformity at the language level, the numbers and letters are expressed using the same ArSL hand gesture [7]. We are going to establish a model that aim to use Lightweight EfficientNet in order to recognize the sign language to facilitate communication with hearing/speech disability persons.
1.1 Motivation
One of the major issues faced by those with a hearing-impairment is the communication barrier that stands between handicapped persons and those who do not have such impairments. In other words, the communication which is the basis of human socialization, often tends to be out of the question for those unfortunate people who are unable to articulate their thoughts. Thus, people who suffer from hearing/speech disorder tend to be victims of social discrimination and isolation. Moreover, hearing impaired or speech disabled persons can become targets of bullying because of their disorders. Such bullying can result in lowering of self-esteem for hearing or speech impaired persons, especially among young children [8]. This project should help people with hearing and speaking
majomajo majo
disabilities to communicate with others. The outcome of this project should be a model that can be a tool for developers to be creative and deploy it in many applications that assist people with hearing and speaking disabilities, reduce the gap between them and their peers and improve their communication [9].
1.2 Problem Statement
The Recent research advances promoted the design and implementation of various sign languages recognition solutions especially for the American Sign Language [10], the Chinese Sign Language [11] and the Korean Sign Language [12]. More specifically, image-based solutions gained the researchers’ interests as a non-intrusive and commodious alternative. In fact, they associate image processing and machine learning techniques to segment the hand captured in the image, and map it automatically to the corresponding sign language gesture. Therefore, the overall performance of such a system depends on segmentation quality and the choice of the features that should encode the main visual properties of the sign language gesture. Recent solutions have exploited the capabilities of deep learning paradigms to address these challenges. Despite these efforts, the existing solutions exhibit considerable rooms for improvement for the ArSL language recognition solutions which typically rely on heavyweight Convolutional Neural Network (CNN) models [13]. The common hardware isn’t sufficient for the existing models to deliver sufficient in real time in order to interpret the ArSL into Arabic spoken language. So, we’re in need to develop a better model that can be deployed in mobile application, web applications and more.
1.3 Goals and Objectives
The main objectives of this project are to: Survey and investigate existing image-based sign language recognition approaches, design and develop a non-invasive ArSL fingerspelling recognition system using Lightweight and Efficient CNN Models. Specifically, the proposed system captures static images of an Arabic alphabet signs using a camera and maps them into the corresponding letter class in a real-time manner. Compare the proposed system to existing relevant
image based Arabic sign language recognition approaches using standard datasets and performance measures.
1.4 Solution
In this project, we propose a novel ArSL recognition system and address the challenges of: (i) Choosing the relevant visual features, (ii) Relying on the segmentation phase, and (iii) Using heavyweight CNN models. Specifically, we propose to design and implement an Arabic Sign Language recognition system based on Lightweight and Efficient CNN Models.
Chapter 2: Background
2.1 Sign Language
With sign language being among the most significant communication means for those with hearing impairment and their integration into the society, the issue is that many vocal persons lack an understanding of sign language. Thus, the need to create automated systems which can translate sign languages into sentences and words is increasingly becoming necessary. Before this is done, language is often viewed as a system comprising of formal sounds, gestures, signs or symbols utilized for day-to-day communication. Ideally, communication can be divided into four types: visual, verbal, nonverbal, and written. Verbal communication entails the transfer of information through sign language or speaking. Contrarily, nonverbal communication is dissimilar since it entails using language in the transfer of information through body language, gestures, and facial expressions. However, written communication entails the conveyance of information through printing, typing or writing symbols including letters and numbers whereas visual communication involves the conveyance of information through means like charts, graphs, arts, photographs or drawings. Sign language is defined as the movement of hands and arms to communicate, particularly among people with hearing disability. The lack of standardization of sign language makes it critical to study it from the angle of a region or country. Because sign language has already become a potential language for communicating to the mute and deaf, there is possibility of developing an automated system to enable them talk to those who are neither mute nor deaf. Sign language comprises of four main manual components: movement of hands, figure configuration of hands, location of hands, and orientation of hands in relation to the body [14].
The two main sign language recognition approaches used are sensor and image-based. Vision and glove-based systems are the major types of gesture recognition. The glove-based systems entail the use of electromechanical devices for collecting data on gestures of the deaf. These systems involve a deaf person wearing a wired glove which is connected to various sensors to relay the gestures made by the person’s hand. In this regard, a computer interface can enhance recognition of such gestures. It provides a great result but has been criticized for being inconvenient since the user has to carry wired sensors or gloves always, which is an unnatural way of communication between the deaf and those with hearing ability. Ideally, the core benefit of the image-based system is the fact that signers are not mandated to utilize complex devices [15]. On the other hand, the vision-based system make machine learning and image processing techniques to not only identify, but also acknowledge and interpret the gestures made by the person’s hand. Such a system can overcome the weakness of inconvenience mentioned in a glove-based system because there is no need for users to wear electromechanical devices. What this implies is that the major strength of vision-based systems is the flexibility they provide. Despite this, there is need for substantial computations during the pre-processing stage. Rather than cameras, sensor-based systems usually utilize instrumental gloves which have in-built sensors. In addition, sensor-based systems usually have their challenges, such as the gloves which signers wear. Notably, traditionally, the isolated word, alphabet, and continuous recognition have been the main typologies of image-based Arabic Sign Language Recognition (ArSLR). Despite various attempts having been done on the various Arabic sign language categories, this paper will mainly focus on Arabian sign language recognition system and the design and implementation of an Arabic Sign Language system founded on Lightweight and Efficient CNN models.
2.2 Arabic Sign Language
2.2.1 Image-Based ArSL Recognition
Traditionally, three types of image-based ArSLR systems have existed: continuous, isolated word, and alphabet recognition. Typically, an image-based recognition system is made up of five main stages: acquisition of images, preprocessing, segmentation, extraction of features, and classification. A large proportion of earlier work concentrated on limited vocabulary, mainly utilized for the basic machine-human interaction. Ideally, the input to this system is various static images or set of signs. Signers are requested to pause between signs in order to make it easy to separate. One of the core benefits of an image-based system is user acceptance since it is not compulsory for the signer to put on the cumbersome data glove. Despite this, this technique has various challenges, including image background, various noise types, hands and face segmentation. In spite of the segmentation of face and hand being computationally costly, recent algorithmic and computing advances have increased the possibility of performing, in real time, this segmentation [16]. Notwithstanding, the widespread commercial use of image-based systems remains quite limited, particularly in the ArSL. The next sections discuss the three major types of recognition systems for ArSLs.
2.2.2 Alphabet Recognition
In the alphabet recognition scenario, signers usually perform letters separately. A static posture represents letters, and the size of the vocabulary remains limited. This section talks about various approaches for image-based ArSL alphabet recognition. Figure 1 below displays the alphabet utilized for ArSL. Despite the Arabic alphabet being made up of only 28 letters, 39 signs are used in the Arabic sign language. The 11 extra signs symbolize basic signs amalgamating two letters. For instance, the two letters remain very common in Arabic in the
same way the “the” article is common in English. In this regard, many studies done on ArSLR makes use of these basic 39 signs.
Figure 1: Arabic Sign Language Alphabet
Source: Mohandes [15]
Mohandes developed an automatic recognition for ArSL letters. In this model, Hus moments are deployed for feature extraction [15]. Moment invariants are usually released to the support vector machines for helping classification. It resulted in the achievement of an 87% correct recognition rate. A neuro-fuzzy system was developed by Al-Jarrah and Halawani [17]. This system operates in a number of steps including the acquisition of images, filtering, segmentation, detection of hand outline and feature extraction. The experiments considered bare hands, resulting in a 93.6% recognition accuracy. It was based on a vision-based system, thus dealt with signs as images. Also, it is worth noting that they deployed two unique classifiers: Probabilistic Neural Networks (PNN) and Feed Forward Neural Network (FFNN).
In [18], an adaptive neuro-fuzzy inference system was built by Al-Rousan and Hussain for the alphabet sign recognition. In order to make simplification easier, the experiment used a glove while geometric features were obtained from the hand area. A 95% recognition accuracy was achieved. On the other hand, a polynomial classifier was used by Assaleh and Al-Rousan [19] for recognizing alphabet signs from deaf people. It involved the use of a glove-based system with six different colors: one for the wrist region, and five for fingertips. They deployed an Adaptive Neuro Fuzzy Inference System (ANFIS). Various geometric measures including angles and lengths were deployed as features. A 93.4% recognition rate was achieved using a database with not less than 200 samples which represented 42 gestures.
Mohandes et al utilized an off-shelf device (cost-effective) for implementing a strong ArSLR system [20]. The statistical features obtained from the acquired signals are combined together with SVM classifier. Using a 120-sign database, it yielded a more than 90% recognition accuracy. Mohandes’ Recognition of two-handed arabic signs using the cyberglove was the first attempt that led to the recognition of a two-handed Arabic sign [21]. The database was made up
of 20 samples from two-handed signs that the two signers performed. The PCA was then used for reducing the feature vector’s length. The SVM was deployed for classification, leading a 99.6% accuracy with 100 signs.
To facilitate alphabet recognition, Maraqa and Abu-Zaiter in [22], utilized recurrent neural networks. A database with 900 samples which covered thirty gestures that two signers performed was utilized within their experiments. The experimenters used colored gloves which resonated with the ones which were deployed in [20]. Notably, an 89.7% accuracy rate was achieved using the Elman network whereas a fully recurrent network enhanced this accuracy to 95.1%. On the other hand, El-Bendary et al in [23] came up with an SLR for the Arabic alphabet which achieved a 91.3% accuracy. Bare hands images are processed in their system, and the input entails various features obtained from a video which has signs, while the output is a basic or simple text. The proposal was founded on a vision-based system. Extraction of the hand outline is first done for each frame. Also, the experimenters utilized three features in order to represent the hand’s position. Distances to the hand outline which cover 180 degrees are then extracted through a centroid point as a fifty-dimensional features vector, which are translation, scale, and rotation invariant. In the experiment, an assumption of a small pause during the feature segmentation phase was made between letters. These pauses are often used for separating letter numbers as well as related video frames. The alphabet’s signs are categorized into three unique categories prior to feature extraction. During the recognition stage, a minimum distance classifier (MDC) and Multilayer Perception Neural Network are deployed.
In [24], Hemay and Hassanien discuss an ArSL alphabet recognition system which changes signs to become voices. While this technique has some similarities with real life setup, the recognition is not carried out in real time. Ideally, the system concentrates on both simple and
static moving gestures. In this system, the gestures’ color images are the inputs, and the YCbCr space is utilized for extracting the skin blobs, while the hand shape is extracted using the Prewitt edge detector. To facilitate the conversion of image area during the classification phase into feature vectors, the K-Nearest Neighbor Algorithm (KNN) is utilized with Principal Component Analysis (PCA). A Hidden Markov Model (HMM) was used by Mohandes et al in [25] for identification of isolated Arabic signs from the images. The study used a Gaussian skin color model for finding the face of signers which is then used for referencing hands movement. It also used two colored gloves for the left and right hands to make it easier for segmentation of the hand region. To facilitate hands segmentation, it used a simple region technique. The dataset used had 500 samples of 300 signs leading to a 95% recognition accuracy. Similarly, in [26], Aliaa et al created an Arabic sign language system founded on the HMM. Using this, they collected broad dataset to facilitate the recognition of 20 isolated words from actual videos for deaf individuals in various skin colors and clothes.
Figure 2: Sign “I’’s Image Sequence.
Source: Mohandes [15].
Figure 3: Extracted left and right-hand regions
Source: Mohandes [15]
In [27], Naoum et al. came up with an image-based SL alphabet recognition that had an accuracy of 50%, 75%, 65% and 80% for the naked hand, red-gloved hand, black-gloved hand and white-gloved hand respectively. In this case, the system begins by finding the images’ histograms. Profiles obtained from such histograms are then utilized as input to KNN classifiers. In [28], Yang and Peng recommended an integration system of better SLR system into intelligent development to create an environment without barriers for the deaf-mute. They made a proposal for a bidirectional speech/language system which can get rid of communication barrier between the vocal persons and the deaf-mute. To embed the sign language into this integration system, integration of the 7-Hu moments and normalized rotational inertia NMI is done. Contrarily, the gesture’s translation is not done in real-time. Recently, as captured in images [29], [30], [31], a methodology which makes use of the Microsoft Kinect camera for capturing was introduced. Computer vision techniques were used for developing a motion profile and characteristic depth for all gestures.
In their study [32], Nadia et al designed an ArSL for identification of the sign character in a high-resolution video. The system is made up of two stages, with the first stage involving the detection of a hand movement in all frames to determine the final processing region (also called the Region of Interest). They then applied a motion detection technique for detecting the frame or time of recognition. Basing on the k-NN classifier and Fourier descriptors feature extraction approach, a sign language recognition was rolled out during the second stage. The system led to a 90.55% accuracy rate.
Apart from the various glove and image-based systems being used today, there has been introduction of new systems which aid human-machine interaction. The leap motion
controller (LMC) and Microsoft Kinect have particularly elicited a lot of attention. Apart from using a video camera with high-resolution, the Kinect system deploys depth sensors and infrared emitter. The LMC utilizes three LEDs and two infrared cameras for capturing information in its interaction range. Despite this, the LMC does not offer the detected objects’ images. Recently, the LMC has been deployed for sign recognition of Arabic alphabet with great results [33]. Sign recognition in the Arabic alphabet is quite simple, and is heralded as the simplest approach among all Arabic sign language recognition methodologies. The two reasons for its simple nature are the limited size of the vocabulary, and representation of signs with static images. These systems tend to attain high rates of recognition, normally more than 90%. However, it is worth noting that alphabet signs are uncommon in day-to-day practice. The use of alphabet signs is constrained to word finger spelling without specific signs such as proper names. Due to these reasons, current research efforts have been implemented in developing systems which concentrated on continuous or isolated word sign recognition.
2.2.3 Word Sign Recognition
Contrary to the alphabet sign recognition, the techniques for word sign recognition often conduct an analysis of various images which represent the whole sign as shown in Figure 3. Mohandes and Deriche in [34] deployed an HMM for identification of isolated Arabic signs obtained from images. The dataset used had 500 samples symbolizing 50 signs. In the study, the face of the signer was found through a Gaussian skin color model, and this was then taken to refer to hand movements. Yellow- and orange-colored gloves were utilized for the left and right hands to make it easy for hand region segmentation as shown in Figure 3. To facilitate hands segmentation, the study used a “simple region growing technique” similar to the one used in [35]. The approach achieved more than 50 signs with a 98% accuracy rate. In [36], Shanableh
and Assaleh came up with a signer-independent system to cater for isolated Arabic signs. Segmented hand images obtained from colored gloves were used, while zonal discrete cosine transform coefficients are used for feature extraction. On the other hand, a KNN algorithm is instrumental for classification. With a 23-sign vocabulary size, the authors attained an 87% classification rate. Shanableh and Assaleh used the HMM-based classification to extend their work [37], [38]. The authors introduced new features, based on video, that consider motion, and yielded a 95% recognition accuracy through the system. Again, in [39], Shanableh and Assaleh came up with a user independent recognition system, with its dataset having 3450 video segments which covered 23 isolated gestures obtained from three signers. Colored gloves were used by the signers in order for color information to be deployed during the preprocessing stage. Feature extraction is done from the images’ accumulated differences. Within the classification stage, a basic/simple KNN algorithm is used during the classification stage, leading to an 87% recognition rate.
Youssif et al [40] came up with an ArSLR system using the HMM for recognition of isolated signs. The finger and palm region were modelled as circles and eclipses. Using a limited size of vocabulary of 20 signs and with eight features, they achieved an 82.2% accuracy. In their study, Zaki and Shaheen [41] combined appearance-based features, and used the Kurtosis position for identifying the articulation location, the PCA for representing the hand region, and motion code chain for the hand movement. Using a 50-sign database, the system attained a 90% recognition accuracy. A semantic-oriented methodology was proposed in [42] by Samir and Aboul-Ela. They used natural language processing rules for detecting and correcting errors in the classification phase. In this proposed approach, it enhanced ArSLR’s recognition accuracy by about 20%. Elons et al [43] deployed a PCNN approach for image feature generation. They
evaluated the features through a fitness function in order to obtain a weighting factor for all cameras, and the ones obtained from the two images used to derive the three-dimensional optimized features. Recognition accuracy stood at 96% with 50 isolated words. Furthermore, Al-Rousan et al. [44] proposed a system with the ability to automatically translate dynamic signs. This hierarchical system divided the signs into groups. The group was first identified followed by the sign’s recognition in that group. The authors used 23 geometric features, and track them using an HMM classifier, thus attaining a 70.5% and 92.5% for recognition accuracy and user-independent mode respectively. Their work extended an algorithm that had already been developed focusing only on the static postures [44]. An automatic isolated-word recognition system was developed by Al Mashagba et al [45] using two unique color gloves and another colored reference mark found on the head. Five geometric features were obtained after extraction of the three colored regions from any video sequence. The features were hand horizontal velocity, hand horizontal position, hand vertical velocity as well as hand vertical position. The Time Delay Neural Network used led to a 77.4% recognition accuracy.
Overall, the isolated word sign language recognition is perceived as being more practical, but alphabet recognition is less complex than it. In addition, word recognition systems are needed to tackle various images in a sequence. Notably, the time component for analysis of such a sequence is crucial. For such systems, it is also worth noting that the size of the vocabulary can be quite large. Also, the challenge is still about dealing with signs that certain pauses separate. As the size of the vocabulary increases, there is a reduction in accuracy. For ArSL, the vocabulary’s size required for practical situations remains an area that needs further research. What this means is that the challenge for ArSL is developing signer-independent systems which have a large size of vocabulary to ensure that they are suitable for practical use.
2.3 Deep Learning
While there are numerous reorganization systems, the Arabic sign language lacks a reorganization system which deploys new techniques including Convolutional Neural Network (CNN), Cyber-physical system, and Cognitive Computing that automated systems extensively use [46]. Notably, the cognitive process allows systems to think in a similar manner a human brain thinks without any help. What this means is that the human brain plays a critical role in inspiring the cognitive ability [47]. Contrarily, deep learning is defined as a subset of machine learning in AI with networks which have networks able to learn unsupervised from unlabeled or unstructured data (also called deep neural network or deep neural learning) [48]. Deep Learning algorithms have been effectively, efficiently, and successful used for image recognition issues. As a process, deep learning entails neural networks with at least one hidden layer, and has been successfully utilized in natural language processing problems, face recognition, as well as speech recognition [49]. Recently, deep learning has been successfully used for human gesture recognition.
CNN, in deep learning, is perceived to be a category of deep neural networks, which are most often applied within the computer vision. It is the most popularly used algorithm for implementation of the deep learning technique. Ideally, a CNN is made up of various layers including the Fully-Connected Layer, Pooling Layer, and Convolutional Layer. Also, vision-based approaches predominantly concentrated on the gesture’s captured image and obtain the main feature for identifying it. The method has always been applied in numerous tasks, such as image classification, multimedia systems, super resolution, emotion recognition, and semantic segmentation [50]. In [51], a proposal is made to use transfer learning on the data gathered from
various users, while simultaneously making use of deep-learning algorithm for learning discriminant characteristics obtained in large datasets.
The first step towards the development of a working deep learning model is data preprocessing. It plays a critical role in transforming the raw data in an efficient and beneficial format. Figure 4 illustrates a flow diagram for data preprocessing.
Figure 4: Data Preprocessing flow diagram
Source: [52]
The raw images are hand sign images captured with the help of a camera for implementation of the proposed system. The conditions the image is taken are different angles, changing light conditions, in focus and good quality, and changing object distance and size. The goal of creating raw images is to help in creation of dataset for testing and training.
2.4 EfficientNet
Conceptually, EfficientNet is an architecture scaling method which makes use of an uncomplicated, productive compound coefficient for escalating CNNs within a better-organized way. Recently, Google published a study for a CNN that had been newly designed
known as EfficientNet [53], which has set new computational efficiency and accuracy records. The various scaling methods of EfficientNet and comparisons as demonstrated in Figure 5.
Figure 5: EfficientNet’s Baseline Network
Source: [53].
The AutoML MNAS framework has been instrumental for optimization of efficiency and accuracy (FLOPS). It is also worth noting that EfficientNet’s architecture utilizes the mobile bottleneck convolution (MBConv). Despite being analogous to MobileNetV2 and MnasNet, in [53], the authors demonstrated that CNN model has been used for precisely perceiving several indications of a sign-based or gender-based communications. The study compared a number of famous deep learning models, including GoogleNet, VGGNet, and Efficient, on an array of factors like validation accuracy, training, total parameter, and trainable parameter. From the findings, it was observed that Arabic Sign Language alphabet dataset, EfficientNet model performed better than other CNN models on the accuracy and total parameters as demonstrated in Figure 6 below.
Figure 6: ImageNet vs FLOPS Accuracy
Source: [54]
The figure at the top illustrates ImageNet vs Model Size Accuracy. Apart from EfficientNet Models being small, they are also computationally cheaper. Higher accuracy has been achieved in the case of EfficientNet-B3 compared to the ResNeXt-101 as evidenced in Figure 7 below.
Figure 7: Accuracy Comparisons for EfficientNet vs ResNetXt-101
Source: [54]
Selecting an appropriate baseline network is important for attaining the best results. In this regard, some authors have used the Neural Architecture Search for building an efficient network architecture called EfficientNet-B0. It results in an accuracy of 77.3% on ImageNet using only 0.9B FLOPS and 5.3M parameters. Ideally, this network’s building block is made up of MBConv where there is addition of squeeze-and-excitation. The inverted residual blocks used in MBConv are similar to the ones deployed in MobileNet v2. They form a critical shortcut connection between the start of a convolutional block and its end. 1×1 convolutions are used for expanding input activation maps in order to enhance the feature maps’ depth. Point-wise and 3×3 Depth-wise convolutions then follow to minimize the number of channels. This structure aids in reduction of the number of operations needed and the model size. Thus, in EfficientNet, scaling up any aspect of network resolution, width or depth enhances accuracy, but the gain reduces for larger models [55]. What this implies is that network scaling for accuracy increase should be done through the amalgamation of the three dimensions. It also suggests that scaling a single dimension deters the accuracy benefits. More importantly, to pursue better efficiency and accuracy, balancing all network dimensions during ConvNet scaling is critical. In summary, using the EfficientNet model can allow authors to produce higher accuracy for SL compared to existing approaches with a significant reduction in model size and overall FLOPS. To contribute to the Arabic Sign Language research, this paper seeks to introduce an entirely different
approach. More specifically, we seek to develop a non-invasive ArSL finger-spelling recognition system using Lightweight and Efficient CNN models.
2.3 Summary
Image and sensor-based are the two most dominant sign language approaches, while the main forms of gesture recognition are glove and vision-based systems. The three traditional forms of image-based Arabian sign language recognition are alphabet, isolated word, and continuous recognition. The five main stages in an image-based recognition are image acquisition, preprocessing, feature extraction, and categorization. Many systems have been proposed in alphabet recognition, including Hus moments, neuro-fuzzy system, PNN, FFNN, ANFIS, PCA, MDC, MPNN, HMM, LMC, and Microsoft Kinect. The word sign language recognition is more practical, but more complex. Because of the gaps in ArSL, this paper sought to address them by introducing a non-invasive ArSL finger-spelling recognition system using Lightweight and Efficient CNN models. The EfficientNet approach/model has been found as more effective in terms of accuracy and total parameters.
[1] C. Martin, and N Chanda,” Mental Health Clinical Simulation: Therapeutic Communication” Int. J. of Clinica l Simulation in Nursing, Vol.12, Issue 6, pages 209-214, Jun.2016.
[2] E. K. Elsayed, and D. R. Fathy, “Semantic Deep Learning to Translate Dynamic Sign Language”, Int. J. of Intelligent Engineering and Systems, Jan. 2021.
[3] A. Horňakova, and A. Hudakova, “Effective communication with deaf patients”, Int. J.
of Original scientific article, VOL.4, NO.7, 2013.
[4] “Eastern Mediterranean Region,” World Health Organization . [Online]. Available: [Accessed: 10-Feb-2021].
[5] E. AlKhattaf, “ ةزیم ىلإ مصلا عم بطاخت نم …ةیدوعسلا يف ةراشلإا ةغل ةیفیظو ”, AlSharq-AlAwsat.[online]. Available: e .[Accessed:
[6] “مبادرة )نحن معك( لخدمة فئة الصم ”,Ministry of Health. [Online]. Available: .[Accessed:
[7] M. Mustafa, “A study on Arabic sign language recognition for differently abled using advanced machine learning classifiers”, J. of Ambient Intelligence and Humanized Computing, Mar. 2020.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”, 25th Int. Conference on Neural Information Processing Systems, Vol.1 Dec. 2012.
[9] L. Tolentino, R. Juan, A. Thio-ac, M. Pamahoy, J. Forteza, and X. Garcia, “Static Sign Language Recognition Using Deep Learning”, Int. J. of Machine Learning and Computing, Vol. 9, No. 6, Dec.2019.
[10] R. A. Kadhim, and M. Khamees, “A Real-Time American Sign Language Recognition System using Convolutional Neural Network for Real Datasets”, TEM Journal. Vol. 9, Issue 3, Pages 937-943, Aug. 2020.
[11] H. Shin, W. Kim, and K. Jang, “Korean sign language recognition based on image and convolution neural network”, 2nd International Conference on Image and Graphics Processing, Pages 52–55, Feb. 2019.
[12] X. Jiang, M. Lu, and S.Wang, “An eight-layer convolutional neural network with stochastic pooling, batch normalization and dropout for fingerspelling recognition of Chinese sign language”. Int. J. of Multimedia Tools Applications 79,15697–15715, Jun. 2020.
[13] R. Rastgoo, K. Kiani, and S. Escalera, “Hand sign language recognition using multi-view hand skeleton”, Int. J. of Expert Systems with Applications, Vol.150, Jul. 2020.
[14] E. Costello, American Sign Language Dictionary, Random House, New York, NY, USA, 2008.
[15] M. Mohandes, “Arabic sign language recognition.” In International conference of imaging science, systems, and technology, Las Vegas, Nevada, USA, vol. 1, pp. 753-9. 2001.
[16] R. Azad, B. Azad, and I. T. Kazerooni, “Real-time and robust method forhand gesture recognition system based on cross-correlation coefficient,”Adv. Comput. Sci., Int. J., vol. 2, no. 5/6, pp. 121–125, Nov. 2013.
[17] O. Al-Jarrah and A. Halawani, “Al-Jarrah, Omar, and Alaa Halawani. “Recognition of gestures in Arabic sign language using neuro-fuzzy systems.” Artificial Intelligence vol. 133, no. 1, pp. 117–138, 2001.
[18] M. Al-Rousan and M. Hussain, “Automatic recognition of Arabic sign language finger spelling.” International Journal of Computers and Their Applications, vol. 8, pp. 80–88, 2001.
[19] K. Assaleh and M. Al-Rousan, “Recognition of Arabic sign language alphabet using polynomial classifiers.” EURASIP Journal on Advances in Signal Processing, no.13, pp. 2136–2145, 2005.
[20] Mohandes, M., S. A-Buraiky, T. Halawani, and S. Al-Baiyat. “Automation of the Arabic sign language recognition.” In Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, pp. 479-480. IEEE, 2004.
[21] M. A. Mohandes, “Mohandes, Mohamed A. “Recognition of two-handed arabic signs using the cyberglove.” Arabian Journal for Science and Engineering, vol. 38, no. 3, pp. 669–677, 2013.
[22] M. Maraqa and R. Abu-Zaiter, “Recognition of Arabic Sign Language (ArSL) using recurrent neural networks.” In 2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT), 2008, pp. 478–481.
[23] N. El-Bendary, H. M. Zawbaa, M. S. Daoud, K. Nakamatsu et al., “El-Bendary, Nashwa, Hossam M. Zawbaa, Mahmoud S. Daoud, Aboul Ella Hassanien, and Kazumi Nakamatsu. “Arslat: Arabic sign language alphabets translator.” In 2010 international conference on computer information systems and industrial management applications (CISIM), 2010, pp. 590–595.
[24] E. E. Hemayed and A. S. Hassanien, Hemayed, Elsayed E., and Allam S. Hassanien. “Edge-based recognizer for Arabic sign language alphabet (ArS2V-Arabic sign to voice).” In 2010 International Computer Engineering Conference (ICENCO), 2010, pp. 121–127.
[25] M. Mohandes, M. Deriche, U. Johar, and S. Ilyas, “A signer-independent Arabic Sign Language recognition system using face detection, geometric features, and a Hidden Markov Model.” Computers & Electrical Engineering, vol. 38, no. 2, pp. 422–433, 2012.
[26] A.A. Youssif, A.A Elsayed, and Hamdy A. “Arabic sign language (arsl) recognition system using hmm.” International Journal of Advanced Computer Science and Applications (IJACSA) 2, no. 11 (2011).
[27] S. J. Reyadh Naoum, Hussein H. Owaied, “Development of a new Arabic sign language recognition using k-nearest neighbor algorithm,” Journal of Emerging Trends in Computing and Information Sciences, vol. 3, pp. 1173–1178, 2012.
[28] Y. Quan, “Chinese sign language recognition based on video sequence appearance modeling.” In 2010 5th IEEE Conference on Industrial Electronics and Applications, 2010, pp. 1537–1542.
[29] Y. Quan et al., “Application of improved sign language recognition and synthesis technology in IB.” In 2008 3rd IEEE Conference on Industrial Electronics and Applications, pp. 1629-1634. IEEE, 2008, pp. 1629–1634.
[30] A. Agarwal and M. K. Thakur, “Agarwal, Anant, and Manish K. Thakur. “Sign language recognition using Microsoft Kinect.” In 2013 Sixth International Conference on Contemporary Computing (IC3), pp. 181-185. IEEE, 2013, pp. 181–185.
[31] E.-J. Ong, H. Cooper, N. Pugeault, and R. Bowden, “Sign language recognition using sequential pattern trees.” In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2200–2207.
[32] A. R, Nadia, and Yasser M. Alginahi. “Real-time arabic sign language (arsl) recognition.” In International Conference on Communications and Information Technology, pp. 497-501. 2012.
[33] M. Mohandes, S. Aliyu, and M. Deriche, “Arabic sign language recognition
using the leap motion,” presented at the IEEE Int. Symp. Industrial Electron., Istanbul, Turkey, Jun. 2014.
[34] M. Mohandes and M. Deriche, “Mohandes, Mohamed, and Mohamed Deriche. “Image based Arabic sign language recognition.” In Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, vol. 1, pp. 86–89, 2005.
[35] T. Shanableh and K. Assaleh, “Arabic sign language recognition in user-independent mode,” in Proc. Int. Conf. Intell. Adv. Syst., 2007, pp. 597–600.
[36] T. Shanableh and K. Assaleh, “Video-based feature extraction techniques for isolated Arabic sign language recognition,” in Proc. 9th Int. Symp. Signal Process. Appl., 2007, pp. 1–4.
[37] T. Shanableh and K. Assaleh, “Two tier feature extractions for recognition of isolated Arabic sign language using fisher’s linear discriminants,” in Proc. IEEE Int. Conf. Acoust, Speech Signal Process., 2007, vol. 2, II–501–II–504.
[38] T. Shanableh and K. Assaleh, “User-independent recognition of Arabic sign language for facilitating communication with the deaf community,” Digit. Signal Process., Rev. J., vol. 21, no. 4, pp. 535–542, 2011.
[39] A. A. A. Youssif, A. E. Aboutabl, and H. H. Ali, “Arabic sign language (ArSL) recognition system using HMM,” Int. J. Adv. Comput. Sci. Appl., vol. 2, no. 11, pp. 45–51, Nov. 2011.
[40] M. M. Zaki and S. I. Shaheen, “Sign language recognition using a combination
of new vision-based features,” Pattern Recog. Lett., vol. 32, no. 4, pp. 572–577, 2011.
[41] A. Samir and M. Aboul-Ela, “Error detection and correction approach for Arabic sign language recognition,” in Proc. 7th Int. Conf. Comput. Eng. Syst., 2012, pp. 117–123.
[42] A. S. Elons, M. Abull-Ela, and M. F. Tolba, “A proposed PCNN features quality optimization technique for pose-invariant 3DArabic sign language recognition,” Appl. Soft Comput., vol. 13, no. 4, pp. 1646–1660, 2013.
[43] M. Al-Rousan, O. Al-Jarrah, and M. Al-Hammouri, “Recognition of dynamic gestures in Arabic sign language using two stages hierarchical scheme,” Int. J. Knowl.-Based Intell. Eng. Syst., vol. 14, no. 3, pp. 139–152, 2010.
[44] M. Al-Rousan, K. Assaleh, and A. Tala’a, “Video-based signer-independent Arabic sign language recognition using hidden Markov models,” Appl. Soft Comput., vol. 9, no. 3, pp. 990–999, Jun. 2009.
[45] F. F. Al Mashagba, E. F. Al Mashagba, and M. O. Nassar, “Automatic isolated-word Arabic sign language recognition system based on time delay neural networks: New improvements,” J. Theor. Appl. Inf. Technol., vol. 57, no. 1, pp. 42–47, Nov. 2013.
[46] Y. Zhang, X. Ma, S. Wan, H. Abbas, and M. Guizani, “CrossRec: cross-domain recommendations based on social big data and cognitive computing,” Mobile Networks & Applications, vol. 23, no. 6, pp. 1610–1623, 2018.
[47] K. Lin, C. Li, D. Tian, A. Ghoneim, M. S. Hossain, and S. U. Amin, “Artificial-intelligence-based data analytics for cognitive communication in heterogeneous wireless networks,” IEEE Wireless Communications, vol. 26, no. 3, pp. 83–89, 2019.
[48] X. Chen, L. Zhang, T. Liu, and M. M. Kamruzzaman, “Research on deep learning in the field of mechanical equipment fault diagnosis image quality,” Journal of Visual Communication and Image Representation, vol. 62, pp. 402–409, 2019.
[49] W. Wang, J. Yang, J. Li. S. Xiao, and D. Zhou. Face recognition based on deep learning. In International Conference on Human Centered Computing, Springer, Cham, pages 812-820, 2014.
[50] M. S. Hossain and G. Muhammad, “Emotion recognition using secure edge and cloud computing,” Information Sciences, vol. 504, no. 2019, pp. 589–601, 2019.
[51] U. Cote-Allard, C. L. Fall, A. Drouin et al., “Deep learning for electromyographic hand gesture signal classification using transfer learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 4, pp. 760–771, 2019.
[52] M.M. Kamruzzaman. Arabic sign language recognition and generating Arabic speech using convolutional neural network. Wireless Communications and Mobile Computing, 2020.

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more