A Survey of Language and Dialect Identification Systems

Download A Survey of Language and Dialect Identification Systems

Preview text


ISSN NO: 1301-2746

A Survey of Language and Dialect Identification Systems

Tanvira Ismail Senior Assistant Professor, Assam Don Bosco University, India
[email protected]

Abstract—As per dictionary, a language is a system of communication which consists of a set of sounds and written symbols which are used by the people of a particular country or region for talking or writing. Besides languages, people also communicate through dialects. Dialect refers to a regional or social variety of a language distinguished by pronunciation, grammar or vocabulary. The quest to automate the ability to identify languages and dialects has never stopped and hence the rise in research on automatic language and dialect identification systems. In this paper, we survey work done in the field of language and dialect identification.
Keywords—LID, dialect identification, GMM, GMM-UBM, SVM, HMM, neural network
Humans are born with the ability to discriminate between spoken languages as part of human intelligence [1]. The first perceptual experiment measuring how well human listeners can perform language identification was reported by Muthusamy et al. [2], wherein it was concluded that human beings, with adequate training, are the most accurate language recognizers. For languages with which they are not familiar, human listeners can often make subjective judgments with reference to the languages they know, e.g., ‘it sounds like English’. Though such judgments are less precise for hard decisions to be made for an identification task, they show how human listeners apply linguistic knowledge at different levels for distinguishing between certain broad language groups. The quest to automate such ability has never stopped. Just like any other artificial intelligence technologies, automatic language identification (LID) aims to replicate such human ability through computational means [2].
Besides languages, people also communicate through dialects. A dialect is considered as a variety of speech differing from the standard literary language or speech pattern of the culture in which it exists. Such a variety could be associated with a particular place or region. For example, American English (spoken by people from America) and British English (spoken by people from Britain) are dialects of English. Dialects of a specific language differ from each other but they are still understandable to the speakers of another dialect of the same language [3]. Dialect identification is the task of recognizing a speaker’s regional dialect within a predetermined language. The problem of automatic dialect identification is viewed more challenging than that of language recognition due to the greater similarity between dialects of the same language [4].
LID based research has received much interest and attention due to its importance in the areas of machine translation,

speech recognition, data mining, spam filtering, document summarization, etc.
On the other hand, developing a good method to detect dialect accurately helps in improving certain applications and services such as speech recognition systems which exist in most of today’s electronic devices [5]. It will allow researchers to infer the speaker’s regional origin and ethnicity and to adapt features used in speaker identification to regional original [6]. Accurate dialect identification technique is expected to help in providing new services in the field of e-health and telemedicine which is especially important for older and homebound people [5].
The task of identifying the language being spoken from a sample of speech by an unknown speaker is called LID. Sugiyama [6] proposed two language identification algorithms. The first algorithm was based on the standard vector quantization (VQ) algorithm and the second one was based on VQ and histogram algorithm. In this work acoustic features of the speech signal such as Linear Predictor Coding (LPC) coefficients, autocorrelation coefficients and delta cepstral coefficients were used. In the first algorithm based on the standard VQ algorithm, every language denoted by k was characterized by its own VQ codeword, Vk. The second algorithm based on VQ and histogram algorithm consisted of a single universal VQ code word, U= {vi}, for all languages and its occurrence probability histograms, hk. Every language, k was characterized by a histogram hk. The multilingual speech database used for the work consisted of 20 languages, namely, American, Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Italian, Japanese, Norwegian, Polish, Portuguese, Russian, Spanish and Swedish. For each language, 16 sentences were uttered twice by four male speakers and four female speakers. The duration of each sentence was about 8 seconds. The experimental results showed that the recognition rates for the first and second algorithms were 65% and 80%, respectively. For each algorithm, just 8 sentences of unknown speech, i.e., a total of approximately 64 seconds were used.
Balleda et al. [7] presented an LID system that worked for four South Indian languages, viz., Kannada, Malayalam, Tamil and Telugu and one North Indian language viz., Hindi. The speech corpora consisted of speech utterances from read text. For each language, they collected speech from five native male speakers and five native female speakers ensuring further that the text read by different speakers was different. The text of the sentences was chosen randomly and no attempt was made to choose a phonetically balanced set of sentences. The training utterances had an average length of ten seconds and the testing utterances had an average length of five seconds.

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746

For each speaker, 60 seconds of speech was collected. It was also ensured that the text sentences used in training and testing were different. Each language was modeled using an approach based on VQ. The speech was segmented into different sounds and the performance of the system on each of the segments was studied. It was observed that the presence of some consonants and vowels (CVs) was crucial for each language, and for the same consonant and vowel combination the quality of the sound was different for different languages. This study also showed that once the speech signal was segmented into CVs, it was possible to perform automatic language identification on very short segments (ranging between 100150ms) of speech.
Nagarajan and Murthy [8] proposed an approach which used parallel syllable like unit recognizers in a framework similar to Parallel Phone Recognition (PPR) for the language identification task. The difference between their proposed system and the PPR system was that unsupervised syllable models were built from the training data. The data were first segmented into syllable-like units. The syllable segments were then clustered using an incremental approach. This resulted in a set of syllable models for each language. These languagedependent syllable models were then used for identifying the language of the unknown test utterances. The Oregon Graduate Institute Multi-language Telephone Speech (OGI-MLTS) corpus, which consists of spontaneous speech in eleven languages, viz., English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese, was used in this approach. The utterances were recorded from 90 male and 40 female speakers in each language. For each language, 30 speakers were used for training and 20 speakers were used for testing. The initial results of their proposed system on the OGI-MLTS corpus showed a performance of 69.5%. They further showed that if only a subset of syllable models that were unique in some sense were considered, then the performance improved to 75.9%.
Padro and Padro [9] performed language identification using three different statistical methods based on Markov models, comparison of Trigram frequency vectors and n-gram text categorization. The experiments were focused on studying the influence of the training set size, the amount of text to classify and the number of languages among which the system can choose, in order to determine the influence they have on the system performance. The corpora were formed by a set of daily newspaper news and consisted of six languages, viz., Catalan, Spanish, English, Italian, German and Dutch. For each of the language corpus a random partition containing about 30,000 words was selected to be used as the test set. The rest of the speech corpora were used to randomly extract training samples of different sizes. The experiments involved training each system for all languages using a train set ranging from 2500 to 25000 words and evaluating their performance over the test data. The test was done by giving the system an amount of unclassified text ranging from 5 to 1000 characters. The process was repeated for all possible combinations of languages, from two to six languages. The experiments revealed that the influence of the train set size is not important when the size is bigger than approximately 50k words. These researchers [9] also proved that the amount of text to classify is crucial but it is not necessary to have very long texts to achieve a good precision. All the systems achieved a precision higher

than 95% for texts over 500 characters while all the systems achieved a precision higher than 99% for texts of 5000 characters. It was observed that the influence of the number of languages the system could identify was a very relevant factor to take into account. The more languages the system has to recognize, the less precision it will have. Furthermore, it was concluded that if the language identification system has to be applied in a multilingual environment involving similar languages, the precision of the system would fall and if the languages to be identified have different origins, the identification system would achieve a high precision.
Singh et al. [10] explored the use of prosodic feature based sparse representation classification (SRC) system for the LID task. The prosodic features, i.e., intonation, rhythm and stress, were computed by extracting syllable like unit with the help of a vowel onset point detection algorithm and mapped to i-vector domain for SRC using an exemplar dictionary. For comparing purpose, they also developed a contrast system based on cosine distance scoring (CDS). The system was evaluated on five Indian languages, viz., Assamese, Bengali, English, Hindi and Nepali. The test data consisted of 114 speech utterances from speakers of age group 18 to 35 years out of which 23 utterances were spoken in Assamese, 22 in Bengali, 22 in English, 23 in Hindi and 24 utterances in Nepali languages. Each of the test utterances were of approximately 1 minute duration. The performance of the SRC based LID system with and without channel compensation was found to be significantly better than the CDS based system.
Martin et al. [11] used a syllable-length segmental framework to analyze how individual information sources contribute to overall language identification performance. The syllabic framework was achieved via a multilingual phone recognition system, which used broad phonetic classes. Features derived to represent acoustic, prosodic and phonotactic information was then used to produce three separate models. First series of experiments were conducted based on modeling acoustic features. When a baseline GMM system was compared with a GMM system which modeled the recognized segments, both systems achieved comparable levels of performance. A second series of experiments were conducted which examined whether complementary phonotactic information could be extracted by using n-gram statistics over both short and extended segmental lengths. It was found that the use of unigrams statistics for the phone triplets provided significant improvements when used to complement existing Parallel Phone Recognition and Language Modeling (PPRLM) systems. Finally, a small set of fusion experiments were conducted in order to assess the degree of complementary information contained within the acoustic, phonotactic and prosodic systems. Martin et al. [11] combined the best performing acoustic system based on the Hidden Markov Model (HMM) models with the pitch system and observed that it provided a minor improvement. The levels of performance achieved by the baseline prosodic system and that built under the syllabic framework were comparable with results indicating that prosodic information can be used to obtain marginal improvements when combined with acoustic and phonotactic systems. However, the combination of the HMM acoustic, prosodic and phone-triplet unigrams achieved similar levels of performance to the PPRLM system and most importantly, the fusion of all systems resulted in an absolute

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746


improvement. In this study, two speech corpora were used, namely, the OGI-MLTS corpus and the Linguistic Data Consortium CallFriend (LDC CallFriend) database. From the OGI-MLTS corpus six languages were considered, namely, English, Hindi, Spanish, Mandarin, German and Japanese. On the other hand, the LDC CallFriend database consists of twelve languages; viz., Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese of which English, Mandarin and Spanish have a second dialect.
Zhang et al. [12] proposed an approach using Support Vector Machines (SVMs). In this approach a large background Gaussian Mixture Model (GMM) was used to perform the sequence mapping process which maps variable length feature vectors to one fixed length vector. Performance of the proposed approach was evaluated on the speech signal from five languages, namely, English, German, Japanese, Mandarin and Spanish. Zhang et al. [13] too used the OGI-MLTS corpus for the speech utterances of the five languages. For each language 50 speakers were selected as the training set and the duration of each speaker was about 60 seconds. The testing set was made up of the rest of the speech utterances for the five languages from the same speech corpus. The feature vectors used consisted of 12 Mel-frequency Cepstral Coefficient (MFCC). Experimental results demonstrated that their proposed system not only performed better than a GMM classifier but also outperformed the system using Generalized Linear Discriminant Sequence (GLDS) Kernel.
Noor and Aronowitz [13] showed that speaker-specific anchor model representation could be used for language identification when combined with SVMs for performing the classification. In the proposed system, given utterances in a language projected them onto a speaker space using anchor modeling and then used an SVM to generalize them. They [13] too made use of the LDC Call Friend in order to evaluate their language identification system which consists of twelve languages. Additional utterances in Russian were introduced only in the test data. One advantage of this method was that very little labeled data was required. The only labels used for training the SVM were taken from National Institute of Standards and Technology Language Recognition Evaluation dataset (NIST LRE-03) development data, which consists of about 30 second utterances per language. This was found to be very helpful for automatic identification of languages that have little human labeled available examples. Noor and Aronowitz [13] also proposed a more efficient way to calculate the speaker characterization vectors using test utterance parameterization instead of the classic Gaussian Mixture Model with Universal Background Model (GMM-UBM).
Manwani et al. [14] described a LID system using GMM for the extracted features which were then trained using Split and Merge Expectation Maximization (SMEM) Algorithm. This approach improved the global convergence of the Expectation and Maximization (EM) algorithm. A maximum likelihood classifier was used for identifying a language. Manwani et al. [14] tested their method on four languages, viz., Hindi, Telugu, Gujarati and English. For Hindi, the training speech corpus consisted of recordings from 27 speakers out of which 23 were male and 4 were female speakers, while the testing speech

corpus consisted of recordings from 35 speakers out of which 31 were male and 4 were female speakers. The length of sentences ranged between 2 to 5 seconds. The total duration of training samples was 440 seconds and the number of training sentences was 135 while the total number of test utterances was 105. For Telugu, the training speech corpus consisted of recordings from 24 speakers out of which 20 were male and 4 were female speakers, while the testing speech corpus consisted of recordings from 22 speakers out of which 18 were male and 4 were female speakers. The length of sentences ranged between 3 to 9 seconds. The total duration of training samples was 440 seconds and the number of training sentences was 98 while the total number of test utterances was 62. The Gujarati training speech corpus consisted of recordings from 22 speakers out of which 18 were male speakers and 4 were female speakers, while the testing speech corpus consisted of 22 speakers out of which 18 were male and 4 were female speakers. The length of sentences ranged between 2 to 10 seconds. The total duration of training samples was 472 seconds and the number of training sentences was 132 while the total number of test utterances was 88. For English, the training speech corpus consisted of recordings from 25 speakers out of which 14 were male speakers and 11 were female speakers while the testing speech corpus consisted of 28 speakers out of which 14 were male and 14 were female speakers. The length of sentences ranged between 2 to 10 seconds. The total duration of training samples was 420 seconds and the number of training sentences was 138 while the total number of test utterances was 91. The extracted features of the speech utterances were the MFCCs and their delta as well as delta-delta coefficients. These features were then modeled as GMM and an SMEM algorithm was used to obtain the model parameters. It was observed that the use of SMEM overcame the difficulty of local maxima due to EM algorithm. Manwani et al. [14] basically showed that the accuracy of the system could be improved by using split and merge EM algorithm.
Kumar et al. [15] proposed a system that uses hybrid robust feature extraction technique for language identification system. The speech recognizers used a parametric form of a signal in order to get the most important distinguishable features of speech signal for recognition task. They used MFCC and PLP coefficients along with two hybrid features. Bark Frequency Cepstral Coefficients (BFCC) and Revised Perceptual Linear Prediction Coefficients (RPLP) were the two hybrid features that were obtained from the combination of MFCC and PLP. Two different classifiers, namely, VQ with Dynamic Time Warping (DTW) and GMM were used for the identification purpose. They evaluated the proposed system on three Indian languages, namely, Bengali, Hindi and Telugu. The speech corpora consisted of seven different speakers for each language and each speaker utterance was of one minute duration. All speakers of the respective languages uttered same paragraph for one minute duration which were recorded in a noise free environment. Apart from the Indian languages, they also worked on seven foreign languages, namely, Dutch, English, French, German, Italian, Russian and Spanish which

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746

were downloaded from the internet. Thus, they worked on ten languages with seven different speakers for each language giving a total of 70 utterances. The duration of speech utterance of all languages ranged between 35 seconds to 70 seconds. They observed that all the feature extraction techniques that they had considered worked better with GMM as compared to VQ and DTW since Gaussian Mixture Language Model falls into the implicit segmentation approach to language identification. It also provides a probabilistic model of the underlying sounds of a person’s voice.
Reddy et al. [16] used both spectral and prosodic features for analyzing the language specific information present in speech. Spectral features extracted from frames of 20 milliseconds, individual pitch cycles and glottal closure regions were used for discriminating the languages. In addition to spectral features, prosodic features extracted from syllable, tri-syllable and multi-word levels were proposed for capturing the language specific information. The language specific prosody was represented by intonation, rhythm and stress features at syllable and tri-syllable levels, whereas temporal variations in fundamental frequency, durations of syllables and temporal variations in intensities were used to represent the prosody at multi-word level. GMM was used to capture the language specific information from the proposed features. Performance of proposed features were analyzed on the Multilingual Indian Language Speech Corpus (IITKGP-MLILSC) which consists of 27 Indian languages, namely, Arunachali, Assamese, Bengali, Bhojpuri, Chattisgarhi, Dogri, Gojri, Gujrati, Hindi, Indian English, Kannada, Kashmiri, Konkani, Manipuri, Mizo, Malayalam, Marathi, Nagamese, Nepali, Oriya, Punjabi, Rajasthani, Sanskrit, Sindhi, Tamil, Telugu and Urdu. Every language in the database consists of speech from at least ten speakers. From each speaker, about 5-10 minutes of speech was collected. On the whole, each language contains minimum 1 hour of speech. Reddy et al. [16] also used the OGI-MLTS database which consists of 11 languages, namely, English, Farsi, French, German, Japanese, Korean, Mandarin Chinese, Spanish, Tamil, Vietnamese and Hindi for analyzing the language recognition accuracy.
Roy et al. [17] discussed the comparison of VQ and GMM classification techniques based on four Indian languages, viz., Assamese, Bengali, English and Hindi. The database used consisted of speech recorded from 50 speakers. Each speaker was asked to repeat each sentence 20 times in all four languages resulting in 4000 samples for training. For testing, the 50 speakers spoke each sentence 5 times in each language resulting in a total of 1000 test samples. Each sample was of 2 to 3 seconds duration. MFCC was used as the feature extraction technique and VQ was found to work better than GMM. The VQ model was more efficient for all the four languages at higher codebook sizes.
Sengupta and Saha [18] worked on identification of the major language families of India. They [18] extended the language identification framework to capture features common to language families and they developed models which could efficiently represent the language families. Sengupta and Saha [19] used MFCC and Speech Signal based Frequency Cepstral Coefficient (SFCC) as the primary feature extraction tools

which were combined with Shifted Delta Coefficient (SDC) to

obtain the final set of features. GMM and SVM were used as

the modeling tools. Different combinations of the feature

extraction methods and the modeling tools were used to

develop four main systems, viz., MFCC+SDC+GMM,




SFCC+SDC+SVM. These systems were applied to identify the

two major language families of India, viz., Indo-Aryan and

Dravidian, which in total consisted of 22 languages. The Indo-

Aryan family consisted of 18 languages, viz., Assamese,

Bengali, Bhojpuri, Chhattisgarhi, Dogri, English, Gujarati,

Hindi, Kashmiri, Konkani, Manipuri, Marathi, Nagamese,

Odia, Punjabi, Sanskrit, Sindhi and Urdu, while Dravidian

family consisted of 4 languages, viz., Kannada, Malayalam,

Tamil and Telugu. The speech corpora were prepared from the

All India Radio website since the quality of speech was good

and the speech was of sufficiently long duration. A large

number of speakers consisting of both male and female

speakers were also available for every language. It was

observed that all the four systems could identify the language

families with high accuracy. The influence of one language

family on the other was also evaluated and in most of the cases

the neighboring languages were found to be influenced more

by the other family.

Muthusamy et al. [19] adopted a segment-based approach

which is based on the idea that the acoustic structure of

languages can be estimated by segmenting speech into broad

phonetic categories. They [19] observed that LID could be

achieved by computing features that describe the phonetic and

prosodic characteristics of a language and by using these

feature measurements to train a classifier to distinguish

between languages. A multi-language, neural network-based

segmentation and broad classification algorithm was first

developed by using seven broad phonetic categories. This

algorithm was trained and tested on separate sets of speakers

of American English, Japanese, Mandarin Chinese and Tamil.

The speech corpora consisted of natural continuous speech

from twelve native speakers for each of the four languages out

of which six were male speakers and six were female speakers.

The recording was done in the laboratory and the age of the

female speakers ranged between 15 to 70 years while that of

male speakers ranged between 18 to 71 years. Each speaker

was told to speak 15 conversational sentences on any topic of

personal choice, to ask two questions, and to recite the days of

the week, the months of the year and the numbers 0 to 10 for a

total of 20 utterances in his/her native language. This system

gave an accuracy of 82.3% on the utterances of the test set.

Muthusamy et al. [20] continued the work further by

developing a four language LID system based on the same four

languages they used previously [19]. The system used a neural

network-based segmentation algorithm to segment speech into

seven broad phonetic categories. Phonetic and prosodic

features computed on these categories were then input to a

second network that performed the language classification.

The system was trained and tested on separate sets of speakers

of American English, Japanese, Mandarin Chinese and Tamil.

The training set contained 12 speakers from each language

with 10 or 20 utterances per speaker, for a total of 440

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746


utterances. The development test set contained a different group of 2 speakers per language with 20 utterances from each speaker, for a total of 160 utterances. The final test set had 6 speakers per language with 10 or 20 utterances per speaker for a total of 440 utterances. The average duration of the utterances in the training set was 5.1 seconds and that of the test sets was 5.5 seconds. Approximately 15% of the utterances in the training and testing sets consisted of a fixed vocabulary of the days of the week, the months of the year and the numbers zero through ten. Their results indicated that the system performed better on longer utterances. Furthermore, their system gave an accuracy of 89.5% without using any spectral information in the classifier feature set.
Montavon [21], on the other hand, used two datasets, viz., VoxForge and RadioStream each consisting of English, French and German languages, in order to evaluate his system. The VoxForge dataset contains 5 seconds speech samples associated with different metadata including the language of the sample. Since the speech samples were recorded by speakers using their own microphones, quality varies significantly between different samples. This dataset contains 25420 English samples, 4021 French samples and 2963 German samples. The RadioStream dataset consists of samples ripped from several web radios. Furthermore, it has the advantage of containing virtually infinite number of samples that are of excellent quality. Montavon [21] suggested a deep architecture that learnt features automatically for the language identification task. The classifier was trained and evaluated on balanced classes, i.e., 33% English samples, 33% French samples and 33% German samples. Each sample corresponded to a speech signal of 5 seconds. The proposed classifier mapped spectrograms into languages and was implemented as a Time-delay Neural Network (TDNN) with two-dimensional convolutional layers as feature extractors. Implementation of the TDNN performs a simple summation on the outputs of the convolutional layers. The results showed that the deep architecture could identify three different languages with 83.5% accuracy on 5 seconds speech samples coming from RadioStreams and with 80.1% accuracy on 5 seconds speech samples coming from VoxForge. The deep architecture was also compared with a shallow architecture and it was observed that the deep architecture was 5-10% more accurate than the shallow architecture.
Jothilakshmi et al. [22] also used the type of language family to which a language belonged as a distinguishing factor and proposed a hierarchical language identification system for Indian languages. Since nearly 98% of the people in India speak languages from Aryan and Dravidian families, the system they proposed was designed to identify the languages of these two families. In the first level of the proposed system, the family of the spoken language was identified and then this information was given as input to the second level in order to identify the particular language in the corresponding family. A database consisting of nine languages was prepared. The Dravidian family consisted of Tamil, Telugu, Kannada and Malayalam while the Indo-Aryan family consisted of Hindi, Bengali, Marathi, Gujarati and Punjabi. The database consisted

of a total of nine hours of broadcast data from Doordarshan television network. The performance of the system was analyzed for various acoustic features and different classifiers. The proposed system was modeled using HMM, GMM and Artificial Neural Network (ANN). Jothilakshmi et al. [22] also studied the discriminative power of the system for the features, namely, MFCC, MFCC with delta and acceleration coefficients and Shifted Delta Cepstral (SDC) coefficients. The GMM based LID system using MFCC with delta and acceleration coefficients was found to perform well with accuracy 80.56%. The performance of GMM based LID system with SDC was also considerable.
Lopez-Moreno et al. [23] experimented with the use of Deep Neural Networks (DNNs) for the problem of language identification. They compared the proposed system with the ivector based acoustic system and extracted 39 Perceptual Linear Prediction (PLP) features for both the DNN based and the i-vector based systems. In this experiment two different datasets were used, viz., the Google 5M LID corpus (Google Language Identification (LID) corpus with 5 million utterances) and the NIST LRE’09. The Google 5M LID corpus consists of twenty five languages and nine dialects. From the LRE’09 dataset Lopez-Moreno et al. [23] selected eight representative languages, viz., US English, Spanish, Dari, French, Pashto, Russian, Urdu and Chinese Mandarin for which at least 200 hours of audio were available. For the Google 5M LID corpus, 87.5 hours of speech per language were used resulting in a total of 2975 hours of speech. When the systems were used on Google 5M LID corpus, the i-vector system gave similar performance for the discriminative backend, Logistic Regression (LR) and the generative ones, Linear Discriminant Analysis (LDA_CD) and the one based on a single Gaussian with a shared covariance matrix across the languages (1G_SC). Lopez-Moreno et al. [23] also observed that increasing to two Gaussians and allowing individual covariance matrices gave a relative improvement of 19%. However, the best performance was achieved by the DNN system especially when the eight hidden layer DNN proposed architecture was used. Similar results were noticed when the developed systems were used on the LRE’09 dataset.
Lopez-Moreno et al. [24] further worked on DNN based automatic language identification systems and went on to propose two more systems. In the first one, the DNN acted as an end-to-end LID classifier, receiving as input the speech features and providing as output the estimated probabilities of the target languages. In the second approach, the DNN was used to extract bottleneck features that were then used as inputs for a state-of-the-art i-vector system. They evaluated their language identification system on the NIST LRE 2009 dataset consisting of 23 languages that comprised of significantly different amounts of available data for each target language and a subset of the Voice of America (VOA) data from LRE’09 consisting of eight languages that comprised of equal quantity of data for each target language. Results for both datasets showed that the DNN based systems significantly outperformed a state-of-the-art i-vector system when dealing with short duration utterances. Furthermore, the combination

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746

of the DNN based and the classical i-vector system led to additional performance improvements.
Gazeau and Varol [25] worked on language identification of four languages, viz. French, English, Spanish and German. As far as the speech corpus is concerned, they chose speech samples from Shtooka, VoxForge and Youtube. For testing, apart from the speech samples from the mentioned sources, they also used personally recorded voices. They used neural network architectures, SVM and HMM for the identification purpose and reached to the conclusion that HMM yields the best results with accuracy of about 70%.
Dialect identification is also equally important as language identification because usually people communicate with dialects. Zissman et al. [26] showed that the Phone Recognition and Language Modeling (PRLM) approach yields good results classifying Cuban and Peruvian dialects of Spanish language. They introduced the Miami corpus which was designed to be a new Spanish speech corpus specifically for the purpose of dialect identification. A variety of Spanish speech was collected including spontaneous speech in which speakers gave answers to Spanish questions designed to elicit long stretches of uninterrupted Spanish speech, read paragraph in which each speaker read three paragraphs consisting of a phonetically balanced paragraph, a variable paragraph from a textbook about Spanish culture and a variable paragraph from a newspaper, fill in the blank sentences which were very simple questions designed to elicit predictable text responses, read sentences that were rich in words useful for analyzing dialects and digits spoken twice from 0 to 10 in random order. After the Spanish speech was collected, the same type of speech was elicited in English. Altogether, each speaker in the corpus spoke for about 20 to 30 minutes. The dialect identification experiments were conducted using speech from 143 Cuban and Peruvian speakers. During training, three minutes of spontaneous Spanish speech from each speaker in the Cuban training set were processed by the English phone recognizer and the Cuban language model statistics were computed. This step was repeated for the Peruvian speakers, from which a Peruvian model was created. After the two language models were developed, test-speaker spontaneous speech was processed and a dialect identification decision was produced. The test utterances were also three minutes long. As mentioned above, PRLM based dialect identification algorithm was used in this study.
Torres-Carrasquillo et al. [27] focused on applying some of the techniques developed for language identification to the area of dialect identification. They employed GMM with shifted delta cepstral features (GMM-SDC) for the purpose of dialect identification. They used the CALLFRIEND corpus and Miami corpus for their data. The CALLFRIEND corpus consists of twelve languages including two dialects for three of the languages, recorded over domestic telephone lines. They basically used the dialects available for English, Mandarin and Spanish languages. Each of these languages included two dialects, namely, North and South for English, Mandarin and Taiwanese for Chinese and Caribbean and Non-Caribbean for Spanish. The training includes 20 conversations for each

dialect in the three languages resulting in 40 conversations per language and each conversation is about 30 minutes long. It also consists of 80 testing utterances per dialect except for English where an additional group of about 320 utterances are included and each of these utterances is about 30 seconds long. The Miami corpus consists of two dialects of the Spanish language, namely, Cuban Spanish and Peruvian Spanish. They observed that the performance obtained by the GMM based system for the Miami corpus was lower than that obtained by Zissman et al. [26] when they used the Miami corpus for dialect identification. However, the system provided very good performance for two of the dialects in the CALLFRIEND corpus. They also observed that the technique was ported from language identification without any specialization for the purpose of dialect identification and the results obtained were promising.
Torres Carrasquillo et al. [28] continued their work on dialect identification and used three GMM based systems for identifying American vs. Indian English, four Chinese dialects and three Arabic dialects. Two of these dialects i.e. the Chinese dialects and the English dialects come from the NISTLanguage Recognition Evaluation (LRE) 2007 campaign and the third dialect discrimination task comes from the LDC Arabic corpus. The Chinese dialects included Cantonese, Mandarin, MinNan and Wu while the Arabic dialects included Gulf, Iraqi and Levantine. They developed three systems, viz. a baseline GMM-UBM, a GMM-UBM with feature compensation using Eigen-channel compensation and a GMMUBM with maximum mutual information (MMI) along with feature compensation using Eigen-channel compensation. They observed that all the three systems showed similar behavior with the third system showing the best results.
Ma et al. [29] worked on dialect identification of three Chinese dialects, namely, Mandarin, Cantonese and Shanghainese using GMMs. Their corpora had about 10 hours of speech data in each of the three dialects as training data. In order to have a clear picture on the relationship between the amount of training data and identification accuracy, the GMM models were trained with different amount of training data of 1, 2, 4, 6, 8 and 10 hours. For the evaluation, 1000 speech segments for each of the four durations of 3, 6, 9 and 15 seconds, respectively were used as test data. They discussed that since Chinese is a tone rich language with multiple intonations, the intonations are important information for people to understand the spoken Chinese language. Different Chinese dialects have different numbers of intonations and different patterns of intonations, so they deduced that better performance on Chinese dialect identification could be achieved by making good use of such kind of discrimination information. Instead of calculation of fundamental frequency (F0) features explicitly, they extracted frame-based multidimensional tone relevant features based on the pitch flux in continuous speech signal. Covariance coefficients between the autocorrelations of two adjacent frames were estimated to serve as such features. These pitch flux features were applied as separate feature stream to provide additional discriminative information at the basis of MFCC feature stream. Each of the two streams was modeled by GMM. They observed that by fusing the pitch flux feature stream with the MFCC stream, the error rate was reduced by more than 30% as compared to when

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746


only the MFCC feature stream was used even when the test speech segments were as short as 3 seconds.
Shen et al. [30] described a dialect recognition system that made use of adapted phonetic models per dialect applied in a PRLM framework to distinguish between American vs. Indian English and two Mandarin dialects (Mainland and Taiwanese). They trained systems for each language using data from the CALLFRIEND corpus, the Language Recognition Evaluation dataset 2005 (LRE’05) test set, data from OGI’s foreign accented English, LDC’s MIXER and FISHER corpora. In total, 104 and 20.14 hours of data were used to adapt and train PRLM and adapted phonetic models for English and Mandarin, respectively. In each task, the performance of the adapted phonetic model system was compared with a baseline GMM model. They observed that the adapted phonetic model system was capable of good performance for the dialect recognition problem without phonetically word transcribed data. Furthermore, this model could be combined with PRLM to improve performance. It was noticed that the combination of this system with PRLM outperformed combinations of PRLM with GMM based models and the combination of all three systems could further improve the performance.
Alorfi [31] used ergodic Hidden Markov Model to identify two Arabic dialects, namely, Gulf and Egyptian Arabic. Apart from using the CALLHOME Egyptian Arabic Speech from the LDC database, he created an additional database for his work by recording TV soap operas containing both Egyptian and Gulf dialects. However, these recordings often contained background noises such as echoes, coughs, laughter and background music. The overall condition of this database was poor compared to other standard speech databases. Furthermore, the additional database consisted of recordings from only male speakers. For the Egyptian dialect, he used a combination of twenty male speakers from the CALLHOME database and twenty male speakers from the TV recordings database. The speech of ten speakers from each database was used for training and the speech from the other ten was used for testing. The speech for training from each speaker was one minute long. The speech used for the Gulf dialect was solely from the TV recordings database. The speech from 10 male speakers was used for training while a different set of 10 speakers was used for testing. He utilized many different combinations of speech features related to MFCC such as time derivatives, energy and the shifted delta cepstra in training and testing the system. Due to similarities and differences between the Arabic dialects, he developed an ergodic HMM that had two states, viz. one of them represented the common sounds across Arabic dialects while the other represented the unique sounds of the specific dialect. The best result of the Arabic dialect identification system was 96.67% correct identification.
Biadsy et al. [32] used the PPRLM framework with nine phone recognizers to distinguish between the Arabic dialects, namely, Gulf Arabic, Iraqi Arabic, Levantine Arabic, Egyptian Arabic and Modern Standard Arabic (MSA). They were able to obtain corpora from the Linguistic Data Consortium (LDC) with similar recording conditions for the first four Arabic dialects. These are corpora of spontaneous telephone conversations produced by native speakers of the dialects, speaking with family members, friends and unrelated individuals, sometimes about predetermined topics. They used

the speech files of about 965 speakers with a total of about 41.02 hours of speech from the Gulf Arabic conversational telephone speech database out of which 150 speakers resulting in about 6.06 hours of speech were set apart for testing. They used the Iraqi Arabic conversational telephone speech database for the Iraqi dialect selecting 475 speakers with a total duration of about 25.73 hours of speech out of which 150 speakers resulting in about 7.33 hours of speech were kept aside for testing. Their Levantine data consisted of 1258 speakers from the Arabic CTS Levantine Fisher Training Data Set with a total duration of 78.79 hours of speech. Here too they kept aside 150 speakers resulting in about 10 hours of speech for testing. For their Egyptian data, they used CallHome Egyptian and its Supplement. They used 398 speakers with a total duration of 75.7 hours of speech keeping aside 150 speakers resulting in 28.7 hours of speech for testing. For MSA, they used TDT4 Arabic broadcast news since no database similar to the other dialects was available. They used about 47.6 hours of speech. For testing, they again kept aside 150 speakers resulting in about 12.06 hours of speech. They observed that their system was giving a good accuracy and the most distinguishable dialect among the five variants was MSA.
Salameh et al. [33] also performed Arabic dialect identification on a large-scale collection of parallel sentences that covered the dialects of 25 Arab cities in addition to English, French and MSA. They worked on two corpora viz. Corpus-26 and Corpus-6. Corpus-26 consists of 2000 sentences translated into dialects of 25 cities while Corpus-6 has 10,000 additional sentences and are translated into dialects of five cities, namely, Beirut, Cairo, Doha, Tunis and Rabat. They basically presented results on a fine-grained classification task and the system they developed could identify the exact city of the speaker with an accuracy of 67.9% for sentences with an average length of seven words and with an accuracy of 90% for sentences with 16 words.
Bougrine et al. [34] worked on Spoken Algerian Arabic Dialect Identification (SAADID) and proposed a new system based on prosodic speech information, viz. intonation and rhythm. They used SVM as the modeling technique and worked on identification of six dialects spoken in the departments of Algiers, Adrar, Bousaâda, Djelfa, Laghouat and Oran. The speech corpus they developed consisted of speech from their own recording database (OR) and speeches extracted from reports which were selected from regional radios and TVs (RTV). The OR database consists of 1.5 hours of recordings from 34 speakers and each speaker recorded 57 sentences. RTV database consists of 10 sentences of MSA and 47 sentences of dialect speech wherein dialect speech consists of free responses, free translation of phrases, a short text story and a semi-guided narration obtained from images without text. Their results showed as accuracy of 69% when test utterances of 2 seconds were used.
Rao and Koolagudi [35] used both spectral and prosodic features to identify five Hindi dialects, viz., Chattisgharhi (spoken in Central India), Bengali (Bengali accented Hindi spoken in Eastern India), Marathi (Marathi accented Hindi spoken in Western India), General (Hindi spoken in Northern India) and Telugu (Telugu accented Hindi spoken in Southern India). For each dialect, their database consisted of data from 10 speakers out of which 5 were male and 5 were female

Volume 9, Issue 1, January 2020




ISSN NO: 1301-2746

speakers speaking in spontaneous speech for about 5-10 minutes resulting in a total of 1-1.5 hours. The spectral features were represented by MFCCs while the prosodic features were represented by durations of syllables, pitch and
From the survey discussed in this paper, it can be observed that while extensive work has been done on language identification, the same cannot be said true for dialect identification. One of the reasons for this can be the lack of databases which is also observed from the above discussion. While popular speech databases such as OGI-MLTS, LDC CallFriend, NIST LRE, IITKGP-MLILSC, VoxForge, RadioStream, Google 5M LID etc. are available for research in language identification, the same is not the case in dialect identification. Most researchers have had to develop their own speech corpus since very few speech databases are available for dialects.
Language and dialect identification systems also help in preserving a language/dialect. The number of languages spoken in the world is estimated to be between six and seven thousand [36]. However, as can be observed from the survey discussed in this paper, research in the field of language and dialect identification is restricted to a few languages/dialects. We need to diversify research in this field in order to include language and dialect identification systems for varied languages and dialects.
[1] H. Li, B. Ma and K. A. Lee, “Spoken Language Recognition: From Fundamentals to Practice”, Proceedings of the IEEE, vol. 101, pp. 11361159, 2013.
[2] Y. K. Muthusamy, N. Jain and R. A. Cole, “Perceptual Benchmarks for Automatic Language Identification”, in Proc. of the IEEE International Conference on Acoustic, Speech, and Signal Processing, Adelaide, 1994, pp. 333-336.
[3] H. Behravan, “Dialect and Accent Recognition”, Master’s Thesis, University of Eastern Finland, 2012.
[4] F. Biadsy, “Automatic Dialect and Accent Recognition and its Application to Speech Recognition”, PhD Thesis, Columbia University, 2011.
[5] A. Etman and A. A. L. Beex, “Language and Dialect Identification: A Survey”, in Proc. of the 2015 SAI Intelligent Systems Conference, London, 2015, pp. 220-231.
[6] M. Sugiyama, “Automatic Language Recognition Using Acoustic Features”, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, 1991, pp. 813-816.
[7] J. Balleda, H. A. Murthy and T. Nagarajan, “Language Identification from Short Segments of Speech”, in Proc. of the 6th International Conference on Spoken Language Processing, Beijing, 2000, pp. 10331036.
[8] T. Nagarajan and H. A. Murthy, “Language Identification Using Parallel Syllable Like Unit Recognition”, in Proc. of International Conference on Acoustics, Speech and Signal Processing, Canada, 2004, pp. 401-404.
[9] M. Padro and L. Padro, “Comparing Methods for Language Identification”, Procesamiento del lenguaje natural, vol. 33, pp. 155161, 2004.
[10] O. P. Singh, B. C. Harris, R. Sinha, B. Chetri and A. Pradhan, “Sparse Representation Based Language Identification Using Prosodic Features for Indian Langauges”, in Proc. of the 2013 Annual IEEE India Conference, Mumbai, 2013, pp. 1-5.
[11] T. Martin, B. Baker, E. Wong and S. Sridharan, “A Syllable-Scale Framework for Language Identification”, Computer Speech & Language, vol. 20, 2006, pp. 276-302.
[12] W. Zhang, B. Li, D. Qu and B.Wang, “Automatic Language Identification Using Support Vector Machines”, in Proc. of 8th International Conference on Signal Processing, Beijing, 2006.

energy contours. They used Auto-associative Neural Network (AANN) models with SVM as the modeling technique. Their dialect identification system was showing recognition performance of 81%.
[13] E. Noor and H. Aronowitz, “Efficient Language Identification Using Anchor Models and Support Vector Machines”, in Proc. of Speaker and Language Recognition Workshop, Puerto Rico, 2006, pp. 1-6.
[14] N. Manwani, S. K. Mitra and M. V. Joshi, “Spoken Language Identification for Indian Languages Using Split and Merge EM Algorithm”, in Proc. of the 2nd International Conference on Pattern Recognition and Machine Intelligence, Kolkata, 2007, pp. 463-468.
[15] P. Kumar, A. Biswas, A. N. Mishra and M. Chandra, “Spoken language identification using hybrid feature extraction methods”, Journal of Telecommunication, vol. 1, pp. 11-15, 2010.
[16] V. R. Reddy, S. Maity, K. S. Rao, “Identification of Indian languages using Multi-Level Spectral and Prosodic Features”, International Journal of Speech Technology, vol. 16, pp. 489-511, 2013.
[17] P. Roy and P.K. Das, “Comparison of VQ and GMM approach for identifying Indian languages”, International Journal of Applied Pattern Recognition, vol. 1, pp. 99-107, 2013.
[18] D. Sengupta and G. Saha, “Identification of the Major Language Families of India and Evaluation of Their Mutual Influence”, Current Science, vol. 110, pp. 667-681, 2016.
[19] Y. K. Muthusamy, R. A. Cole and M. Gopalakrishnan, “A SegmentBased Approach to Automatic Language Identification”, in Proc. of 1991 IEEE International Conference on Acoustics, Speech and Signal Processing, 1991, pp. 353-356.
[20] Y. K. Muthusamy and R. A. Cole, “A Segment-Based Automatic Language Identification System”, in Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, Ed. San Mateo: Morgan Kaufmann Publisher, 1992, pp. 241-248.
[21] G. Montavon, “Deep Learning for Spoken Language Identification”, in Proc. of NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, Vancouver, 2009, pp. 1-4.
[22] S. Jothilakshmi, V. Ramalingam and S. Palanivel, “A Hierarchical Language Identification System for Indian Languages”, Digital Signal Processing, vol. 22, pp. 544-553, 2012.
[23] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez and P. Moreno, “Automatic Language Identification using Deep Neural Networks”, in Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing, Italy, 2014, pp. 5337-5341.
[24] I. Lopez-Moreno, J. Gonzalez-Dominguez, D. Martinez, O. Pichot, J. Gonzalez-Rodriguez and P. J. Moreno, “On the Use of Deep Feedforward Neural Networks for Automatic Language Identification”, Computer Speech Language, vol. 40, pp. 46-59, 2016.
[25] Valentin Gazeau and Cihan Varol, “Automatic Spoken Language Recognition with Neural Networks”, International Journal of Information Technology and Computer Science, vol. 8, pp. 11-17, 2018.
[26] M. A. Zissman, T. Gleason, D. Rekart and B. Losiewicz, “Automatic Dialect Identification of Extemporaneous Conversational, Latin American Spanish Speech”, in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Atlanta, 1996, pp. 777-780.
[27] P. A. Torres-Carrasquillo, T. P. Gleason and D. A. Reynolds, “Dialect Identification using Gaussian Mixture Models”, in Proc. of the Speaker and Language Recognition Workshop, Toledo, 2004, pp. 297-300.
[28] P. A. Torres-Carrasquillo, D. E. Sturim, D. A. Reynolds and A. McCree, “Eigen-Channel Compensation and Discriminatively Trained Gaussian Mixture Models for Dialect and Accent Recognition”, in Proc. of the 9th Annual Conference of the International Speech Communication Association, Brisbane, 2008, pp. 723-726.
[29] B. Ma, D. Zhu and R. Tong, “Chinese Dialect Identification using Tone Features Based on Pitch Flux”, in Proc. of International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006, pp. 10291032.
[30] W. Shen, N. Chen and D. Reynolds, “Dialect Recognition using Adapted Phonetic Model”, in Proc. of the 9th Annual Conference of the International Speech Communication Association, Brisbane, 2008, pp. 763-766.
[31] F. S. Alorfi, “Automatic Identification of Arabic Dialects Using Hidden Markov Models”, PhD Thesis, University of Pittsburgh, 2008.
[32] F. Biadsy, J. Hirschberg and N. Habash, “Spoken Arabic Dialect Identification using Phonotactic Modeling”, in Proc. of the EACL 2009

Volume 9, Issue 1, January 2020



Workshop on Computational Approaches to Semitic Languages, Athens, 2009, pp. 53-61. [33] M. Salameh, H. Bouamor, and N. Habash, “Fine-Grained Arabic Dialect Identification”, in Proc. of the 27th International Conference on Computational Linguistic, New Mexico, 2018, pp. 1332-1334. [34] S. Bougrine, H. Cherroun, and D. Ziadi, “Prosody-based Spoken Algerian Arabic Dialect Identification”, Procedia Computer Science, vol. 128, pp. 9-17, 2018. [35] K. S. Rao and S. G. Koolagudi, “Identification of Hindi Dialects and Emotions Using Spectral and Prosodic Features of Speech”, Systemics, Cybernics and Informatics, vol. 9, pp. 24-33, 2011. [36] M. P. Lewis, Ed. “Ethnologue: Languages of the World, Sixteenth Edition”, Dallas, SIL International, 2009.

ISSN NO: 1301-2746

Volume 9, Issue 1, January 2020



Preparing to load PDF file. please wait...

0 of 0
A Survey of Language and Dialect Identification Systems