Views 
   PDF Download PDF Downloads: 1352

 Open Access -   Download full article: 

An Automated Complex Word Identification from Text: A Survey

Jaspreet Singh* , Gurvinder Singh and Rajinder Singh Virk

Department of Computer Science, Guru Nanak Dev University Amritsar, Punjab, India.

Corresponding Author Email: PROFJASPREETBATTH@GMAIL.COM

 

DOI : http://dx.doi.org/10.13005/ojcst/10.03.09

Article Publishing History
Article Received on : 27-May-17
Article Accepted on : 23-Jun-17
Article Published : 03 Jul 2017
Article Metrics
ABSTRACT:

Complex Word Identification (CWI) is the process of locating difficult words from a given sentence. The aim of automated CWI system is to make non-native English user understand the meaning of target word in the sentence. CWI systems assist second language learners and dyslexic users through simplification of text. This study introduces the CWI process and investigates the performance of twenty systems submitted in the SemEval -2016 for CWI. The G-score measure which is harmonic mean of accuracy and recall is taken for the performance evaluation of systems. This paper explores twenty CWI systems and identifies that why sv000gg system outperformed with highest G-score as 0.773 and 0.774 for the two respective submissions.

KEYWORDS: CWI; Lexical Simplification; Textual Entailment; Text Classification

Copy the following to cite this article:

Singh J, Singh G, Virk R. S. An Automated Complex Word Identification from Text: A Survey. Orient.J. Comp. Sci. and Technol;10(3)


Copy the following to cite this URL:

Singh J, Singh G, Virk R. S. An Automated Complex Word Identification from Text: A Survey. Orient. J. Comp. Sci. and Technol;10(3). Available from: http://www.computerscijournal.org/?p=6199


Introduction 

The process of recognizing difficult words from text is the main task of Complex Word Identification (CWI). In this task, the difficult words are substituted with simpler words aim to enhancement of reader’s understanding. CWI is useful in Lexical Simplification integrated with text simplification requires accurate identification of complex words from sentence. CWI can assist non native speakers by providing summarization of stories and generating simplified abstracts of essays. This can assist second language learners and dyslexic users through enhancement of understanding of complex web text. This way CWI is beneficial for naïve lerners too, making their lessons more readable by replacing difficult words with commonly used words.

SemEval-2016 Task-11 provided dataset of 9200 sentences with word range of 20 to 40 words per sentence. This dataset is generated from three sources as; CW corpus (Shardlow, 2013b), LexMTurk Corpus (Horn. et. al., 2014) and Simple Wikipedia corpus by (Kauchak, 2013). The CW corpus contains 731 simple English sentences in which one complex word is substituted by Wikipedia editors. The second dataset LexMTurk is commonly used for CWI is composed of 500 sentences from Simple English and containing one complex word in each sentence. The third dataset Simple Wikipedia composed of 1,67,689 sentences taken from simple English Wikipedia sources. The CWI systems are being leveraged on 8700 sentences from the third dataset.

Consider a sentence as an example: “Our University follows the praxes of Guru Nanak Dev Ji”. The automated system firstly collects complex words using complexity measures described in table-1 below, eg. praxes, then look for suitable matches that can be appropriately replaced with it without affecting the meaning of sentence. A thesaurus lookup produces the following replacements: practices, rehearse, exercise and drill. Here rehearse, exercise and drill need to be dropped because they are falling out of context. Finally the automated system would find appropriate substitute as ‘practices’ and substitute it with its complex variant, generating the simple sentence: “Our University follows the practices of Guru Nanak Dev Ji”. The following Fig. 1 shows the lexical simplification of text through CWI.

Fig 1: Lexical Simplification process using CWI

Figure 1: Lexical Simplification process using CWI 



Click here to View figure

 

Two types of techniques are used for CWI: Threshold based CWI and Classification based CWI. Threshold based CWI techniques are compared by (Shardlow, 2013). Corpus of complex words is collected from Wikipedia in which pairs of sentences (XwY, X Y) are extracted based on edit history and different annotations of word ‘w’ as complex. One of the pioneers in CWI is (Carroll et. al., 1998)  considered word frequency as a parameter for CWI. Those words whose frequency is lying under some threshold value are considered as complex or uncommon words. Classification based techniques uses some machine learning algorithm eg. SVM to train a classifier which uses word features to decide the complexity of word. Word features that resemble the trained word’s features are taken to be complex words. Following are the threshold based scales for measuring text complexity:

Table-1: Threshold based scales for measuring complexity of text.

Table 1: Threshold based scales for measuring complexity of text. 



Click here to View table

 

Related Work

Automated text simplification started in 1996 by (Chandersehar & Srinivas)1, they have performed superficial analysis of text to identify noun and verb phrases from complex sentences. In2, Siddharthan, 2006 concluded that lexical simplification is a subtask of text simplification in which complex phrases are substituted with wide eyed variants to enhance the readability of text.

SemEval-2012 featured with text simplification tasks based on three aspects as complexity analysis, search for substitute words and ranking of substitute variants.

In 2010, Lucia Specia et. al.,3 described English text simplification using context aware lexical simplification approaches. They have outperformed with the best G-score among nine participating teams in SemEval-2012.

In 2012, De Belder et. al.,4 proposed a method in which combination of two sources as lexicon and language model is introduced. Out of given text, the idea is to generate two lists; one list contains synonyms from lexical databases while second list holds alternative words generated through latent word models. A probabilistic model then estimates the probability of substitute for original complex word.

In 2013, A. Di. Marco & R. Navigli5 proposed a graph based Word Sense Induction (WSI) model for clustering and diversifying results of web search text. They have automated the task of WSI by evaluating semantic similarities from raw text and discovering words senses from them.

In SemEval-2013, D. Jurgen & I.P. Klapaftis6 measured the performance of Word Sense Disambiguation (WSD) systems by discovering a software bug in them. The bug was wrongly labelling word senses and resulting in a wrong interpretation of words.

In SemEval-2014, M. Marelli et. al.,7 presented a model for finding semantic relatedness and textual entailment of English sentences. They have decomposed the dataset into two halves for training and testing the classifiers. Pair of sentences are taken in lexical entailment process and degree of relatedness is measured in terms of Pearson’s correlation and Spearman’s correlation coefficients.

In the same competition 2014, S. Oepen et. al.,8 defined a Semantic Dependency Parsing (SDP) system for extracting internal structure of sentences by collecting predicate-argument pairs for context words.

In SemEval-2015, E. Agirre et. al.,9 submitted systems for Semantic Textual Similarity (STS) to identify the degree of relatedness between two text snippets. In the same series of tasks in 2015, A. Moro & R. Navigli10 presented a system for WSD and entity linking in multilingual texts.

The idea of taking CWI in competitive series was conceived during SemEval-2016 Task11, H.P. Gustavo & Lucia Special11 found CWI systems capable of identifying complex words from text and assisting lexical simplification of text. In SemEval-2016, Sanjay S.P et. al.,18 attained handsome accuracy in CWI using SVM linear classifier.

Cwi Systems at Semeval-2016 Task11

SemEval is the semantic evaluation series of computational linguistics held in San Diego, California every year. SemEval-2016 has provided a platform for different linguistic tasks to the Linguistics-Computing professionals. One of the tasks was CWI in which total 42 teams across the world participated and 20 teams have submitted their systems for CWI.12

Following is the summary of systems with their accuracy of findings. The evaluation is carried out in terms of G-score metric which is the Harmonic Mean of accuracy and recall.

Table 2: CWI systems at SemEval-2016

Table 2: CWI systems at SemEval-2016 



Click here to View table

 

Best Performed System Sv000gg

The reason behind the highest G-score of sv000gg system is the ensemble of machine learning classifiers. The system does hard voting of complex word labels predicted through different classifiers. The soft voting in which system classifies the estimates of complexity through maximum argument of traditional hard voting. This makes the system confident to classify the complex words in the given context from the text snippet.

The other reason for the performance of the system is features used in classification process. This system has used total 69 features which are grouped into four categories as Binary, Lexical, Collocational and Nominal features covering the wide range of features than other systems in the competition. In this system the training of classifiers is done through 21 voters which are grouped into three categories as Lexicon based, Threshold based and Machine learning. The third group of machine learning contains the ensemble of seven Machine learning algorithms due to which system got strengthened with more preciseness in classification. Further the results are undergone through five fold cross validation over a joint dataset.

Challenges in Text Simplification

One of the biggest challenges in Natural Language Processing is ambiguity problem. Since last decade many researchers have tried to reduce the ambiguity through word sense disambiguation techniques. CWI also comprises many challenges including ambiguity. Accuracy of appropriate substitutes depends on dataset used for classification based CWI. Results of text simplification may not be promising if a classifier is trained using immature dataset. In SemEval-2016, the submissions of twenty teams worked in the same direction to enhance the lexical simplification process of web text. They have faced so many challenges while developing solutions for CWI. Following are the challenges in CWI for lexical simplification of text:

To Accurately identify Complex Words

An appropriate thesaurus lookup (Avoid jargons and dyslexic phrases)

The Context aware substitution (Substitutes of complex variants should preserve the semantic structure of sentence)

To measure the degree of ambiguity of complex words (Word sense disambiguation)

To make non-native English user understand the complex challenging words in complex sentences.

To identify simplification needs of individuals by comparing complexity of words with overall users of English on the web of same category.

To build a new corpus to be used in Lexical Simplification and other tasks related to semantic evaluations.

To measure the applicability of different datasets used in formation of CWI systems.

To enhance the performance matrices of CWI systems for English text.

To investigate various word parameters used in Lexical Simplification process.

Conclusion

Automated CWI systems are quite useful in assisting aphasic users, non-native English users, second language learners and students as per their simplification needs. The investigation of twenty CWI systems is carried out on the basis of G-score measure of their performances. The SVG000GG system seems to be on the highest position in terms on the G-score. The main reason for the higher performance is due to coverage of wide range of Machine Learning classifiers along with more number of word features considered than other systems. The accuracy of the classification is also validated through five fold cross validation on the joint dataset. This system has brought competition among other systems to incorporate ensemble of classifiers into CWI process. The future aspect of CWI systems will leverage Deep Learning Techniques and Convolution Neural Networks for better performance.

References

  1. Chandrasekar, R., Doran, C., & Srinivas, B. Motivations and methods for text simplification. Proceedings of the 16th conference on Computational linguistics.1996; doi:10.3115/993268.993361.
    CrossRef
  2. Siddharthan, A. Syntactic Simplification and Text Cohesion. Research on Language and Computation.2006; 4(1):77-109. doi:10.1007/s11168-006-9011-1
    CrossRef
  3. Specia, L. Translating from Complex to Simplified Sentences. Lecture Notes in Computer Science.2010;30-39. doi:10.1007/978-3-642-12320-7_5
    CrossRef
  4. De Belder, J., & Moens, M. A Dataset for the Evaluation of Lexical Simplification. Computational Linguistics and Intelligent Text Processing. 2012;426-437. doi:10.1007/978-3-642-28601-8_36
    CrossRef
  5. Di Marco, A., & Navigli, R. Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction. Computational Linguistics. 2013;39(3):709-754. doi:10.1162/coli_a_00148
    CrossRef
  6. Klapaftis, I. P., & Manandhar, S. Evaluating Word Sense Induction and Disambiguation Methods. Language Resources and Evaluation. 2013;47(3):579-605. doi:10.1007/s10579-012-9205-0
    CrossRef
  7. Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., & Zamparelli, R. SemEval -2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014).2014; doi:10.3115/v1/s14-2001
    CrossRef
  8. Oepen, S., Kuhlmann, M., Miyao, Y., Zeman, D., Flickinger, D., Hajic, J., … Zhang, Y. SemEval 2014 Task 8: Broad-Coverage Semantic Dependency Parsing. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). 2014;doi:10.3115/v1/s14-2008
    CrossRef
  9. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., … Wiebe, J. SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). 2014;doi:10.3115/v1/s14-2010
    CrossRef
  10. Moro, A., & Navigli, R. SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015).2015 doi:10.18653/v1/s15-2049
    CrossRef
  11. Paetzold, G., & Specia, L. LEXenstein: A Framework for Lexical Simplification. Proceedings of ACL-IJCNLP 2015 System Demonstrations.2015; doi:10.3115/v1/p15-4015
    CrossRef
  12. Paetzold, G., & Specia, L. SemEval 2016 Task 11: Complex Word Identification. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016).2016; doi:10.18653/v1/s16-1085
    CrossRef
  13. Davoodi, E., & Kosseim, L. CLaC at SemEval-2016 Task 11: Exploring linguistic and psycho-linguistic Features for Complex Word Identification. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016; doi:10.18653/v1/s16-1151
    CrossRef
  14. Konkol, M. UWB at SemEval-2016 Task 11: Exploring Features for Complex Word Identification. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval- 2016). 2016;doi:10.18653/v1/s16-1162
    CrossRef
  15. Kuru, O. AI-KU at SemEval-2016 Task 11: Word Embeddings and Substring Features for Complex Word Identification. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 2016; doi:10.18653/v1/s16-1163
    CrossRef
  16. Martínez Martínez, J. M., & Tan, L. USAAR at SemEval-2016 Task 11: Complex Word Identification with Sense Entropy and Sentence Perplexity. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). doi:10.18653/v1/s16-1147
    CrossRef
  17. Paetzold, G., & Specia, L. SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval -2016). doi:10.18653/v1/s16-1149
    CrossRef
  18. Sp, S., Kumar, A., & K P, S. AmritaCEN at SemEval-2016 Task 11: Complex Word Identification using Word Embedding. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). doi:10.18653/v1/s16-1159
    CrossRef
  19. Wróbel, K. PLUJAGH at SemEval-2016 Task 11: Simple System for Complex Word Identification. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval- 2016). doi:10.18653/v1/s16-1146
    CrossRef
  20. Choubey, P., & Pateria, S. Garuda and Bhasha at SemEval-2016 Task 11: Complex Word Identification Using Aggregated Learning Models. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). doi:10.18653/v1/s16-1156
    CrossRef
  21. Malmasi, S., Dras, M., & Zampieri, M. LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). doi:10.18653/v1/s16-1154
    CrossRef

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.