Detection of Compound Word with Combination Noun and Adjective using Rule Based Technique in Malay Standard Document

Zamri Abu Bakar, Normaly Kamal Ismail, Mohd Izani Mohamed Rawi

Abstract


In this paper we describe our methods for detecting the compound word with combination of Noun and Adjective Compound Nouns in Malay standard document. We addressed the problem on detection of combination noun and adjective in Malay sentences to become a compound word. We modified several identification rules based by using Malay grammar rules and syntactic information to increase the percentage of recall, precision and F1-Score. For compound word identification, we used dictionary-based and thesaurus information for implementing Part of Speech (POS) tagging to all words in the selected Malay document. Testing was done on selected Malay document. The result showed an improvement compared to previous research with a precision of 90.9%, a recall of 10.2% and a F1-Score of 18.1%.

Keywords


Compound Word; Malay Standard Document; Ruled-Based; Syntactic Information;

Full Text:

PDF

References


V. Vincze, T. I. Nagy, and G. Berend , “Detecting noun compounds and light verb constructions: A contrastive study,” in Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011), 2011, pp. 116-121.

V. Nastase and S. Szpakowicz, “Exploring noun-modifier semantic relations,” in Proceedings of the 5th International Workshop on Computational Semantics (IWCS-03), 2003, pp. 285-301.

S. A. Rahman, N. Omar, and N. B. C. Hassan, “Construction of compound nouns (CNs) for noun phrase in Malay sentence,” in Proceedings of the 2012 International Conference on Information Retrieval and Knowledge Management, CAMP’12, 2012, pp. 22–25.

K. Y. Su, M. W. Wu, and J. S. Chang, “A Corpus-based Approach to Automatic Compound Extraction,” in Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 1992, pp. 242- 247.

I. A. Sag, T. Baldwin, F. Bond, A. Copestake, and D. Flickinger, “Multiword expressions: A pain in the neck for NLP,” in Computational Linguistics and Intelligent Text Processing, 2002, pp. 38-43.

S. A. Rahman, N. Omar, and J. A. Aziz, “Extraction of compound nouns in Malay noun phrases using a noun phrase frame structure,” Asia-Pacific Journal of Information Technology and Multimedia, vol. 3, no. 1, pp. 23-32, 2014.

F. M. M. Sultan, “Struktur sintaksis frasa nama (FN) Bahasa Melayu,” Jurnal Bahasa, vol. 8, pp. 204-219, 2008.

S. H. A. Aziz and K. Husin, Pusat Sumber Sekolah”. Kuala Lumpur: Kumpulan Budiman, 1996.

R. H. Robin, General Linguistics. An Introductory Survey. London : Longman, 1971.

S. A. Rahman, N. Omar, and M. J. A. Aziz, “A fundamental study on detecting head modifier noun phrases in Malay sentence,” in Proceedings of the International Conference on Semantic Technology and Information Retrieval, STAIR, 2011, pp. 255-259.

R. Alfred, L. C. Leong, C. K. On, and Anthony, P., “Malay named entity recognition based on rule-based approach,” International Journal of Machine Learning and Computing, vol. 4, no. 3, pp. 300- 306, 2014.

M. A. Falih and N. Omar, “A comparative study on Arabic grammatical relation extraction based on machine learning classification,” MiddleEast Journal of Scientific Research, vol. 23, pp. 1222-1227, 2015.

J. Peng and K. Araki, “Detecting the countability of English compound nouns using web-based models,” International Joint Conference on Natural Language Processing, 2005, pp. 103-107.

M. Miyashita and V. Klyuev, “TermExtract: Accuracy of compound noun detection in Japanese,” in Future Information Technology, J. J. Park, Y. Pan, C.-S. Kim, and Y. Yang, Eds. Berlin, Heidelberg: Springer, 2014, vol. 276, pp. 473-476.

J. Kleenankandy, “Implementation of Sandhi-rule based compound word generator for Malayalam,” in Proceedings of the Fourth International Conference on Advances in Computing and Communications, 2014, pp. 134-137.

L. R. Nair and S. D. Peter, “Development of a rule based learning system for splitting compound words in Malayalam language,” IEEE Recent Advances in Intelligent Computational Systems, 2011, pp. 751- 755.

A. M. Saif and M. J. A. Aziz. “An automatic noun compound extraction from Arabic corpus,” in Proceedings of the International Conference on Semantic Technology and Information Retrieval, STAIR 2011, 2011, pp. 224-230.

S. Poria, E. Cambria, L. W. Ku, C. Gui, and A. Gelbukh, “A rule-based approach to aspect extraction from product reviews,” in Workshop on Natural Language Processing for Social Media (SocialNLP), 2014, pp. 28-37.

L. Li, J. Chen, Q. Chen, and F. Fang, “A novel model for recognition of compounding nouns in English and Chinese,” in Proceedings of the 6th International Symposium on Neural Networks, Part III ISNN 2009 Wuhan, China, 2009, pp. 457-465.

L.F. Chien, “PAT-tree-based keyword extraction for Chinese information retrieval,” in Proceedings of the ACM SIGIR’97 Conference, 1997, pp. 50-58.

J. Zhang, J. Gao, and M. Zhou, “Extraction of Chinese compound words - An experimental study on a very large corpus,” in Proceedings of the Second Workshop on Chinese Language Processing Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, 2000, vol. 12, pp.132.

J. F. Gao, J. Goodman, M. J. Li, and K.F. Lee, “Toward a unified approach to statistical language modeling for Chinese,” ACM Transactions on Asian Language Information Processing. Vol. 1, no. 1, pp. 3-33, 2002.

Luo, S. F. and Sun, M. S., “Two-character Chinese word extraction based on hybrid of internal and contextual measures,” in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 2003, 2003, pp. 24-30.

Y. Xiong J. and Zhu, “Toward a unified approach to lexicon optimization and perplexity minimization for Chinese language modeling,” in Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, 2005, pp. 18-21.

D. Lin, “Automatic retrieval and clustering of similar words,” in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics ACL‘98 and 17th International Conference on Computational Linguistics, 1998, pp.768-774.

R. Sproat and C. Shih, “A statistical method for finding word boundaries in Chinese text,” Computer Processing of Chinese & Oriental Languages, vol. 4, pp. 336-351, 1990.

S. Maosong, S. Dayang, and B. K. Tsou, “Chinese word segmentation without using lexicon and hand-crafted training data,” in Proceedings of the 36th Annual Meeting on Association for Computational Linguistics, 1998, pp. 1265-1271.

S. Peng and D. Schuurmans, “Self-supervised Chinese word segmentation,” in Advances in Intelligent Data Analysis, F. Hoffmann, D. J. Hand, N. Adams, D. Fisher, and G. Guimaraes, Eds. Berlin, Heidelberg: Springer, 2001, pp. 238-247.

C. Goncalves, J. F. Silva, and J. C. Cunha, “A parallel algorithm for statistical multiword term extraction from very large corpora,” in Proceedings of the IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015, pp. 219-224.

M. Al-Mashhadani and N. Omar, “Extraction of Arabic nested noun compounds based on a hybrid method of linguistic approach and statistical methods,” Journal of Theoretical and Applied Information Technology, vol. 76, no. 3, pp. 408-416, 2015.

O. C. Guan, Kuasai Struktur Ayat Bahasa Melayu. Malaysia: Dewan Bahasa dan Pusataka (DBP), 2009.

A. Hassan, Linguistik Am. Malaysia: PTS Profesional Publishing, 1992.

A. Hassan, Tatabahasa Bahasa Melayu: Morfologi dan Sintaksis. Malaysia: PTS Publications and Distributors, 2002.

A. K. M. Nor, Tatabahasa Asas. Kuala Lumpur: Persatuan Pendidikan Bahasa Malaysia, 2012.

B. J. Juhasz, Y. H. Lai, and M. L. Woodcock, “A database of 629 English compound words: ratings of familiarity, lexeme meaning dominance, semantic transparency, age of acquisition, imageability, and sensory experience,” Behavior Research Methods, vol. 47, no. 4, pp. 1004-1019, 2015.

J. Palemans, K. Demuynck, H. V. Hamme, and P. WamBacq, “Coping With Language Data Sparsity: Semantic Head Mapping of Compound Words,” in Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), 2014, vol. 6, pp. 1- 5.

D. ´O S´eaghdha, Learning Compound Noun Semantics. Ph.D. Thesis, Computer Laboratory, University of Cambridge, 2008.

I. Hendrickx, Z. Kozareva, P. Nakov, D. ´O S´eaghdha, S. Szpakowicz, and T. Veale, “Semeval-2013 task 4: Free paraphrases of noun compounds,” in Workshop on Semantic Evaluation (SemEval 2013), 2013, pp. 138-143.

S. Li, L. Zhang, B. Han, T. Lei, Q. Wang, T. Peng, and P. Cao, “A SVM-based compound-word recognition method in information security,” in 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2013, pp. 837-841.

M. Taharim, R. Ja’afar, and N. A. Shukur, Tesaurus Bahasa Melayu Dewan. Malaysia: Dewan Bahasa dan Pustaka (DBP), Edisi Baharu, 2015.

S. A. Rahman, N. Omar, and M. J. A. Aziz, “The effectiveness of using the dependency relations approach in recognizing the head modifier for malay compound nouns,” in 2014 International Conference Computer and Information Sciences (ICCOINS), 2014, pp. 837-841.

M. A. S. Hazaa, N. Omar, F. M. Ba-Alwi, and M. Albared, “Automatic extraction of Malay compound nouns using a hybrid of statistical and machine learning methods,” International Journal of Electrical and Computer Engineering (IJECE), vol. 6, no. 3, 925-935, 2016.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

ISSN: 2180-1843

eISSN: 2289-8131