How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Lenci, A., Masini, F., Nissim, M., Castagnoli, S., Lebani, G. E., Passaro, L. C. & Senaldi, M. S. G., 2017, In : Studi e saggi linguistici. 55, 2, p. 45-68 24 p.

Research output: Contribution to journalArticleAcademicpeer-review

  • Alessandro Lenci
  • Francesca Masini
  • Malvina Nissim
  • Sara Castagnoli
  • Gianluca E. Lebani
  • Lucia C. Passaro
  • Marco S. G. Senaldi

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance - contrastively, and with reference to external benchmarks - and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy's potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

Original languageEnglish
Pages (from-to)45-68
Number of pages24
JournalStudi e saggi linguistici
Issue number2
Publication statusPublished - 2017


  • word combinations, computational methods, idiomatic expressions, IDIOMATICITY

ID: 100069078