Theoretically founded algorithms for the automatic production of analogy tests in NLP
【研究キーワード】
自然言語処理 / 埋め込み表現 / 類推関係 / 類推関係データセット / アルゴリズム / 深層学習
【研究成果の概要】
During the first year, work on the tasks (a) to (c) announced in the plan have been pursued in parallel.
(a) Casting string edit distances into vector spaces. Review on 1-D clustering and multidimensional scaling. Reading and programmation of metrics ('mutual information score') to measure the difference between analogical grids.
(b) Adapting existing algorithms for arithmetic analogy on integer values to real values. Study on distributions of values in word embedding models. Findings: the distribution is Gaussian on one dimension. This poses a problem: no clustering method can be applied to separate values on one dimension. Study on correlations of dimensions in word embedding models. Findings: some dimensions are correlated in subspaces. This allows some dimensionality reduction.
(c) Parallelising existing algorithms. Use of the mathematical library numpy in existing programs. A master's student was hired in August and September. Results: speed-up in retrieval of analogical clusters.
Work on extraction of all analogies from a word space. As many semantic phenomena are realised formally in language (e.g., the opposition male/female is expressed by suffixes -er/-ress), start with regular patterns like waiter : waitress :: mister : mistress, etc. and extend to vector representations to catch irregular patterns like king : queen. First experiments in local.
A study on retrieval of all possible formal analogies between sentences at different granularities has been conducted. Only analogies on the formal level were retrieved, but a journal paper has been published.
【研究代表者】
【研究種目】基盤研究(C)
【研究期間】2021-04-01 - 2024-03-31
【配分額】4,030千円 (直接経費: 3,100千円、間接経費: 930千円)