Discovery Sagaサイレントキーワード俯瞰

During the first year, work on the tasks (a) to (c) announced in the plan have been pursued in parallel.
(a) Casting string edit distances into vector spaces. Review on 1-D clustering and multidimensional scaling. Reading and programmation of metrics ('mutual information score') to measure the difference between analogical grids.
(b) Adapting existing algorithms for arithmetic analogy on integer values to real values. Study on distributions of values in word embedding models. Findings: the distribution is Gaussian on one dimension. This poses a problem: no clustering method can be applied to separate values on one dimension. Study on correlations of dimensions in word embedding models. Findings: some dimensions are correlated in subspaces. This allows some dimensionality reduction.
(c) Parallelising existing algorithms. Use of the mathematical library numpy in existing programs. A master's student was hired in August and September. Results: speed-up in retrieval of analogical clusters.
Work on extraction of all analogies from a word space. As many semantic phenomena are realised formally in language (e.g., the opposition male/female is expressed by suffixes -er/-ress), start with regular patterns like waiter : waitress :: mister : mistress, etc. and extend to vector representations to catch irregular patterns like king : queen. First experiments in local.
A study on retrieval of all possible formal analogies between sentences at different granularities has been conducted. Only analogies on the formal level were retrieved, but a journal paper has been published.