Semantic Similarity による重複文書の検出と多言語横断検索

Semantic Vectors (意味ベクトル) は、語句や文書をその意味空間上のベクトル値に変換します。意味が類似しているほど数値的に近いベクトル値に変換されるので、ベクトル値を比較することで重複する文書を検出できます。また、異なる言語間でも類似した意味の語句は、ベクトル値の近似度が高くなります。Similar Terms (類似用語検出) は、ある語句に対し意味的類似性が高い語句を返します。Similar Termを利用すれば、クエリの語句を他の言語の類似した用語へ展開できるので、複数の言語の文書を横断的に検索できます。


Duplicate Document Detection & Cross-Lingual Search


How to automate mundane tasks and find relevant text using text embedding

Numbers are great, because they are easy to compare, tabulate and examine. Text? Not so much.
But text embeddings let one manipulate and compare the meaning behind words and text like numbers.

Basically, text embeddings convert words, phrases, or even whole documents into a mathematical vector representing its meaning. Vectors that are numerically closer will be closer in meaning. (For the long explanation of how text embeddings work, read our blog posts “Using Deep Learning to Power Multilingual Text Embeddings for Global Analysis” Part I and Part II.) A given word compared to itself will score a 1.0 in similarity, but outside of that case, 0.8 is about as high a match as you will ever see.

Cross-lingual query expansion (i.e., taking your English search and generating the equivalent in a number of languages) and duplicate document detection can be built using text embeddings. The only difference is cross-lingual search is looking for an equivalent phrase in a different language, and detecting duplicate documents is often done in a single language.

Let’s see how this works.

CROSS-LINGUAL QUERY EXPANSION

Before we had access to text embeddings, monolingual English speakers would take a search term and drop it into Google Translate and then copy the result into the search box. It’s laborious and you may not even have the right term. That has all changed with the availability of semantic similarity of terms with Rosette version 1.12.1, which supports Arabic, English, Chinese, German, Japanese, North & South Korean, Russian, and Spanish for this function.

Similar words or phrases can be discovered within a language or across languages. Given the word “spy”, Rosette returns these similar terms in Spanish, German, and Japanese.

Input: Spy
Spanish
{"term":"espía","similarity":0.61295485},
{"term":"cia","similarity":0.46201307},
{"term":"desertor","similarity":0.42849663},
{"term":"cómplice","similarity":0.36646274},
{"term":"subrepticiamente","similarity":0.36629659}
German
{"term":"Deckname","similarity":0.51391315},
{"term":"GRU","similarity":0.50809389},
{"term":"Spion","similarity":0.50051737},
{"term":"KGB","similarity":0.49981388},
{"term":"Informant","similarity":0.48774603},
Japanese
{"term":"スパイ","similarity":0.5544399},
{"term":"諜報","similarity":0.46903181},
{"term":"MI6","similarity":0.46344957},
{"term":"殺し屋","similarity":0.41098994},
{"term":"正体","similarity":0.40109193},

Rosette’s /semantic/similar endpoint is returning similar terms from a term database compiled from Wikipedia and Gigaword.

DUPLICATE DOCUMENT DETECTION

Text embeddings are also dead useful in areas such as eDiscovery where being able to detect nearly duplicate documents could save man-weeks or more of labor during discovery. Rosette will accept an entire document as input to its /semantics/vector endpoint and calculate the vector (a location in semantic space, as represented by a vector of floating point numbers). Then, the values of the resulting vectors for each document can be compared.

For instance, a press release can be published on 100+ websites. Using semantic vectors, you can programmatically identify all 100 copies as different versions of the article.

Curious to try this out? Sign up for a free Rosette Cloud trial account.