深層学習は異言語テキスト間の意味的類似性の評価にも役立つ

Rosette API に新しい機能「テキスト埋め込み」が加わりました。テキスト埋め込みの最もポピュラーな用途の一つは類似度計算です。
当社のエンジンは、文脈、コンテンツ、関連の類似性を認識するために機械学習を利用しています。
2つの文書やフレーズ、単語の埋め込みから、意味や内容がどの程度似ているかを評価することができます。
5つの言語(英語、ドイツ語、スペイン語、日本語、中国語)をサポートしており、言語間での比較も可能です。
無料で試用できます。
——————————————————–

Deep Learning Powers Cross-Lingual Semantic
Similarity Calculation

September 16, 2016

Text Embeddings Now Available in the Rosette API

The Rosette API team is excited to announce the addition of a new function to Rosette’s suite of capabilities:
text embedding. This endpoint returns a single vector
of floating point numbers for your input, a.k.a. an embedding of your text in a semantic vector space.

Text embeddings can be used for a variety of text analysis tasks, including judging the semantic similarity of
one or more texts across languages. Knowing the embeddings of two documents, phrases or words can allow
you to evaluate how similar they are in meaning or content.

What is semantic similarity?

While word and text embedding is still an emerging capability in the realm of natural language processing, one
of the most popular uses for text embeddings so far is similarity calculation. Our engine utilizes machine
learning to recognize similarities in context, content, and associations. For example, king correlates
to man while queen correlates to woman.

For the end user, text embeddings could power a number of different applications. Businesses engaging in
eDiscovery might use text embedding for deduplication of documents. In consumer services, review websites
like Yelp or TripAdvisor could use text embeddings to aggregate related phrases such as “the bathrooms
were spotless” or “the restroom was very clean.”

Find Related Words and Documents in 5 Languages

The Rosette API text embedding endpoint supports five languages: English, German, Spanish, Japanese, and
Chinese. It also supports cross-lingual comparisons, which allow you to calculate the similarity of words
or documents written in different languages. As a test, consider evaluating the similarity of comparable
or related words in different languages, such as “amor” and “love,” or “die Braut” and “le mariage.”

Try it Out

Once you’ve signed up (no commitment, no credit card
required) for your Rosette API account, you can try out a basic application for text embeddings using s
ome sample Python code we created. Remember, the Rosette API is free for up to 10,000 calls per month!
If you need more calls, check out our paid plans.

First, head to the Rosette API GitHub community and clone the text-embeddings-sample repo to your machine.

You should see two files (plus a README.md):

  • cosine_similarity.py
  • test_embeddings.py

Make sure you’ve installed the latest version of our Python client binding — 1.3.2 — via

$ pip install rosette-api --upgrade

Then edit cosine_similarity.py in your favorite text editor to replace “[your key here]” with yourRosette API key.
demo

Save, and head back to the text-embeddings-sample directory in your command line to run test_embeddings.py. It should look something like this:

$ python test_embeddings.py

results

Sample Results

We can see some interesting measures of semantic similarity between “Paris”, “France”, “London”, and “England”. Notice that “Paris” is closer in meaning to “France” than “London” is to “France.” Similarly, “England” is more semantically similar to “London” than to “France”.

Once you’ve gotten the hang of it, replace the sample input words in test_embeddings.py with your own words or longer input text to calculate the similarity between them.

You can find more details about text embedding and language coverage in the documentation. If you discover any cool results or use cases, let us know! Email support@rosette.com and we’ll feature your results on our blog or in our GitHub community.