Improve the speed and accuracy of your search application with advanced linguistic analysis.

Search many languages with high accuracy

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.

As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.



  • Catalan
  • Czech
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Italian
  • Norwegian
  • Portuguese
  • Spanish
  • Swedish
  • Albanian
  • Bulgarian
  • Croatian
  • Estonian
  • Hungarian
  • Latvian
  • Polish
  • Romanian
  • Russian
  • Serbian
  • Slovak
  • Slovenian
  • Turkish
  • Ukrainian
  • Arabic
  • Hebrew
  • Pashto
  • Persian
  • Urdu
  • ASIA
  • Chinese, Simplified
  • Chinese, Traditional
  • Indonesian
  • Japanese
  • Korean
  • Malay
  • Thai
Code Base
Platform Support
Red Hat



  • Simple API
  • High-scale and throughput
  • Industrial-strength support
  • Easy installation
  • Flexible and customizable
  • Java or C++
  • Component of the Rosette SDK
  • Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts
  • Cloudera certified

Advanced Morphological Features



Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

Tokenization Example



Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

Lemmatization Example

Noun Phrase Extraction

Certain nouns, especially proper names, can be very tricky to identify as a single entity. RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.

Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.



RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

Example: German

Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.

Sentence Detection

Sentence Detection

The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.

