Search many languages with high accuracy
Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.
As linguistics experts with deep understanding at the intersection of language and technology, Basis Technology continually improves the Rosette product family with language additions, feature updates, and the latest innovations from the academic world.
- WESTERN EUROPE
- EASTERN EUROPE
- MIDDLE EAST
- Chinese, Simplified
- Chinese, Traditional
- Simple API
- High-scale and throughput
- Industrial-strength support
- Easy installation
- Flexible and customizable
- Java or C++
- Component of the Rosette SDK
- Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts
- Cloudera certified
Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.
Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.
Noun Phrase Extraction
Certain nouns, especially proper names, can be very tricky to identify as a single entity. RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.
Part of Speech Tagging
As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.
Rosette supports the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.
RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.
Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.
The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.