Accurate & adaptable statistical entity extraction
Rosette® Entity Extractor (REX) delivers structure, clarity, and insight, by revealing the key information—names, places, organizations, products, and other words and phrases—lying hidden within large volumes of unstructured Big Text.
REX is the foundation for applications in eDiscovery, social media analysis, financial compliance, and government intelligence. The effectiveness of these mission-critical applications depend on REX for its accuracy, robustness, and ability to find entities across many languages.
By nature, statistically trained models are most accurate on the type of data they are trained on. Besides machine learning from a wide range of text beyond news articles, REX is unique among named entity recognition software in its adaptability. REX’s field training mechanism enables you to add your text data to your entity extraction model to increase REX’s accuracy on your text.
- Component of the Rosette SDK
- Simple API
- Fast and scalable
- Industrial-strength support
- Easy installation
- Flexible and customizable
- Java or C++
- Unix, Linux, Mac, Windows
Statistical Entity Extraction
Statistical modeling with advanced linguistics solves three major problems:
- Finds entities which cannot be exhaustively listed.
- Finds entities which are yet unknown.
- Considers context so that place names (Newton, MA) are not confused with people names (Isaac Newton).
Because of these problems, entity extraction for people, organizations, products, and locations can only be accomplished with a statistical model that is trained on millions of news and blog articles and has learned the context within which one finds these entities.
Field Training for Increased Accuracy
For users with text that is particularly challenging in format, style, or vocabulary, REX’s unique field training capability has multiple mechanisms to adapt its statistical model to their data. Users just add a quantity of their data (unannotated or annotated), and rebuild the model for maximum accuracy.
Rules expressed as regular expressions find entities which follow a pattern, such as dates, times, and email addresses. Many standard string patterns are included with REX; customers can customize by editing or adding their own rules, based on their specific needs.
Custom Entity Lists
Custom lists are helpful when users know that specific words or phrases in their data are almost never misspelled and always refer to the same thing (i.e., are unambiguous). An example is a list of basic colors like red, yellow, and green for tweets mentioning a clothing manufacturer. REX comes with such lists for entity types like religions and nationalities. For identifying specific entities that can have variant names or might be ambiguous like the various Presidents Bush, REX should be combined with the Rosette Entity Resolver (RES).
- Chinese, Simplified
- Chinese, Traditional
- Credit Card Number
- Geographic Coordinate
- Generic Number
- Personal ID Number
- Phone Number
- Email Address/URL
Entity extraction is often used in combination with other text analytics to solve a specific problem.
- For multilingual entity analytics combine REX with Rosette Language Identifier (RLI) to identify the language of documents and Rosette Entity Resolver (RES) to collect all the times a person, location or organization is mentioned across all of them
- For searching documents by the names mentioned in them, combine REX with Rosette Name Indexer (RNI) for fuzzy name search
- For multilingual document triage, combine REX with Rosette Name Translator (RNT) to translate the entities extracted from non-English documents into English