Customer Case Studies
Airbnb puts the “unique” into travel by connecting private owners renting out a room, apartment, castle, or villa to travelers staying a night, a week, or a month. This community marketplace of accommodations spans 34,000 cities and 190 countries. More than 17 million guests have over 800,000 listings to choose from. People can book accommodations or monetize extra space through the web or their mobile phone.
given the international nature of their business, Airbnb’s Verified ID process helps users match names that originate in multiple languages, and in more than just the Roman A-to-Z alphabet
The old adage “You can’t see the forest for the trees” applies to the acres and acres of data that overrun government, legal, and those in e-discovery. The gardener in this case might be Equivio, whose business is managing data redundancy. Equivio’s software mimics human intuition by organizing sets of documents and emails in meaningful ways: grouping near-duplicate documents, reconstructing email threads, clustering by subject, search, language detection, data mining, and more.
The first step in sorting documents of any type is determining its language. For this critical step, Equivio relies on the Rosette Language Identifier, the leader in its area for wide language coverage (55 and counting!) and high performance to churn through terabytes of data. Unique to Rosette is its ability to identify multiple languages within a single document. For example, an email might be in French, but its disclaimer footer might be in English. A document might be in one language, but then quote from another document in a different language. Whatever the document, Rosette delivers dependable results quickly.
Following the Arab Spring—a series of populist upheavals in the Middle East from early 2011—government analysts in the Office of the Director of National Intelligence (ODNI) asked “Could we have foreseen these events?” That question became an initiative put forth by the Intelligence Advanced Research Projects Activity (IARPA) called the Open Source Indicators (OSI) Program, which challenged applicants “to develop methods for continuous, automated analysis of publicly available data in order to anticipate and/or detect significant societal events, such as political crises, humanitarian crises, mass violence, riots, mass migrations, disease outbreaks, economic instability, resource shortages, and responses to natural disasters.” Essentially to “beat the news.”
For Latin America, at least 60% of EMBERS’ alerts are generated from unstructured data: 35% from social media (including tweets) and 25% from news stories.
The Historise Social Monitoring system analyzes millions of social media posts and news publications around the world to predict the behavior of Internet users, and calculate their future interest in a particular topic. It provides the most complete picture of brand, person, and company mentions in social media with subsequent in-depth analysis of the data in real time. The expert system is able to identify weaknesses in the brand’s online reputation and make recommendations for improvement.
Incoming data is enhanced by attaching valuable metadata such as age, gender, and location of the author. Rosette Language Identifier detects the language of each item for linguistically appropriate enrichment and processing to make Historise’s search with Elasticsearch and Apache Nutch yet more accurate and comprehensive. With Rosette Entity Extractor, Historise also pulls out names of products, companies, and people, and can then deduce high-level topics (e.g. sports, banking). Rosette enables Historise to enrich social media content in its native language, thus recognizing the mention of a brand or person across many languages. And spelling variations, errors, nicknames, and other name variations are smoothed out with Rosette Name Indexer, ensuring that “Chas. Schwab” will be matched to “Charles Schwab.”
“We have developed the Historise Social Media Monitoring System for companies to better understand how consumers relate to their product and gain insight into how to improve the quality of their goods or services,” said Dmitry Baykov, CEO Historise Ltd. “Basis Technology and its proven Rosette platform enable us to provide customers with the most comprehensive analysis of the opinions and mentions of a product, person or company on the Internet across many languages.”
Newsle is all about presenting you with a highly personalized news feed. Processing over a million articles a day from news sites and blogs, Newsle combs through publicly available information for news on your friends and colleagues. Newsle was launched in 2011 by then-Harvard University sophomores Axel Hansen and Jonah Varon, who benefited from Basis Technology’s startup program, making our best ideas and innovations accessible to high-impact, early-stage companies.
Newsle took advantage of the entity extraction technology of Rosette Entity Extractor, to find names of people, places, and organizations within news feeds. Rosette enables Newsle to differentiate between the different occurrences of “newton” in text such as:
The Fig Newton is a cookie named after Newton, Massachusetts, but a newton is a unit of force. Its name honors the English physicist and mathematician Isaac Newton, who laid the foundations for most of classical mechanics.
Pinterest is an online visual discovery tool helping people all over the world discover, collect, and share what they love. Users “pin” images, video, and other media from the Internet or their uploads. A collection of pins on a theme form a “pinboard,” the basis for organizing a trip, sharing one’s passion, or organizing a wish list or event.
Pinterest’s global users expect to find what they are looking for, no matter the language. For languages such as Chinese, Japanese and Korean, which are written without spaces between each word, it is particularly important to have linguistically intelligent text processing. Through Rosette Base Linguistics, Pinterest expands searches in CJK for more accurate, comprehensive results.
Non-linguistic methods, such as n-gram (dicing text into overlapping lengths of n-characters) will allow indexing and searching, but will bloat an index, slow performance, and increase false positives. Consider a Japanese search for 東京都美術館 (“Tokyo Metropolitan Art Museum”). Morphological analysis yields 東京都 (“Tokyo”) and 美術館 (“art museum”). Bigramming yields 東京 (“Tokyo”), 京都 (“Kyoto”), 都美 (not a word), 美術 (“art”), and 術館 (not a word). When seeking images of the Tokyo Metropolitan Museum, Rosette ensures that art museums in Kyoto won’t be mixed in!
E-commerce and publishing sites turn to SLI Systems for full-service site search, navigation, merchandising, and user-generated search engine optimization. Better accuracy and targeted searches increase customer satisfaction and help site visitors find the products and information they seek. SLI’s patented Learning Search technology learns from visitors’ behavior over time to deliver more relevant results. Simultaneously, this technology reduces costs and yields valuable customer information to support other marketing activities.
With accuracy so critical to its core mission, SLI chose Rosette Base Linguistics for linguistic support in multiple languages. Rosette expands search results for a more comprehensive set of results while reducing the number of irrelevant results. Through a hybrid of methods – morphological analysis, dictionaries, and statistical analysis – Rosette enables search engines to deliver highly accurate and relevant results in 40 languages covering the Americas, Europe, Asia, and the Middle East.
StumbleUpon churns through the Internet helping users discover articles, media and images, based on their areas of interest and the user’s thumbs up and thumbs down rating of already-seen content. More than 25 million people turn to StumbleUpon to be informed, entertained and surprised by content and information recommended just for them. In addition, more than 75,000 brands, publishers and other marketers have used StumbleUpon’s Paid Discovery platform to tell their stories and promote their products and services.
To cater to its international audience, Stumble Upon uses Rosette Language Identifier to make sure English users see English content and Chinese users see Chinese content, etc. For languages without reliable spaces between word–such as Chinese, Japanese, and Korean–Rosette Base Linguistics finds the words and normalizes them to enable more accurate and faster search results.
Buenos Aires-based Imagen Satelital S.A. is a subsidiary of Turner Broadcasting Systems with offices also in São Paulo, Brazil and Miami Beach, Florida. It provides cable programming and distribution services through the television channels it owns and operates: Infinito, Space, Júpiter Comic, I-Sat, and Uniseries. Additionally, it offers signal distribution and programming services.
Given its diverse programming and audience, it maintains a database of subtitles in various languages. Imagen Satelital uses Basis Technology’s Rosette Language Identifier to detect the language of the subtitles so that the database can be searched. Rosette automatically identifies 55 languages by using a series of statistical profiles for each language, delivering both speed and accuracy.
CareerBuilder, the global leader in human capital solutions, operates the largest job board in the U.S. and has an extensive and growing global presence. The CareerBuilder.com content is the very definition of Big Text: mountains of structured and unstructured text data (resumes and job listings) in many languages. CareerBuilder’s mission is to empower employment, striving to organize the world’s human capital data and make it meaningful for society. Fundamental to this mission is delivering highly accurate and reliable search results to match the right people with the right jobs.
The CareerBuilder.com content is the very definition of Big Text: mountains of structured and unstructured text data (resumes and job listings) in many languages.
EMC Documentum products deliver the most comprehensive and highly-integrated enterprise content management (ECM) platform for managing the entire information lifecycle with superior security and governance. Its Content Server OEM Edition is an open, extensible information infrastructure platform for managing information within the applications one develops, distributes, or hosts.
EMC adopted open source search toolkit Apache Lucene for managing search and indexing within its content server. To support multilingual search for its customers worldwide, EMC integrated Basis Technology’s Rosette Linguistics Platform, a module that connects to Lucene out of the box. Through this module, applications have access to Rosette, providing to a full array of linguistic technology in 40 languages, such as language identification, morphological analysis, entity extraction, and name matching and translation. Industry-leading Rosette has been adopted by major search engines such as Google, Exalead/Dassault, Bing, Microsoft’s FAST, and Oracle (Endeca).
IPRO Tech designs scalable, easy-to-manage software for litigation support. They integrated the Rosette linguistics into their e-discovery and document review product lines including e-Capture, e-Review, and IPRO Eclipse.
IPRO utilizes Rosette for language identification, Unicode enablement, and text analysis of Chinese, Japanese, Korean, Arabic, and Persian. This powerful combination enables IPRO Eclipse to search and review thousands of documents across multiple languages.
“Multilingual legal discovery is increasingly becoming a major concern to the industry,” said Jim King, CEO of IPRO. “We partnered with Basis Technology to ensure that all documents demanded in litigation will be identified and analyzed by empowering our applications with the best linguistics-based solutions. This combination lowers the risk of missing critical evidence hiding in documents containing different languages.”
NCB Capital is the investment arm of National Commercial Bank, the first Saudi Arabian bank. As NCB Capital is the largest capital holdings bank in the Arab world, it is not surprising that the names in their customer databases number in the millions
In looking for a practical and cost-effective way to comply with Know Your Customer regulatory standards, NCB Capital chose to integrate Rosette Name Translator into its compliance process to match Arabic names against English language Watch Lists. Rosette enables NCB Capital to automatically translate names from Arabic into English with high accuracy to reduce the number of false positives when matching against watch lists.
Rosette Name Translator is a fully-automatic translation engine for names of people, places, and organizations. Currently Rosette translates names between English and Arabic, Dari, Farsi, Pashto, Urdu, Chinese, Japanese, Korean, and Russian.
“We’re really impressed with Rosette Name Translator’s capabilities and how it has improved our OFAC name checking process,” said Peter Wilkinson, VP of Application Development at NCB Capital. “It translated 330,000 Arabic names into Roman script very quickly, consistently, and accurately.”
SAVO Group combines technology, expertise, and strategy to provide a collaborative sales enablement solution that links corporate initiatives to sales execution. Its award-winning on-demand Sales Enablement platform maximizes the sales organization’s ability to communicate value and differentiation in clear, consistent and compelling ways. It links to existing CRMs and fills in the gaps to ensure the sales team has all it needs at its fingertips: sales process, CRM, mobile access to information, sales content, RFP, metrics/reporting, social intelligence, etc.
To extend their software capabilities to Asia, SAVO Group turned to Basis Technology for Chinese, Japanese, and Korean support. These languages are notoriously challenging to handle for searching and indexing. They lack reliable spaces between words, and long acronyms or names, (when segmented incorrectly) can result in searches for misleading words. With non-linguistic solutions, the phrase for “Tokyo Prefecture” (東京都) becomes “east” (東) and Kyoto (京都) if the string is misparsed. “Business trend” (景気動向) becomes “not a word (景)- air current (気動) – direction (向).” But, with Rosette powering the CJK enablement of the SAVO’s sales enablement platform, users only get the comprehensive and accurate results they are expecting.
Socialgist is the leading global provider of social media data and the first official provider of data from Chinese Microblogging platform Sina Weibo. Socialgist aggregates and normalizes the world’s social media data, providing up-to-the minute access to enterprises seeking to analyze, monitor, and mine social media for business intelligence, evaluating ad campaigns, predicting and tracking stock patterns, and more. Socialgist secures and manages social data for the largest companies and largest brands in the world, so they don’t have to.
To incorporate highly accurate multilingual search capabilities (Chinese, English, Japanese and Russian) into Sphinx, the open-source search engine, Socialgist chose Basis Technology’s Rosette. Rosette Base Linguistics (RBL) delivers faster and more accurate search results whether searching within a language or across multiple languages. It integrates seamlessly with open source platforms, like Sphinx, and a wide range of enterprise applications.
RBL helps Socialgist deliver more value to its enterprise customers by improving the quality and relevance of their search results. Languages such as Chinese and Japanese require specialized linguistic processing just to find words to index, but even English searches are boosted with linguistic processing that delivers a greater volume of highly relevant results.
Osaka-based Synergy Marketing helps companies leverage their internal customer data by supplementing it with the context of the outside world through the iNSIGHTBOX social intelligence database. The resulting analysis reveals greater insights into the needs of the customers, which would be impossible using enterprise data alone.
For the development of iNSIGHTBOX, Synergy Marketing used Rosette Base Linguistics to find the words in Japanese text, which is written without spaces between words. Rosette Entity Extractor then automatically scans web pages, emails, and documents highlighting the key data points—e.g., personal names, place names, email addresses, URLs, etc. to help iNSIGHTBOX connect internal and external content.
In August 2008, the Coalition government in Iraq was handing the reins to the fledgling Government of Iraq (GOI). This transition included moving the “Sons of Iraq” (SOI)—former Sunni insurgents paid by the U.S. to fight insurgents and al-Qaeda—to the payroll of the Shiite-heavy GOI. Government elements unhappy with this idea required that the names of all 93,000 SOI members be delivered to Iraqi officials in Arabic in four days.
The U.S. Forces agreed, but it was a tall order. Arabic names of SOI members had been poorly transliterated into English by Coalition field units. With only three native speakers assigned to the task, collecting the information from the field, organizing the data, and translating the names back to Arabic seemed impossible to do by the deadline.
Seeking linguist support, the Coalition came across a copy of Basis Technology’s Transliteration Assistant (now called “Highlight”), that ran on Windows and could quickly and automatically transliterate names from Arabic to English. The team adapted the software’s name standardization function to convert the poor English transliterations into the closest matching Arabic name. Humans checked the output.
With the software, translators spent three seconds checking (instead of 10 seconds translating) each name—a production increase of over 300%. As a result, granted a slight extension to the project deadline, the translators completed the job in seven days.