BasisTech Chief Scientist Pioneers Work on Partial Diacritization of Arabic
Automating diacritics in Arabic translations is a tricky process. Including the full range of available diacritics can slow reading progress: in fact, fully vowelized Arabic texts are considered too complicated for the ordinary reader. Conversely, lack of diacritics often causes textual ambiguity.
In How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study, BasisTech Chief Scientist, Kfir Bar, Ph.D., argues for a middle path: partial diacritization of translations achieved through a machine learning model. This model applies diacritics — or the marks appearing above or below a letter to indicate a change in pronunciation or meaning — to text only when necessary to resolve ambiguity. The research, conducted by Bar along with Saeed Esmail and Nachum Dershowitz, will soon be published in Computational Linguistics.
To determine how to apply partial diacritization to Arabic text, Bar and his research partners trained two neural networks to predict the need for diacritics. One network examined an entire sentence. The second considered only the text appearing before the word in question. Partial diacritization is obtained by retaining only those diacritics on which the two networks disagree.
This dual-network architecture both mimics human linear reading and models the impact of lookahead. By comparing the annotations of the two networks, researchers can determine what text needs lookahead to determine whether diacritics are needed, and which diacritic determinations can be made based simply on preceding text. Bar’s research indicates that one-word lookahead has a dramatic impact on clarity. Lookahead returns diminish thereafter.
“Our goal is to improve reader comprehension of Arabic by automatically generating diacritics, but only when they provide strong insight to the reader,” Bar said. “We found that partial diacritization accomplishes this, and also improves translation quality compared to either the total absence of diacritics or a random selection of them.”
About BasisTech
Data analytics and machine learning are critical to verifying identity, understanding customers, anticipating world events, and uncovering crime. BasisTech provides businesses and governments with advanced analytics and AI-powered solutions for deriving insights from multilingual text, connecting data silos, and discovering digital evidence. Our Rosette text analytics platform employs classical machine learning and deep neural nets to extract meaningful information from unstructured data. Autopsy, our digital forensics platform, and Cyber Triage, our incident response tool, serve the needs of law enforcement, national security, and legal technologists. KonaSearch delivers deep search across Salesforce and other data sources.
Company headquarters are in Somerville, Mass., with offices in Washington, D.C., London, Tel Aviv, and Tokyo. For more information, visit basistech.com.