Rosette

Why “Vanilla” Search Can’t Handle the Nuances of Name Matching

 

Transcript of the webinar

Carlos Azeglio:
All right. So let’s start. Great to be here. My name is Carlos Azeglio. And I’m the business development lead for fintech at Basis Technology. And today I’m here with our senior solutions engineer, Patrick Deeb.
Pat Deeb:
Hey, everyone. How are you doing? Yeah. Sorry. Just lost power and internet. So dialing in now. Plan B.
Carlos Azeglio:
Always have a plan B and a plan C and a plan D. Great. So today we’re going to talk about a topic that’s in everybody top of mind. Why is it so difficult to get search algorithms to perform properly or the way we want them to? I mean, from my many years of content management experience, being able to connect data and link information about people and organizations is, what I think, the underlying foundation for getting the right answers, especially when it comes to assessing risk. I mean, I can tell you first hand that connecting messy data is really, really tough, especially when all you have available is a name to work with.
Carlos Azeglio:
Now, Pat, on to you. So search as we know it, I think, has been around since the mid ’90s as we all know it with search engines and so on and internet. That’s almost 30 years now.
Carlos Azeglio:
I know I’m dating myself. But, Pat, why is searching for names still such a problem? I’ll even call it a conundrum.
Pat Deeb:
Yeah. I mean, that’s a great question. And it’s why we’re here today. And really, the simple answer is that almost every kind of digital search that we perform in our day-to-day lives is based on what are called full-text search algorithms.
Pat Deeb:
And these are the ones that power internet searches, the searches you encounter when you shop online or in your business applications and even when you’re performing technical searches against your relational databases and other record sets. But full-text search algorithms are not built for name search. There’s basic assumptions that full-text search algorithms make which simply don’t cover these distinct nuances and challenges that are posed by proper names.
Carlos Azeglio:
So that’s interesting. You said, “Nuances.” What exactly does that mean, “Nuances?”
Pat Deeb:
Yeah. So when I say, “Nuances,” so maybe it’s best to illustrate with an example. So start by imagining that you have a paragraph that’s a news article on the internet. And let’s equate that to a record in a database about a specific person.
Pat Deeb:
So full-text search algorithms that go against that news article are going to make certain assumptions. So one of the assumptions it’s going to make is that frequently occurring words are less important. So it uses a under-the-hood algorithm called TF-IDF, which is term frequency-inverse document frequency. It’s really concerned with how often or not often words actually exist in the document. So in full-text search, words that occur with high frequency are considered less important as part of the search; your common words like, “the” and “a” and “very” and “with” and “we” and other pronouns.
Pat Deeb:
Whereas, if you have a database where, let’s say, the name “William” appears 100 times. And let’s say… let me think of a variation to that. “Wilson” appears once. “William” is just as important to the search algorithm as “Wilson” because they’re all accurate representations of a person’s first name.
Pat Deeb:
Another reason, another nuance is that typos in full-text search really aren’t that big of a deal. For your typical common words that occur in an article, little typos aren’t going to have a huge impact on the search results, right? And spelling errors in a typical document are less of an issue, since a single word in a document is such a small percentage of the overall data. But for a name, that can be 50% or 100% of what you’re looking for. With a name, a single misspelled token can have a substantial downstream effect that could prevent entire entries from being returned in your search.
Carlos Azeglio:
Right. But, Pat, what about spell check that we’re so accustomed to, right? Every time you try to type something that might be a little off, the system tries to correct us in many ways, like, “John” with J-O-H-N versus “Jhon,” J-H-O-N, versus “John,” J-O-H-N. The search engines usually pick that up. And they tell me I got it wrong.
Pat Deeb:
Right. Yeah. And that’s a good point. And again, spell check is good for typical text queries. And it works well in its intent. But it doesn’t hold well for names, unfortunately.
Pat Deeb:
And really, it’s because there’s [inaudible 00:04:29] handful. If I think of the name “Cindy,” I can think right off the top of my head six or seven different ways that you could spell “Cindy” with C-I and C-Y and different variations of S-Y and S-I. And that’s an example where if you try to normalize with something like a spell check, it could take away the actual name variation that identifies the person that you may be searching for.
Carlos Azeglio:
Right. That’s interesting, because I actually experience this quite often with my wife’s name. Her official spelling is L-O-U-R-I-E for “Laurie.” But it usually gets mixed up with the A version or just the O version. So that happens all the time. It’s interesting.
Pat Deeb:
Yeah. Exactly. And even with normal fuzzy matching, it’s not going to understand that. In a typical full-text search, if you search on L-O-R-I, you’re not going to get a hit on L-O-U-R-I-E. And it also extends to, for example, transliterated names.
Pat Deeb:
Out in the wild, there’s not really a standard name to write an Arabic name in English or even a standard agreement on where to put the spaces. And that applies to Chinese and Japanese, Vietnamese, Korean as well. So that adds a whole new level of complexity to the name variations challenge.
Carlos Azeglio:
Interesting. No. Thank you, Pat. Now back to typos and names that sound alike, can’t you just set up your search algorithms to do that for you?
Pat Deeb:
Well, so what we call fuzzy text matching is definitely not the same thing as fuzzy name matching. So fuzzy text matching or fuzzy string matching is what its called in computer science and data science… and even called fuzzy logic as a shorter way to say it, it’s based almost entirely around the idea of insertions, deletions and substitutions of characters. So almost all of the fuzzy text matching algorithms are leveraged by full-text search or based on these three items.
Pat Deeb:
So what does that mean? So what happens in the logic is that it makes mathematical computations on positions of the word’s letters and how these letters are inserted, deleted and/or substituted within a word versus the word or words it’s being compared to; a.k.a. the word or words that you’re searching on, right? But there’s so many phenomena specific to proper names that are simply not covered by these operations.
Pat Deeb:
So while insertion and deletion and substitutions of characters are certainly components included as part of a good fuzzy name matching algorithm, there’s so much more to be considered in a complete name matching algorithm. Just to throw a few out there, you have nicknames. You have initials, titles and honorifics, shortened versions of the names. There’s aforementioned spelling variations, missing components, gender differences.
Pat Deeb:
The list goes on. So you can try to match names using a standard search algorithm. But it gets really complex and ugly both for the setup to make it where this text is searchable and also for the actual search execution.
Carlos Azeglio:
Okay. Great. Thanks, Pat. Now, okay. Changing gears a bit now. So given all the advancements in technology, what new strategies are seeing the most success? Basically, what does the before and after look like?
Carlos Azeglio:
I mean, I know from my own experience with rules-based systems, they tend to be really top heavy and really hard to maintain. I always equate them to a game of Jenga, right? So you’re sliding in rules. You’re sliding out rules.
Carlos Azeglio:
And at some point, the thing becomes so top heavy it just keels over. I remember even in a past life where there was only one person who knew exactly how to untangle this thing, right? So imagine job security, how good it was for this person.
Pat Deeb:
Right. Sure. Yeah. So, yeah. Great question. So 20 to 30 years ago, the go-to method for name search was creating these huge, static lists of name variations for every name on a watch list. So for just the name “Abdul Rasheed,” you might have hundreds of variations. And then you multiply that by millions of names. And you get an idea of the massive computing power that’s needed as well as the performance that would be just far from what would be needed for someone who’s trying to do real-time screening.
Pat Deeb:
And even with that, you still might not find a match on your search. So this standard text search was the system that the US Customs and Border Protection were using at the time of the Boston Marathon bombing in 2013. And I don’t know how many are familiar.
Pat Deeb:
But many months before the incident, there was an FBI alert for Tamerlan Tsarnaev. But the alert spelled his name with a Y, which was different than his passport. It was spelled differently. And so he was not detained when he passed through Boston Logan Airport.
Pat Deeb:
You can see the importance of something just as simple as one letter. So AI is now heavily used in this area. And it’s best suited to handle the complexity that name matching demands.
Carlos Azeglio:
So, Pat, you mentioned AI. Now, that seems to be the most trendiest word around, right? You hear that everywhere.
Carlos Azeglio:
How does AI really improve name matching? How do you see that? [crosstalk 00:09:57]?
Pat Deeb:
Yeah. I know. And you’re right. “AI” is almost over used. And I’d like to start to trend toward something that we want to start saying here, “HI,” hybrid intelligence.
Pat Deeb:
But in the meantime, AI, artificial intelligence, it can dynamically consider all the key ways that names vary and not just one or two at a time. So you need this artificial intelligence because it aligns the data in a way that’s more effective than rules. So when that AI is a critical component of a name matching solution, you not only replace the need for all these static lists and rules but you benefit from trained models per language. That can leverage that dynamic consideration and identify possible name variations in real time instead of having to iterate over this big static list, which, by the way, those lists need to be maintained and kept updated, right?
Pat Deeb:
So one example of this kind of software is Rosette Name Indexer, which, by the way, is what the US Customs and Border Protection uses now for border security. And they brought Rosette into use as a result of the Boston Marathon bombing. So you plug Rosette into whatever system or search engine you’re running just to do your name searching.
Pat Deeb:
And Rosette understands names like a person does. It has all the built-in algorithms, which consequently minimizes false positives and false negatives. And in addition, it’s fast and scalable and lightweight and doesn’t add a huge load to your existing systems.
Carlos Azeglio:
Right. Now, Pat, as an engineer, you know that there’s a lot of open source options out there, right? Now, why couldn’t I just get some clever engineers to build it out?
Pat Deeb:
Yeah. Sure. And, I mean, it’s a great question. And I’ve seen it attempted. And it’s interesting, because the attempts are made. And then a lot of folks end upcoming our way.
Pat Deeb:
But the AI part of the puzzle is really what makes it complex to just have a group of engineers attempt to come up with logic to handle the fuzziness of proper names. And what we have behind our product is several things that… several items, several bullet points that you can’t have by putting a group of engineers together and tell them to build something over a matter of six months or a year. And the biggest one for us is the time and the R&D that we have invested in Rosette.
Pat Deeb:
It has 25-plus years of investment in R&D and build and client feedback all built into the process. And it’s all factoring in real-world scenarios. So that’s something that absolutely can’t be replaced by several engineers going in a room and trying to throw this together.
Pat Deeb:
In addition, the name variety. It handles all kinds of name complexities that most engines ignore. And it handles multiple languages.
Pat Deeb:
It’s transparent, which helps with compliance and regulation as far as explaining why a match is what it is. And it’s ultra fast with performance and scalability and accuracy. So putting it into your system, which does not take much time at all, brings an immediate return on your investment.
Carlos Azeglio:
Great. Wow. I would have really loved to have these capabilities from years ago when I started out. So it’s good to know that what’s coming on the horizon is a step ahead of everything else that we’ve seen in the past.
Pat Deeb:
Yeah. I hear that a lot. And I’m really proud of the impact we’ve made with this software within the intelligence community, counter terrorism, financial crimes, just to name a few.
Carlos Azeglio:
Great. Okay. So I think that’s… thank you very much, Pat. This has been very useful, very helpful hopefully for all of us, all the participants. Now we’re going to try to look and see what kind of questions are coming in.
Carlos Azeglio:
So let’s see. Let me look at the questions. Let me start with one of these.
Carlos Azeglio:
Okay. So here’s a question, Pat. Everybody talks about AI. Doesn’t AI tend to be black box? How do you know what’s happening under the hood?
Pat Deeb:
Yes. AI does tend to be black box. But what’s great about Rosette is the transparency and the explain-ability. So we can get extremely granular in explaining why a match is what it is, why we’ve assigned a match score to what it is.
Pat Deeb:
But for a typical end user, it doesn’t have to get that granular. It can actually be exposed as a couple of just general ideas as to why something matched the way that it did. But for data science folks and for folks that really have to adhere to compliance and regulations, there’s no black box here. We fully explain why a match is what it is and why a score is what it is with Rosette.
Carlos Azeglio:
Great. Another question just came in. “Is Rosette available for Elastic in the cloud?” That’s the question.
Pat Deeb:
Very good question. So, yes, with some caveats. It is available for Elastic in the cloud right now. And I think Elastic is working on making this a little bit easier.
Pat Deeb:
But Elastic in the cloud, if you are on their Platinum Support Plan, you can absolutely use our plugin with Elasticsearch. The reason you have to be on the Elastic Support Plan is because you do have to engage support to get the plugin installed in Elastic’s cloud because of the size of it. I do believe that they may be making strides to make that a little bit easier and a little bit more cost effective. But in the meantime, it is definitely supported. You just have to be on their Platinum Support Plan.
Carlos Azeglio:
Okay. Great. Okay. We actually have quite a few questions coming in. So this is great.
Carlos Azeglio:
Thank you so much for asking questions. Okay. So here’s another one. “What type of AI is it?”
Pat Deeb:
What type of AI is it? It is trained models per language underneath the hood. So it’s a trifecta of different things.
Pat Deeb:
There are lists. And there are sets of rules under there. But we have models that have been trained per language to allow the algorithm to make decisions in real time about whether or not something matches or not. So that’s where the AI comes in.
Carlos Azeglio:
So you mentioned languages. I just wanted to add something. So if I had a name in Chinese characters and then the equivalent of that name was in my database where my search target was in Latin-English characters, how would Rosette handle that?
Pat Deeb:
Oh. It would handle it beautifully. The cross-lingual capabilities of Rosette is one of the many things that allow us to stand out from any of the competition. So, yeah. If you have a name that’s in Chinese script and that name that you’re looking for happens to be in Latin script in your database, it will match against it because we have cross-lingual matching capabilities.
Carlos Azeglio:
Okay. Great. I have another great question here. So, “How exactly can name matching be combined with other kinds of match, address or fuzzy address?” So let’s say I’m thinking if you have multiple attributes, right? So the name is one critical attribute. So what if you had also another one you wanted to add to the mix, like addresses, in combination?
Pat Deeb:
Yeah. That’s a great question. And actually, we’ve been talking about name matching all this time. That’s just one component of what Rosette does.
Pat Deeb:
Rosette also does date matching. It does address matching. It does location matching.
Pat Deeb:
And so the more attributes that you have that you enter in your search, the better match you’re going to get. And the address and date matching also have this fuzzy AI element to it that allows you to get relevant results even if you’re not putting in the exact date or the exact address that you’re looking for. But we have many, many cases where name matching, address matching and date matching are all being used in the same use case to allow for even more relevant matching because the more attributes you pass in, the more likely you are to hit a match.
Carlos Azeglio:
Okay. Great. Okay. So I think that’s all I see here. There was one more question that came in. I’m not sure how clear I am on it. But let’s attempt this, Pat.
Pat Deeb:
Okay.
Carlos Azeglio:
It says, “Can you explain what technology in details? What if we don’t have test samples?” So let’s interpret this as, what if you’re using our system and you don’t have test samples? You just want to base it on the accuracy that’s already built into the system, right?
Carlos Azeglio:
So the system, I’m assuming, comes with a default setting. And it has a certain amount of accuracy and confidence within that default setting. Maybe, can you extrapolate on that a bit?
Pat Deeb:
Yeah. So if I understand the question correctly, if they don’t have test samples but they want to test on how the matching looks?
Carlos Azeglio:
Yeah. Let’s assume that.
Pat Deeb:
Okay. Well, so we have several pieces of public data, several public datasets that we like to suggest for people to use in test scenarios just like this. There is the OFAC list, which is a very popular one. There’s voter registration data from different states.
Pat Deeb:
What we typically suggest for that is to take a public dataset like that and to ingest it into Elasticsearch, for example and run our plugin against that. What you want to do, obviously, is you want to understand as much about a sample of that dataset as you can so you know what you’re searching against. And then you can start with your name variations to search against to see what the different types of results return look like and make tweaks to the parameters and tuning to see how that affects the score.
Carlos Azeglio:
Okay. Great. Thank you. Okay. So let me just double-check. I think that’s all for the questions.
Carlos Azeglio:
Let me see. Is there another one here? Oh. Okay. So one more question here, I see. Actually, there’s two more. “Is Rosette only a name matching solution or also a variant generation engine?”
Pat Deeb:
For lack of a better way to put it, variant generation’s a little bit of an antiquated approach at name matching. That’s a way that it was done. I had mentioned earlier how things have been done over the last 20 to 30 years. That was one way of doing it when I spoke about these long lists of name variations.
Pat Deeb:
Rosette is not meant to do it that way. Rosette, we like to say, is meant to do it the best way. And so, no. It’s not something that generates several name variants because that’s not what it’s designed to do.
Carlos Azeglio:
Okay. Great. All right. The last question was just, “Can we see a demo?” And of course, the answer would be, yes. I mean, I can’t do it today, obviously.
Carlos Azeglio:
But, yeah. Please reach out to us. Contact us directly. And we can arrange that for you.
Carlos Azeglio:
All right. So that concludes our discussion and our questions. I’m going to close this out. I want to thank everyone so much for your patience and joining us.
Carlos Azeglio:
I know we had some difficulties today. So I really thank you for your patience. Hopefully, you found this useful.
Carlos Azeglio:
I’ve mentioned a few times already there’s a super quick survey at the end. So if you don’t mind just answering that, it should take just about a minute to fill out. But we really appreciate you coming to join us today. So thank you very much. And have great day. Okay. Goodbye.
Pat Deeb:
Thanks, everyone. Bye-bye.
Carlos Azeglio:
Thank you.