Data Science in Business: Precision and Recall
In my blog post on imbalanced data, I introduced some ways to measure the effectiveness of predictions for rare events, specifically the metrics precision and recall. Now, these terms are common jargon in the data science community, but can feel confusing to someone just stepping into the field.
I have put together an imaginary case study that illustrates how these concepts can be applied in a business context to capture and weigh off complex business priorities.
Credit Card Fraud
Let’s take an imaginary bank, we’ll call it ExampleBank.
As part of their growth strategy, ExampleBank have entered into a new partnership with a major credit card company. Their analysis shows that the proportion of fraudulent transactions for the recent years has been almost double the national average, about 0.2%.
Lucy is the business strategist at ExampleBank in charge of asset security. She is dissatisfied with the results of the rules-based model ExampleBank is using to prevent credit card fraud. She has heard a lot about machine learning and wants to find out whether machine learning could help ExampleBank get better results.
To begin, she has a few questions about the performance of the current model.
“So, the fraud defense unit tells me the model has an accuracy of 98%. If that’s true, why are we not catching all of those fraudsters?”
Ehsan and Janet explain.
“We have put together a more detailed report on the current model,” Ehsan says, advancing a slide.
“It details what happens exactly. The accuracy of the model might be 98%, but its precision is about 20% and its recall is about 52%.”
Janet points at the line with precision.
“This means that out of all the cases the current model classifies as fraud, only about 20% are actually fraud. And this,” she shifts down to the line with recall, “means that the model only catches about half of the fraud cases that occur.”
“Wait,” Lucy holds up a hand. “It catches only half? But it’s 98% accurate? How can that be?”
“Accuracy also counts how many valid transactions the model classifies correctly, and that number is really large,” Ehsan replies.
“Yes,” Janet agrees.”Accuracy is like a headline. It sums it up, and that can miss detail.”
“Okay. I see how that works. But how do we get more detail? I don’t want to do a whole lot of math when looking at a result or presenting my findings to the board.”
“We’ll look at what the model does from a different angle,” Janet explains while Ehsan prepares another slide.
“A precision of 20% means that for each fraud, we also falsely suspect 4 customers of fraud.”
“And a recall of 50% means that we catch half of the fraud cases out there.”
“Okay,” Lucy agrees. “I’ve got to hand it to you, that’s a lot more solid information than 98% accuracy. But I still don’t really get it. Why can’t we just catch all the fraud and not bother any customers?”
“I wish we could. We could just suspect everyone of fraud, and that would catch all the fraudsters, but we probably would not have any customers left after that.”
“What Ehsan means,” Janet says, ”is that the two aren’t as related as it seems. That’s why we use the two separate metrics and why the two numbers are so different. It’s because of what type of mistake the different metrics count, whether we falsely suspect someone or let something slip through by mistake.”
“If you can put a dollar value on angry customers, though, we can compare those two as well,” Ehsan adds. “Then we can compare them by cost.”
Lucy’s face lights up.
“That’s the number I need. I’ll get in touch with the risk team and get you that dollar value. If you can rate the effectiveness of the model in dollars, I can understand it. And I can present it to the board.”
A few weeks later, Lucy meets up with Ehsan and Janet again.
During that time, Ehsan and Janet have tested a variety of model types on the data gathered by ExampleBank’s fraud defense division. Lucy has requested details on the risk analysis of the customer experience with fraud interventions and forwarded the information to Ehsan and Janet.
While Ehsan sets up the presentation, Janet summarises their work of the past weeks.
“I think we’ve got something promising for you,” she says. “we’ve taken the report from the risk analysis team and put together a comparison of some candidate models.”
Ehsan brings up the first slide.
“The report outlined an expected risk between $900 and $1500 for an angry customer affected by fraud interventions, which they average to $1200. We found the average value of a fraud transaction in the last year to be $298, so we can conservatively round this up to $300. In other words, preserving the customer experience is 4 times as important as catching a fraudulent transaction.”
He forwards to the next slide.
“There is a combined metric called the F1-score that gives us a value for how efficiently the models balance these priorities.”
“Wait. Those are just numbers. I thought we were going to express everything in dollars?”
“We still are,” Janet explains, ”the F1-score uses a weight that’s based on the dollar value per fraud divided by the dollar value per customer, that’s just a number. It compares how well the model balances the two values, where higher is better.”
“Okay,” Lucy nods. “that makes sense. Carry on.”
Ehsan points at the slide.
“Our base model has a precision of 20%, a recall of 52%, that means its F1-score becomes 0.20751. That’s the number we’ve got to beat, or we’re losing money somewhere.”
“Yes, I remember,” Lucy says. “It gets half the fraud, but has to go through four customers to get to a fraud case.”
“Exactly,” Ehsan confirms. “The first new variant we built has a similar precision at 22% and its recall is 68%. It has an F1 score of 0.22912.”
Lucy frowns again. “That seems like it’s not doing very much. What else have you got?”
“Our third model has a precision of 13%, and we managed to get the recall up to 82% on this one.”
“That sounds promising,” Lucy says. “That’d sure catch a lot more fraud.”
“That’s true,” Janet agrees, “it’s got an F1-score of 0.13677, though. That means overall it would perform worse than even the model we are currently using.”
Lucy stares at the screen, trying to make sense of the numbers.
“I don’t get it. Why is that? It’s got like 30% more recall, that should catch way more fraud cases. The difference in precision is so small.”
“It’s because of the cost,” Janet explains. “Instead of 5 out of 10, it will now catch 8 out of 10 fraud cases, but for each 10 suspected cases, 9 will be honest customers. That’s more than twice as many, and those cases are more expensive.”
Lucy frowns at the screen for a bit longer, then nods.
“Yes. Yes, I see now. Okay, that sounds to me like we’ve got something to work with.”
The three agree for the team to continue working and to report regularly on their findings.
Lucy, Ehsan and Janet meet again after the development period for the proof of concept has concluded. Ehsan and Janet have prepared a few slides to help Lucy report the results of the project to her supervisor.
“So, what’s the latest result of the model development?”
Ehsan and Janet both smile.
“We managed to get both the precision and the recall up a little, to 23% and 76% respectively.” Janet says, and Ehsan adds:
“That means the F1 score is now up from 0.22912 to 0.23984.”
Lucy lets her eyes wander over the figures on the screen.
“Good work! And this means the customer experience has not gotten worse, right?”
“Yes,” Ehsan confirms.
“And we catch 50% more fraud cases than before?”
“Great. You’ve given me some great tools to show the performance of the model. I’m really chuffed about what we’ve achieved. I’ll add your data to my presentation and let you know how we go.”
The board is impressed to have tangible, real world effects as part of the description of the model performance and applaud Lucy for a well-prepared presentation.
“Accuracy isn’t everything. Always bring it down to the dollar value,” she writes down, with a little set of scales drawn next to it. “If I can understand it, I can explain it.”
She puts the notes aside and turns to her laptop. Time to get on to the next project.