Machine learning for poets

This article is the first in a series that explains, in simple terms, what machine learning is. It is geared toward non-technical folks (like me). Later, we’ll discuss more advanced examples, technical challenges, etc.

What is machine learning? You may not have a crisp definition, but it’s a technology you experience daily. Whether browsing Netflix movie recommendations, having a heart-to-heart with Siri, or navigating the interwebs with Google, you’re witnessing machine learning’s unique power.

Today, we’ll talk about the branch of machine learning often used for fraud detection — supervised machine learning — and show how it’s applied to spam detection.

What is supervised machine learning? An answer in 60 words

Supervised machine learning is a technology where computers:

Train a model: Analyze and determine how to categorize past instances
Apply the model: Categorize current instances
Improve the model: Get and adapt to feedback on those categorizations

Like puppies, employees, or spouses, supervised machine-learning models improve with feedback. High-quality supervised machine learning is difficult to develop but can be easy and cost-effective to deploy.

Machine learning made tangible: Spam filtering

Ten years ago, I systematically abandoned email accounts because of excessive spam. Spammer sophistication made setting manual filters a Sisyphean chore. Were I not overwhelmed by spam, I’d have marveled at the creativity required to find infinite spellings of “Viagra” or “weight loss.”

aol said spam

In 2001, AOL said spam is the “No. 1 complaint of its subscribers”

Enter Google’s Gmail, which applies supervised machine-learning to spam filtering:

Train: A new Gmail account has its own spam model, trained by emails from other Gmail users.
Apply: The model categorizes incoming emails as spam or not-spam, based on subtle signals like the number of exclamation points or the frequency of the word “free.”
Improve: If the model makes a mistake, you mark bad email as spam (or vice versa). The model incorporates your and other Gmail users’ feedback for future categorizations.

Spam filters are both effective and personalized. Even though spam was 68% of email traffic in 2012, you saw much less in your inbox.

The power of supervised machine learning in fraud detection

In spam filtering, supervised machine learning is common. In fraud detection, it’s not. Using machine learning technology, like Sift Science’s, to detect fraud requires extraordinary technical expertise.

Three reasons why:

Higher stakes: You don’t lose emails marked as spam. In contrast, once you turn down a customer, that’s it. With the risks of foregone revenue, lost lifetime value, and a Twitter tantrum, you must balance the costs of fraud with the realities of your business. You need more than a score. Best-in-class, supervised machine learning offers both scores and information.

the question

The question is “Is this customer a fraduster,” so you deserve more than just a number.

More diversity: Spam is less diverse than fraud. Fraud patterns vary by factors like your industry, business model, geography, customer base, and inventory. Fortunately, a quality machine learning model can be customized to your business. You can include whatever signals you want (e.g. “pants size”) and understand how they’re associated with fraud (“A 38 x 32 for a customer in Japan is suspicious based on other orders on your site”).

More data: Spam emails are not human beings. Fraudsters are, and their aim is to dupe you. Fortunately, the number of potential fraud signals is huge, and a robust machine learning model can make sense of both unconventional signals (“Page navigation”) and millions of patterns for those signals (“Users visiting sale products in alphabetical order are often fraudulent”).

Stay tuned for more explanations, applications, and discussion on machine learning. If there are specific topics you’d like us to cover, let us know at [email protected] or @siftscience!