Secure your business from login to chargeback
Stop fraud, break down data silos, and lower friction with Sift.
- Achieve up to 285% ROI
- Increase user acceptance rates up to 99%
- Drop time spent on manual review up to 80%
By Alex Paino /
25 Aug 2015
We at Sift Science provide fraud detection for hundreds of customers spanning many industries and use cases. To do this, we have devised a specialized modeling stack that is able to adapt to individual customers while simultaneously delivering a great out-of-box experience for new customers, achieved by mixing the output from a “global” model – trained on our entire network of data – with the output from a customer’s individualized model.
Prior to decision forests, we used a custom-built logistic regression classifier combined with highly specialized feature engineering for our global model. While logistic regression has many great attributes, it is fundamentally limited by its inability to model non-linear interactions between features. At Sift, we tend to think of our modeling stack primarily as an enabler of our feature engineering; more powerful modeling allows us to extract the most insight from our features and can even lead to new classes of features. So when in early 2015 we stopped seeing benefits from feature engineering work, it was clear to us that we needed a major upgrade to our modeling stack.
Around January, we began investigating potential improvements to our modeling stack, and quickly identified our global model and its linear modeling constraints as the best place to start. We wanted our global model to be expressive enough to easily model non-linearities in our feature space, but we also needed to retain the ability to explain our global model’s predictions to our customers in a straightforward manner. Additionally, we needed to be able to scale up training of this new model to our very large data set without significantly increasing server costs.
Given these constraints, random decision forests seemed a natural fit. They have the ability to model arbitrary interactions amongst features, are relatively inexpensive to train, and are one of the most interpretable machine learning models around (especially when compared with neural networks, another option we considered). Beyond that, decision forests have been shown to be among the most accurate models on datasets closest to ours, and are widely used in large scale applications. Some of the other benefits of decision forests that we liked for our application include the ability to easily incorporate domain knowledge and the fact that a decision forest can be thought of as a large system of automatically learned rules. This last point is helpful when explaining machine learning to new customers who have only used rule-based fraud detection systems in the past.
Once we were sold on decision forests as the future of modeling at Sift, we set to work building out our own in-house version. We had a host of reasons for building our own implementation rather than using an existing open source implementation, but the most important was to allow us to extend and specialize the model to fit our unique dataset and feature space (as we’ll demonstrate below).
Fast forward several months and 100+ experiments later, we now have a global decision forest model working as a productive member of our modeling stack. Along the way to shipping our decision forest model we learned quite a few things about working with decision forests for the fraud detection use case. The most interesting lessons we’ve learned are detailed below:
A fundamental challenge with decision forest training is getting the feature vector down to a reasonably small size. In our experience, “reasonably small” means less than 1,000 features; adding features beyond that mark, even if they are individually useful, resulted in degraded accuracy in our system.
This requirement poses a problem for how sparse features are treated. At Sift, we define sparse features as any that can take on several hundred unique values (e.g. an email address). The way we handle sparse feature for our logistic regression model turns every sparse feature into tens of indicator features, which results in a feature vector with several thousand features. Given the above, we clearly needed to develop an alternate method of handling sparse features.
The approach we settled on involves turning every sparse feature into a numeric “fraud rate” feature equal to the average class prediction of all samples with the same value for said feature. This approach is best explained by a simple example:
Let’s say a feature vector comes in with a value of “94105” for the feature “Shipping zipcode”. The first thing we do is look up (in our densification tables in HBase) the number of fraudsters and normal users we have previously seen with this shipping zipcode. If we have seen this shipping zipcode in either capacity before, we can then produce a fraud rate feature that equals the ratio of fraudsters to all users with the same shipping zipcode. This simplification results in optimal splits for at least the root node in a tree, since it directly optimizes the distribution of predictions over the leaf nodes, which is in turn the input into our split quality measure. At lower levels of a tree, this approach can result in missed opportunities to combine sparse features compared with treating every feature value distinctly, but we’ve found experimentally that this is a worthwhile tradeoff in most cases.
If a not-previously-seen shipping zipcode comes in, we will leave the shipping zipcode feature empty; that is, the feature will be missing from the feature vector. Which leads to our next lesson…
Dealing with missing features in a decision forest model is also tricky to get right. Typical approaches include filling in a missing feature with that feature’s mean, median, or some default value. In our experience, these approaches erase valuable information from the feature vector; at least in our data model, the fact that a feature is missing is an important signal. This is due in large part to the flexible manner in which we allow our customers to integrate with us. For example, some customers may make the shipping zipcode field optional on their checkout forms. Whether or not the end user chooses to fill in this field is a useful piece of information for fraud detection.
So now we turn to finding a method of capturing the missing feature information and presenting it to the decision forest model. The need to keep the feature space small eliminates the approach of creating a “feature XYZ is missing” indicator feature for every feature we emit. Instead of expanding the feature vector, we modify the decision forest model itself by adding a third “missing feature” branch to every interior node:
This third branch is traversed whenever the feature vector we are examining is missing the feature an interior node is split upon. The beauty of this approach is three-fold:
We learned a number of things about training a decision forest model while tuning our training algorithm:
We also discovered some interesting optimizations for evaluating decision forests:
We found this to work well for improving the class separation of scores produced by our model, and even resulted in a slight accuracy boost.
While building out our own decision forest trainer – modeled after Google’s PLANET paper – we learned several interesting optimizations. It’s useful to first clarify our requirements of the training algorithm. We needed to train a decision forest:
Given these requirements, we made a few modifications to the original PLANET paper:
As anyone who has worked with a machine learning model in the real world can attest to, being able to visualize the final model you have trained is invaluable. At Sift, we had already recognized this through the process of building our other ML models, so we planned to build a human-readable visualization of our decision forest model early on.
The first form of visualization we built was a very dense text-only view of a decision forest:
While this representation is rather ugly and hard to understand at first glance, it manages to present 100% of the information in our finished models, which allows us to more easily debug issues in our training algorithm. It has the additional benefit of working well with standard text manipulation tools such as grep and diff, which has helped us catch quite a few bugs in our training algorithm.
Moving beyond debugging a model, we also like to have an accurate way of measuring the impact of each individual feature on a model. For decision forests, we ended up adding a few such methods. The first (Figure 5 below) ranks all features by the total reduction in impurity each induces across the entire forest. This metric has proven to be a reliable way to measure the impact of a feature on our decision forest model:
We also built out a ranking of the most frequent feature pair combinations seen in the model (Figure 6 below). This ranking isn’t too useful for deciding whether a feature should be added to our model or not, but is interesting to look at to get a sense for what is going on “inside the mind” of a decision forest model.
The latest addition to our fleet of decision forest visualization tools is a D3 based interactive view of a single decision tree (Figure 7). This visualization is most useful for digging into specific paths of a single tree to understand its behavior.
When we first began the project to add decision forests to our modeling stack, we thought the change to our modeling stack would look like this:
However, after months of experimentation, we realized that even though our decision forest model was outperforming our logistic regression model in aggregate, there were still some cases (customers, features, users, etc.) where logistic regression outperformed decision forests. In a nutshell, every model has its own unique strengths and weaknesses. After coming to this realization, we decided to move forward with a modeling stack that looked like this:
Combining all of our models together in our modeling stack far outperformed either modeling stack seen in the Figure 8. Our belief now is that our decision forest model is best at learning about complex behavioral patterns whereas our logistic regression model is superior at learning over sparse features. This leads to scores from each of these models being de-correlated with one another, lending themselves well to being mixed together.
Prior to the above learnings, our initial “stock” decision forest model was actually performing significantly worse than our logistic regression model in offline experiments. After going through the arduous process of learning the above lessons, however, we were able to train a decision forest model that significantly improved overall accuracy to the tune of 18%. But the benefits of adding a decision forest model to our stack don’t end there; now that our modeling has received a serious upgrade, we have that much more room to grow with our feature engineering. We’ve already begun taking advantage of this new expressivity by experimenting with customer level features — features that can effectively use decision forests to cluster our customers and learn separate sub-trees for each of them. Beyond that, we now also have the exciting challenge of updating our decision forests online ahead of us.
In conclusion:
Stop fraud, break down data silos, and lower friction with Sift.