What is Big Data (Part I)
By Stephanie Yee /
10 Oct 2013
This post is part of a series that discusses, in simple terms, machine learning and big data. Today we’re demystifying big data. To learn about machine learning, check out Machine Learning For Poets.
What is Big Data?
What is big data? Many define it in terms of the computing power it requires. To understand what big data is, however, you first need to know what big data means. In this post, we’ll discuss the implications of big data’s meteoric rise.
What big data means
The excitement around big data isn’t just marketing hype. In fact, it captures a qualitative shift, from model complexity to data complexity.
Answering complicated questions used to require equally complicated models. Despite their elegant mathematical underpinnings, these were usually imperfect, especially when modeling real life. They required many assumptions, which didn’t always hold true (e.g. “Humans are rational”).
Human behavior is more complicated than E = mc^2. Therefore, when making predictions about humans, discovering how things actually work has proven more effective than depending on a caveat-laden model.
In other words, big data frees us to derive insights empirically. With enough information, you can approximate what you want to know by “asking the data directly” rather than relying on assumptions. Fewer assumptions mean fewer places for things to go wrong.
Of course, the quantity of data required to reduce model complexity results in — you guessed it — increased data complexity.
Fight fraud with big data
At Sift, we know that big data is critical to staying ahead of fraudsters. Contemplating what I think fraudsters do is less important than discovering what they actually do. Predicting fraudster attacks based solely on recent trends is less effective than incorporating all information. Constraining your fraud team to a limited set of variables is less efficient than using every piece of information available.
So now you understand the most important aspect of what big data is: its implications. Next up: the logistical challenges that define it.
For more insight, look at Alon Halevy, Peter Norvig, and Fernando Pereira’s excellent paper The Unreasonable Effectiveness of Big Data. Stay tuned for more explanations, applications, and discussion on machine learning and big data. If there are specific topics you’d like us to cover, let us know at firstname.lastname@example.org or @siftscience!