Data too big for your algorithm? Use more machines

Data science strives to build applications that help solve modern, complex business problems. Often these problems requires solutions that scale through the use of distributed, parallel computing. However, a lot of our known techniques do not seem to directly scale using these techniques. This post discusses how we can take advantage of the Central Limits Theorem to scale some of our more advanced analysis tools.
What is the point of all of these normal distributions?

A distributed algorithm needs to have a couple of important properties. First, we need to be able to break it down into smaller components. Something which was a single, serial process turns into many smaller processes. Second, it needs to be commutative. This means that it needs to be insensitive to the order in which each process takes place. Finally, it needs to be associative. This means that the process needs to be insensitive to the order in which we combine the final results.

Okay, enough text already. Here is an example. Clearly we can do this for simple operations such as min, max, mean, sum, etc. However, we can also use this for things such as significance tests or Monte Carlo simulations? How? Start with a single simulation that approximates the mean using 100,000 iterations:
The outcomes of a single larger simulation clearly surrounds a central point.

Now how about 100 smaller simulations to approximate the same mean?
The outcome of a averaging many smaller simulations clearly surrounds a central point.

They look like they are all over the place. But, what happens if I combine them?
The outcomes of many simulations appear to surround a single shared center.

Well, they both converged. But, how did they do? Since ordinary least squares (OLS) method that has the least errors at solving this technique, I will compare each simulation to the OLS results. The result of many smaller simulations is actually slightly more accurate than the single larger simulation. This makes sense: each of the smaller simulations had a different random starting location.

OLS Value Single Simulation Multi Simulation
Slope 82.1313 82.1326 82.1304
Intercept 0.2588 0.2580 0.2584

You can distribute a variety of data science processes using a simple technique. This technique works because of the Central Limits Theorem. In particular, data drawn from a known distribution will approximate the distribution itself. As Data Scientists, we intuitively use this principle in our daily lives. For example, we use this technique any time we build averages to approximate the typical rate at which something takes or place or when we estimate the typical size of something. We also take advantage of this principle when we test for significance. We can incorporate this with distributed approaches such as Parallel Python, Theano, or PySpark.

However, as our data gets larger and more complex, it gets harder to use process-intensive techniques. At some point even distributed processing will not suffice. Future posts will focus on ways to further simplify these techniques using approximation techniques.

Tags: