Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
2. Why should I have to write a different
program for 1000 rows or 1 billion?
3. Our Vision: Simplified Distributed
Computing
• Using lots of machines should be as easy as using one.
• Enable scalable, fast machine learning and data processing
• Parallelism should be natural, come from the language itself
I want to treat the cloud like it’s one big, fast, desktop.
4. What is Ufora?
Auto-parallel, compiled, multi-host python
Key Components
• JIT Compiled
• Implicit Parallelism at the language level
• Fault tolerant
• Automatic co-location of data and compute
5. We are now open source!
• 5 years of work by ~ 5 engineers
• ~350k lines of code
• Apache 2.0 License
• Hosted on GitHub
6. Sound Familiar?
• Similar approach to JIT Compilation
• Scalable but without frameworks like MapReduce
• Package that works easily with existing python workflow
7. How do I use it?
Install the client pip install pyfora
pyfora_aws start … --num-instances 4
or
docker run … ufora/service
import pyfora
ufora = pyfora.connect('http://<ip_address>:30000’)
with ufora.remotely:
#your code here
Get some workers
In your python program
8. How do I use it?
def isPrime(p):
if p < 2: return 0
x = 2
while x*x <= p:
if p%x == 0: return 0
x = x + 1
return 1
result = sum(isPrime(x) for x in xrange(100 * 1000 * 1000))
~1 hour
9. How do I use it?
def isPrime(p):
if p < 2: return 0
x = 2
while x*x <= p:
if p%x == 0: return 0
x = x + 1
return 1
with ufora.remote:
result = sum(isPrime(x) for x in xrange(100 * 1000 * 1000))
~10 secs
10. What do you give up?
• No mutability of data-structures
• No side-effects
• No nondeterminism
• Emphasize “functional” programming style
15. Answer: React dynamically as the program runs
Watch running threads to see what blocks of data they’re accessing.
Move threads to data, or data to threads, depending on what’s cheaper.
Detect when two blocks of data absolutely have to be on the same machine.
Build a statistical model of correlations between block accesses.
Place data to minimize expected future number of machine boundary crossings.
16. Machine 1
Machine 2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Machine 3
Machine 4
A simple example
v = range(0, 2*10**9)
Red boxes are blocks of data
17. Machine 1
Machine 2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Machine 3
Machine 4
Computation starts on Machine 1
When the computation exhausts the data on one
machine, the runtime moves it to the next
for x in v:
state = f(state,x)
18. But real access patterns are more complex!
User writes
Now the computation is looking at all pairs v[i] and v[i+10]
res = 0
def f(x,y):
# some function
for i in xrange(0, len(v)-10):
res = res + f(v[i], v[i+10])
19. Machine 1
Machine 2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Machine 3
Machine 4
But when the computation reaches the end of block 4,
v[i] and v[i+10] aren’t on the same machine!
At first, everything is OK, since v[ix] and v[ix+10] are
close to each other in the data
20. Every time we have to move the
computation, we’re hitting the network.
Block 4 on Machine 1
Block 5 on Machine 2
v[ix]
v[ix+10]
This is really slow!
21. Machine 1
Machine 2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Machine 3
Machine 4
Solution: Replicate blocks so that they overlap
5
9
13
Data can live on two different machines at the same
time because its immutable!
22. Project Roadmap: Current Version (0.1)
• Coverage of core python2.7 language.
• Run locally (using docker) or in AWS
• Import pyfora and go!
23. Project Roadmap: Upcoming Release
(0.2)
• Core numpy and dataframe implementations (in python)
• Coverage for some core scikit data science algorithms (gbm,
regressions, etc.)
• Better error handling, lots of bugfixes
24. Project Roadmap: the future
• Python 3 support
• Execution of arbitrary python code out-of-process (for non-pure
code we don't want to port)
• More generic model for import/export of data from the cluster.
• Enabling better feedback in the pyfora api for tracking progress
of computations.
• Support for running calculations on GPU
25. Ufora is Auto-Parallel, Multi-Host Python
• Star/fork the repo: github.com/ufora/ufora
• Contribute to the codebase
• Find me after this presentation
• Tell us what we should build next. This affects our priorities!!!
• Email me: braxton@ufora.com