Casual mass parallel data
processing in Java
Alexey Ragozin

Mar 2014
Building new bicycle …
Build Vs. Buy
Build
• No dedicated team to
support infrastructure
• Very specific tasks
• Exclusive use of
infrastructure
...
“Casual” computing
•
•
•
•
•

Small computation farms (< 100 servers)
Team owns both application and grid
Java platform
Re...
Simple master slave topology
Master process
Scheduler

Task queue
Ad v
e
Tas rtise
k
Rep

Slave

Slave

ort

Slave
Simple master slave topology
Control plane
 RMI

Queue / scheduler
 Simple in memory queue
 May be more complex than ju...
Data plane
Never, ever, try to send data over RMI 
File system
 Avoid network mounts!

In-memory key-value
 Client side...
Distributed objects revised
Pit falls of CORBA/RMI
• IDL – functional contract
• IDL – protocol

Separating concerns
• Fun...
Distributed objects revised
Renewed distributed objects paradigm
Strong
• Polymorphism
• Encapsulation
 Network protocol,...
Deployment problem
Brute force

Computation grid software







 Compile and run batch
Behind scene
 Your classes ...
Central scheduler topology
Batch controller
Batch controller

Queue server
Add tasks
Consume
reports

Task queue

task
Tas...
Or more elaborated
Flavors of parallel processing
Flow organized tasks
• Input data available before
task starts
• e.g. Map/Reduce

Collabora...
Get back to data plane
Rules of thumb
•
•
•
•

Insert / delete – never update
Write locally (reducing risks)
Read remotely...
Exploiting file system
Avoid network file systems
• File system concept is not designed to be distributed
• Good network f...
Algorithmic optimization
Parallel computing
• N times speed up will increase
your OPEX and CAPEX cost by N*lg(N)

Algorith...
Streaming algorithms
Finding N most frequent elements
• Min-Count

Estimating number of unique values
• HyperLogLog

Distr...
NanoCloud – drastically simplified
coding for computing clusters
As easy as …

@Test
public void hello_remote_world() {
Cloud cloud = CloudFactory.createSimpleSshCloud();
cloud.node("myse...
All you need is …
NanoCloud requirements
 SSHd
 Java (1.6 and above) present
 Works though NAT and firewalls
 Works on...
Master – slave communications

SSH

Master process
diag

Slave host

(Single TCP)

Agent

multiplexed slave streams

Slave...
Links
NanoCloud
• https://code.google.com/p/gridkit/wiki/NanoCloudTutorial
• Maven Central: org.gridkit.lab:telecontrol-ss...
Thank you
http://blog.ragozin.info
- my articles
http://code.google.com/p/gridkit
http://github.com/gridkit
- my open sour...
Upcoming SlideShare
Loading in …5
×

Casual mass parallel data processing in Java

1,630 views

Published on

Published in: Technology
  • Be the first to comment

Casual mass parallel data processing in Java

  1. 1. Casual mass parallel data processing in Java Alexey Ragozin Mar 2014
  2. 2. Building new bicycle …
  3. 3. Build Vs. Buy Build • No dedicated team to support infrastructure • Very specific tasks • Exclusive use of infrastructure • Reasonable scale Buy • Product can bought as service (internal or external) • Large scale • Multi tenancy • You are going to use advanced features (e.g. map/reduce)
  4. 4. “Casual” computing • • • • • Small computation farms (< 100 servers) Team owns both application and grid Java platform Reasonably short batches (< 24 hours) Reasonably small data sets (< 10 TiB)
  5. 5. Simple master slave topology Master process Scheduler Task queue Ad v e Tas rtise k Rep Slave Slave ort Slave
  6. 6. Simple master slave topology Control plane  RMI Queue / scheduler  Simple in memory queue  May be more complex than just task queue Data plane …
  7. 7. Data plane Never, ever, try to send data over RMI  File system  Avoid network mounts! In-memory key-value  Client side sharding works best Disk database (RDBMS or NoSQL)  Consider prefetch of data Direct socket streaming …
  8. 8. Distributed objects revised Pit falls of CORBA/RMI • IDL – functional contract • IDL – protocol Separating concerns • Functional contract – wrapper object • Protocol – hidden remote interface
  9. 9. Distributed objects revised Renewed distributed objects paradigm Strong • Polymorphism • Encapsulation  Network protocol, caching aspects etc Weak • Homogenous code base required • Synchronous network communications
  10. 10. Deployment problem Brute force Computation grid software       Compile and run batch Behind scene  Your classes would be collected  Associated with batch  Deployed on participating slaves Build / package Deploy / SCP Restart slaves Start batch Change code, repeat
  11. 11. Central scheduler topology Batch controller Batch controller Queue server Add tasks Consume reports Task queue task Task ort Rep Pu l l Slave Slave Slave
  12. 12. Or more elaborated
  13. 13. Flavors of parallel processing Flow organized tasks • Input data available before task starts • e.g. Map/Reduce Collaborative tasks • Tasks communicate intermediate results to each other • e.g. physic simulations
  14. 14. Get back to data plane Rules of thumb • • • • Insert / delete – never update Write locally (reducing risks) Read remotely (retry on error) Store input as is  File system  Document / column oriented NoSQL • Input and temporary data is different  Choose right store for each
  15. 15. Exploiting file system Avoid network file systems • File system concept is not designed to be distributed • Good network file system cannot not exists • Use simple remote file access protocols • SCP (unencrypted data transfer options added by CERN guys) • HTTP (if you really do not want SCP) Cheap SAN could be build from open source
  16. 16. Algorithmic optimization Parallel computing • N times speed up will increase your OPEX and CAPEX cost by N*lg(N) Algorithmic optimization • • • • Up front costs only Orders of magnitude optimization opportunities Exciting coding Ecological way of computing 
  17. 17. Streaming algorithms Finding N most frequent elements • Min-Count Estimating number of unique values • HyperLogLog Distribution histograms https://github.com/addthis/stream-lib https://github.com/rwl/ParallelColt
  18. 18. NanoCloud – drastically simplified coding for computing clusters
  19. 19. As easy as … @Test public void hello_remote_world() { Cloud cloud = CloudFactory.createSimpleSshCloud(); cloud.node("myserver.acme.com").exec(new Callable<Void>(){ @Override public Void call() throws Exception { String localhost = InetAddress.getLocalHost().toString(); System.out.println("Hi! I'm running on " + localhost); return null; } }); }
  20. 20. All you need is … NanoCloud requirements  SSHd  Java (1.6 and above) present  Works though NAT and firewalls  Works on Amazon EC2  Works everywhere where SSH works
  21. 21. Master – slave communications SSH Master process diag Slave host (Single TCP) Agent multiplexed slave streams Slave controller Slave controller std out std err std in RMI (TCP) Slave Slave
  22. 22. Links NanoCloud • https://code.google.com/p/gridkit/wiki/NanoCloudTutorial • Maven Central: org.gridkit.lab:telecontrol-ssh:0.7.23 • http://blog.ragozin.info/2013/01/remote-code-execution-in-java-made.html ANT task • https://github.com/gridkit/gridant
  23. 23. Thank you http://blog.ragozin.info - my articles http://code.google.com/p/gridkit http://github.com/gridkit - my open source code http://aragozin.timepad.ru - community events in Moscow Alexey Ragozin alexey.ragozin@gmail.com

×