Design of a_dsl_by_ruby_for_heavy_computations

Design of a DSL by Ruby
for heavy computations
over map-reduce clusters

the 37th Grace seminar
16th June, 2010

Koichi Fujikawa
Cirius Technologies, Inc.

Today's Agenda
Background
Problem
Approach
My Project
Conclusion

Background
Where are we in the world?

We Live in the "Big Data" era
World-wide web page data (Text-only) is expected
400TB (at one point).
Some web service company (like Google,
Yahoo, etc) have to process these data for
their business, but..
General HDD can read data in 50MB/sec. This
means we can take 2000 hours (approx. 100
days) to read the total web data(400TB) by one
machine.
We need the parallel processing / file system.

MapReduce
MapReduce is one of the parallel skeletons
Became popular by Google's paper(2004)
MapReduce has two phases
Map phase: transform key and value to
another (key and) value
Reduce phase: aggregate and calculate
values by one key
Each record process by map phase first and
then by reduce phase

Hadoop
Hadoop is open source clone of Google
MapReduce hosted by Apache Foundation
Big web service provider(Yahoo, Facebook,
etc) contribute this project actively.
Large development and user community all
over the world (including Japan)
Hadoop conference Japan 2009
Hadoop source code reading events

Problem
What issues do we face?

Programming Model
General programmers, engineers are not
familiar with this "MapReduce" model, so it is
too difficult to try and use
Especially to separate Map and Reduce
No Effective way of the "pattern of the
MapRecuce programming" because this
technology is not mature for the engineers.
We have to find this individually. It is very
difficult and time-consuming.

Programming Language
Hadoop is written in Java language, so the
programmers need to write Map and Reduce
procedure in Java.
Java is strong typed and compile language.
Some web service engineer don't like these
language.
No problem if the code is fixed and
completed, but I wonder it is suitable for ad-
hoc prototyping and easy querying.
MapReduce jobs depend on what users want to
get, so flexibility is important, I think.

Approach
How do we resolve it?

Hide complexity of MapReduce
I found the description for MapReduce could
be simpler in some specific case (e.g. log
analysis).
In this case (but almost all of Hadoop usage is
now log analysis), it would be nice if
programmers can write the description without
taking care of MapReduce!

DSL approach by Ruby
For this description, I created DSL for each
specific usage.
Log analysis DSL is a reference
implementation which I prepared.
As DSL runtime environment for Hadoop, I
chose Ruby and JRuby, which is Ruby
runtime working on JVM.
Ruby is very flexible and reusable object-
oriented language, so very easy to create
DSL processor.

My project
What do I do?

Hadoop Papyrus
DSL framework for Hadoop by JRuby
We can write log analysis code by
only several line.
Open source (Apache Licence) same as
Hadoop
Hosted by github
Distributed by common Ruby archive site
RubyGems.org
Supported by IPA mitoh 2009

Conclusion
What is archiving now?

On the way to big challenge
We need parallel processing method to
handle massive web-scale data.
MapReduce and Hadoop is one of good tools,
but..
Difficult to describe Map and Reduce
Irritated to write Java for someone :-)
Hadoop Papyrus is providing the key!
Ruby-based DSL framework for Hadoop
You can write Map and Reduce at once

Questions?
Thank you very much!
Twitter ID: @fujibee

Design of a_dsl_by_ruby_for_heavy_computations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Design of a_dsl_by_ruby_for_heavy_computations

Similar to Design of a_dsl_by_ruby_for_heavy_computations (20)

Recently uploaded

Recently uploaded (20)

Design of a_dsl_by_ruby_for_heavy_computations