Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Design of a_dsl_by_ruby_for_heavy_computations


Published on

Presentation of the 37th GRACE seminar at NII, 16th June, 2010.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Design of a_dsl_by_ruby_for_heavy_computations

  1. 1. Design of a DSL by Ruby for heavy computations over map-reduce clusters the 37th Grace seminar 16th June, 2010 Koichi Fujikawa Cirius Technologies, Inc.
  2. 2. Today's Agenda Background Problem Approach My Project Conclusion
  3. 3. Background Where are we in the world?
  4. 4. We Live in the "Big Data" era World-wide web page data (Text-only) is expected 400TB (at one point). Some web service company (like Google, Yahoo, etc) have to process these data for their business, but.. General HDD can read data in 50MB/sec. This means we can take 2000 hours (approx. 100 days) to read the total web data(400TB) by one machine. We need the parallel processing / file system.
  5. 5. MapReduce MapReduce is one of the parallel skeletons Became popular by Google's paper(2004) MapReduce has two phases Map phase: transform key and value to another (key and) value Reduce phase: aggregate and calculate values by one key Each record process by map phase first and then by reduce phase
  6. 6. Hadoop Hadoop is open source clone of Google MapReduce hosted by Apache Foundation Big web service provider(Yahoo, Facebook, etc) contribute this project actively. Large development and user community all over the world (including Japan) Hadoop conference Japan 2009 Hadoop source code reading events
  7. 7. Problem What issues do we face?
  8. 8. Programming Model General programmers, engineers are not familiar with this "MapReduce" model, so it is too difficult to try and use Especially to separate Map and Reduce No Effective way of the "pattern of the MapRecuce programming" because this technology is not mature for the engineers. We have to find this individually. It is very difficult and time-consuming.
  9. 9. Programming Language Hadoop is written in Java language, so the programmers need to write Map and Reduce procedure in Java. Java is strong typed and compile language. Some web service engineer don't like these language. No problem if the code is fixed and completed, but I wonder it is suitable for ad- hoc prototyping and easy querying. MapReduce jobs depend on what users want to get, so flexibility is important, I think.
  10. 10. Approach How do we resolve it?
  11. 11. Hide complexity of MapReduce I found the description for MapReduce could be simpler in some specific case (e.g. log analysis). In this case (but almost all of Hadoop usage is now log analysis), it would be nice if programmers can write the description without taking care of MapReduce!
  12. 12. DSL approach by Ruby For this description, I created DSL for each specific usage. Log analysis DSL is a reference implementation which I prepared. As DSL runtime environment for Hadoop, I chose Ruby and JRuby, which is Ruby runtime working on JVM. Ruby is very flexible and reusable object- oriented language, so very easy to create DSL processor.
  13. 13. My project What do I do?
  14. 14. Hadoop Papyrus DSL framework for Hadoop by JRuby We can write log analysis code by only several line. Open source (Apache Licence) same as Hadoop Hosted by github Distributed by common Ruby archive site Supported by IPA mitoh 2009
  15. 15. DEMO
  16. 16. Conclusion What is archiving now?
  17. 17. On the way to big challenge We need parallel processing method to handle massive web-scale data. MapReduce and Hadoop is one of good tools, but.. Difficult to describe Map and Reduce Irritated to write Java for someone :-) Hadoop Papyrus is providing the key! Ruby-based DSL framework for Hadoop You can write Map and Reduce at once
  18. 18. Questions? Thank you very much! Twitter ID: @fujibee