Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. An Improved Hardware Acceleration Scheme for Java Method Calls Tero S¨ ntti∗† , Joonas Tyystj¨ rvi∗‡ , and Juha Plosila†∗ a a ∗ Dept. of Information Technology, University of Turku, Finland † Academy of Finland, Research Council for Natural Sciences and Engineering ‡ Turku Centre for Computer Science, Finland {teansa|jttyys|juplos}@utu.fi Abstract— This paper presents a significantly improved strat- This work is a part of the VirtuES project, which focusesegy for accelerating the method calls in the REALJava co- on fully utilizing the potential of embedded multicore systemsprocessor. The hardware assisted virtual machine architecture is using a virtual machine approach.described shortly to provide context for the method call accelera-tion. The strategy is implemented in an FPGA prototype. It allows Overview of the paper We proceed as follows. In Sectionmeasurements of real life performance increase, and validates 2 we shortly describe the structure of our hardware assistedthe whole co-processor concept. The system is intended to be JVM, and show how the proposed co-processor fits into theused in embedded environments, with limited CPU performance Java specifications. Section 3 describes the methods in Javaand memory available to the virtual machine. The co-processor and sheds light on the differences in the ways methods can beis designed in a highly modular fashion, especially separatingthe communication from the actual core. This modularity of the invoked. In Section 4 the strategy for accelerator is presenteddesign makes the co-processor more reusable and allows system with details of the hardware unit focusing on the differenceslevel scalability. This work is a part of a project focusing on design to the previous solution. In Section 5 some benchmark resultsof a hardware accelerated multicore Java Virtual Machine for are given and analyzed. Finally in Section 6 we draw someembedded systems. conclusions and describe the future efforts related to the REALJava virtual machine. I. I NTRODUCTION Java is very popular and portable, as it is a write-once run- II. JAVA V IRTUAL M ACHINEany-where language. This enables coders to develop portable In the Java Virtual Machine Specification [4], Second Edi-software for any platform. Java code is first compiled into byte- tion the structure and behavior of all JVMs is specified atcode, which is then run on a Java Virtual Machine (hereafter a quite abstract level. This specification can be met usingJVM). The JVM acts as an interpreter from bytecode to native several techniques. The usual solutions are software only,microcode, or more recently uses just in time compilation (JIT) including some performance enhancing features, such as JITto affect the same result a bit faster at the cost of memory. (Just In Time Compilation). We have chosen to use a HW/SWThis software only approach is quite inefficient in terms of combination [7] in order to maximize the hardware usage andpower consumption and execution time. These problems rise minimize the power consumption.from the fact that executing one Java instruction requiresseveral native instructions. Another source for inefficiency isthe memory usage. The software based JVMs have to keepinternal registers of the virtual machine in the main memoryof the host system. When the execution of the bytecode isperformed on a hardware co-processor this is avoided and theoverall amount of memory accesses is reduced. Because themethods in Java are generally quite small in terms of storagerequirements for the code that is running and the data beingprocessed it is possible to keep all the required items in a Fig. 1. Internal architecture of the REALJava JVMrelatively small local memory inside the co-processor. Actuallyjust 128 kb of internal memory is enough to store all of the The HW portion (shown on the right side of Figure 1)methods used in an embedded application. This includes the handles most of the actual Java bytecode execution, whereasJava benchmark for embedded systems found in [15] and the the SW portion (the left side of Figure 1) takes care of memoryembedded version of the CaffeineMark [14]. Since this local management, class loading and native method calling. Thismemory is not mirrored to the main memory, which usually partitioning gives the possibility to use the co-processor withresides in a physically external memory chip, it is energy any type of host CPU(s) and operating systems, as all of theefficient. platform dependent properties are implemented in software978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. and most of the platform independent bytecode execution is methods can be found in [11]. An independent study can alsodone in hardware. be found in [1]. Because Java supports multithreading at the language level,it makes sense to integrate several co-processors as a SoC. Salesman Sort Raytrace Caffeine Stack frame size 8.98 4.77 7.01 5.55This gives an ideal solution for complex systems running Method length 38.86 8.67 9.26 14.83several Java threads and possibly some native code at the same Total invocations 991228 18412516 1957996 27779867time. This approach brings forth true multithreading and thus TABLE Iimproves performance. Also large systems possibly contain S TATISTICS FROM METHOD INVOCATIONS IN SELECTED BENCHMARKS .several software subsystems, such as internet protocols, user T HE FIRST TWO ROWS ARE AVERAGES AND THEY ARE MEASURED INinterface controllers and so on, which can easily be coded 32- BIT WORDS .in Java, and since they all are executed in parallel the userexperience is enhanced. III. M ETHOD CALLS IN JAVA The system architecture can be chosen to be a network of The Java virtual machine specification [4] defines the typesany kind or bus based, as suitable for other components in of methods that can be invoked in Java. Because Java is anthe system. The structure of the underlying communication object-oriented language, methods are usually invoked on ob-medium is rather irrelevant, as long as the lower level provides jects, with the actual method implementation chosen based ontwo properties: 1) the datagrams must arrive in their destina- the runtime type of the object. Methods that are not invoked ontion in the same order that they were sent, and 2) the datagrams objects are called static methods. Besides static methods, thearriving from two different sources to the same destination most important categories of methods are defined in the accessmust be identifiable. The first property can be be achieved flags bit field of the method definition. The most importantwith a lower level network protocol, like ATM adaptation access flags during bytecode execution are acc synchronizedlayer (AAL) for the internet, or by the physical structure of a and acc native. Acc synchronized means that when the methodbus. The second property seems quite natural, and should be is invoked, the monitor (the primary synchronization constructpresent in all solutions. The communication scheme for the in Java) of the object that the method is invoked on isco-processor is discussed in more detain in [5]. entered, and the monitor is exited on return from the method. The architecture for the co-processor is presented in [6] and Acc native means that the method is implemented in a nativethe whole system including hardware and software portions language of the platform. Native methods can be bound tocan be found in [7] and [10]. The basic design used for the actual native functions at runtime.FPGA implementations in this paper is the same, with only Methods are invoked using one of the four bytecodeminor fine tuning on some of the units. There are 5 control instructions invokevirtual, invokespecial, invokeinterface andregisters in the execution unit. These are the program counter invokestatic. All of these instructions perform a method lookupP C, stack top pointer ST , code offset CO, local variable based on a 16-bit index to the constant pool of the currently ex-pointer LV and local variable info LO. The P C holds the ecuting class. Invokevirtual and invokeinterface then performaddress of the current instruction relative to the CO. The ST a further lookup based on the runtime type of the object thatand LV registers are internal addresses to the local memory. the method is being invoked on, while invokestatic and invoke-The CO contains the starting address of the current method special invoke the method found immediately. As symbolicin the method area of the co-processor. The last register holds method resolution is very slow, it is common to modify thetwo values, the number of parameters Nparams and the number constant pool and the instruction data itself during either classof local variables Nlocals in for the current method. After loading or after the execution of a call instruction. A commonapplying the new method invocation structure, the LO register technique for accelerating invokevirtual instructions is the useis removed from the design. of virtual tables [2], which contain a pointer to each non- The Java virtual machine also provides a rich standard interface method that a class implements, with a fixed indexlibrary. In most current research virtual machines the GNU for each method identifier. Performing a virtual table lookupClasspath [16] is used. The GNU Classpath is a free imple- is much faster than finding the method by symbolic lookup inmentation of the standard library, and it is constantly being the class of the object. A somewhat related technique is so-developed. Currently it covers more than 95% of the methods. called “inline caches” [3], which enable just-in-time compilersThe missing methods are quite rare, so in most cases the to quickly inline the most common implementations of virtualGNU Classpath is sufficient. As per recommendations for Java methods into their call sites.programming, the classpath has been build from very small As Java is an object-oriented language, invokevirtual ismethods, which are invoked often during the execution of a intended to be the primary method invocation instruction. TheJava program. Also many of the methods in the classpath call other instructions are used for special cases: invokespecial iseven smaller sub-methods. This emphasizes the importance used to invoke object constructors, private methods (which canof having fast method invocation architecture in a virtual be hidden by subclasses) and to explicitly invoke a certainmachine. The method size statistics for selected benchmarks implementation of a virtual method, invokeinterface is onlyare shown in Table I, and they clearly support the claim of used to invoke methods through an interface pointer andsmall methods being invoked often. More statistics about Java invokestatic does not operate on an object instance. It is
  3. 3. important to notice that as long as a class has no subclasses, The symbol X is just a shorthand for Nlocals − Nparams .invokevirtual can be executed like invokespecial. The sameapplies for interfaces with only one implementation. If the overload status of virtual functions is stored in themethod definition and updated when new classes are loaded,three types of method invocation instructions can be executedwithout access to heap data or native functions: non-nativeinvokestatic, invokespecial and invokevirtual with a singleimplementation. These instructions can be implemented usingonly a constant pool lookup and, in the case of invokevirtual,a test of the overload status of the method. The new architecture presented in this paper makes use ofan observation about Java programs we made recently. Wenoticed that the stack of a given Java method is always emptywhen a return instruction is executed. This feature is notmentioned in the Java virtual machine specification [4], butit seems that not one of the Java compilers we tried generatescode where the stack would not be empty. Assuming the stackto always be empty makes the return much simpler, but we Fig. 2. The effects of the invocation process on the stack.were hesitant due to the fact that a class with non emptystack during return would still be a legal construct. When a Now let us review the mechanism presented in [9]. Inbytecode modification engine [12] was added to the bytecode the following formulas the CallInfo vector comes from theverification phase, it was noticed that it could be used for invoker module shown in Figures 3 and 4. In the originalemptying the stack if required. The bytecode verification keeps architecture the CallInfo was 56-bits long. The SWCT RLcount of the stack at all points of the bytecode, so adding just symbol is used for control bits that tell both the hardware andthe required amount of pop instructions before an offending the software that some special actions are required during thereturn instruction would fix the situation. So far this has never return phase of the method. An example would be a returnbeen observed, but the check up is kept in the verification to a native method. This situation cannot be handled in theprocess for sake of security and in order to be compliant with hardware, since the control is returned to the native methodall legal Java code. executed by the CPU. Please notice that pushing the return Returning from a process happens using one of 6 instruc- info to the stack after the new register values have beentions, return, ireturn, freturn, areturn, lreturn or dreturn. These calculated updates the ST accordingly.differ only by the data pushed to the stack of the callingmethod. The first one pushes nothing, while the next three Formulas for calculating the new register values:push one word and the last two push two words. Even though PC ⇐ 0the 32-bit versions have several bytecodes reserved, they ST ⇐ STOLD − CallInf o(15..0) + CallInf o(31..16)are implemented using only one mechanism. The difference CO ⇐ CallInf o(55..32)between these instructions is only used during class loading LV ⇐ STOLD − CallInf o(15..0)for verification purposes. The 64-bit instructions are handled LO ⇐ CallInf o(31..0)similarly. Since the actual returning process is exactly thesame for all of the instructions, we only consider the returninstruction, and state that the data to be pushed to the calling Data pushed to the stack frame (Return Info):method stack is stored into temporary registers during the SWCT RL & P COLDreturn process. STOLD − CallInf o(15..0) COOLD LVOLD IV. I NVOCATION AND RETURN PROCEDURES LOOLD First, let us have a look at what happens in the stack ofthe virtual machine during a method invocation. Figure 2 Then we can move on to the new method invocationshows how the new stack frame is created. Before the actual procedure. Here are the modified invocation formulas. Theseinvocation, the calling method pushes the required parameters use the new architecture, so the CallInfo is only 16-bits long,to the top of its stack. In the Figure these are shown as and it is presented in Figure 5.Parameters, and the number of them is denoted with thesymbol Nparams . The symbol Nlocals tells how many local Formulas for calculating the new register values:variables the new method uses. Note that the parameters PC ⇐ 0become a part of the local variable array for the new method. ST ⇐ STOLD + X
  4. 4. CO ⇐ CallInf o(15..0) LV ⇐ STOLD − Nparams Data Pushed to the stack frame (Return Info): SWCT RL & COOLD LVOLD & P COLD Naturally the procedures for performing a return instructionare also simplified in the new architecture. The process ofreturning from a method can be seen as going to the oppositedirection in Figure 2. The original values for most of theregisters associated with the stack frame on the left sideare returned. Only the ST is modified, to reflect the factthat the parameters consumed by the invoked method havebeen removed from the stack. The old system performed the Fig. 3. The invoker connected to the ALU and the registers.following actions during return instructions. Formulas for calculating the new register values: method id and the code offset as the key, as shown in Figure P C ⇐ Data0 4. In the old architecture the code offset was 24 bits long. In ST ⇐ Data1 the new architecture the software performs a process called CO ⇐ Data2 constant pool merging, during which the constants defined by LV ⇐ Data3 a given class are added to a global constant pool instead of a LO ⇐ Data4 separate pool for each class. This saves memory by merging constants already defined by other classes and also speeds up Where the DataN symbols are being retrieved from the the constant pool look-up, since finding the constant pool forstack frame using a separate indexing scheme, which offsets the current class is not required. This technique also reducesthe index by Nlocals and then uses the normal local variable the size of the CAM key, because the code offset of theloading mechanism inside the co-processor. Altogether this current method is no longer needed. Only the method id,sequence requires 5 data items to be retrieved from the data now in the new unified constant pool, is required. The newmemory. structure can be seen in Figure 5. As an additional bonus, the method cache utilization is improved. This happens while The modified architecture gets the same results in a invoking one method, let us call it A, from several differentmuch simpler fashion, using the following formulas: classes. The scenario results in only one cache line, while the P C ⇐ Data0(15..0) old architecture would have required a separate line for each ST ⇐ LVOLD method invoking A. CO ⇐ Data1(1..0) LV ⇐ Data0(31..16) Again the DataN symbols are retrieved from the datamemory, but now they can be retrieved using the normal popmechanism. This simplifies the hardware since now the unithandling local variables is not required to handle the additionaloffsetting. The amount of data to be retrieved is also decreased Fig. 4. The original CAM structure.from 5 to just 2 words. This naturally decreases the amount ofmemory required for return information in each stack framefrom 5 to 2 words. Using the pop mechanism is possible sincewe are now assuming that the stack of the current method isempty when performing the return. Notice also, that now thenew value for the ST is not calculated at all, but it is simplythe value of the LV in the current method. The invoker will speed up the invocation of methods thatare already loaded to the local memory of the co-processor. Fig. 5. The modified CAM structure.When an invocation command is encountered in the ALU, itsends the constant pool index of the method to the invoker After the key has been found in the CAM, the match addressmodule and sets query high. At this time the invoker performs is sent to a normal RAM, which stores the information neededa look up in the content addressable memory (CAM) using the to perform the method call. This RAM was 56 bits wide, and
  5. 5. Processor REALJava (old) REALJava (new) Kaffe on PPC Units Gain % Engine speed 100 100 300 MHz N/A Simple call 3125000 6666666 59453 1/s 113.3 Instance call 713042 1160730 19460 1/s 62.8 Synchronized call 366473 564810 15567 1/s 54.1 Final call 671343 1097950 18090 1/s 63.5 Class call 671255 1248860 18847 1/s 86.0 Synchronized class call 260401 350181 1/s 34.5 Salesman 11438 9027 111824 ms 26.7 Sort 40569 31386 856684 ms 29.3 Raytrace 7205 5494 169646 ms 31.1 EmbeddedCaffeineMark 156 231 10 48.1 EmbeddedCaffeineMark ND 184 279 11 51.6 TABLE II R ESULTS FROM VARIOUS BENCHMARKS .consisted of 24 bits for the code offset of the new method, be loaded. Overloading of methods causes them to fall out of16 bits for the number of local variables, Nlocals , and finally the cache because selecting the implementation for a specific16 bits for the number of parameters, Nparams , taken by the call requires access to heap data. The host CPU is better suitednew method. Our improved scheme stores only the code offset for this kind of task, so it is assigned to there.for the method to be invoked. The length is now limited to The module was integrated into our REALJava co-processor16 bits. In stead of storing the Nlocals and Nparams to the prototype as 8 places deep. This depth was chosen as theinvoker cache, a different approach is chosen. Namely the statistics presented in [9] show that size to provide highestNparams and X are stored to the instruction memory, just impact on performance with least resources. The prototypebefore the actual Java code for the method. The value of X is is based on a Xilinx ML410 demonstration board. This boardcalculated during class loading to minimize the computation provides all the services one might expect of a computer, suchrequired during runtime. This strategy minimizes the size of as a network controller, a hard drive controller, a PCI bus andthe method cache unit. The increased memory requirement so on. The FPGA chip is a Virtex4FX, which includes twofor the instruction side is only one word per each method, hardcore PowerPC CPUs. The co-processor is connected toand since the stack frames have been reduced by 3 words, the CPU via the Processor Local Bus (PLB 3.4). The systemthe net effect is positive even if each method is invoked runs the co-processor at 100 MHz, while the PowerPC CPUonly once. Naturally, if there are subsequent invocations for runs at 300 MHz. The CPU runs Linux 2.4.20 as the operatinga method already loaded to the instruction memory, the net system providing services (network, filesystem, etc.) to thesaving resulting from the new architecture is 3 words per virtual machine. For more details on the prototype, please seeinvocation. The code offset found from the RAM is also sent [8] and [10]. The system has also been implemented on ato the instruction memory controller, which in turn returns Virtex5 based board. This configuration used the newer PLBthe values for Nparams and X to the ALU for use in the bus (4.6) as the communication channel and a MicroBlazeinvocation. If a key is found, then get regs signal is set high as the CPU. The CPU provides considerably less arithmeticto indicate a valid match. This triggers the ALU to capture performance, since it is a softcore processor implementedthe CallInfo and the Nparams and the X, and to calculate using FPGA resources and runs at 100 MHz. The larger FPGAnew register values using the rules presented earlier. chip allowed us to include eight co-processor cores to the In case a match is not found in the CAM, a trap is produced. system. Unfortunately we have only implemented the newTo indicate this condition to the ALU the do trap signal is invocation architecture on that platform, so we do not presentset high. Upon receiving this signal the ALU sets the trap detailed results for this platform in this paper.signal high to communication module, and finally the hostCPU performs the needed actions to start execution of the V. R ESULTSnew method. At the same time the invoker module saves the The results in Table II show that the invoker module haskey to the CAM. When the execution resumes after the trap significant impact on execution times of the benchmarks. Inthe invoker module captures the required register values and the table REALJava (old) stands for a configuration with thesaves them to the RAM. Now the invoker is ready to speed up original invoker, REALJava (new) stands for a configurationexecution in case the same method is called again. When the with the improved invoker and Kaffe on PPC is Kaffe Virtualinvoker module saves a new key to the CAM it uses circular Machine running on the same PowerPC processor. REALJava,oldest algorithm to choose which entry to replace. This scheme even though running at lower clock speed, clearly outperformsprovides reasonably close approximation of the least recently the Kaffe in all of the benchmarks. The Gain is the percentageused algorithm with very low complexity. of improvement achieved with the improved invoker module. The invoker module can also clear its contents. This is The first set of benchmarks is a collection of method callrequired for situations where a virtual method has been cached tests. They measure mostly the method invocation perfor-to the module, and a new overloading virtual method needs to mance, and do not include (significant amounts of) arithmetics.
  6. 6. The first one simply calls an empty method and then returns frames, thus reducing the overall memory requirements for thewithout any processing inside the invoked method. The next co-processor. Also the hardware is simplified, since the LO5 are taken from the Java Grande Suite [17] to show the per- register and the offsetting mechanism for local variables wereformance gains for various method types. These benchmarks removed.contain a few Java instructions inside the invoked methods, We plan to continue refining the REALJava virtual machine.so some time is spend performing actual arithmetics. The Currently we are mostly focusing on improvements to thearithmetic speed of the JPU is exactly the same for both software partition, but the hardware is also evolving at theversions, which explains the lower gain percentages in these same time. On the hardware side the most interesting newtest when compared to the simple call test. topic we are studying is making the co-processor core into a The next set of benchmarks is a collection of tests that reconfigurable module and providing system level support forhave been written to evaluate real life performance. The dynamically adding and removing co-processors as needed.benchmark programs do not contain any special optimizations This kind of a system could better utilize the resources on afor our hardware. Short descriptions of the benchmarks fol- given FPGA by providing several special purpose cores to below. Salesman solves the traveling salesman problem using a used based on the user application.naive try all combinations method, Sort tests array handling ACKNOWLEDGMENTperformance by creating arrays of random numbers and thensorting them, and Raytrace renders a 3D sphere above a plane. The authors would like to thank the Academy of FinlandAs the benchmarks emphasize different aspects of the system, for their financial support for this work through the VirtuEStogether they should give a rather good estimation of different project.practical applications that might be found on an embedded R EFERENCESJava system. The results show 26 to 31 percent improvement [1] S. Byrne, C. Daly, D. Gregg and J. Waldron, “Dynamic Analysis of thein the execution speed with the new invocation module. Java Virtual Machine Method Invocation Architecture”, In Proc. WSEAS Several websites and research papers dedicated to Java 2002, Cancun, Mexico, May 2002.execution have used the CaffeineMark as a performance mea- [2] O.-J. Dahl and B. Myhrhaug, “Simula Implementation Guide”, Publica- tion S 47, Norwegian Computing Center, March 1973.surement. The CaffeineMark is also available as an embedded [3] J. Lee, B. Yang, S. Kim, K. Ebcio˘ lu, E. Altman, S. Lee, Y. C. Chung, gversion, which omits the graphical tests from the desktop H. Lee, J. H. Lee, and S. Moon, “Reducing virtual call overheads in aversion. The test scores are calibrated so that a score of 100 Java VM just-in-time compiler”, In SIGARCH Comput. Archit. News 28, 1, pp 21-33, March 2000.would equal the performance of a desktop computer with [4] T. Lindholm and F. Yellin, “The Java Virtual Machine Specification”,133 MHz Intel Pentium class processor. The individual tests Second Edition, Addison-Wesley, 1997.cover a broad spectrum of applications. Since the REALJava [5] T. S¨ ntti and J. Plosila, “Communication Scheme for an Advanced Java a Co-Processor”, In Proc. IEEE Norchip 2004, Oslo, Norway, Novemberis intended for embedded systems, we also calculated the 2004.scores without the floating point sub-test. These scores are [6] T. S¨ ntti and J. Plosila, “Architecture for an Advanced Java Co- areported in Table II on the line marked with ND (No Double Processor”, In Proc. International Symposium on Signals, Circuits and Systems 2005, Iasi, Romania, July 2005.arithmetics). These results are marked with italics because [7] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “Java Co-Processor for Embedded a athey were measured using a new version of the software par- Systems”, In Processor Design: System-on-Chip Computing for ASICstition of the REALJava virtual machine, which contains some and FPGAs, J. Nurmi, Ed. Kluwer Academic Publishers / Springer Publishers, 2007, ch. 13, pp. 287-308, ISBN-10: 1402055293, ISBN-13:modifications besides the invocation architecture. Because of 978-1402055294.this, the results do not give an accurate view on the effect [8] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “FPGA Prototype of the REALJava a aof the new invocation architecture. For reference we give Co-Processor”, In Proc. 2007 International Symposium of System-on- Chip, Tampere, Finland, November 2007.the scores for the Virtex5 based system, which are 142 and [9] T. S¨ ntti, J. Tyystj¨ rvi and J. Plosila, “A Novel Hardware Acceleration a a198 for the embedded CaffeineMark with and without double Scheme for Java Method Calls”, In Proc. ISCAS 2008, Seattle, Washing-arithmetics. These results show a decrease from the PowerPC ton, USA, May 2008. [10] T. S¨ ntti, “A Co-Processor Approach for Efficient Java Execution in Em- abased system, which is due to the significantly slower CPU. bedded Systems”, Ph.D. thesis, (https://oa.doria.fi/handle/10024/42248),Naturally this test was run using only one core on that system, University of Turku, November 2008.although eight of them could be used in parallel. More results [11] J. Tyystj¨ rvi, “A Virtual Machine for Embedded Systems with a Co- a Processor”, M.Sc. Thesis, University of Turku, 2007.can be found at our results site [13], the invocation architecture [12] J. Tyystj¨ rvi, T. S¨ ntti and J. Plosila, “Instruction Set Enhancements for a awas changed and fine tuned between versions 2.09 and 3.01 High-Performance Multicore Execution on the REALJava Platform”, Inof the REALJava. Proc. NORCHIP 2008, Tallinn, Estonia, November 2008. [13] “BenchMark Results”, consulted 18 August 2010, VI. C ONCLUSIONS AND F UTURE W ORK http://vco.ett.utu.fi/˜teansa/REALResults. [14] “CaffeineMark 3.0”, consulted 18 August 2010, An improved strategy for accelerating of method calls in http://www.benchmarkhq.ru/cm30/. [15] “Embedded Java Book Index”, consulted 18 August 2010,Java using a hardware module was presented. The module http://www.practicalembeddedjava.com/.was implemented on Xilinx FPGA to provide several bench- [16] “GNU Classpath”, consulted 18 August 2010,marks and show significant improvement in both specialized http://www.gnu.org/software/classpath/. [17] “JavaG Benchmarking”, consulted 18 August 2010,and more general benchmarks. In addition to the improved http://www2.epcc.ed.ac.uk/computing/research activities/java grande/performance, the new architecture reduces the size of the stack