Hadoop vs Java Batch Processing JSR 352

14,602 views

Published on

Hadoop has become synonymous to Big Data. Oracle has release the latest standard to Java EE stack: Batch Processing JSR 352. Batch processing has been around for decades and there are many Java framework already available such Spring Batch. This talks provides a perspective about Hadoop and JSR352. Knowing when to use or the other or both together.

Published in: Technology

Hadoop vs Java Batch Processing JSR 352

  1. 1. AGENDA • Introduction • What is batch processing? • Batch processing using Hadoop • Batch processing using Java Batch Processing JSR 352 • When to use Hadoop or JSR 352? • Conclusion A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 2
  2. 2. INTRODUCTION Motivation for this presentation are: • Petabytes of data available in the wild (Internet, cars, fridge…) • Need for competitive edge • Processing large dataset • Analysing large complex data (ETL) • Generating reports A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 3
  3. 3. WHAT IS BATCH PROCESSING? Batch processing is execution of a series of programs ("jobs") on a computer without manual intervention. Batch processing has these benefits: • It can shift the time of job processing to when the computing resources are less busy. • It avoids idling the computing resources with minute-by- minute manual intervention and supervision. • By keeping high overall rate of utilization, it amortizes the computer, especially an expensive one. • It allows the system to use different priorities for batch and interactive work. Source: Wikipedia A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 4
  4. 4. BATCH PROCESSING USING HADOOP Hadoop is a massively scalable storage and batch data processing system. It provides an integrated storage and processing fabric that scales horizontally with commodity hardware and provides fault tolerance through software. Rather than replace existing systems, Hadoop augments them by offloading the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do: whether that be serve real time transactional data or provide interactive business intelligence. A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 5
  5. 5. BATCH PROCESSING WITH HADOOP CONT… • Hadoop uses the MapReduce programming model • Parallel job processing – no need to worry about synchronization, concurrency, hardware failure, etc… • Databases: Using the RDBMS built-in tools to dump the data or Hadoop native JDBC tools to extract data • Unstructured data such as log files can be processed using Hadoop • Hardware and Data agnostic A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 6
  6. 6. BATCH PROCESSING USING JAVA BATCH PROCESSING JSR 352 Batch processing refers to running batch jobs on a computer system. Java EE includes a batch processing framework that provides the batch execution infrastructure common to all batch applications, enabling developers to concentrate on the business logic of their batch applications. The batch framework consists of a job specification language based on XML, a set of batch annotations and interfaces for application classes that implement the business logic, a batch container that manages the execution of batch jobs, and supporting classes and interfaces to interact with the batch container. A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 7
  7. 7. BATCH PROCESSING USING JAVA BATCH PROCESSING JSR 352 CONT… Java EE includes a batch processing framework that consists of the following elements: • A batch runtime that manages the execution of jobs. • A job specification language based on XML. • A Java API to interact with the batch runtime. • A Java API to implement steps, decision elements, and other batch artefacts. JSR-325 is easily integrated in SOA architecture, JMX for monitoring, Java Messaging Services and the full Java EE stack. The learning curve for a Java EE developer is substantially reduced. A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 8
  8. 8. WHEN TO USE HADOOP OR JSR 352? Java EE Batch Processing is not a competitive technology to Apache Hadoop. They were built for different uses cases. Here are some examples of use cases where I believe they can be best: Financial Risk Modelling Creating reports from Database Internet Threat Analysis System housekeepin g Hadoop JBatch JSR 352 A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 9
  9. 9. WHEN TO USE HADOOP OR JSR 352? CONT… When deciding which technology to implement, you may want to consider the following: • Source of data • Size of data • Processing/ business logic • Does the batch process integrates with your existing architecture • What do with the processed data A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 10
  10. 10. CONCLUSION • JSR 352 is not a replacement for Hadoop • You can use them both together, maybe JSR 352 as a trigger for Hadoop jobs • JSR 352 is better suited for small batch jobs such as generating sales reports • Hadoop should be used when large dataset (>1TB) need to be analysed • JSR352 can be easily integrated in your Enterprise Service Bus architecture A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 11
  11. 11. END. A R M E L N E N E – E T A P I X G L O B A L L T D - W W W . E T A P I X . C O M 12 Armel Nene is software architect and developer. He is also the founder of ETAPIX Global Limited – The Big Data Company - www.etapix.com Armel Nene Recruitment - www.armelnene.com is an IT specialist recruitment based in London, UK. @armelnene http://uk.linkedin.com/in/armelnene/

×