one of the most interesting project I have ever worked on was a migration project that needed to be handled as a batch process, in this slides we will have an overview of the challenge we had, the choices, why we chosed Spring batch, and have an overview of Spring Batch capabilities, in less than 15 minutes.
4. Problem Definition
Migrating 1M new users into our Subscription Engine
Migration, no interactions needed while running
Legacy tool:
migrates 20000 in ~ 7 hours
1000000 -> 350 hours -> 14.58 days
6. Spring Batch
the leading batch processing framework on the JVM
Benefits:
Job flow state machine
Transaction handling
Declarative IO
Robust error handling(retry, Skip, fail)
Scalability options
Battle tested
Built on Spring
7. What is Spring ?
Spring is an open source application framework, and inversion of
control container for Java. The framework's core features can be used by any Java
application, and there are extensions for building web applications, Enterprise
Applications, and many other features.
8. Spring Batch Job
Job Repository
Transitions
Decisions
Nested Jobs
Job Parameters
Job
• *
JobInstance
• *
JobExecution
EndOfDay
“5/12/2018”
One for Each Attempt
15. Solution Result
Process finished in less than 12h
Time to handle the failed items
Project replaced the old tool and used in more than 6 other migration processes
Sessions have been organized to handover the tool to the new engineers
Enhance for the next versions
16. Comparison
Spring Batch
- Time: 12h
- Clean Code
- Less Code
- Easy To learn
- Less Complex
Legacy Tool
- Time: 14.58d
- More Code
- More Complex
17.
18. Contact Me
• El-sherouk city, Cairo, Egypt
• +201023842575
• Taher.ayoub90@gmail.com
Any Questions
Editor's Notes
-JSR-352: (the standardization of batch processing on the JVM)
Transaction management: for example if you have a file with a million record and youdo not want this amount of data to be processed in one transaction, spring batch provides you with a chunk based steps to process the file chunk by chunk, handle the state of processing, and handle if error happen where it is last processed and start from this point when triggered to restart.
Declarative IO: spring batch provide a collection of readers and writers from and to (files, XML, JSON, JDBC, and even JPA….Etc) to use and focus on the business logic.
Scalability on single JVM or multiple JVMs
Battle tested: since 2008 and used in many verticals, finance, retails, Governments, with mission critical applications running on spring batch on production, so spring batch components are well tested
Build on Spring, so all the facilities we have on spring we also have with spring batch, spring initializr, spring boot, context, configurations, IDE integrations, testing utilities…etc.
So before speaking about spring batch, does anyone here heard about spring Frangwaork, it is Java related framework, I know but I think it is well-knonw even for the engineers with no java back ground?
2.1Inversion of control container (dependency injection)
2.2Aspect-oriented programming framework
2.3Data access framework
2.4Transaction management
2.5Model–view–controller framework(MVC)
s
2.6Convention-over-configuration rapid application development
2.7.1Spring Boot
2.7.2Spring Roo
2.8Batch framework
2.9Integration framework
Transiton: Spring batch is a state machine, and so we need to configure how to transit from state to state and from step to step, and the expected conditions, so when “step1” completes do we go to “step2” or “step3”, also we need to configure the terminal state of the job itself, what happen as a result of this job, does it finished “successful”, “failed”, “stopped” and so.
Decisions : deciders is a way to orchestrate the job steps based on the output of the step logic itself, so it is not depending on the termination state of the step but the step logic output
Nested Jobs: job can be nested inside a step, this makes it more clean to compose a job instead of having a huge complex job, so the job can execute another, how it is work?, the parent job is waiting for the child step to complete and if it is successful the step is considered successful and the processing of the parent job continues, if the child job fails the step is considered failed and the parent job itself stops as expected.
It is pretty useful to add additional configurations at run time, spring batch provides a mechanism for providing parameters to a job to allow you customize the configurations, at the same time job parameters are used to identify “Job instance”, if we take a look at the diagram, this is how it is designed, so we have a job,, conceptual job, and a job can have a job instance, a job instance can have many executions, a job instance is a logical run, so in this example if I have an “endOfDay” job that should run for each day, I get an instance for each day, a logical run, so in this case I can pass a parameter for each day indicating the new job instance, each time I physically run the job I get a “job execution”, with the same parameters I passed to the “job instance”.
In our case for example we needed to send the input file path as a job parameter. Job instance can only run once to the completion.
This sequence diagram shows how chunk based steps handle the process, read record by record and pass it to the processor, until finish the chunk, then the whole output data is written to by the writer at once, reading and processing item by item, helps in handling the errors, while writing all the output at once for performance reasons, it is better to execute one insert statement rather than execute one for each item.
In chunk based steps the itemReader is responsible of providing the input to the step.
cursorItemReader: very simple stateful, so if something happen I can restart from the failing point with no issues, but it is not thread safe, the resultset has only one cursor, this will cause issues if it is called from multiple threads,
setSql, data source, Row mapper
pagingItemReader: the key difference is that it is thread safe, if some error happened in a page the whole page is considered failed and will be the start point in the next “restart”
Multiple resources: reading each file to the end and then move to the next one, you specify spring batch the resources and it handles the rest, you can keep track of each record’s source.
Managing state within the chunk based steps, this facility makes “retry and skip” features possible, the jobExection or the step execution state is saved in the job repository, the component state of the step itself is saved, so you can save the execution state of the reader or the writer, so if an error occurred while reading or processing, the
As a batch processing framework, sprnig batch provide many and robust ways to handle the errors, relying on the job repository, spring batch can catch the job or the step where it last failed, and start over there if the job restarted, re-run with the same parameters, for example 1M records job with 3 steps and 1000 chunk size, step1 succeeded, and step2 failed at record 500250, if we restart the job it will skip step1 and the first 500,000 records that have been processed successfully in steps2 and start from 500001 at step2,
Spring batch give the ability to retry an action if something went wrong
- As you can see, the code is very readable, easy too understand even before going into details, here we defining a step, that can be injected in any subsequent job definitions, it is builder, and factory DP dependent implementation, so here we are using “stepBuilderFactory” to get a builder object which is used to build the step with the supplied parameters, so we are building a step named “step1”,and it is chunk based step, not tasklet, the chunk size is 20 records, and we are using a reader name “subscriptionItemReader”, and processor “subscriptionItemProcessor”, and the output will be written using the writer defined and named “itemWriter”, then call build method of the builder to build the step.
end() : termination state indicates that the job has finished “successfully”, hence it can not be restarted again with the same arguments, there are also, fail(), means that the job failed at that step and stoped(), indicates that we just programaicaly stopped the execution, thus are termination state that implies that the job does not finish successfully, so it can be restarted again , and it will start working from the point it failed.
stepAndRestart(step3)