Skip to content

Exceptions during chunk commit are silently ignored #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
spring-projects-issues opened this issue Aug 7, 2015 · 15 comments
Open

Exceptions during chunk commit are silently ignored #1189

spring-projects-issues opened this issue Aug 7, 2015 · 15 comments

Comments

@spring-projects-issues
Copy link
Collaborator

Erwin Vervaet opened BATCH-2415 and commented

A TaskletStep uses various helpers to loop over all chunks to be processed and process each of them in a
separate transaction while managing things such as skippable and retryable exceptions.
In pseudo code this whole system can be summarized as follows:

while (more chunks) { // RepeatTemplate usage in TaskletStep.doExecute
	try { // RepeatTemplate.executeInternal

		try { // StepContextRepeatCallback.doInIteration
			chunkContext = queue.poll()

			{ // TaskletStep: StepContextRepeatCallback.doInChunkContext
				try { // TransactionTemplate.execute
					{ // TaskletStep: ChunkTransactionCallback.doInTransaction
						{ // ChunkOrientedTasklet.execute
							chunk = chunkProvider.provide()
							chunkProcessor.process(chunk) [A]
							chunkContext.setComplete()
							return RepeatStatus.continueIf(!chunk.isEnd())
						}
					}
				} catch (Exception e) {
					rollback()
				}
				commit() [B]
			}

		} finally {
			if (!chunkContext.isComplete()) {
				queue.add(chunkContext)
			}
		}

	} catch (Exception e) {
		{ // SimpleRetryExceptionHandler.handleException
			if (exception is fatal) {
				throw e;
			}
		}
	}
}

This code exhibits the following type of behavior:

  • If no exceptions occur during chunk processing, the chunk is marked as complete and the transaction
    commits. Since the chunk its complete it is not pushed back on the queue and the repeat loop ends normally.
  • A non-fatal (i.e. retryable) exception that is thrown while processing the chunk (point [A])
    will cause the chunk not to be marked complete and the transaction to rollback. Since the chunk
    is not complete its pushed back onto the queue of chunks to process. Furthermore, since the
    exception is not fatal it will be ignored by the RepeatTemplate and a new iteration will start, actually doing the retry.

So far so good. However, we're experiencing the following case:

  • No exceptions occur during chunk processing. Consequently the chunk is marked as complete.
    Now an exception is thrown during transaction commit (point [B]). Since the chunk is already marked as
    complete the chunk is not pushed back on the queue and no retry is attempted. The commit failure
    caused correct chunk transaction rollback. However, since the exception is not deemed to be fatal
    it is silently ignored by the RepeatTemplate! End result: the batch completes normally but certain chunks were silently skipped!

The exception we get is (see also BATCH-2403):

2015-08-06 13:53:04.440 DEBUG o.s.b.c.s.i.SimpleRetryExceptionHandler - Handled non-fatal exception
javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: could not execute statement
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1316) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
	at org.hibernate.ejb.AbstractEntityManagerImpl$CallbackExceptionMapperImpl.mapManagedFlushFailure(AbstractEntityManagerImpl.java:1510) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
	at org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:110) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
	at org.hibernate.engine.transaction.synchronization.internal.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:53) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
	at bitronix.tm.BitronixTransaction.fireBeforeCompletionEvent(BitronixTransaction.java:478) ~[btm-2.1.2.jar:2.1.2]
	at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:193) ~[btm-2.1.2.jar:2.1.2]
	at bitronix.tm.BitronixTransactionManager.commit(BitronixTransactionManager.java:120) ~[btm-2.1.2.jar:2.1.2]
	at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1012) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
	at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:755) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
	at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:724) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:148) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:267) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:77) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
	at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:368) [spring-batch-infrastructure-2.2.7.RELEASE.jar:na]
...
Caused by: org.hibernate.exception.LockAcquisitionException: could not execute statement
	at org.hibernate.dialect.MySQLDialect$1.convert(MySQLDialect.java:412) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
...
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_05]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_05]
...

I fail to see how that can be desired behavior. Or am I just missing something here?


Affects: 2.2.7, 3.0.4

Attachments:

4 votes, 9 watchers

@spring-projects-issues
Copy link
Collaborator Author

Michael Minella commented

  1. Does this issue persist in the 3.0.x branch? 2.2.x is no longer maintained.
  2. How is this issue different from BATCH-2403 which you point out?

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

Michael Minella,

  1. We're still on 2.2.7 so I haven't verified that yet. I'll upgrade and test this next week.
  2. BATCH-2403 is ultimately about the noRetry and noSkip methods on FaultTolerantStepBuilder. This ticket is about a logic path that can lead to a situation where Spring Batch rolls back a chunk and silently ignores the exception that caused the rollback, ultimately marking the batch as having completed successfully while some chunks were actually skipped.

@spring-projects-issues
Copy link
Collaborator Author

Albert Strasser commented

Hi Erwin,

You might be correct that we are looking at 2 different issues here. My assumption was that just checking if an exception was fatal (last catch block of your example) seemed to me as a wrong implementation of skipping (which should default to false for any exception, not only for fatal ones). But you could be right that at this point we are already beyond the concept of skipping and therefore this would mean that we look at a major bug in handling the status of a chunk.

I can assure that this issue does persist in the 3.0.x branch as we are using 3.0.4 and experience exactly the same behaviour.

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

I can now also confirm that the problem is reproducible with Spring Batch 3.0.4.
Attached you can find DEBUG level Spring Batch logging for an occurrence of the problem.

The logging is from two 50-item batches (documents_cuba_RLBTST17.zip and documents_cuba_RLBTST16.zip) that run concurrently. The system first picks up the documents_cuba_RLBTST17.zip batch and about 10 seconds later documents_cuba_RLBTST16.zip is also picked up.

Both batches complete successfully.

documents_cuba_RLBTST17.zip incorrectly skipped 2 items. The STEP_EXECUTION table contains the following (no skips):

STATUS COMMIT_COUNT READ_COUNT WRITE_COUNT ROLLBACK_COUNT
COMPLETED 49 50 48 2

documents_cuba_RLBTST16.zip incorrectly skipped 12 items. The STEP_EXECUTION table contains the following (no skips):

STATUS COMMIT_COUNT READ_COUNT WRITE_COUNT ROLLBACK_COUNT
COMPLETED 39 50 38 12

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

As a temporary work-around, we're now forcing a transaction flush (using EntityManager.flush()) from inside our ItemProcessor. This way Spring Batch will correctly retry or fail the batch if retries are exhausted.

Obviously this is not a real fix since a transaction commit might still fail for other reasons.

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

Michael Minella,
Can you given any feedback related to this issue?
Given that if confirmed this is a critical bug I would say this has some priority.

@spring-projects-issues
Copy link
Collaborator Author

Andrei Amariei commented

Also affected by this issue.
I attached [^step_succeeds_when_commit_fails.patch] that adds a failing test to spring-batch-samples module, maybe it helps moving it forward.

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

Thanks for that Andrei Amariei!

@spring-projects-issues
Copy link
Collaborator Author

Andrei Amariei commented

One concern I have with flushing is that it can add side effects to an unsuccessful write phase.

For example:

  • given inputs are entities and hibernate's Session.saveOrUpdate(item) is used (as in HibernateItemWriter), flushing will increase the version field before throwing any exception.
  • therefore after rollback, when attempting to process the items individually, writing will fail with StaleObjectStateException.
  • the same can also happen in commit phase, so maybe a way to skip all items in the chunk (instead of retrying) in case of write/commit failure would help.

About the attached test, maybe it's wrong to expect the step to fail (even if all the items are skipped), and it would be better to instead have a field StepExecution.commitSkipCount and a method SkipListener.onSkipInCommit() (the method can be placed in a new interface to keep backward compatibility). Similar for retry rules.

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

Andrei Amariei, you're right, the flush is just a hack. In our application it effectively fixes the problem: we've not seen any new occurrences in our production system. But I fully agree that it's not a real solution that applies generally.

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

Michael Minella, still no input on this?

@spring-projects-issues
Copy link
Collaborator Author

Erwin Vervaet commented

still no input on this?
I guess not...

@fmbenhassine
Copy link
Contributor

Failing test with 5.0.3 (to add in ChunkOrientedStepIntegrationTests):

@Test
public void faultTolerantStepShouldFailWhenCommitFails() throws Exception {
	StepBuilder stepBuilder = new StepBuilder("step", jobRepository);
	FaultTolerantStepBuilder<String, String> faultTolerantStepBuilder = new FaultTolerantStepBuilder<>(stepBuilder);
	faultTolerantStepBuilder.transactionManager(transactionManager);
	faultTolerantStepBuilder.reader(getReader(new String[] { "a", "b", "c" }));
	faultTolerantStepBuilder.writer(data -> TransactionSynchronizationManager
			.registerSynchronization(new TransactionSynchronizationAdapter() {
				@Override
				public void beforeCommit(boolean readOnly) {
					throw new RuntimeException("Simulate commit failure");
				}
			}));
	step = faultTolerantStepBuilder.build();

	JobParameters jobParameters = new JobParameters();
	JobExecution jobExecution = jobRepository.createJobExecution(job.getName(), jobParameters);
	StepExecution stepExecution = new StepExecution(step.getName(), jobExecution);

	jobRepository.add(stepExecution);
	step.execute(stepExecution);

	Assert.assertEquals(BatchStatus.FAILED, stepExecution.getStatus());
}

Will be reviewed in #3950 .

@fmbenhassine fmbenhassine removed the status: waiting-for-triage Issues that we did not analyse yet label Sep 21, 2023
@renannprado
Copy link

Is there a workaround for this issue?

@tanjooxiang
Copy link

Hello. facing the same issue. is there any workaround currently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants