There are at least two hang-up problems in the concurrent vlv search
and delete.
1. From ldbm_back_search, vlv_trim_candidates_byvalue is called via
vlv_trim_candidates. The function performs a binary search over
the candidate idlist. Under some stress condition, entries which
id is in the candidate idlist could be deleted between the time the
candidate idlist is generated and the time performing the binary
search. When it occurs, the errors log stores the the error
message printed at the line 1484 endlessly and the thread hangs up
since it never steps forward in the binary search.
[...] - vlv_trim_candidates_byvalue: Candidate ID 120527 not found
err=-30990
[...] - vlv_trim_candidates_byvalue: Candidate ID 120527 not found
err=-30990
[vlv.c]
1424 static PRUint32
1425 vlv_trim_candidates_byvalue(backend *be, const IDList
*candidates, const sort_spec* sort_control, const struct
vlv_request *vlv_request_control)
1426 {
[...]
1466 /*
1467 * Perform a binary
search over the candidate list
1468 */
1469 do {
[...]
1482 if (
e == NULL )
1483 {
1484
LDAPDebug( LDAP_DEBUG_ANY, "vlv_trim_candidates_byvalue: Candidate ID
%lu not found err=%d\n", (u_long)id, err, 0 );
1485 }
1486 else
1487 {
[...]
1578 }
1579 } while (!found);
Stack Trace (snapshot in the endless loop)
(gdb) bt
#0 0x006287a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008a6a9f in __fsync_nocancel () from /lib/tls/libpthread.so.0
#2 0x00cefcc2 in pt_Fsync (fd=0x8775d88) at
../../../../../nsprpub/pr/src/pthreads/ptio.c:1536
#3 0x00cd14e8 in PR_Sync (fd=0x8775d88) at
../../../../../nsprpub/pr/src/io/priometh.c:181
#4 0x004cc618 in vslapd_log_error (fp=0x8775d88, subsystem=0x0,
fmt=0x48089c "vlv_trim_candidates_byvalue: Candidate ID %lu not found
err=%d\n", ap=0x5b5d4c8 "��\001") at log.c:1840
#5 0x004cc325 in slapd_log_error_proc_internal (subsystem=0x0,
fmt=0x48089c "vlv_trim_candidates_byvalue: Candidate ID %lu not
found err=%d\n", ap_err=0x5b5d4c8 "��\001", ap_file=0x5b5d4c8 "��\001")
at log.c:1758
#6 0x004cc201 in slapd_log_error_proc (subsystem=0x0,
fmt=0x48089c "vlv_trim_candidates_byvalue: Candidate ID %lu not found
err=%d\n") at log.c:1722
#7 0x00452fa0 in vlv_trim_candidates_byvalue (be=0x88503f0,
candidates=0x98ec0d8, sort_control=0x98d71e0,
vlv_request_control=0x5b5d690) at vlv.c:1484
#8 0x00452b5d in vlv_trim_candidates (be=0x88503f0,
candidates=0x98ec0d8, sort_control=0x98d71e0,
vlv_request_control=0x5b5d690, trimmedCandidates=0x5b5d5c0,
vlv_response_control=0x5b5d5f0) at vlv.c:1314
#9 0x0046b354 in ldbm_back_search (pb=0xb2418970) at
ldbm_search.c:540
#10 0x0051ac18 in op_shared_search (pb=0xb2418970, send_result=1) at
opshared.c:545
#11 0x0805e389 in do_search (pb=0xb2418970) at search.c:276
#12 0x08055ede in connection_dispatch_operation (conn=0xb6e368a8,
op=0xb241ddd8, pb=0xb2418970) at connection.c:521
#13 0x08057312 in connection_threadmain () at connection.c:2146
#14 0x00cf4296 in _pt_root (arg=0x89cb5e8) at
../../../../../nsprpub/pr/src/pthreads/ptthread.c:220
#15 0x008a1371 in start_thread () from /lib/tls/libpthread.so.0
#16 0x009dbffe in clone () from /lib/tls/libc.so.6
Change Description
(a) First, I tried to protect the whole search operation with the read
lock on the vlvSearchList_lock (
PR_RWLock_Rlock(be->vlvSearchList_lock); ), but this causes more
serious deadlock with the delete operation (and other update
operations, as well). This idea does not work.
(b) Instead, I added a code to remove the not-found entry's id from the
idlist and retry the binary search. It solved the hang-up problem.
[vlv.c]
1485 if (
e == NULL )
1486 {
1487
int rval;
1488
LDAPDebug( LDAP_DEBUG_ANY, "vlv_trim_candidates_byvalue: Candidate ID
%lu not found err=%d\n", (u_long)id, err, 0 );
1489
rval = idl_delete(&candidates, id);
1490
if (0 == rval) {
1491
goto retry;
1492
} else {
1493
return candidates->b_nids; /* not found */
1494
}
1495 }
2. With the fix described in [1], I kept running the stress test.
Then, a typical deadlock between the search thread and the delete
thread was observed.
Stack Trace showing the deadlock
(gdb) thread 13
[Switching to thread 13 (Thread -1236546640 (LWP 3812))]#0
0x006287a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0 0x006287a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008a3b26 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/libpthread.so.0
#2 0x002dedc4 in __db_pthread_mutex_lock (dbenv=0x92dfb80,
mutexp=0xb71e90c8) at ../../../dist/../mutex/mut_pthread.c:225
#3 0x0034c02f in __lock_get_internal (lt=0x92e61e0, locker=63,
flags=0, obj=0x92e8edc, lock_mode=DB_LOCK_READ, timeout=0,
lock=0xb64b6fd0) at ../../../dist/../lock/lock.c:990
#4 0x0034cb95 in __lock_vec (dbenv=0x92dfb80, locker=63, flags=0,
list=0xb64b6fc0, nlist=2, elistp=0xb64b6fbc) at
../../../dist/../lock/lock.c:294
#5 0x00331f87 in __db_lget (dbc=0x92e8e70, action=2,
pgno=4294967292, mode=DB_LOCK_READ, lkflags=0, lockp=0x96f86a4) at
../../../dist/../db/db_meta.c:446
#6 0x002e0b9f in __bam_c_next (dbc=0x92e8e70, initial_move=1,
deleted_okay=0) at ../../../dist/../btree/bt_cursor.c:2276
#7 0x002e3e84 in __bam_c_get (dbc=0x92e8e70, key=0xb64b71f0,
data=0xb64b71d0, flags=18, pgnop=0xb64b7160) at
../../../dist/../btree/bt_cursor.c:917
#8 0x00327cf8 in __db_c_get (dbc_arg=0x96fdf20, key=0xb64b71f0,
data=0xb64b71d0, flags=18) at ../../../dist/../db/db_cam.c:643
#9 0x0032f925 in __db_c_get_pp (dbc=0x96fdf20, key=0xb64b71f0,
data=0xb64b71d0, flags=18) at ../../../dist/../db/db_iface.c:1836
#10 0x00eac39e in vlv_build_idl (start=0, stop=1983, db=0x92e0210,
dbc=0x96fdf20, candidates=0xb64b725c, dosort=1) at vlv.c:1009
#11 0x00ead9e9 in vlv_find_index_by_filter (be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", f=0x96f7fa8) at
vlv.c:1875
#12 0x00e933ea in filter_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", f=0x96f7fa8,
nextf=0x0, range=0, err=0xb64b7564) at filterindex.c:105
#13 0x00e94b47 in list_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", flist=0x9706d38,
ftype=161, err=0xb64b7564) at filterindex.c:707
#14 0x00e93731 in filter_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", f=0x9706d38,
nextf=0x0, range=0, err=0xb64b7564) at filterindex.c:160
#15 0x00e94b47 in list_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com",
flist=0x9e28848,ftype=160, err=0xb64b7564) at filterindex.c:707
#16 0x00e936e1 in filter_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", f=0x9e28848,
nextf=0x0, range=0, err=0xb64b7564) at filterindex.c:155
#17 0x00ec5abe in onelevel_candidates (pb=0x96f8098, be=0x92aa0d8,
base=0x9e6dde0 "ou=accounting,dc=example,dc=com", e=0x9e285e0,
filter=0x96f7fa8, managedsait=0, lookup_returned_allidsp=0xb64b7680,
err=0xb64b7564) at ldbm_search.c:784
#18 0x00ec5754 in build_candidate_list (pb=0x96f8098, be=0x92aa0d8,
e=0x9e285e0, base=0x9e6dde0 "ou=accounting,dc=example,dc=com", scope=1,
lookup_returned_allidsp=0xb64b7680, candidates=0xb64b770c) at
ldbm_search.c:651
#19 0x00ec4fc6 in ldbm_back_search (pb=0x96f8098) at ldbm_search.c:415
#20 0x00afcef4 in op_shared_search (pb=0x96f8098, send_result=1) at
opshared.c:545
#21 0x0805e389 in do_search (pb=0x96f8098) at search.c:276
#22 0x08055ede in connection_dispatch_operation (conn=0xb6cc1808,
op=0x9705070, pb=0x96f8098) at connection.c:521
#23 0x08057312 in connection_threadmain () at connection.c:2146
#24 0x00bbb296 in _pt_root (arg=0x93fc3b0) at
../../../../../nsprpub/pr/src/pthreads/ptthread.c:220
#25 0x008a1371 in start_thread () from /lib/tls/libpthread.so.0
#26 0x00708ffe in clone () from /lib/tls/libc.so.6
(gdb) thread 15
[Switching to thread 15 (Thread -1228149840 (LWP 3810))]#0
0x006287a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0 0x006287a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008a3b26 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/libpthread.so.0
#2 0x00bb3d20 in PR_WaitCondVar (cvar=0x92e28a8,
timeout=4294967295) at
../../../../../nsprpub/pr/src/pthreads/ptsynch.c:405
#3 0x00ba0103 in PR_RWLock_Wlock (rwlock=0x92b3550) at
../../../../../nsprpub/pr/src/threads/prrwlock.c:298
#4 0x00eadbf9 in vlv_delete_search_entry (pb=0x9e70068,
e=0x96f8ca8, inst=0x92aaa40) at vlv.c:1945
#5 0x00ecbc16 in ldbm_back_delete (pb=0x9e70068) at
ldbm_delete.c:548
#6 0x00b02dee in op_shared_delete (pb=0x9e70068) at delete.c:318
#7 0x00b027fe in do_delete (pb=0x9e70068) at delete.c:116
#8 0x08055dee in connection_dispatch_operation (conn=0xb6cc18a8,
op=0x96fb938, pb=0x9e70068) at connection.c:486
#9 0x08057312 in connection_threadmain () at connection.c:2146
#10 0x00bbb296 in _pt_root (arg=0x94508d0) at
../../../../../nsprpub/pr/src/pthreads/ptthread.c:220
#11 0x008a1371 in start_thread () from /lib/tls/libpthread.so.0
#12 0x00708ffe in clone () from /lib/tls/libc.so.6
This is a deadlock caused by the read-write lock on the vlvSearchList
and the db lock. The search thread calls read lock on
vlvSearchList (e.g., at the top of vlv_find_index_by_filter), then
tries to search the db index. On the other hand, the delete
thread calls db transaction which could lock pages with the write lock
(it makes the search thread wait in the stacktrace for thread 13), then
tries to get write-lock on the vlvSearchList (the mutex is already
read-locked by the search thread).
To prevent this deadlock, I eliminated the read-lock from
vlv_find_index_by_filter. Then, as we experienced in [1], some
id's in the candidate idlist could have lost the corresponding entries
in the primary index file when the code is being executed and displayed
this message in the errors log:
[...] - vlv_filter_candidates: Candidate 227347 not found err=-30990
I ran the test suggested in Comment #7 repeatedly over night (about 15
hours) and got this number of conflicts, but no deadlock occurred.
$ egrep "not found" errors | wc -l
6499
This is not an ideal solution, but thinking of the vlv nature, it
should be fast and deadlock free. If we introduce more protection
by locking, it'd make the advantage lose.
I'm attaching my proposed source code diffs for the review, next.