This article mentions one of the reason for vrouter agent crash in Contrail Networking Release R1909 and its fixed version details.
The following error message is seen in vrouter-log.json
(obtained by running docker inspect vrouter_vrouter-agent_1 | grep json
command):
{"log":"contrail-vrouter-agent: controller/src/vnsw/agent/oper/vrf.cc:419: bool VrfEntry: deleteTimeout(): Assertion `0' failed.\r\n","stream":"stdout","time":"XXXXXXX"}
{"log":"/entrypoint.sh: line 400: 143274 Aborted (core dumped) $@\r\n","stream":"stdout","time":"XXXXXXX"}
The backtrace output may be similar to the following:
(gdb) bt
#0 0x00007f8d395f5350 in std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /lib64/libstdc++.so.6
#1 0x0000000000ef2f8d in std::_Rb_tree<boost::intrusive_ptr<DBTableWalk>, boost::intrusive_ptr<DBTableWalk>, std::_Identity<boost::intrusive_ptr<DBTableWalk> >, std::less<boost::intrusive_ptr<DBTableWalk> >, std::allocator<boost::intrusive_ptr<DBTableWalk> > >::_M_insert_unique(boost::intrusive_ptr<DBTableWalk> const&) ()
#2 0x0000000000ef28cb in DBTableWalkMgr::WalkTable(boost::intrusive_ptr<DBTableWalk>) ()
#3 0x0000000000ef2c94 in DBTableWalkMgr::WalkAgain(boost::intrusive_ptr<DBTableWalk>) ()
#4 0x0000000000eec7de in DBTable::WalkAgain(boost::intrusive_ptr<DBTableWalk>) ()
#5 0x0000000000c066e2 in AgentSandesh::DoSandeshInternal(boost::shared_ptr<AgentSandesh>, int, int) ()
#6 0x0000000000c06a12 in AgentSandesh::DoSandesh(boost::shared_ptr<AgentSandesh>, int, int) ()
#7 0x0000000000c06a8a in AgentSandesh::DoSandesh(boost::shared_ptr<AgentSandesh>) ()
#8 0x0000000000c934ac in ItfReq::HandleRequest() const ()
#9 0x0000000000de1b0d in Sandesh::ProcessRecv(SandeshRequest*) ()
#10 0x0000000000df5b34 in QueueTaskRunner<SandeshRequest*, WorkQueue<SandeshRequest*> >::Run() ()
#11 0x0000000000ec65cf in TaskImpl::execute() ()
The above error occurs because the route delete walk did not happen when the VRF delete request was received.
Currently in the Contrail Networking R1909 branch, DBTableWalkMgr::ProcessWalkRequestList does not acquire a lock before processing the walk requests, which requires accessing the walk_request_list
and walk_request_set
data structures.
The issue has been resolved from Contrail Networking Release R1911.
In the fix, DBTableWalkMgr::ProcessWalkRequestList (which processes the walk requests) and DBTableWalkMgr::WalkTable (which allocates walker) may run in parallel as DBTableWalkMgr::WalkTable is called from any task, which can run concurrently with the DB::walker task.
See https://review.opencontrail.org/c/Juniper/contrail-controller/+/54659.