Support Support Downloads Knowledge Base Case Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

[Contrail] contrail-query-engine crash in a race condition in pre-3.2.12.0/4.1.2.0/5.1.0 releases

0

0

Article ID: KB34147 KB Last Updated: 30 Apr 2019Version: 1.0
Summary:

The Contrail query engine handles queries and requires access to the data in the Cassandra database. In Contrail pre-3.2.12.0, 4.1.2.0, and 5.1.0 releases, a race condition is known to exist in the WhereQuery::subquery_processed function, which could cause analytics query-engine crashes when multiple queries are generated by the Contrail query engine in a short interval.​

This article describes how to take a back trace from a query-engine core file and on a few known bugs related to query-engine crash in order to troubleshoot this problem when it occurs.

 

Symptoms:

When users run the contrail-status -d command, they may notice a few core files generated by the Contrail analytics query engine as highlighted below. Compared to other processes that have been running fine for 150 days, the query engine is just back up from its last crash, approximately 1 hour 23 minutes ago.

contrail-status -d
== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen:0 active pid 3430, uptime 150 days, 15:00:05
contrail-analytics-api active pid 3429, uptime 150 days, 15:00:05
contrail-analytics-nodemgr active pid 3424, uptime 150 days, 15:00:05
contrail-collector active pid 3427, uptime 150 days, 15:00:05
contrail-query-engine active pid 19624, uptime 1:23:17
contrail-snmp-collector active pid 3425, uptime 150 days, 15:00:05
contrail-topology active pid 3426, uptime 150 days, 15:00:05


========Run time service failures=============
/var/crashes/core.contrail-query-.19993.cnal01.example.com.1531583808
/var/crashes/core.contrail-query-.20389.cnal01.example.com.1532413310
/var/crashes/core.contrail-query-.22384.cnal01.example.com.1539698281
/var/crashes/core.contrail-query-.30428.cnal01.example.com.1544446409

 

Cause:

In Contrail pre-3.2.12.0, 4.1.2.0, and 5.1.0 releases, a race condition is known to exist in the WhereQuery::subquery_processed function, which could cause analytics query-engine crashes when multiple queries are generated by the Contrail query engine in a short interval.​

The following picture highlights the query engines on a typical Contrail analytics node.

 

Solution:

This known race condition has been fixed in the following Contrail releases: 3.2.12.0, 4.0.3.0, 4.1.2.0, 5.0.1.0, and 5.1.0.0.

To troubleshoot this issue, perform the following steps:

  1. First check the Contrail version and build number in the core file by using the strings and contrail-version commands.
root@bcomp79:~# strings core.contrail-query-.22384.cnal01.example.com.1539698281|grep build-info
{"build-info": [{"build-time": "2018-05-13 14:11:27.619893", "build-hostname": "ubuntu", "build-user": "contrail-builder","build-version": "3.2.10.0"}]}

root@bcomp79:~# contrail-version
Package                                Version                        Build-ID | Repo | Package Name
-------------------------------------- ------------------------------ ----------------------------------
contrail-config                        3.2.10.0-75                          75
  1. Log in to the appropriate version or build release (should be the same as the output in the above step) of the Contrail server to analyze the binary.

  2. Use gdb to decode the aforementioned core file with a proper binary. Observe that in this case, the core file was triggered due to a segmentation fault (memory corruption) in WhereQuery::subquery_processed(QueryUnit*) ()

root@bcomp79:~# gdb contrail-query-engine core.contrail-query-.22384.cnal01.example.com.1539698281

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-query-engine --conf_file /etc/contrail/contrail-query-engine.'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000532778 in WhereQuery::subquery_processed(QueryUnit*) ()
Missing separate debuginfos, use: debuginfo-install contrail-analytics-5.1.0-89.el7.centos.x86_64
(gdb) bt
#0 0x0000000000532778 in WhereQuery::subquery_processed(QueryUnit*) ()
#1 0x00000000004cfc5f in DbQueryUnit::WPCompleteCb(WorkPipeline<DbQueryUnit::Input, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output>*, bool) ()
#2 0x00000000004a37bc in boost::function1<void, bool>::operator()(bool) const ()
#3 0x00000000004d2e59 in void WorkPipeline<DbQueryUnit::Input, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output>::NextStage<0, DbQueryUnit::Output>() ()
#4 0x00000000004d368d in WorkPipeline<DbQueryUnit::Input, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output, DbQueryUnit::Output>::WorkStageCb(unsigned int, bool) ()
#5 0x00000000004d1107 in WorkStage<DbQueryUnit::Input, DbQueryUnit::Output, std::vector<query_result_unit_t, std::allocator<query_result_unit_t> >, DbQueryUnit::Stage0Out>::Runner() ()
#6 0x000000000049c1a9 in PipelineWorker::Run() ()
#7 0x00000000004646ff in TaskImpl::execute() ()
#8 0x00007fbe049bc8ca in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all(tbb::task&, tbb::task*) () from /lib64/libtbb.so.2
#9 0x00007fbe049b85b6 in tbb::internal::arena::process(tbb::internal::generic_scheduler&) () from /lib64/libtbb.so.2
#10 0x00007fbe049b7c8b in tbb::internal::market::process(rml::job&) () from /lib64/libtbb.so.2
#11 0x00007fbe049b567f in tbb::internal::rml::private_worker::run() () from /lib64/libtbb.so.2
#12 0x00007fbe049b5879 in tbb::internal::rml::private_worker::thread_routine(void*) () from /lib64/libtbb.so.2
#13 0x00007fbe04bd7e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fbe03ca834d in clone () from /lib64/libc.so.6
  1. With the back trace information that is collected, open a Technical Service Request by contacting Support.

 

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Getting Up and Running with Junos

Getting Up and Running with Junos Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search