Support Support Downloads Knowledge Base Service Request Manager My Juniper Community

Knowledge Base

Search our Knowledge Base sites to find answers to your questions.

Ask All Knowledge Base Sites All Knowledge Base Sites JunosE Defect (KA)Knowledge BaseSecurity AdvisoriesTechnical BulletinsTechnotes Sign in to display secure content and recently viewed articles

Syslog message: /dev/ada0, Offline uncorrectable sectors

0

0

Article ID: KB31515 KB Last Updated: 13 May 2019Version: 2.0
Summary:

Offline uncorrectable sectors messages appear on MX960 and EX9200 platforms.

This article explains the following syslog message:

smartd[xxxx]: %DAEMON-2: DEVICE: /dev/ada0, <value>

Symptoms:

The following log messages are seen in 30 minute intervals after upgrading to Junos 15.1 and while using StorFly SSD on RE-S-1800X4 routing engines:

Feb 20 15:28:19.296 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 15:58:19.303 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 16:28:19.306 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 16:58:19.314 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 17:28:19.342 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 17:58:19.352 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Feb 20 18:28:19.365 2017  ds-fmb-dc-1 smartd[3914]: %DAEMON-2: Device: /dev/ada0, 2 Offline uncorrectable sectors

Cause:

In Junos, smartd is the daemon that interfaces with the hard disk's Self-Monitoring Analysis and Reporting Technology (SMART) system. The SMART system monitors the drive for anything that might seem out of the ordinary and uses attributes to check the disk condition and to analyze its reliability. Smartd runs in the background and continuously monitors the drive and reports any errors.

When using a StorFly SSD, raw data that is being pulled from the disk may not be read properly by the smartd daemon. This could cause the above messages to be written to the syslog every 30 minutes.

 

Solution:

If the value of "Offline uncorrectable sectors" (in the above example logs: value = 2) is not incrementing, the logs are harmless and do not indicate SSD issues. The fixed versions of PR1233992 do not print these logs (15.1R7 and later). 

If the value of "Offline uncorrectable sectors" is incrementing, then the messages may indicate a possible failure of the SSD, and further investigation is needed.

To get a summary of the health status of the disk, the smartctl command from the shell can be used. All SSD information can be viewed, including detailed output of all parameters. Below is an output example of this command which displays information for a StorFly SSD:
 

root@ds-fmb-dc-1> start shell
​root@ds-fmb-dc-1: /var/home/remote #
 smartctl -a /dev/ada0
smartctl 6.4 2015-06-04 r4109 [FreeBSD JNPR-10.3-20160927.337663_build amd64] Junos Build
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     StorFly - VSFA18PI032G-1A0
Serial Number:    P1T05003961506020199
Firmware Version: 0605-1A0
User Capacity:    29,880,221,696 bytes [29.8 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
Local Time is:    Tue Feb 21 15:08:35 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Unavailable
Write cache is:   Enabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x15) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     ------   100   100   000    -    0
  5 Reallocated_Sector_Ct   ------   100   100   000    -    0
  9 Power_On_Hours          ------   100   100   000    -    4658
 12 Power_Cycle_Count       ------   100   100   000    -    24
160 Unknown_Attribute       ------   100   100   000    -    4
161 Unknown_Attribute       ------   100   100   000    -    232
163 Unknown_Attribute       ------   100   100   000    -    38
164 Unknown_Attribute       ------   100   100   000    -    10556
165 Unknown_Attribute       ------   100   100   050    -    20
166 Unknown_Attribute       ------   100   100   050    -    0
167 Unknown_Attribute       ------   100   100   001    -    5
192 Power-Off_Retract_Count ------   100   100   000    -    0
194 Temperature_Celsius     ------   100   100   000    -    27
195 Hardware_ECC_Recovered  ------   100   100   000    -    0
196 Reallocated_Event_Count ------   100   100   016    -    0
198 Offline_Uncorrectable   ------   100   100   050    -    2
199 UDMA_CRC_Error_Count    ------   100   100   050    -    0
241 Total_LBAs_Written      ------   100   100   000    -    4058
242 Total_LBAs_Read         ------   100   100   000    -    829
243 Unknown_Attribute       ------   000   000   000    -    0
244 Unknown_Attribute       ------   000   000   000    -    0


From the example output, it is easily observed that the SMART overall-health self-assessment test has been PASSED. This basic overall health test has only two results: PASSED or FAILED. If the test result is FAILED, this indicates that the SMART system believes the drive to be in imminent danger of failure, so it is imperative that all important data should be backed up.

Furthermore, the most important attributes to check for disk health, as highlighted in the output example above are:

  • Attribute #5 (Reallocated_Sector_Ct): This is a critical parameter. It shows the count of reallocated sectors. When a drive finds a read error, it marks that sector as "reallocated" and data is transferred to a spare area. The RAW_VALUE represents a count of the bad sectors that have been found and remapped. The higher the attribute value, the more sectors the drive has had to reallocate. A drive with bad sectors can continue to operate. However, a drive which has had any reallocation is prone to failure. If this value is 0, the SDD is healthy. If this value is different than 0, urgent data backup and hardware replacement is recommended. 

In the example output provided, Reallocated_Sector_Ct has RAW_VALUE = 0, meaning that the drive is OK.

  • Attribute #198 (Offline_Uncorrectable): This value indicates that there have been two errors while reading a sector, but self-adjustments were made such that data could be eventually retrieved. If the value of this attribute is continuously incrementing in the logs, this indicates defects of the disk surface.

It can be seen from the example logs that this value is constant, meaning that the drive is OK. This parameter is permanent and cannot be reset, however. It can only increment when another read error happens.

  • Attribute #161 (Spares Count) and Attribute #167 (Erase Count) - These attributes can be checked additionally to confirm SSD health. A failing SSD should have Attribute #161 below the value of 10 and Attribute #167 above the value of 100000.

Example output shows these attributes to be within limits, meaning that the drive is OK.

Solutions:

  1. If SSD has PASSED the self-assessment test and if above attributes have been verified and Offline_Uncorrectable value is not incrementing in the logs, SSD is healthy and logs can be ignored.

    The "Device: /dev/ada0, <value> Offline uncorrectable sectors" message can be ignored if its frequency is once every 30 minutes for a low (single digit) number of sectors. The repeating message for the same value of uncorrectable sectors has been proven to be a product limitation for the MX960 and EX9200 platforms and cannot be suppressed.

    Releases 15.1R7, 16.1R5, 16.2R2 and 17.1R2 have improved logging of these messages: if the Offline_Uncorrectable value is non-zero and not incrementing, the log messages are printed every 3 hours.

    It is not recommended to filter these entries from being written to the syslog messages file.  If a change in the Offline_Uncorrectable value would happen, indicating an issue with the SSD, it could be missed completely. The disk might fail and it would have to be replaced eventually.
     

  2. If logs are showing a high count of Offline_Uncorrectable sectors (higher than single digit), and if after checking the disk attributes Reallocated_Sector_Ct is not 0, urgent data backup and hardware replacement is recommended for the SSD.

Modification History:
2019-05-10: added PR1233992; clarified the statement whether 'Offline uncorrectable sectors' is or is not incrementing.

Related Links

Comment on this article > Affected Products Browse the Knowledge Base for more articles related to these product categories. Select a category to begin.

Security Alerts and Vulnerabilities

Security Alerts and Vulnerabilities Product Alerts and Software Release Notices Problem Report (PR) Search Tool EOL Notices and Bulletins JTAC User Guide Customer Care User Guide Pathfinder SRX High Availability Configurator SRX VPN Configurator Training Courses and Videos End User Licence Agreement Global Search