↑日記で日々積み重ねた情報をトップの「わんこのページ」にまとめています。

おのたく日記 [RDF] YouTubeも始めました→


2018-11-11(Sun) [ZFS] DISKが壊れた? [長年日記]

ZFSでエラー発生メールを受け取る

From: root

Subject: ZFS device fault for pool 0x9773EB28D655981A on on-o.com

The number of I/O errors associated with a ZFS device exceeded

acceptable levels. ZFS has marked the device as faulted.

impact: Fault tolerance of the pool may be compromised.

eid: 40

class: statechange

state: FAULTED

host: on-o.com

time: 2018-11-11 07:25:07+0900

vpath: /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_66WYX9FS-part1

vguid: 0x1E626A2E7FA04A2D

pool: 0x9773EB28D655981A

マジですか? と思っていたら自動的にscrubも実行されて、メールが来た

Subject: ZFS scrub_finish event for tank on on-o.com

ZFS has finished a scrub:

eid: 43

class: scrub_finish

host: on-o.com

time: 2018-11-11 11:00:20+0900

pool: tank

state: DEGRADED

status: One or more devices are faulted in response to persistent errors.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Replace the faulted device, or use 'zpool clear' to mark the device

repaired.

scan: scrub repaired 288K in 10h36m with 0 errors on Sun Nov 11 11:00:20 2018

config:

NAME STATE READ WRITE CKSUM

tank DEGRADED 0 0 0

mirror-0 DEGRADED 0 0 0

ata-ST4000DM000-1F2169_Z997V2F2-part1 ONLINE 0 0 0

ata-TOSHIBA_DT01ACA300_66SWYX9FS-part1 FAULTED 0 0 9 too many errors

errors: No known data errors

というわけで、TOSHIBAの3TB HDDでチェックサムエラーが9回発生して、too many errorsでFAULTEDになっている。

詳しく見てみると

# zpool status -v

pool: tank

state: DEGRADED

status: One or more devices are faulted in response to persistent errors.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Replace the faulted device, or use 'zpool clear' to mark the device

repaired.

scan: scrub repaired 288K in 10h36m with 0 errors on Sun Nov 11 11:00:20 2018

config:

NAME STATE READ WRITE CKSUM

tank DEGRADED 0 0 0

mirror-0 DEGRADED 0 0 0

ata-ST4000DM000-1F2169_Z997V2F2-part1 ONLINE 0 0 0

ata-TOSHIBA_DT01ACA300_66SWYX9FS-part1 FAULTED 0 0 9 too many errors

errors: No known data errors

「zpool clear」してみろとのことなので、まずはSMARTと並行してエラーのクリアをしてみる

# zpool clear tank ata-TOSHIBA_DT01ACA300_66SWYX9FS-part1

# zpool status

pool: tank

state: ONLINE

status: One or more devices is currently being resilvered. The pool will

continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

scan: resilver in progress since Mon Nov 12 22:53:58 2018

338G scanned out of 1.79T at 858M/s, 0h29m to go

5.44G resilvered, 18.37% done

config:

NAME STATE READ WRITE CKSUM

tank ONLINE 0 0 0

mirror-0 ONLINE 0 0 0

ata-ST4000DM000-1F2169_Z997V2F2-part1 ONLINE 0 0 0

ata-TOSHIBA_DT01ACA300_66SWYX9FS-part1 ONLINE 0 0 0 (resilvering)

errors: No known data errors

29分待てば良いのかと思いきや、いま見ると3時間待ちなので、SMARTテストも同時実行する。

SMART shortテスト

# smartctl -t short /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.0-2-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART Short self-test routine immediately in off-line mode".

Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 1 minutes for test to complete.

Test will complete after Mon Nov 12 22:47:49 2018

Use smartctl -X to abort test.

# smartctl -l error /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.0-2-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===

SMART Error Log Version: 1

No Errors Logged

# smartctl -l selftest /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.0-2-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 42719 -

# 2 Short offline Completed without error 00% 32323 -

# 3 Extended offline Aborted by host 90% 32323 -

# 4 Short offline Completed without error 00% 25958 -

# 5 Short offline Completed without error 00% 25958 -

とりあえず、shortテストはOK

SMART offlineテスト

# smartctl -t offline /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.0-2-amd64] (local build)

Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===

Sending command: "Execute SMART off-line routine immediately in off-line mode".

Drive command "Execute SMART off-line routine immediately in off-line mode" successful.

Testing has begun.

Please wait 23082 seconds for test to complete.

Test will complete after Tue Nov 13 05:12:42 2018

Use smartctl -X to abort test.

朝までかかるので、果報は寝て待つ

本日のPingbacks(全0件)

Google Web検索 on-o.com内を検索