因HIVE元數(shù)據(jù)與HDFS上的數(shù)據(jù)不一致引起的問題修復(fù)
本文轉(zhuǎn)載自微信公眾號「明哥的IT隨筆」,作者IT明哥。轉(zhuǎn)載本文請聯(lián)系明哥的IT隨筆公眾號。
前言
大家好,我是明哥!
本片博文是“大數(shù)據(jù)問題排查系列”之一,講述某HIVE SQL 作業(yè)因?yàn)?HIVE 中的元數(shù)據(jù)與 HDFS中實(shí)際的數(shù)據(jù)不一致引起的一個(gè)問題的排查和修復(fù)。
以下是正文。
問題現(xiàn)象
客戶端報(bào)錯如下:
- Unable to move source xxx to destination xxx
客戶端報(bào)錯
問題分析
客戶端的報(bào)錯信息,并沒有完全展現(xiàn)問題背后的全貌。我們進(jìn)入 hiveserver2 所在節(jié)點(diǎn)查看hiveserver2的日志,可以看到如下相關(guān)信息:
- 2021-09-01 11:47:46,795 INFO org.apache.hadoop.hive.ql.exec.Task: [HiveServer2-Background-Pool: Thread-1105]: Loading data to table hs_ods.ods_ses_acct_assure_scale partition (part_date=20210118) from hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000
- 2021-09-01 11:47:46,795 INFO hive.metastore: [HiveServer2-Background-Pool: Thread-1105]: HMS client filtering is enabled.
- 2021-09-01 11:47:46,795 INFO hive.metastore: [HiveServer2-Background-Pool: Thread-1105]: Trying to connect to metastore with URI thrift://hs01:9083
- 2021-09-01 11:47:46,795 INFO hive.metastore: [HiveServer2-Background-Pool: Thread-1105]: Opened a connection to metastore, current connections: 54
- 2021-09-01 11:47:46,796 INFO hive.metastore: [HiveServer2-Background-Pool: Thread-1105]: Connected to metastore.
- 2021-09-01 11:47:46,928 INFO org.apache.hadoop.hive.ql.exec.MoveTask: [HiveServer2-Background-Pool: Thread-1105]: Partition is: {part_date=20210118}
- 2021-09-01 11:47:46,945 INFO org.apache.hadoop.hive.common.FileUtils: [HiveServer2-Background-Pool: Thread-1105]: Creating directory if it doesn't exist: hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118
- 2021-09-01 11:47:46,947 ERROR hive.ql.metadata.Hive: [HiveServer2-Background-Pool: Thread-1105]: Failed to move: {}
- 2021-09-01 11:47:46,947 ERROR org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-1105]: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000 to destination hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118
- 2021-09-01 11:47:46,948 INFO org.apache.hadoop.hive.ql.Driver: [HiveServer2-Background-Pool: Thread-1105]: Completed executing command(queryId=hive_20210901114731_d7a78302-fb2a-4b45-9472-db6a9787f710); Time taken: 15.489 seconds
- 2021-09-01 11:47:46,957 ERROR org.apache.hive.service.cli.operation.Operation: [HiveServer2-Background-Pool: Thread-1105]: Error running hive query:
- org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000 to destination hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118
- at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:329) ~[hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:258) ~[hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hive.service.cli.operation.SQLOperation.access$600(SQLOperation.java:92) ~[hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:345) [hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_271]
- at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_271]
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) [hadoop-common-3.0.0-cdh6.3.2.jar:?]
- at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:357) [hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_271]
- at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_271]
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_271]
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_271]
- at java.lang.Thread.run(Thread.java:748) [?:1.8.0_271]
- Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000 to destination hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118
- at org.apache.hadoop.hive.ql.metadata.Hive.getHiveException(Hive.java:3449) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.getHiveException(Hive.java:3405) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:3400) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3697) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1614) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1525) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1489) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:501) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2200) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1843) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1563) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1339) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1334) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:256) ~[hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- ... 11 more
- Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000/000000_0 to destination hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/000000_0
- at org.apache.hadoop.hive.ql.metadata.Hive.handlePoolException(Hive.java:3422) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:3367) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3697) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1614) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1525) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1489) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:501) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2200) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1843) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1563) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1339) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1334) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:256) ~[hive-service-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- ... 11 more
- Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000/000000_0 to destination hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/000000_0
- at org.apache.hadoop.hive.ql.metadata.Hive.getHiveException(Hive.java:3449) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.getHiveException(Hive.java:3405) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive.access$200(Hive.java:175) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:3350) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:3335) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- ... 4 more
- Caused by: java.io.IOException: rename for src path: hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/.hive-staging_hive_2021-09-01_11-47-31_216_694180642957006705-35/-ext-10000/000000_0 to dest path:hdfs://hs01:8020/user/hundsun/dap/hive/hs_ods/ods_ses_acct_assure_scale/part_date=20210118/000000_0 returned false
- at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:3347) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:3335) ~[hive-exec-2.1.1-cdh6.3.2.jar:2.1.1-cdh6.3.2]
- ... 4 more
- 2021-09-01 11:47:46,962 INFO org.apache.hadoop.hive.conf.HiveConf: [HiveServer2-Handler-Pool: Thread-474]: Using the default value passed in for log id: d9b4646b-3c22-4589-8912-a68b92efcca7
- 2021-09-01 11:47:46,962 INFO org.apache.hadoop.hive.ql.session.SessionState: [HiveServer2-Handler-Pool: Thread-474]: Updating thread name to d9b4646b-3c22-4589-8912-a68b92efcca7 HiveServer2-Handler-Pool: Thread-474
- 2021-09-01 11:47:46,962 INFO org.apache.hive.service.cli.operation.OperationManager: [d9b4646b-3c22-4589-8912-a68b92efcca7 HiveServer2-Handler-Pool: Thread-474]: Closing operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=29e01dfe-8b9b-427a-a610-1d58c526faca]
- 2021-09-01 11:47:46,973 INFO org.apache.hadoop.hive.conf.HiveConf: [d9b4646b-3c22-4589-8912-a68b92efcca7 HiveServer2-Handler-Pool: Thread-474]: Using the default value passed in for log id: d9b4646b-3c22-4589-8912-a68b92efcca7
- 2021-09-01 11:47:46,973 INFO org.apache.hadoop.hive.ql.session.SessionState: [d9b4646b-3c22-4589-8912-a68b92efcca7 HiveServer2-Handler-Pool: Thread-474]: Resetting thread name to HiveServer2-Handler-Pool: Thread-474
- 2021-09-01 11:47:47,025 INFO org.apache.hive.service.cli.thrift.ThriftCLIService: [HiveServer2-Handler-Pool: Thread-474]: Session disconnected without closing properly.
- 2021-09-01 11:47:47,025 INFO org.apache.hive.service.cli.thrift.ThriftCLIService: [HiveServer2-Handler-Pool: Thread-474]: Closing the session: SessionHandle [d9b4646b-3c22-4589-8912-a68b92efcca7]
- 2021-09-01 11:47:47,025 INFO org.apache.hive.service.CompositeService: [HiveServer2-Handler-Pool: Thread-474]: Session closed, SessionHandle [d9b4646b-3c22-4589-8912-a68b92efcca7], current sessions:0
從以上信息可以看到,hivesql 使用的是spark執(zhí)行引擎,且 spark 作業(yè)已經(jīng)執(zhí)行成功了,但是在hiveser2做收尾工作時(shí),移動臨時(shí)文件到目標(biāo)目錄下時(shí)報(bào)錯了,最底層的報(bào)錯信息是 rename for src path xxx to dest path xxx returned false.
順藤摸瓜,接下來去 hdfs 上查看問題。通過 hdfs dfs -ls 發(fā)現(xiàn) hdfs上目標(biāo)文件已經(jīng)存在了,且通過時(shí)間信息可以發(fā)現(xiàn)該文件是幾天前創(chuàng)建的,跟當(dāng)前sql作業(yè)的執(zhí)行沒有關(guān)系:
hdfs-destination-path
問題原因
通過上述排查分析,問題直接原因已經(jīng)清晰:hive sql 底層的spark作業(yè)已經(jīng)執(zhí)行成功,對應(yīng)的數(shù)據(jù)已經(jīng)計(jì)算完畢,但在移動臨時(shí)結(jié)果文件到最終目標(biāo)目錄時(shí),因?yàn)閔dfs上最終目標(biāo)目錄已經(jīng)存在且目標(biāo)目錄下存在同名文件,rename操作無法完成,作業(yè)執(zhí)行失敗。
回頭看下我們的 sql,其本質(zhì)就是個(gè)對分區(qū)表某個(gè)分區(qū)的 insert overwrite, 照道理來說,應(yīng)該會覆蓋目標(biāo)分區(qū)對應(yīng)的目錄下的數(shù)據(jù)文件(即先刪除舊的數(shù)據(jù)文件,再創(chuàng)建新數(shù)據(jù)文件),但為什么這里沒有執(zhí)行刪除動作呢?
這其實(shí)是因?yàn)樵摲謪^(qū)表在 HIVE 中的元數(shù)據(jù)與 HDFS 中的數(shù)據(jù)不一致。通過 show create table 和 show partitions 可以發(fā)現(xiàn),在HIVE元數(shù)據(jù)中該分區(qū)表只有一個(gè)分區(qū),但HDFS上存在該表其它分區(qū)對應(yīng)的目錄和文件:
show create table
show partitions
所以問題的根本原因是:該分區(qū)表在 HIVE中的元數(shù)據(jù)與HDFS上實(shí)際的數(shù)據(jù)不一致,當(dāng)執(zhí)行 insert overwrite 操作時(shí),hive 通過存儲在 metastore 中的元數(shù)據(jù)信息發(fā)現(xiàn)目標(biāo)分區(qū)并不存在,也就不會嘗試去執(zhí)行hdfs上該分區(qū)對應(yīng)目錄的刪除操作了,而實(shí)際上hdfs上該分區(qū)對應(yīng)的目錄和文件都是存在的,所以作業(yè)底層的 rename 操作失敗了。
問題解決
知道了問題的直接原因和根本原因,解決方法也就順理成章了:修復(fù) hive 元數(shù)據(jù)跟hdfs實(shí)際的數(shù)據(jù)一致即可。
可以使用命令 msck repair table xxx來修復(fù)hive表的元數(shù)據(jù):
元數(shù)據(jù)修復(fù)完畢,通過show partitions xx 發(fā)現(xiàn),hive中已經(jīng)可以查到原來遺失的分區(qū)。然后重新執(zhí)行上述 insert overwrite 語句,作業(yè)成功執(zhí)行。
問題總結(jié)
- 當(dāng) HIVE 中的元數(shù)據(jù)與 HDFS 上實(shí)際的數(shù)據(jù)不一致時(shí),一些正常的 HIVE SQL 操作可能會執(zhí)行失敗。
- HIVE 中的元數(shù)據(jù)與 HDFS 上實(shí)際的數(shù)據(jù)不一致的原因有很多,常見的有:
- 使用了 HIVE 外表,由于外表的特性,在HIVE 中刪除外表或外表的某些分區(qū)時(shí), HDFS上對應(yīng)的目錄和文件仍會存在,此時(shí)就會造成不一致;(我們這里就是這種情況)
- HIVE 的元數(shù)據(jù)和底層HDFS的數(shù)據(jù)是從其他集群同步過來的,但同步過程中有問題,比如時(shí)間沒對齊狀態(tài)不一致;(跨集群同步處理不善,會有這種情況)
- HIVE中的元數(shù)據(jù)或HDFS上的數(shù)據(jù)有損壞和丟失。(集群運(yùn)維管理不規(guī)范,會造成這種現(xiàn)象)
- 可以通過 msck repair table xxx 來修復(fù) hive表的元數(shù)據(jù): MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];