各種原因?qū)е?rollback、.clean、.clean.requested和.clean.inflight文件大小為0,也就是空文件,而在archive和clean時(shí)無法處理空文件,就報(bào)錯(cuò)上面的異常。
?前言
記錄寫Hudi時(shí)的一個(gè)異常的解決方法,其實(shí)這個(gè)異常從去年就發(fā)現(xiàn)并找到解決方法了,而且已經(jīng)提交到社區(qū)merge了,PR:[HUDI-2675] Fix the exception 'Not an Avro data file' when archive and clean。之所以現(xiàn)在又要總結(jié)這個(gè)異常的處理方法是因?yàn)?我們生產(chǎn)環(huán)境用的 Hudi0.9.0版本,且沒有升級(jí),因?yàn)樯?jí)Hudi版本可能會(huì)有不兼容的問題,需要測(cè)試,比較費(fèi)時(shí),所以目前還沒有升級(jí)版本,而這個(gè)PR合入的版本為0.11.0,所以本文主要總結(jié)在0.9.0版本如何解決這個(gè)問題,當(dāng)然也適用于0.11.0版本之前的其他有同樣問題的版本。
異常信息
archive和clean時(shí)都會(huì)有這個(gè)異常,主要異常信息:
Caused by: java.io.IOException: Not an Avro data file
異常產(chǎn)生原因
各種原因?qū)е?rollback、.clean、.clean.requested和.clean.inflight?文件大小為0,也就是空文件,而在archive和clean時(shí)無法處理空文件,就報(bào)錯(cuò)上面的異常。有一個(gè)已知原因,就是HDFS配了滿了之后會(huì)產(chǎn)生空文件,更多的是PMC也不清楚的未知原因,上面的PR中有體現(xiàn)。
解決方案
這是在不升級(jí)Hudi版本的前提下,如果可以升級(jí)Hudi版本,直接升級(jí)到Hudi最新版即可。
解決方案1
當(dāng)發(fā)生該異常時(shí),由運(yùn)維人員刪除對(duì)應(yīng)的空文件即可,當(dāng)然這適用于表不多且異常偶發(fā)的情況,具體命令放在最后。但是當(dāng)表比較多時(shí),運(yùn)維人員處理起來比較麻煩,這就需要第二種解決方案了。
解決方案2
基于Hudi0.9.0源碼將文章開頭提到的PR合進(jìn)去,然后install本地倉庫或者公司自己的內(nèi)部倉庫中,然后Maven pom依賴中引用自己的倉庫地址就可以了?;?.9.0的代碼我已經(jīng)提交,有需要的可以自行下載,其他版本就需要大家自己合了。
- gitee: https://gitee.com/dongkelun/hudi/tree/0.9.0-fixNotAvro/
- github: https://github.com/dongkelun/hudi/tree/0.9.0-fixNotAvro
Hudi maven install命令:
mvn clean install -DskipTest
驗(yàn)證
直接本地運(yùn)行測(cè)試用例中的testArchiveCompletedRollbackAndClean和testCleanEmptyInstants,這倆測(cè)試用例通過了應(yīng)該就沒有問題
方案1具體處理方法
不管是什么原因?qū)е碌漠惓?,不管用何種方式,只要確保找到正確的對(duì)應(yīng)的大小為0的空文件刪掉即可,一定不要?jiǎng)h錯(cuò)
異常信息1
ERROR [Timer-Driven Process Thread-4] o.a.hudi.table.HoodieTimelineArchiveLog Failed to archive commits, .commit file: 20220726050533.rollback
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:178)
at org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:103)
at org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:341)
at org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:305)
at org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128)
at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:439)
at org.apache.hudi.client.HoodieJavaWriteClient.postWrite(HoodieJavaWriteClient.java:187)
at org.apache.hudi.client.HoodieJavaWriteClient.insert(HoodieJavaWriteClient.java:129)
at org.apache.nifi.processors.javaHudi.JavaHudi.write(JavaHudi.java:523)
at org.apache.nifi.processors.javaHudi.JavaHudi.onTrigger(JavaHudi.java:404)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1167)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:208)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
原因
.rollback文件大小為0
解決方法
在表元數(shù)據(jù)路徑下查看異常信息里的文件,確認(rèn)是否大小為0
hadoop fs -ls hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie/20220726050533.rollback
0 2022-07-26 07:05 hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie/20220726050533.rollback
確認(rèn)為0后,刪掉該文件即可
hadoop fs -rm -r hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie/20220726050533.rollback
注意不要?jiǎng)h錯(cuò),也可將該文件改名避免刪錯(cuò)再啟動(dòng)組件驗(yàn)證是否正常,如果還有異常,排查其他.rollback大小為0文件,一起刪掉
最好不要用grep的方式刪除,避免誤刪,只有配額不足導(dǎo)致的文件特別多的情況下才建議使用
查找所有符合條件的文件(一般只有一符合條件的文件,目前發(fā)現(xiàn)只有配額不足的情況才會(huì)有多個(gè))
hadoop fs -ls -R hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie | grep .rollback | grep -v .rollback.inflight | awk '{ if ($5 == 0) print $8 }'
刪除所有符合條件的文件
hadoop fs -ls -R hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie | grep .rollback | grep -v .rollback.inflight | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
異常信息2:
ERROR [Timer-Driven Process Thread-4] o.a.hudi.table.HoodieTimelineArchiveLog Failed to archive commits, .commit file: 20220726050533.rollback
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:178)
at org.apache.hudi.client.utils.MetadataConversionUtils.createMetaWrapper(MetadataConversionUtils.java:103)
at org.apache.hudi.table.HoodieTimelineArchiveLog.convertToAvroRecord(HoodieTimelineArchiveLog.java:341)
at org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:305)
at org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:128)
at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:439)
at org.apache.hudi.client.HoodieJavaWriteClient.postWrite(HoodieJavaWriteClient.java:187)
at org.apache.hudi.client.HoodieJavaWriteClient.insert(HoodieJavaWriteClient.java:129)
at org.apache.nifi.processors.javaHudi.JavaHudi.write(JavaHudi.java:523)
at org.apache.nifi.processors.javaHudi.JavaHudi.onTrigger(JavaHudi.java:404)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1167)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:208)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
異常原因
.clean文件大為0
解決方法
找到對(duì)應(yīng)表元數(shù)據(jù)路徑下.clean文件大小為0 的文件并刪除,目前遇到的情況只有一個(gè)文件且是最新的.clean文件最好不要用grep的方式刪除,避免誤刪
hadoop fs -ls hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie/20220726050533.rollback
0 2022-07-26 07:05 hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie/20220726050533.rollback
異常信息3:
o.a.h.t.a.clean.BaseCleanActionExecutor Failed to perform previous clean operation, instant: [==>20211011143809__clean__REQUESTED]
org.apache.hudi.exception.HoodieIOException: Not an Avro data file
at org.apache.hudi.table.action.clean.BaseCleanActionExecutor.runPendingClean(BaseCleanActionExecutor.java:87)
at org.apache.hudi.table.action.clean.BaseCleanActionExecutor.lambda$execute$0(BaseCleanActionExecutor.java:137)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.apache.hudi.table.action.clean.BaseCleanActionExecutor.execute(BaseCleanActionExecutor.java:134)
at org.apache.hudi.table.HoodieJavaCopyOnWriteTable.clean(HoodieJavaCopyOnWriteTable.java:188)
at org.apache.hudi.client.AbstractHoodieWriteClient.clean(AbstractHoodieWriteClient.java:660)
at org.apache.hudi.client.AbstractHoodieWriteClient.clean(AbstractHoodieWriteClient.java:641)
at org.apache.hudi.client.AbstractHoodieWriteClient.clean(AbstractHoodieWriteClient.java:672)
at org.apache.hudi.client.AbstractHoodieWriteClient.autoCleanOnCommit(AbstractHoodieWriteClient.java:505)
at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:440)
at org.apache.hudi.client.HoodieJavaWriteClient.postWrite(HoodieJavaWriteClient.java:187)
at org.apache.hudi.client.HoodieJavaWriteClient.insert(HoodieJavaWriteClient.java:129)
at org.apache.nifi.processors.javaHudi.JavaHudi.write(JavaHudi.java:401)
at org.apache.nifi.processors.javaHudi.JavaHudi.onTrigger(JavaHudi.java:305)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1166)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:208)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.deserializeAvroMetadata(TimelineMetadataUtils.java:178)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:106)
at org.apache.hudi.table.action.clean.BaseCleanActionExecutor.runPendingClean(BaseCleanActionExecutor.java:84)
... 24 common frames omitted
異常原因
. clean.requested? 或者 . clean.inflight
解決方法
刪除對(duì)應(yīng)的大小為0的文件,文件名異常信息里已經(jīng)有了
最好不要用grep的方式刪除,避免誤刪,只有配額不足導(dǎo)致的文件特別多的情況下才建議使用
hadoop fs -ls -R hdfs://cluster1/apps/hive/tenant/zxqzk_smzt_mztgx/sam_exp/.hoodie | grep .clean.requested | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
