HBae找不到协处理器导致RegionServer全部挂掉
一、問題背景:
? ? 跟兄弟單位公用一個大數據集群,通過Dataspace結合Kerberos控制數據的訪問,我們生產環境中用到的OLAP工具Kylin,在升級Kylin的過程中,由于刪除了舊的協處理器,導致原來數據繼續去尋找目標協處理器,找不到引起所有RegionServer退出,始終想不明白hbase有關協處理器的機制,于是查找資料才得以清楚。
一下內容為轉載,原地址:http://blog.itpub.net/12129601/viewspace-1690668/? ? ?主要用于個人收藏、備查。轉載請注明原作者。
二、協處理的使用
1 加載協處理器
1.1 將協處理器上傳到hdfs:
hadoop fs -mkdir /hbasenew/usercoprocesser
hadoop fs -ls /hbasenew/usercoprocesser
hadoop fs -rm /hbasenew/usercoprocesser/coprocessor.jar
hadoop fs -copyFromLocal /home/hbase/coprocessor.jar? /hbasenew/usercoprocessor
1.2 將協處理器加載到表中:
1)先卸載協處理器:
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att_unset',NAME =>'coprocessor$1'
enable 'ns_bigdata:tb_test_coprocesser'
2)再加載協處理器:
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att','coprocessor' => '/hbasenew/usercoprocesser/coprocessor.jar|com.suning.hbase.coprocessor.service.HelloWorldEndPoin|1001|'
enable 'ns_bigdata:tb_test_coprocesser'
注意:在加載協處理器是我特意將協處理器中的類名少寫一個字母t,以重現將集群regionserver搞掛的現象以及表的狀態不一致的現象。
2 出現的問題
以上操作會導致如下兩個問題:
2.1 將集群的region server搞掛掉
2.2 將加載協處理器的表的狀態搞的不一致,一直處于enabling狀態
對表做disable和enable操作均不可操作:
同時此表對應的regionserver上出現如下錯誤:
3 原因分析
3.1 關于協處理加載錯誤導致regionserver掛掉的原因分析
在hbase的源碼中,參數:hbase.coprocessor.abortonerror的默認值是true:
public static final String ABORT_ON_ERROR_KEY = "hbase.coprocessor.abortonerror";
? public static final boolean DEFAULT_ABORT_ON_ERROR = true;
下面查看此參數的含義:
??????hbase.coprocessor.abortonerror
??????true
??????Set to true to cause the hosting server (master or regionserver)
????? to abort if a coprocessor fails to load, fails to initialize, or throws an
????? unexpected Throwable object. Setting this to false will allow the server to
????? continue execution but the system wide state of the coprocessor in question
????? will become inconsistent as it will be properly executing in only a subset
????? of servers, so this is most useful for debugging only.?
因此,當加載錯誤的協處理器之后,會導致regionserver掛掉。
3.2 關于加載協處理器的表的狀態不一致的原因分析:
相關錯誤日志:
查看enable的相關源碼:
public void enableTable(final TableName tableName)
? throws IOException {
??? enableTableAsync(tableName);
??? // Wait until all regions are enabled
??? waitUntilTableIsEnabled(tableName);
??? LOG.info("Enabled table " + tableName);
? }
private void waitUntilTableIsEnabled(final TableName tableName) throws IOException {
??? boolean enabled = false;
??? long start = EnvironmentEdgeManager.currentTimeMillis();
??? for (int tries = 0; tries < (this.numRetries * this.retryLongerMultiplier); tries++) {
????? try {
??????? enabled = isTableEnabled(tableName);
????? } catch (TableNotFoundException tnfe) {
??????? // wait for table to be created
??????? enabled = false;
????? }
????? enabled = enabled && isTableAvailable(tableName);
????? if (enabled) {
??????? break;
????? }
????? long sleep = getPauseTime(tries);
????? if (LOG.isDebugEnabled()) {
??????? LOG.debug("Sleeping= " + sleep + "ms, waiting for all regions to be " +
????????? "enabled in " + tableName);
????? }
????? try {
??????? Thread.sleep(sleep);
????? } catch (InterruptedException e) {
??????? // Do this conversion rather than let it out because do not want to
??????? // change the method signature.
??????? throw (InterruptedIOException)new InterruptedIOException("Interrupted").initCause(e);
????? }
??? }
??? if (!enabled) {
????? long msec = EnvironmentEdgeManager.currentTimeMillis() - start;
????? throw new IOException("Table '" + tableName +
??????? "' not yet enabled, after " + msec + "ms.");
??? }
? }
===========================================================================
? /**
?? * Brings a table on-line (enables it).? Method returns immediately though
?? * enable of table may take some time to complete, especially if the table
?? * is large (All regions are opened as part of enabling process).? Check
?? *?{@link?#isTableEnabled(byte[])} to learn when table is fully online.? If
?? * table is taking too long to online, check server logs.
?? * @param tableName
?? * @throws IOException
?? * @since 0.90.0
?? */
? public void enableTableAsync(final TableName tableName)
? throws IOException {
??? TableName.isLegalFullyQualifiedTableName(tableName.getName());
??? executeCallable(new MasterCallable(getConnection()) {
????? @Override
????? public Void call() throws ServiceException {
??????? LOG.info("Started enable of " + tableName);
??????? EnableTableRequest req = RequestConverter.buildEnableTableRequest(tableName);
??????? master.enableTable(null,req);
??????? return null;
????? }
??? });
? }
發現在enable的過程中,首先是執行enable操作,操作完畢后需要等待各個regionserver反饋所有region的狀態,由于此時regionserver已經掛掉,一直在連接重試等待,此時表的狀態一直是ENABLING。
4 問題的處理
4.1 關于regionserver 掛掉的問題處理:
通過在hbase-site.xml文件中設置參數:
????
????hbase.coprocessor.abortonerror
????false
????
并啟動region server可以解決,這樣就忽略了協處理器出現的錯誤,保證集群高可用。
4.2 關于有協處理器的表的狀態不一致,不能disable和enable問題的解決辦法:
此問題可以通過切換master節點可以解決,將主停掉,backup-master會承擔主master的任務,同時在切換的過程中,會將狀態不一致的表的狀態改為一致的:
切換后的master信息如下:
在切換的過程中調用了如下方法:
? /**
?? * Recover the tables that are not fully moved to ENABLED state. These tables
?? * are in ENABLING state when the master restarted/switched
?? *
?? * @throws KeeperException
?? * @throws org.apache.hadoop.hbase.TableNotFoundException
?? * @throws IOException
?? */
? private void recoverTableInEnablingState()
????? throws KeeperException, TableNotFoundException, IOException {
??? Set?enablingTables = ZKTable.getEnablingTables(watcher);
??? if (enablingTables.size() != 0) {
????? for (TableName tableName : enablingTables) {
??????? // Recover by calling EnableTableHandler
??????? LOG.info("The table " + tableName
??????????? + " is in ENABLING state.? Hence recovering by moving the table"
??????????? + " to ENABLED state.");
??????? // enableTable in sync way during master startup,
??????? // no need to invoke coprocessor
??????? EnableTableHandler eth = new EnableTableHandler(this.server, tableName,
????????? catalogTracker, this, tableLockManager, true);
??????? try {
????????? eth.prepare();
??????? } catch (TableNotFoundException e) {
????????? LOG.warn("Table " + tableName + " not found in hbase:meta to recover.");
????????? continue;
??????? }
??????? eth.process();
????? }
??? }
? }
在卻換過程中,跟蹤master和對應的regionserver的后臺日志:
master日志:
其中的部分日志信息如下:
2015-05-20 10:00:01,398 INFO? [master:nim-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state.? Hence recovering by moving the table to ENABLED state.
2015-05-20 10:00:01,421 DEBUG [master:nim-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000002
2015-05-20 10:00:01,436 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 10:00:01,465 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 10:00:01,466 INFO? [master:nim-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 10:00:01,466 INFO? [master:nim-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true
對應的regionserver的日志如下:
2015-05-20 14:39:56,175 INFO? [master:sup02-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state.? Hence recovering by moving the table to ENABLED state.
2015-05-20 14:39:56,211 DEBUG [master:sup02-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000031
2015-05-20 14:39:56,235 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 14:39:56,269 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 14:39:56,270 INFO? [master:sup02-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 14:39:56,270 INFO? [master:sup02-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true
結論:
1. 為了提高集群的高可用性,應該將參數:hbase.coprocessor.abortonerror設置為true,這樣即使加載的協處理器有問題,也不會導致集群的regionserver掛掉,也不會導致表不能enable和disable;
2.即使表出現不能enable和disable的現象后,也可以通過切換master來解決,因此在搭建集群時,一定要至少有一到兩個backupmaster
5 全部master節點宕后集群的讀寫測試
1. 在集群都是正常的情況下,通過客戶端往集群中插入2000000行數據,插入正常
2.將集群的所有master全部停掉:
3.監控客戶端的數據插入情況,發現客戶端的數據插入正常。持續讓客戶端繼續插入20000000行數據,發現數據插入正常。
4.在客戶端批量讀取數據,發現數據讀取正常。
結論:當hbase集群的master所有節點掛掉后(一定時間段,目前測試的是半小時內),客戶端的數據讀寫正常。
總結
以上是生活随笔為你收集整理的HBae找不到协处理器导致RegionServer全部挂掉的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: namespace命名空间的使用
- 下一篇: js的校验