티스토리 뷰
Balancer Test Plan
Feature(s) Tested: enumerate the feature(s)
Which Jira issue(s)?
- HADOOP-1652 - Rebalance data blocks when new data nodes added or data nodes become full
- HADOOP-5463 - Balancer throws "Not a host:port pair" unless port is specified in fs.default.name
- HADOOP-5145 - Balancer sometimes runs out of memory after days or weeks running
- HADOOP-4435 - The JobTracker should display the amount of heap memory used
- HADOOP-4416 - Balancer should provide better resource management
- HADOOP-3487 - Balancer should not allocate a thread per block move
- HADOOP-2716 - Balancer should require superuser privilege
Package and classes tested & corresponding Junit test class name(s)
package org.apache.hadoop.hdfs.server.balancerWhat is the feature?
HDFS data might not always be be placed uniformly across the DataNode. One common reason is addition of new DataNodes to an existing cluster. While placing new blocks (data for a file is stored as a series of blocks), NameNode considers various parameters before choosing the DataNodes to receive these blocks. Some of the considerations are:
- Policy to keep one of the replicas of a block on the same node as the node that is writing the block.
- Need to spread different replicas of a block across the racks so that cluster can survive loss of whole rack.
- One of the replicas is usually placed on the same rack as the node writing to the file so that cross-rack network I/O is reduced.
- Spread HDFS data uniformly across the DataNodes in the cluster.
The balancer is a tool that balances disk space usage on an HDFS cluster when some datanodes become full or when new empty nodes join the cluster. The tool is deployed as an application program that can be run by the cluster administrator on a live HDFS cluster while applications adding and deleting files.
DESCRIPTION
The threshold parameter is a fraction in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether the cluster is balanced. A cluster is balanced if for each datanode, the utilization of the node (ratio of used space at the node to total capacity of the node) differs from the utilization of the (ratio of used space in the cluster to total capacity of the cluster) by no more than the threshold value. The smaller the threshold, the more balanced a cluster will become. It takes more time to run the balancer for small threshold values. Also for a very small threshold the cluster may not be able to reach the balanced state when applications write and delete files concurrently.
The tool moves blocks from highly utilized datanodes to poorly utilized datanodes iteratively. In each iteration a datanode moves or receives no more than the lesser of 10G bytes or the threshold fraction of its capacity. Each iteration runs no more than 20 minutes. At the end of each iteration, the balancer obtains updated datanodes information from the namenode.
A system property that limits the balancer's use of bandwidth is defined in the default configuration file:
<property> <name>dfs.balance.bandwidthPerSec</name> <value>1048576</value> <description> Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second. </description> </property>
This property determines the maximum speed at which a block will be moved from one datanode to another. The default value is 1MB/s. The higher the bandwidth, the faster a cluster can reach the balanced state, but with greater competition with application processes. If an administrator changes the value of this property in the configuration file, the change is observed when HDFS is next restarted.
MONITERING BALANCER PROGRESS
After the balancer is started, an output file name where the balancer progress will be recorded is printed on the screen. The administrator can monitor the running of the balancer by reading the output file. The output shows the balancer's status iteration by iteration. In each iteration it prints the starting time, the iteration number, the total number of bytes that have been moved in the previous iterations, the total number of bytes that are left to move in order for the cluster to be balanced, and the number of bytes that are being moved in this iteration. Normally "Bytes Already Moved" is increasing while "Bytes Left To Move" is decreasing.
Running multiple instances of the balancer in an HDFS cluster is prohibited by the tool.
The balancer automatically exits when any of the following five conditions is satisfied:
- The cluster is balanced;
- No block can be moved;
- No block has been moved for five consecutive iterations;
- An IOException occurs while communicating with the namenode;
- Another balancer is running.
Upon exit, a balancer returns an exit code and prints one of the following messages to the output file in corresponding to the above exit reasons:
- The cluster is balanced. Exiting
- No block can be moved. Exiting...
- No block has been moved for 3 iterations. Exiting...
- Received an IO exception: failure reason. Exiting...
- Another balancer is running. Exiting...
The administrator can interrupt the execution of the balancer at any time by running the command "stop-balancer.sh" on the machine where the balancer is running.
What is the externally visible view of the feature?
Test packageTo start: bin/start-balancer.sh [-threshold
] Example: bin/ start-balancer.sh start the balancer with a default threshold of 10% bin/ start-balancer.sh -threshold 5 start the balancer with a threshold of 5% To stop: bin/ stop-balancer.sh Where, threshold - is a fraction in the range of (0%, 100%) with a default value of 10%. The threshold sets a target for whether the cluster is balanced. A cluster is balanced if for each datanode, the utilization of the node (ratio of used space at the node to total capacity of the node) differs from the utilization of the (ratio of used space in the cluster to total capacity of the cluster) by no more than the threshold value. The smaller the threshold, the more balanced a cluster will become.
Risk Scenarios: enumerate the bad things that could happen in the system that either:
- Could be caused by the feature
- Data node over utilized or underutilized .
- Data loss
- Missing blocks
- Over replication or under replication of blocks while balancing nodes
- Could have an effect on the feature
This feature is tool that balances disk space usage on an HDFS cluster when some data nodes become full or when new empty nodes join the cluster. Bug in this feature could cause
Test Cases: enumerate all tests in tables
- Balancer tests
All balancer tests are run with a threshold of 10% unless otherwise noted. All nodes have the same capacity unless otherwise noted. Test cases 3, 4, and 5 are automatic while all the other cases are manual. All tests are expected to meet the following requirements unless otherwise noted.
- REQ1. Rebalancing does not cause the loss of a block;
- REQ2. Rebalancing does not change the number of replicas that a block had;
- REQ3. Rebalancing does not decrease the number of racks that a block had.
- REQ4. Rebalancing process makes the cluster to be less and less imbalanced.(Algorithm should try best to satisfy this but provides no guarantee.)
Id Type of Test Description Expected Behavior Is Automated Balancer_01 Positive Start balancer and check if the cluster is balanced after the run. Cluster should be in balanced state No Balancer_02 Positive Test a cluster with even distribution, then a new empty node is added to the cluster. Balancer should automatically start balancing cluster by loading data on empty cluster No Balancer_03 Positive Bring up a one-node dfs cluster. Set files’ replication factor to be 1 and fill up the node to 30% full. Then add an empty data node Old node is 25% utilized and the new node is 5% utilized. Yes Balancer_04 Positive The same as 03 except that the empty new data node is on a different rack. The same as 03 Yes Balancer_05 Positive The same as 03 except that the empty new data node is half of the capacity as the old one. Old one is 25% utilized and the new one is 10% utilized Yes Balancer_06 Positive Bring up a 2-node cluster and fill one node to be 60% and the other to be 10% full. All nodes are on different racks. One node is 40% utilized and the other one is 30% utilized No Balancer_07 Positive Bring up a dfs cluster with nodes A and B. Set files’ replication factor to be 2 and fill up the cluster to 30% full. Then add an empty data node C. All three nodes are on the same rack. Old ones are 25% utilized and the new one is 10% No Balancer_08 Positive The same as test case 7 except that A, B, and C are on different racks. The same as above No Balancer_09 Positive The same as test case 7 except that interrupt rebalancing. The cluster is less imbalanced Balancer_10 Positive Restart rebalancing until it is done. The same as 7 No Balancer_11 Positive The same as test case 7 except that shut down namenode while rebalancing Rebalancing is interrupted No Balancer_12 Positive The same as test case 5 except that writing while rebalancing. he cluster most likely becomes balanced, but may fluctuate No Balancer_13 Positive The same as test case 5 except that deleting while rebalancing The same as above No Balancer_14 Positive The same as test case 5 except that writing & deleting while rebalancing The same as above No Balancer_15 Positive Scalability test: populate a 750-node cluster - Run rebalancing after 3 node are added.
- Run rebalancing after 2 racks of nodes (60 nodes) are added.
- Run rebalancing after 2 racks of nodes are added and run file writing/deleting at the same time
Cluster becomes balanced; File I/O’s performance should not be noticeable slower. No Balancer_16 Positive Start balancer with negative threshold value. Command execution error output =
Expect a double parameter in the range of [0, 100]: -10
Usage: java Balancer [-threshold <threshold>] percentage of disk capacity
Balancing took __ millisecondsNo Balancer_17 Positive Start balancer with out-of-range threshold value. e.g ( -123, 0, -324 , 100000, -1222222 , 1000000000, -10000 , 345, 989 ) Exit with error message No Balancer_18 Positive Start balancer with alpha-numeric threshold value (e.g 103dsf , asd234 ,asfd ,ASD , #$asd , 2345& , $35 , %34). Exit with error message No Balancer_19 Positive Start 2 instances of balancer on the same gateway. Exit with error message No Balancer_20 Positive Start 2 instances of balancer on two different gateways Exit with error message No Balancer_21 Positive Start balancer when the cluster is already balanced. Balancer should print information about all nodes in cluster and exit with status of Cluster is balanced. No Balancer_22 Positive Running the balancer with half the data nodes not running No Balancer_23 Positive Running the balancer and simultaneously simulating load on the cluster with half the data nodes not running No - Block Replacement Protocol Test
- Throttling test
- Namenode Protocol Test: getBlocks
First set up a 3 node cluster with nodes NA, NB, and NC, which are on different racks. Then create a file with one block B with a replication factor 3. Finally add a new node ND to the cluster on the same rack as NC.
Id | Type of Test | Description | Expected Behavior | Is Automated |
ProtocolTest_01 | Positive | Copy block B from ND to NA with del hint NC | Fail because proxy source ND does not have the block | No |
ProtocolTest_02 | Positive | Copy block B from NA to NB with del hint N | Fail because the destination NB contains the block | No |
ProtocolTest_02 | Positive | Copy block B from NA to ND with del hint NB | Succeed now block B is on NA, NC, and ND | No |
ProtocolTest_02 | Positive | Copy block B from NB to NCwith del hint NA | NA Succeed but NA is not a valid del hint. So block B is on NA and NB. The third replica is either on NC or ND | No |
Id | Type of Test | Description | Expected Behavior | Is Automated |
ThrottlingTest_01 | Positive | Create a throttler with 1MB/s bandwidth. Send 6MB data, and throttle at 0.5MB, 0.75MB, and in the end. | Actual bandwidth should be less than or equal to the expected bandwidth 1MB/s | No |
Set up a 2-node cluster and create a file with a length of 2 blocks and a replication factor of 2
Id | Type of Test | Description | Expected Behavior | Is Automated |
NamenodeProtocolTest_01 | Positive | Get blocks from datanode 0 with a size of 2 blocks. | Actual bandwidth should be less than or equal to the expected bandwidth 1MB/s | No |
NamenodeProtocolTest_02 | Positive | Get blocks from datanode 0 with a size of 1 block. | Return 1 block | No |
NamenodeProtocolTest_03 | Positive | Get blocks from datanode 0 with a size of 0 . | Receive an IOException | No |
NamenodeProtocolTest_04 | Positive | Get blocks from datanode 0 with a size of -1. | Receive an IOException | No |
NamenodeProtocolTest_05 | Positive | Get blocks from a non-existent datanode. | Receive an IOException | No |
'개발 노트 > Hadoop' 카테고리의 다른 글
Balancer Java Code(상세) (0) | 2013.12.11 |
---|---|
Hadoop Upgrade Guide for v.0.14 (0) | 2013.12.06 |
운영상의 이슈 (0) | 2013.11.14 |
hadoop datanode 제거 / 추가 하기 (2) (0) | 2013.11.04 |
hadoop datanode 제거 / 추가 하기 (1) (0) | 2013.10.30 |
- Total
- Today
- Yesterday
- natas7
- java
- tr
- ssh
- Strings
- grep
- BASE64
- 리눅스
- nc
- HTTPS
- 웹보안
- Encode
- OverTheWire
- Bandit
- OpenSSL
- 32bit
- tar
- SSL
- 압축파일
- bz2
- find
- Natas
- Linux
- solution
- X32
- gz
- over the wire
- 리터럴
- 웹보안공부
- 풀이
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |