Category Archives: HBase

[repost ]Hbase Schema Design Case Studies


6.11. Schema Design Case Studies

The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements.

It is highly recommended that you read the rest of the Chapter 6, HBase and Schema Design first, before reading these case studies.

The following case studies are described:

  • Log Data / Timeseries Data
  • Log Data / Timeseries on Steroids
  • Customer/Order
  • Tall/Wide/Middle Schema Design
  • List Data

6.11.1. Case Study – Log Data and Timeseries Data

Assume that the following data elements are being collected.

  • Hostname
  • Timestamp
  • Log event
  • Value/message

We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event – but what specifically? Timestamp In The Rowkey Lead Position

The rowkey [timestamp][hostname][log-event] suffers from the monotonically increasing rowkey problem described in Section 6.3.1, “ Monotonically Increasing Row Keys/Timeseries Data ”.

There is another pattern frequently mentioned in the dist-lists about “bucketing” timestamps, by performing a mod operation on the timestamp. If time-oriented scans are important, this could be a useful approach. Attention must be paid to the number of buckets, because this will require the same number of scans to return results.

long bucket = timestamp % numBuckets;

… to construct:


As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket. 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs. Host In The Rowkey Lead Position

The rowkey [hostname][log-event][timestamp] is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace. This approach would be useful if scanning by hostname was a priority. Timestamp, or Reverse Timestamp?

If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., timestamp = Long.MAX_VALUE – timestamp) will create the property of being able to do a Scan on [hostname][log-event] to obtain the quickly obtain the most recently captured events.

Neither approach is wrong, it just depends on what is most appropriate for the situation.

Reverse Scan API

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See for more information. Variangle Length or Fixed Length Rowkeys?

It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is “a” and the event type is “e1” then the resulting rowkey would be quite small. However, what if the ingested hostname is “” and the event type is “com.package1.subpackage2.subsubpackage3.ImportantService”?

It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:

Composite Rowkey With Hashes:

  • [MD5 hash of hostname] = 16 bytes
  • [MD5 hash of event-type] = 16 bytes
  • [timestamp] = 8 bytes

Composite Rowkey With Numeric Substitution:

For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES. The rowkey of LOG_TYPES would be:

  • [type] (e.g., byte indicating hostname vs. event-type)
  • [bytes] variable length bytes for raw hostname or event-type.

A column for this rowkey could be a long with an assigned number, which could be obtained by using an HBase counter.

So the resulting composite rowkey would be:

  • [substituted long for hostname] = 8 bytes
  • [substituted long for event type] = 8 bytes
  • [timestamp] = 8 bytes

In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.

6.11.2. Case Study – Log Data and Timeseries Data on Steroids

This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see:, and Lessons Learned from OpenTSDB from HBaseCon2012.

But this is how the general concept works: data is ingested, for example, in this manner…


… with separate rowkeys for each detailed event, but is re-written like this…


… and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.

6.11.3. Case Study – Customer/Order

Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.

The Customer record type would include all the things that you’d typically expect:

  • Customer number
  • Customer name
  • Address (e.g., city, state, zip)
  • Phone numbers, etc.

The Order record type would include things like:

Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:

[customer number][order number]

… for a ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?

The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:

Composite Rowkey With Hashes:

  • [MD5 of customer number] = 16 bytes
  • [MD5 of order number] = 16 bytes

Composite Numeric/Hash Combo Rowkey:

  • [substituted long for customer number] = 8 bytes
  • [MD5 of order number] = 16 bytes Single Table? Multiple Tables?

A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).

Customer Record Type Rowkey:

  • [customer-id]
  • [type] = type indicating ‘1’ for customer record type

Order Record Type Rowkey:

  • [customer-id]
  • [type] = type indicating ‘2’ for order record type
  • [order]

The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type. Order Object Design

Now we need to address how to model the Order object. Assume that the class structure is as follows:

(an Order can have multiple ShippingLocations

(a ShippingLocation can have multiple LineItems

… there are multiple options on storing this data. Completely Normalized

With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.

The ORDER table’s rowkey was described above: Section 6.11.3, “Case Study – Customer/Order”

The SHIPPING_LOCATION’s composite rowkey would be something like this:

  • [order-rowkey]
  • [shipping location number] (e.g., 1st location, 2nd, etc.)

The LINE_ITEM table’s composite rowkey would be something like this:

  • [order-rowkey]
  • [shipping location number] (e.g., 1st location, 2nd, etc.)
  • [line item number] (e.g., 1st lineitem, 2nd, etc.)

Such a normalized model is likely to be the approach with an RDBMS, but that’s not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:

  • Get on the ORDER table for the Order
  • Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances
  • Scan on the LINE_ITEM for each ShippingLocation

… granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you’re just more aware of this fact. Single Table With Record Types

With this approach, there would exist a single table ORDER that would contain

The Order rowkey was described above: Section 6.11.3, “Case Study – Customer/Order”

  • [order-rowkey]
  • [ORDER record type]

The ShippingLocation composite rowkey would be something like this:

  • [order-rowkey]
  • [SHIPPING record type]
  • [shipping location number] (e.g., 1st location, 2nd, etc.)

The LineItem composite rowkey would be something like this:

  • [order-rowkey]
  • [LINE record type]
  • [shipping location number] (e.g., 1st location, 2nd, etc.)
  • [line item number] (e.g., 1st lineitem, 2nd, etc.) Denormalized

A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.

The LineItem composite rowkey would be something like this:

  • [order-rowkey]
  • [LINE record type]
  • [line item number] (e.g., 1st lineitem, 2nd, etc. – care must be taken that there are unique across the entire order)

… and the LineItem columns would be something like this:

  • itemNumber
  • quantity
  • price
  • shipToLine1 (denormalized from ShippingLocation)
  • shipToLine2 (denormalized from ShippingLocation)
  • shipToCity (denormalized from ShippingLocation)
  • shipToState (denormalized from ShippingLocation)
  • shipToZip (denormalized from ShippingLocation)

The pros of this approach include a less complex object heirarchy, but one of the cons is that updating gets more complicated in case any of this information changes. Object BLOB

With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table’s rowkey was described above: Section 6.11.3, “Case Study – Customer/Order”, and a single column called “order” would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.

There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibilty in case the object model changes such that older persisted structures can still be read back out of HBase.

Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatiblity of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.

6.11.4. Case Study – “Tall/Wide/Middle” Schema Design Smackdown

This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws – each application must consider its own needs. Rows vs. Versions

A common question is whether one should prefer rows or HBase’s built-in-versioning. The context is typically where there are “a lot” of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwite with each successive update.

Preference: Rows (generally speaking). Rows vs. Columns

Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.

Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is “Rows as Columns.” Rows as Columns

The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see Section 6.11.2, “Case Study – Log Data and Timeseries Data on Steroids”.

6.11.5. Case Study – List Data

The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.

*** QUESTION ***

We’re looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:

<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
<FixedWidthUserName><FixedWidthValueId3>:"" (no value)

The other option we had was to do this entirely using:


where each row would contain multiple values. So in one case reading the first thirty values would be:

scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}

And in the second case it would be

get 'FixedWidthUserName\x00\x00\x00\x00'

The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution)

The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?

My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we’ll always need the same page size. I’ve ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we’d need to update all subsequent rows).

Thanks for help / suggestions / follow-up questions.

*** ANSWER ***

If I understand you correctly, you’re ultimately trying to store triples in the form “user, valueid, value”, right? E.g., something like:

"user123, firstname, Paul",
"user234, lastname, Smith"

(But the usernames are fixed width, and the valueids are fixed width).

And, your access pattern is along the lines of: “for user X, list the next 30 values, starting with valueid Y”. Is that right? And these values should be returned sorted by valueid?

The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you’re really sure it is needed.

Your two options mirror a common question people have when designing HBase schemas: should I go “tall” or “wide”? Your first schema is “tall”: each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means “the value”. This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you’re giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn’t sound like you need that. Doing it this way is generally recommended (see here

Your second option is “wide”: you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I’m guessing you jumped to the “paginated” version because you’re assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you’re not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn’t be fundamentally worse. The client has methods that allow you to get specific slices of columns.

Note that neither case fundamentally uses more disk space than the other; you’re just “shifting” part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George’s excellent video about understanding HBase schema design:

A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don’t have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)

[repost ]How-to: Use the Apache HBase REST Interface, Part 1


There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

HBase REST Basics

For both Thrift and REST to work, another HBase daemon needs to be running to handle these requests. These daemons can be installed in the hbase-thrift and hbase-rest packages. The diagram below illustrates where Thrift and REST are placed in the cluster. Note that the Thrift and REST clients usually don’t run any other services services like DataNode or RegionServers to keep the load down, and responsiveness high, for REST interactions.

Be sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the web application server. The REST interface doesn’t have any built-in load balancing; that will need to be done with hardware or in code. Cloudera Manager makes it really easy to install and manage the HBase REST and Thrift services. (You can download and try it out for free!) The downside to REST is that it is much heavier-weight than Thrift or Java.

A REST interface can use various data formatsXMLJSON, and protobuf. By specifying the Accept andContent-Type headers, you can choose the format you want to pass in or receive back.

To start using the REST interface, you need to figure out which port it’s running on. The default port for CDH is port 8070. For this post, you’ll see the baseurl variable used, and here is the value I’ll be using::

baseurl = "http://localhost:8070"

The REST interface can be set up to use a Kerberos credential to increase security.

For your code, you’ll need to use the IP address or fully qualified domain name DNS of the node running the REST daemon. Also, confirm that the port is correct. I highly recommend making this URL a variable, as it could change with network changes.

Python and HBase Bug Workarounds

There are two bugs and workarounds that need to be addressed. The first bug is that the built-in Python modules don’t support all of the HTTP verbs. The second is an HBase REST bug when working with JSON.

The built-in Python modules for REST interaction don’t easily support all of the HTTP verbs needed for HBase REST. You’ll need to install the Python requests module. The requests module also cleans up the code and makes all of the interactions much easier.

The HBase REST interface has a bug when adding data via JSON: it is required that the fields maintain their exact order. The built-in Python dict type doesn’t support this feature, so to maintain the order, we’ll need to use theOrderedDict class. (Those with Python 2.6 and older will need to install the ordereddict module.) I’ll cover the bug and workaround later in the post, too.

It was also difficult to use base64 encode and decode integers, so I wrote some code to do that:

# Method for encoding ints with base64 encoding
def encode(n):
     data = struct.pack("i", n)
     s = base64.b64encode(data)
     return s

# Method for decoding ints with base64 encoding
def decode(s):
     data = base64.b64decode(s)
     n = struct.unpack("i", data)
     return n[0]


To make things even easier, I wrote a method to confirm that HTTP responses come back in the 200s, which indicates that the operation worked. The sample code uses this method to check the success of a call before moving on. Here is the method:

# Checks the request object to see if the call was successful
def issuccessful(request):
	if 200


Working With Tables

Using the REST interface, you can create or delete tables. Let’s take a look at the code to create a table.

content =  '<?xml version="1.0" encoding="UTF-8"?>'
content += '<TableSchema name="' + tablename + '">'
content += '  <ColumnSchema name="' + cfname + '" />'
content += '</TableSchema>'

request = + "/" + tablename + "/schema", data=content, headers={"Content-Type" : "text/xml", "Accept" : "text/xml"})


In this snippet, we create a small XML document that defines the table schema in the content variable. We need to provide the name of the table and the column family name. If there are multiple column families, you create some moreColumnSchemanodes.

Next, we use the requests module to POST the XML to the URL we create. This URL needs to include the name of the new table. Also, note that we are setting the headers for this POST call. We are showing that we are sending in XML with the Content-Type set to “text/xml” and that we want XML back with the Accept set to “text/xml”.

Using the request.status_code, you can check that the table create was successful. The REST interface uses the same HTTP error codes to detect if a call was successful or errored out. A status code in the 200s means that things worked correctly.

We can easily check if a table exists using the following code:

request = requests.get(baseurl + "/" + tablename + "/schema")


The calls uses the GET verb to tell the REST interface we want to get the schema information about the table in the URL. Once again, we can use the status code to see if the table exists. A status code in the 200s means it does exist and any other number means it doesn’t.

Using the curl command, we can check the success of a REST operation without writing code. The following command will return a 200 showing the success of the call because the messagestabletable does exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/messagestable/schema
HTTP/1.1 200 OK
Content-Length: 0
Cache-Control: no-cache
Content-Type: text/xml


This REST call will error out because the tablenottheretable doesn’t exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/tablenotthere/schema
HTTP/1.1 500 org.apache.hadoop.hbase.TableNotFoundException: tablenotthere
Content-Type: text/html; charset=iso-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 10767


We can delete a table using the following code:

request = requests.delete(baseurl + "/" + tablename + "/schema")


This call uses the DELETE verb to tell the REST interface that we want to delete the table. Deleting a table through the REST interface doesn’t require you to disable it first. As usual, we can confirm success by looking at the status code.

In the next two posts in this series, we’ll cover inserting and getting rows, respectively.

Jesse Anderson is an instructor with Cloudera University.

If you’re interested in HBase, be sure to register for HBaseCon 2013 (June 13, San Francisco) – THE community event for HBase contributors, developers, admins, and users. Early Bird registration is open until April 23.

[repost ]RHadoop搭建(HBase)


hadoop1 master、secondaryname、zookeeper、hbase HMaster)
hadoop2、hbase HRegion、Hive Shell)
hadoop3 slave、zookeeper、hbase HRegion)
hadoop4 slave、zookeeper、hbase HRegion)
hadoop5 slave、zookeeper、hbase HRegion)
dataserver metastore、MySQL Server、Oracle)
相关的安装文档:hadoop2.2.0测试环境搭建     Hbase0.96.0 +hadoop2.2.0安装     RHadoop搭建(HDFS+MapReduce)

[root@dataserver app]# sudo yum install automake libtool flex bison pkgconfig gcc-c++ boost-devel libevent-devel zlib-devel python-devel ruby-devel
[root@dataserver app]# tar zxf /mnt/mydisk/soft/program/thrift-0.9.1.tar.gz
[root@dataserver app]# cd thrift-0.9.1
[root@dataserver thrift-0.9.1]# ./configure

Python开发HBase程序 - mmicky - mmicky 的博客

[root@dataserver thrift-0.9.1]# make

Python开发HBase程序 - mmicky - mmicky 的博客

[root@dataserver thrift-0.9.1] cd test/cpp
[root@dataserver cpp]# cp *.o .libs/
[root@dataserver thrift-0.9.1]# make install
[root@dataserver thrift-0.9.1]# thrift –version

[root@dataserver /]# export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
[root@dataserver /]# pkg-config –cflags thrift
[root@dataserver /]# vi /usr/local/lib/pkgconfig/thrift.pc
Cflags: -I${includedir}/thrift
[root@dataserver lib]# cp /usr/local/lib/ /usr/lib/
[root@dataserver lib]# /sbin/ldconfig /usr/lib/
注意要将动态链接库libthrift-0.9.1.so为系统所共享,不然会出现类似unable to load shared object ‘/usr/lib64/R/library/rhbase/libs/' cannot open shared object file的错误
[root@dataserver usr]# R CMD INSTALL /mnt/mydisk/soft/R/rhbase_1.2.0.tar.gz

RHadoop搭建(HBase) - mmicky - mmicky 的博客



RHadoop搭建(HBase) - mmicky - mmicky 的博客


[repost ]HBase in 2013


2013年马上就要过去了,总结下这一年HBase在这么一年中发生的主要变化。影响最大的事件就是HBase 0.96的发布,代码结构已经按照模块化release了,而且提供了许多大家迫切需求的特点。这些特点大多在Yahoo/Facebook/淘宝/小米等公司内部的集群中跑了挺长时间了,可以算是比较稳定可用了。


HBase的Compaction是长期以来广受诟病的一个feature,很多人吐槽HBase也是因为这个特征。不过我们不能因为HBase有这样一个缺点就把它一棒子打死,更多的还是希望能够驯服它,能够使得它适应自己的应用场景。根据业务负载类型调整compaction的类型和参数,一般在业务高峰时候禁掉Major Compaction。在0.96中HBase社区为了提供更多的compaction的策略适用于不同的应用场景,采用了插件式的架构。同时改进了HBase在RegionServer端的存储管理,原来是直接Region->Store->StoreFile,现在为了支持更加灵活多样的管理StoreFile和compact的策略,RS端采用了StoreEngine的结构。一个StoreEngine涉及到StoreFlusher, CompactionPolicy, Compactor, StoreFileManager。不指定的话默认是DefaultStoreEngine,四个组件分别是DefaultStoreFlusher, ExploringCompactionPolicy, DefaultCompactor, DefaultStoreFileManager。可以看出在0.96版之后,默认的Compaction算法从RatioBasedCompactionPolicy改为了ExploringCompactionPolicy。为什么要这么改,首先从compaction的优化目标来看:compaction is about trading some disk IO now for fewer seeks later,也就是compaction的优化目标是执行Compaction操作能合并越多的文件越好,如果合并同样多的文件产生的IO越小越好,这样select出来的列表才是最优的。


  • RatioBasedCompactionPolicy是简单的从头到尾遍历StoreFile列表,遇到一个符合Ratio条件的序列就选定执行compaction。对于典型的不断flush memstore形成 StoreFile的场景是合适的,但是对于bulk-loaded是不合适的,会陷入局部最优。
  • 而ExploringCompactionPolicy则是从头到尾遍历的同时记录下当前最优,然后从中选择一个全局最优列表。

关于这两个算法的逻辑可以在代码中参考对应的applyCompactionPolicy()函数。其他CompactionPolicy的研究和开发也非常活跃,例如Tier-based compaction (HBASE-6371,来自Facebook) 和stripe compaction(HBASE-7667)

吐槽:HBase compaction为什么会问题这么多,我感觉缺少了一个整体的IO负载的反馈和调度机制。因为compaction是从HDFS读数据,然后再写到HDFS中,和其他HDFS上的负载一样在抢占IO资源。如果能有个IO资源管理和调度的机制,在HDFS负载轻的时候执行compaction,在负载重的时候不要执行。而且这个问题在Hadoop/HDFS里同样存在,Hadoop的资源管理目前只针对CPU/Memory的资源管理,没有对IO的资源管理,会导致有些Job受自己程序bug的影响可能会写大量的数据到HDFS,会严重影响其他正常Job的读写性能。

2,Mean Time To Recovery/MTTR优化

目前HBase对外提供服务,Region Server是单点。如果某台RS挂掉,那么直到该RS上的所有Region被重新分配到其他RS上之前,这些Region是的数据是无法访问的。对这个过程的改进主要包括:

  • HBASE-5844 和 HBASE-5926:删除zookeeper上Region Server/Master对应的znode,这样就省的等到znode 30s超时才发现对应的RS/Master挂了。
  • HBASE-7006: Distributed Log Replay,就是直接从HDFS上读取宕机的WAL日志,直接向新分配的RS进行Log Replay,而不是创建临时文件recovered.edits然后再进行Log Replay
  • HBASE-7213/8631: HBase的META表中所有的Region所在的Region Server将会有两个WAL,一个是普通的,一个专门给META表对应的Region用。这样在进行recovery的时候可以先恢复META表。

3,Bucket Cache (L2 cache on HBase)

HBase上Regionserver的内存分为两个部分,一部分作为Memstore,主要用来写;另外一部分作为BlockCache,主要用于读。Block的cache命中率对HBase的读性能影响十分大。目前默认的是LruBlockCache,直接使用JVM的HashMap来管理BlockCache,会有Heap碎片和Full GC的问题。

HBASE-7404引入Bucket Cache的概念可以放在内存中,也可以放在像SSD这样的适合高速随机读的外存储设备上,这样使得缓存的空间可以非常大,可以显著提高HBase读性能。Bucket Cache的本质是让HBase自己来管理内存资源而不是让Java的GC来管理,这个特点也是HBase自从诞生以来一直在激烈讨论的问题。

4,Java GC改进

MemStore-Local Allocation Buffers通过预先分配内存块的方式解决了因为内存碎片造成的Full GC问题,但是对于频繁更新操作的时候,MemStore被flush到文件系统时没有reference的chunk还是会触发很多的Young GC。所以HBase-8163提出了MemStoreChunkPool的概念,也就是由HBase来管理一个ChunkPool用来存放chunk,不再依赖JVM的GC。这个ticket的本质也是由HBase进程来管理内存分配和重分配,不再依赖于Java GC。

5,HBase的企业级数据库特性(Secondary Index, Join和Transaction)

谈到HBase的企业级数据库特性,首先想到的就是Secondary Index, Join, Transaction。不过目前这些功能的实现都是通过外围项目的形式提供的。

华为的hindex (是目前看到的最好的Secondary Index的实现方式。主要思想是建立Index Table,而且这个Index Table的Region分布跟主表是一致的,也就是说主表中某一Region对应的Index Table对应的Region是在同一个RS上的。而且这个索引表是禁止自动或者手动出发split的,只有主表发生了split才会触发索引表的split。

这个是怎么做到的呢?本质上Index Table也是一个HBase的表,那么也只有一个RowKey是可以索引的。这个索引表的RowKey设计就比较重要了,索引表的RowKey=主表Region开始的RowKey+索引名(因为一个主表可能有多个索引,都放在同一个索引表中)+需要索引的列值+主表RowKey。这样的索引表的RowKey设计就可以保证索引表和主表对应的Region是在同一台RS上,可以省查询过程中的RPC。每次Insert数据的时候,通过Coprocessor顺便插入到索引表中。每次按照二级索引列Scan数据的时候,先通过Coprocessor从索引表中获取对应的主表的RowKey然后就行Scan。在性能上看,查询的性能获得了极大提升,插入性能下降了10%左右。


Phoenix也实现了一大一小两个表的Join操作。还是老办法把小表broadcast到所有的RS,然后通过coprocessor来做hash join,最后汇总。感觉有点画蛇添足,毕竟HBase设计的初衷就是用大表数据冗余来尽量避免Join操作的。现在又来支持Join,不知道Salesfore的什么业务需求这个场景。

关于Transaction的支持,目前最受关注的还就是Yahoo的Omid (。不过貌似大家对这个feature的热情还不是特别高。


由于HBase的KeyValue存储是按照Row/Family/Qualifier/TimeStamp/Value的形式存储的,Row/Family/Qualifier这些相当于前缀,如果每一行都按照原始数据进行存储会导致占据存储空间比较大。HBase 0.94版本就已经引入了DataBlock Encode的概念(HBASE-4218),将重复的Row/Family/Qualifier按照顺序进行压缩存储,提高内存利用率,支持四种压缩方式FAST_DIFF\PREFIX\PREFIX_TRIE\DIFF。但是这个feature也仅仅是通过delta encoding/compression降低了内存占用率,对数据查询效率没有提升,甚至会带来压缩/解压缩对CPU资源占用的情况。

HBASE-4676:PrefixTreeCompression是把重复的Row/Family/Qualifier按照Prefix Tree的形式进行压缩存储的,可以在解析时生成前缀树,并且树节点的儿子是排序的,所以从DataBlock中查询数据的效率可以超过二分查找。


  • HBASE-5305:为了更好的跨版本的兼容性,引进了Protocol Buffer作为序列化/反序列化引擎使用到RPC中(此前Hadoop的RPC也全部用PB重写了)。因为随着HBase Server的不断升级,有些Client的版本可能还比较旧,所以需要RPC在新旧版本之间兼容。
  • HBASE-6055 HBASE-7290:HBase Table Snapshot。创建snaphost对HBase集群没有性能影响,只是生成了snaphost对应的metadata而不会去拷贝数据。用户可以通过创建snaphost实现backup和disaster recovery,例如用户在创建一个snaphost之后可能会误操作导致一些表出现了问题,这样我们可以选择回滚到创建snaphost的那个阶段而不会导致数据全都不可用。也可以定期创建snapshot然后拷贝到其他集群用于定时的offline processing。
  • HBASE-8015: 在0.96中,ROOT表已经改名为hbase:namespace,META则是hbase:meta。而且hbase:namespace是存在zookeeper上的。这个namespace类似于RDBMS里的database的概念,可以更好的做权限管理和安全控制。HBase中table的META信息也是作为一种Region存放在Region Server上的,那么META表的Region和其他普通Region就会产生明显的资源竞争。为了改善META Region的性能,360的HBase中提出了专属MetaServer,在这个Region Server上只存放META Region
  • HBASE-5229:同一个Region内的跨行事务。一次操作中涉及到同一个Region中的所有写操作在获取到的相关Row的所有行锁(按照RowKey的顺序依次取行锁,防止死锁)之后事务执行。
  • HBASE-4811:Reverse Scan。过去被问到说如何反向查找HBase中的数据,常常被答道再建一张反向存储的表,而且LevelDB和Cassandra都支持反向扫描。HBase反向扫描比正向扫描性能下降30%,这个和LevelDB是差不多的。
  • Hoya — HBase on YARN。可以在一个YARN集群上部署多个不同版本、不同配置的HBase instance,可以参考

展望2014年,HBase即将release 1.0版本,更好的支持multi-tenancy, 支持Cell级别的ACL控制。


  • Cloudera/Hortonworks/Yahoo/Facebook的人从系统和性能等多方面关注
  • Salesfore/huawei的人貌似更关注企业级特性,毕竟他们面对的客户都是电信、金融、证券等高帅富行业
  • 来自国内的阿里巴巴/小米/360等公司更加关注系统性能、稳定性和运维相关的话题。国内互联网行业用HBase更加关注的是如何解决业务问题。
  • 越来越多的公司把它们的HBase集群构建在云上,例如Pinterest所有的HBase集群都是在AWS上,国外的start up环境太好了,有了AWS自己根本不用花费太多的资源在基础设施上。
  • 传统的HBase应用是online storage,实时数据读取服务。例如支付宝用HBase存放用户的历史交易信息服务用户查询,中国联通也使用HBase存储用户的上网历史记录信息用于用户的实时查询需求。现在HBase也向实时数据挖掘的应用场景中发展,例如wibidata公司开源的kiji (能够在HBase上轻松构建实时推荐引擎、实时用户分层和实时欺诈监控。

[repost ]浅谈HBase



HBase – Hadoop Database,是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价PC Server上搭建起大规模结构化存储集群。

它经常被描述为一种稀疏的,分布式的,持久花的,多维有序映射, 基于行键rowkey,列簇column family和时间戳timestemp.


HBase时Google Bigtable的开源实现,它的整个生态环境入下图


Map Reduce为它提供了高性能的计算






HBase的表结构. 由下图可以看到,HBase的表时由RowKey,TS,column family组成的

RowKey, HBase的行键,是唯一标识,一般长度在50字节以内


Colulmn Family,列簇, 我的理解的是,多个列的概述,一张表可以有多个列簇,每个列簇下有多个column,这样就组成了一张表,在这里,列簇是预先定义好的,column是在执行过程中动态指定的,你可以随时添加一列,也可以删除一列,删除时,记得要先将table下线.



Row Key Timestamp Column Family
URI Parser
r1 t3 url= title=hello
r2 t5 url= content=工作




HBase存储的最小单元, region是由Region Server来管理的,每一个region可以分布在不同的region server上,当建立一张表的时候,默认会创建一个Region,但是当数据量大了,默认256M,系统就会split成多个Region.




1.发送请求信息到zookeeper, ZK返回ROOT表所在Region的位置信息



ROW                                    COLUMN+CELL
.META.,,1                             column=info:regioninfo, timestamp=1378110081278, value={NAME => ‘.META.,,1′, STARTKEY => ”, ENDKEY => ”, ENCO
DED => 1028785192,}
.META.,,1                             column=info:server, timestamp=1378349783451, value=cdh-1:60020
.META.,,1                             column=info:serverstartcode, timestamp=1378349783451, value=1378349719946
.META.,,1                             column=info:v, timestamp=1378110081278, value=\x00\x00


经过了上面的三个步骤,用户才真正的吧请求信息发送到了user table, 当然这还不算完,在客户端,是有缓存机制的,也就是说,用户首先按照上述的三个步骤访问本地缓存文件,如果没有,才去访问ZK,这样一折腾的话,就变成了六步了,哈哈淡定吧~


HBase 架构



看完了这张图,来慢慢说这个架构吧. 首先Client去访问ZK, 得到了Region的位置,去访问RegionServer, 好吧进入到了Region内部了,我们从这里开始说.


一个Hlog + N个Region组成


Hlog就是大名鼎鼎的WAL(Write-ahead Log), 主要是作用于HBase的高可靠性, 当写的信息请求Region时,HBase先将请求内容记录到HLog中(HLog位于HDFS上,安全可靠,还有副本), 然后在存信息到Region中,这个过程下面说.因为hlog时sequence的,所以写的速度特别快,性能非常的高.

当RS宕机后,那么ZK会拿到这个HLog, 并拆分它,分别放到对应的Region中,在将Region指定到其他活的RegionServer中,RS发现由可用的hlog,会启动一个新的线程,处理log file.



Region是由多个Store组成,每个Store对应一个column family, column family可以看作是一个集中的存储单元.

Store是由Mem Store和Store file组成的, MeM Store是内存数据,用户写数据时先写hlog,在写mem store,当mem store满了之后,在刷新到storefile.


minorCompection,当有多个小的store file时,系统会进行压缩,这里只涉及到部分store file,不是全部,会将store file压缩成一个大的文件,但是不会删除冗余的列



[repost ]提升HBase写性能




1. Write Buffer Size


Java代码  收藏代码
  1. HTable htable = new HTable(config, tablename);
  2. htable.setWriteBufferSize(6 * 1024 * 1024);
  3. htable.setAutoFlush(false);


* 必须禁止auto flush。

* 6MB是经验值,可以上下微调以适应不同的写场景。



HBase Client会在数据累积到设置的阈值后才提交Region Server。这样做的好处在于可以减少RPC连接次数。同时,我们得计算一下服务端因此而消耗的内存:hbase.client.write.buffer * hbase.regionserver.handler.count。在减少PRC次数和增加服务器端内存之间找到平衡点。


2. RPC Handler



Xml代码  收藏代码
  1. <property>
  2. <name>hbase.regionserver.handler.count</name>
  3. <value>100</value>
  4. </property>



该配置定义了每个Region Server上的RPC Handler的数量。Region Server通过RPC Handler接收外部请求并加以处理。所以提升RPC Handler的数量可以一定程度上提高HBase接收请求的能力。当然,handler数量也不是越大越好,这要取决于节点的硬件情况。


3. Compression 压缩


Java代码  收藏代码
  1. HColumnDescriptor hcd = new HColumnDescriptor(familyName);
  2. hcd.setCompressionType(Algorithm.SNAPPY);





4. WAL


Java代码  收藏代码
  1. Put put = new Put(rowKey);
  2. put.setWriteToWAL(false);





5. Replication



6. Compaction

在插数据时,打开HMaster的web界面,查看每个region server的request数量。确保大部分时间,写请求在region server层面大致平均分布。


在此前提下,我们再考虑compaction的问题。继续观察request数量,你会发现在某个时间段,若干region server接收的请求数为0(当然这也可能是client根本没有向这个region server写数据,所以之前说,要确保请求在各region server大致平均分布)。这很有可能是region server在做compaction导致。compaction的过程会block写。





Xml代码  收藏代码
  1. hbase.regionserver.thread.compaction.large
  2. hbase.regionserver.thread.compaction.small



7. 减少Region Split次数

region split是提升写性能的一大障碍。减少region split次数可以从两方面入手,一是预分配region(该内容会在下章节表设计优化里详述)。其二是适当提升hbase.hregion.max.filesize


提升region的file容量也可以减少split的次数。具体的值需要按照你的数据量,region数量,row key分布等情况具体考量。一般来说,3~4G是不错的选择。


8. HFile format version

0.92.0后的version都应该是2。v2比v1支持更大的region大小。一般经验是Region越大越少,性能更好(当然也不能过分大,否则major compaction的时候时间长的吃不消)。所以推荐把hfile.format.version改成2,并提高hfile大小。对于使用v1 format的用户,不用担心,数据迁移到v2上是有工具的。具体参见HBASE-1621。


9. hbase.ipc.client.tcpnodelay

设置成True。关闭Nagle,可能提高latency。当然HDFS也关掉TPC Nagle。

A TCP/IP optimization called the Nagle Algorithm can also limit data transfer speed on a connection. The Nagle Algorithm is designed to reduce protocol overhead for applications that send small amounts of data, such as Telnet, which sends a single character at a time. Rather than immediately send a packet with lots of header and little data, the stack waits for more data from the application, or an acknowledgment, before proceeding.





1. 预分配Region

之前有说防止region split的两大手段其中之一就是预分配region。


在此不重复region split的原理,请参见。按数据量,row key的规则预先设计并分配好region,可以大幅降低region split的次数, 甚至不split。这点非常重要。


2. Column Family的数量

实测发现column family的数量对性能会有直接影响。建议减少column family的数量。单个cf是最好



前者确定保存一个cell的最大历史份数,后者确定多少byte可以存进一个cell 历史记录。所以我们可以减低这些值。


4. Row Key的设计

Region的数据边界是start key和end key。如果记录的row key落在某个region的start key和end key的范围之内,该数据就会存储到这个region上。在写数据的时候,尤其是导入客户原有数据的时候,如果row key设计不当,很可能导致性能问题。之前我们也介绍了row key和region的关系。如果在某个时段内,很多数据的row key都处在某个特定的row key范围内。那这个特定范围row key对应的region会非常繁忙,而其他的region很可能非常的空闲,导致资源浪费。


那么,如何设计row key呢?举个比较实际的例子,如果有张HBase表来记录每天某城市的通话记录, 常规思路下的row key是由电话号码 + yyyyMMddHHmmSS(通话开始时间) + … 组成。按电话号码的规律来划分region。但是这样很容易导致某时段row key极其不均匀(因为电话通话呈随机性)。但是,如果把电话号码倒序,数据在region层面的分布情况就大有改观。


设计row key的方法千变万化,宗旨只有一条,尽量保证单位时间内写入数据的row key对于region呈均匀分布。






1. 均匀分布每个Region Server的写压力

之前也提到了RPC Handler的概念。好的Data Loader需要保证每个RPC Handlder都有活干,每个handler忙,但不至超载。注意region的压力不能过大,否则会导致反复重试,并伴有超时异常(可以提高超时的时间设置)。


如何保证每个Region Server的压力均衡呢?这和region 数量,row key的设计 和client数据的插入顺序有关。设计者需要根据用户数据的情况,集群情况来综合考虑。


2. 并行的数据插入框架

多线程是最简单的解决方案。要点是让每个线程负责一部分的row key范围,而row key范围又和region相关,所以可以在数据插入时,程序控制每个region的压力,不至于有些region闲着没事干。由于相对简单,不再赘述。



注意,不要使用reducer。mapper到reducer需要走网络,受限于集群带宽。其次,实际的应用场景一般是用户从关系型数据库中导出了文本类型的数据,然后希望能把导出的数据写到HBase里。在这种情况下,需要小心谨慎地设计和实现FileInputFormat的file split逻辑。


3. BulkLoad


Configure a MapReduce Job to perform an incremental load into the given table. This
Inspects the table to configure a total order partitioner
Uploads the partitions file to the cluster and adds it to the DistributedCache
Sets the number of reduce tasks to match the current number of regions
Sets the output key/value class to match HFileOutputFormat’s requirements
Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)
The user should be sure to set the map output value class to either KeyValue or Put before running this function.


这是HBase提供的一种基于MapReduce的数据导入方案,完美地绕过了HBase Client(上一节的分布式插入方法也是用mapreduce实现的,不过本质上还是用hbase client来写数据)



但是,不得不说,实际生产环境上很难使用这种方式。毕竟源数据不可能直接用来写HBase。在数据迁移的过程中会涉及到数据清洗、整理归并等许多额外的工作。这显然不是命令行可以做到的事情。按照API的描述, 可行的方案是自定义一个Mapper在mapper中清洗数据,Mapper的输出value为HBase的Put类型,Reducer选用PutSortReducer。然后使用HFileOutputFormat#configureIncrementalLoad(Job, HTable);解决剩余工作。






[repost ]HFile存储格式


HBase中的所有数据文件都存储在Hadoop HDFS文件系统上,主要包括上述提出的两种文件类型:

1. HFile, HBase中KeyValue数据的存储格式,HFile是Hadoop的二进制格式文件,实际上StoreFile就是对HFile做了轻量级包装,即StoreFile底层就是HFile

2. HLog File,HBase中WAL(Write Ahead Log) 的存储格式,物理上是Hadoop的Sequence File




HFile由6部分组成的,其中数据KeyValue保存在Block 0 … N中,其他部分的功能有:确定Block Index的起始位置;确定某个key所在的Block位置(如block index);判断一个key是否在这个HFile中(如Meta Block保存了Bloom Filter信息)。具体代码是在HFile.java中实现的,HFile内容是按照从上到下的顺序写入的(Data Block、Meta Block、File Info、Data Block Index、Meta Block Index、Fixed File Trailer)。

KeyValue: HFile里面的每个KeyValue对就是一个简单的byte数组。但是这个byte数组里面包含了很多项,并且有固定的结构。我们来看看里面的具体结构:

开始是两个固定长度的数值,分别表示Key的长度和Value的长度。紧接着是Key,开始是固定长度的数值,表示RowKey的长度,紧接着是 RowKey,然后是固定长度的数值,表示Family的长度,然后是Family,接着是Qualifier,然后是两个固定长度的数值,表示Time Stamp和Key Type(Put/Delete)。Value部分没有这么复杂的结构,就是纯粹的二进制数据了。

Data Block:由DATABLOCKMAGIC和若干个record组成,其中record就是一个KeyValue(key length, value length, key, value),默认大小是64k,小的数据块有利于随机读操作,而大的数据块则有利于scan操作,这是因为读KeyValue的时候,HBase会将查询到的data block全部读到Lru Block Cache中去,而不是仅仅将这个record读到cache中去。

private void append(final byte [] key, final int koffset, final int klength, final byte [] value, final int voffset, final int vlength) throws IOException {


this.keylength += klength;


this.valuelength += vlength;

this.out.write(key, koffset, klength);

this.out.write(value, voffset, vlength);


Meta Block:由METABLOCKMAGIC和Bloom Filter信息组成。

public void close() throws IOException {

if (metaNames.size() > 0) {

for (int i = 0 ; i < metaNames.size() ; ++ i ) {






File Info: 由MapSize和若干个key/value,这里保存的是HFile的一些基本信息,如hfile.LASTKEY, hfile.AVG_KEY_LEN, hfile.AVG_VALUE_LEN, hfile.COMPARATOR。

private long writeFileInfo(FSDataOutputStream o) throws IOException {

if (this.lastKeyBuffer != null) {

// Make a copy. The copy is stuffed into HMapWritable. Needs a clean

// byte buffer. Won’t take a tuple.

byte [] b = new byte[this.lastKeyLength];

System.arraycopy(this.lastKeyBuffer, this.lastKeyOffset, b, 0, this.lastKeyLength);

appendFileInfo(this.fileinfo, FileInfo.LASTKEY, b, false);


int avgKeyLen = this.entryCount == 0? 0: (int)(this.keylength/this.entryCount);

appendFileInfo(this.fileinfo, FileInfo.AVG_KEY_LEN, Bytes.toBytes(avgKeyLen), false);

int avgValueLen = this.entryCount == 0? 0: (int)(this.valuelength/this.entryCount);

appendFileInfo(this.fileinfo, FileInfo.AVG_VALUE_LEN,

Bytes.toBytes(avgValueLen), false);

appendFileInfo(this.fileinfo, FileInfo.COMPARATOR, Bytes.toBytes(this.comparator.getClass().getName()), false);

long pos = o.getPos();


return pos;


Data/Meta Block Index: 由INDEXBLOCKMAGIC和若干个record组成,而每一个record由3个部分组成 ― block的起始位置,block的大小,block中的第一个key。

static long writeIndex(final FSDataOutputStream o, final List keys, final List offsets, final List sizes) throws IOException {

long pos = o.getPos();

// Don’t write an index if nothing in the index.

if (keys.size() > 0) {


// Write the index.

for (int i = 0; i < keys.size(); ++i) {



byte [] key = keys.get(i);

Bytes.writeByteArray(o, key);



return pos;


Fixed file trailer: 大小固定,主要是可以根据它查找到File Info, Block Index的起始位置。

public void close() throws IOException {

trailer.fileinfoOffset = writeFileInfo(this.outputStream);

trailer.dataIndexOffset = BlockIndex.writeIndex(this.outputStream,

this.blockKeys, this.blockOffsets, this.blockDataSizes);

if (metaNames.size() > 0) {

trailer.metaIndexOffset = BlockIndex.writeIndex(this.outputStream,

this.metaNames, metaOffsets, metaDataSizes);


trailer.dataIndexCount = blockKeys.size();

trailer.metaIndexCount = metaNames.size();

trailer.totalUncompressedBytes = totalBytes;

trailer.entryCount = entryCount;

trailer.compressionCodec = this.compressAlgo.ordinal();