HBase

Introduction

Apache HBase is an open-source, distributed, versioned, sorted map datastore modeled after Google's BigTable. It works on top of Hadoop Distributed File System (HDFS).

Since HBase is written in Java, it can access through a Java API (like JDBC). HBase also supports RESTful access. To access HBase using another programming language, you may refer to Thrift. To access HBase through HTTP, you may refer to RESTful access.

In this lab, we will teach you

How to interact with HBase using the command line interface hbase-shell.
How to access HBase through native Java API.

Interact with HBase using shell

Since HBase is implemented on top of HDFS, we need to start HDFS before launching HBase.

Start HDFS

Open the terminal. Navigate to the home directory of hadoop by the following command:

Start HDFS by the following command:

Open HDFS's WebUI address: http://localhost:50070 in the virtual machine to see if HDFS has started successfully.

Start HBase

Navigate to the home directory of HBase by the following command:

Start HBase by the following command:

Open HBase's WebUI address: http://localhost:16010 in the virtual machine to see if HBase has started successfully.

Use HBase for the first time using hbase-shell

Connect to the running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install:

Use the create command to create a new table named Contacts with two column families, i.e., Personal and Office. Recall that only Table and Column Family names have to be pre-defined (columns within a column family can be added/deleted dynamically). Also notice that table names, rows, columns all must be enclosed in quote characters.

To insert data into the table, use the put command.

Use the scan command to scan the table. Here, we can see that different row keys can have different subsets of columns specified. For instance, column Personal:Residence_phone is not specified for row key 00002.

To retrieve the data with respect to a single row key, use the get command.

To drop (delete) a table, we need to disable it first (using the delete command), then use the drop command.

Access HBase using Java API

HBase is written in Java, no surprise that it has a native Java API. This API can do everything that hbase-shell can do and more. It can be categorized as:

Java Administrative API: often used by application administrator to create/delete tables along with their column families.
Java Client API: often used by application clients for CRUD (i.e., Create, Retrieve, Update, and Delete) operations in terms of the table rows.

We use an example about Twitter application to go through these two kinds of APIs. Download the code, unzip it, and move the folder to the guest machine through shared folder (see Lab 1). Assume the path of the folder is as follows:

Java Administrative API

The application administrator is responsible for the table design, which can be defined by answering the following questions in the context of a use case:

What should the row key be? (Note that indexing is only done for row keys. So use this to your advantage.)
What column families should the table have? (Note that different column families for a single row may be stored separately. So store everything with similar access patterns in the same column family.)

In our use case, we mainly use twitter id to access the information of each twitter, so we set twitterId as the row key [answering question (i)]. Furthermore, we create two column families general and user. The former one contains information w.r.t. the twitter itself (e.g., text, created time), and the latter one contains its sender's information (e.g., name, registered state). Such design depends on our access pattern that information belonging to different column families will rarely be accessed together [answering question (ii)]. The file HbaseTableCreator.java creates the twitter table as required.

Execute the following command to compile this code:

The string `~/Programs/hbase/bin/hbase classpath` executes the hbase classpath command to return all the Java libraries that are required to compile the code. Note that the symbol to quote this command is the backtick (`) instead of the single quote (').

Next, run this code as follows:

Log information will appear after running this code. To see whether the table twitter is created successfully, let us open the hbase shell to find the table information using list and describe command:

Java Client API

After the table twitter has been created, the application clients could access the table with CRUD (Create, Retrieve, Update, and Delete) operations. The file HbaseClientExample.java contains the complete code. Within this file, we created a simple Twitter class:

In this class, twitterId serves as the row key. Besides, attributes (or column qualifier) text and createdAt belong to column family general, and name belongs to column family user. Note that in practice, column qualifiers of each column family can be added dynamically. Here we fix them for the ease of illustration.

Gaining access to the `twitter` table

Next, we walk through the example step-by-step. First of all, in the constructor of class HbaseClientExample, it connects to HBase and saves a reference to the HTableInterface as a class member variable (i.e., twitterTable), which handles any following access to the table twitter in HBase. Then, we illustrate four types of access: (1) create, (2) retrieve, (3) update, and (4) delete.

Create (=Put/Insert)

We create five twitters and insert them into the table. Note that the hbase table is sparse. For example, the value for general:createdAt column is not specified for twitter2.

Retrieve

Here we illustrate two types of retrieve methods:

Range scan. It retrieves data with a specific range of row keys using the Scan class.
Single row retrieval. It obtains data w.r.t. a specific row key using the Get class.