Brief Look on Apache HBase

Like Google's Bigtable and in hard competition to it, Apache HBase is an open-source, non-relational, scalable, and distributed database developed as part of Apache Software Foundation's Apache Hadoop project. It operates on top of HDFS (Hadoop Distributed File System) in the architectural structure, which has Bigtable capabilities equivalent to Hadoop.

Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. Over others, purposes to use HBase are Data Volume (in petabytes data format), Application Types (variable schema with somewhat different rows), Hardware environment (running on top of HDFS with larger number of nodes (5 or more)), No requirement of RDMS (without features like transaction, triggers, complex query, complex joins, etc.) and Quick access to data (only if random and real-time access to data is required).In complex systems of BigData analysis, HBase and Hive – important Hadoop based technologies can be used in conjuction also, for further extended features to reduce the complexity.

Background:

Apache HBase is the current top level Apache project which was initiated by the company in the name of “Powerset”. The process was to process a large number of data and make it compatible to natural language search.
Facebook even implemented its new messaging platform in HBase.
The 1.2.x series is considered to be stable release line. (as of February 2017)
Data can be stored to HDFS either directly or via HBase. Data consumer reads or accesses the data in HDFS randomly with HBase. HBase stays on top of the Hadoop File System and provides both read and write access.

hadoop-file-system

HBase vs HDFS:

Storage Mechanism:

HBase is a column-oriented database. The tables in HBase are sorted by row. The table schema represents only column families, which are commonly called the key-value pairs. A table has several column families and each column family possesses multiple columns. Succeeding column values are stored constantly on the disk. Furthermore, each cell value of the table has a timestamp.

Table is a collection of rows

Row is a collection of column families

Column family is a collection of columns

Column is a collection of key value pairs

Features:

Linearly scalable

Automatic Failure Support

Consistent Read and Write facility

Integrates with Hadoop, both as Source and Destination

Caters easy Java API for client

Data replication across clusters

Architecture:

In HBase, tables are divided into smaller regions and are assisted by the region servers. Each region is further vertically partitioned by column families into parts, commonly known as Stores. Each store is saved as a file in HDFS. Below diagram is the architecture of HBase:

Note: The term ‘store’ is used for regions to explain the storage structure.

Setting up Runtime Environment:

Following are the pre-requisites for HBase –

Create a separate Hadoop user (Recommended)

Setup SSH

Java

Hadoop

Configuring Hadoop

core-site.xml – Adding host & port (HDFS URL), total memory allocation of file system, size of read/write buffer

hdfs-site.xml – Should contain values of replication data, namenode path, datanode path, etc

yarn-site.xml – Useful to configure yarn in Hadoop

mapred-site.xml – Used to specify which MapReduce framework

Installing HBase

We recommend to use HDP for learning purpose as it has all the pre-requisites already installed. We are using HDP v2.6 for demo purpose.

Java API:

To communicate Java API is provided which apparently is the fastest way to deal with. All DLL operations are mainly facilitated by HBaseAdmin. Sample code to receive HBaseAdmin instance is mentioned below:

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "<server_ip>:2181");
conf.set("zookeeper.znode.parent", "/hbase-unsecure");
HBaseAdmin admin = new HBaseAdminconf);</server_ip>


Note: Mentioned HDP is running on IP 192.168.23.101 and port 2181 and is connecting it from local system.

DDL Commands:




Create Table:


HTableDescriptor tableDescriptor = new HTableDescriptor("user");
 // Add column families to table descriptor
 tableDescriptor.addFamily(new HColumnDescriptor("id"));
 tableDescriptor.addFamily(new HColumnDescriptor("username"));
 //Execute the table using admin object
 admin.createTable(tableDescriptor);




Alter Table:


HColumnDescriptor columnDescriptor = new HColumnDescriptor("emailId");
admin.addColumn("employee", columnDescriptor);




Disable Table:


admin.disableTable("user");




Delete Table:


admin.deleteTable("user");




List Table:


String str[] = admin.getTableNames();




DML Commands:

For DML operations, need to use HTable class. Following is the code snippet for the same. Don’t forget to close HTable after finishing.


// instantiating HTable class
HTable hTable = new HTable(conf, "user");
hTable.close();




Insert/Update Data:


// instantiating Put class
Put put = new Put(Bytes.toBytes("myRow"));
 
// adding/updating values using add() method
put.add(Bytes.toBytes("personal"),
Bytes.toBytes("name"),Bytes.toBytes("john"));
 
put.add(Bytes.toBytes("personal"),
Bytes.toBytes("city"),Bytes.toBytes("Boston"));
 
put.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
Bytes.toBytes("APM"));
 
put.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
Bytes.toBytes("50000"));
 
// saving the put Instance to the HTable
hTable.put(put);




Read Data:


Get get = new Get(Bytes.toBytes("row1"));
// fetching the data
Result result = table.get(get);
// reading the object
String name = Bytes.toString( result.getValue( Bytes.toBytes("personal"), Bytes.toBytes("name")));
String city = Bytes.toString( result.getValue( Bytes.toBytes("personal"),Bytes.toBytes("city")));




Delete Data:


Delete delete = new Delete(Bytes.toBytes("row1"));
delete.deleteColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
delete.deleteFamily(Bytes.toBytes("professional"));
 
// deleting the data
table.delete(delete);




Security:

Grant Permission: It grants specific rights such as read, write, execute, and admin on a table to a certain authorized users. The syntax of grant command is as follows:



grant <user> <permissions> [<table> [<column family> [<column; qualifier>]]




Revoke Permission: The revoke command is used to revoke a user’s access rights of a table. Its syntax is as follows:



revoke <user>





Check Permission: The “user_permission” is used to list all the permissions for a particular created task.


Why to choose HBase?

Looking to the business perspectives for the acceptance of HBase and using it as a prominent solution for fetching out and synchronizing data, it is wiser to look into the benefits of it and then select it, in context of others.

Pros

Built-in versioning
Strong consistency at the record level
Provides RDBMS-like triggers and stored procedures through co-processors
Built on tried-and-true Hadoop technologies
Active development community


Cons

Lacks with a friendly, SQL-like query language
To setup beyond a Single-Node development cluster is not easy


Apart from, Scalability, sharding, Distributed storage, Consistency, Failover support, API support, MapReduce support, Backup support and Real time processing are the core features that make it unique from others. In a nutshell, it surely revolutionize the existing system to synchronize the structured and unstructured data.

Flash

Entertainment

Wednesday, 10 January 2018

Brief Look on Apache HBase

HBase vs HDFS:

DDL Commands:

Create Table:

Alter Table:

Disable Table:

Delete Table:

List Table:

DML Commands:

Insert/Update Data:

Read Data:

Delete Data:

Security:

Why to choose HBase?

Pros

Cons

Subscribe via email

Brief Look on Apache HBaseJan 10, 2018

No comments:

Post a Comment

Pages

Search This Blog

Popular Posts

Sponsor

Archive

Pages

Contributors

Featured Post

Popular Posts

Top reviews