Hardware profile

Right now, for POC purposes, Hadoop 0.19.1 and HBase 0.19.2 are running in a single node configuration with the following hardware:

CPU: 

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping        : 1
cpu MHz         : 1000.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips        : 2010.51
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping        : 1
cpu MHz         : 1000.000
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips        : 2010.51
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps

Memory:
Looks like there's only 1gb of memory available on the machine.  Seem to have lost some over the years...!

134G has been set aside for the POC for the HDFS

Creating HBase Indexes

Another interesting session trying to learn how to create HBase indexes. A few things i've picked up on so far (and there is a good possibility i'm wrong) is that you can not convert a table after it's been created to have indexes.  but then, maybe you can with alter.  for another day.

Create a new table + index:

public static void createIndex(String TABLE_NAME) throws IOException {
   String familyName = "entry:";
   byte[] FAMILY = Bytes.toBytes(familyName);

   IndexedTableAdmin admin;
   IndexedTable table;
   HTableDescriptor desc = new HTableDescriptor(TABLE_NAME);
   desc.addFamily(new HColumnDescriptor(FAMILY));
   String[] columns = { "hostname", "msg" };
   for (int i = 0; i <>
byte[] COL_NAME = Bytes.toBytes(familyName + columns[i].toString());
String INDEX_COL_NAME = columns[i].toString();
IndexSpecification colIndex = new IndexSpecification(INDEX_COL_NAME, COL_NAME);
desc.addIndex(colIndex);
   }

   admin = new IndexedTableAdmin(getConfig());
   // creates new table
   admin.createTable(desc);
   table = new IndexedTable(getConfig(), desc.getName());
}

So, once this is run, the following happens:

Table:  TABLE_NAME is created with indexes on "entry:hostname" and "entry:msg"
Table:  TABLE_NAME-hostname is created
Table:  TABLE_NAME-msg is created

Great, so now there is one table and two index tables. 

We can now push some data into it.  Before doing this however, we need to make a configuration change to HBase.

In $HBASE_HOME/conf/hbase-site.xml add the following:

  
        hbase.regionserver.class
        org.apache.hadoop.hbase.ipc.IndexedRegionInterface
        enable indexing
  

  
        hbase.regionserver.impl
        org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
        enable indexing
  

This will start the indexing service (if that's the right terminology).  Restart HBase and push some data into the table.  Once you've done this and you scan either of the index tables you'll see it working.  

using hadoop and hdfs with java

Data is received in parallel and is written to a queue, then a single thread reads the queue and writes those messages to a FSDataOutputStream which is kept open, but the messages never get flushed. Tried flush() and sync() with no joy.

1. outputStream.writeBytes(rawMessage.toString());
2. log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().sync();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

or

log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().flush();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

Just see the size() remain the same after performing this action.

This is using hadoop-0.20.0.

2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event
2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986
2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235

So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file...


Will investigate using hbase now as a container for all of the information.  It adds a little more overhead but allows the ability to still use hadoop/hdfs as the underlying storage engine while satisfying lots of concurrent writes (inserts in the context of hbase)