HadoopDB

Ok, so now we're talking. Thanks to slashdot I found this article


Great thing about this is that I can easily use JDBC in my front end code and leverage the benefits of Hadoop and everything else on the backend without major code changes. At least, this is my initial thought.

hadoop thoughts

After two months of playing with hadoop-core, hbase and the rest of the hadoop related projects, I have sat, pondered and wondered.

First, what is hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

Ok, if you want to know more about hadoop and hbase, etc., go read http://hadoop.apache.org/ and also http://www.cloudera.com/

My initial thoughts about hadoop were in no particular order:

awesome. cool. sweet. huh? hmmmm. what?

awesome:
HDFS. Hadoop file system. Finally, I had at my fingers the ability to have a network storage system that didn't cost a lot and was fairly easy to set up. Started playing around and toying with lucerne and other aspects and that led me to:

cool:
yes, it is cool. if you are a geek, it's great. i see many applications for this ... without getting into the map/reduce that is one of the fundamental benefits of hadoop

sweet:
web interfaces out of the box for all of the different services. hadoop, hbase. sweet! as much as i love to get down and dirty with a cli (command line interface) ... loading up multiple webpages in chrome was and still is appealing.

huh?
SPOF (single point of failure) ... by design, or not, the namenode in hadoop is a single point of failure. from the website:

The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.


hmmmmm:
Ok, so in the future this would be removed. I have to keep reminding myself that it's not that old and there have to be some let downs along the way...so long as when it finally gets to somewhere ... those let downs are brought back online. forward thinking and long term planning.

what?
POSIX (Portable Operating System Interface for Unix) . my issue here is that hdfs is not a POSIX file system. this means that it doesn't behave as a normal fs, yet. i had hoped and in a way had forced myself to believe that updates would be supported and supported well. after a lot of digging, this wasn't the case. a lot of digging. sure, you can add a few configuration lines and it will update files ... but you get into lots of problems.

Conclusion?
In 6-12 months I think that this technology will be amazing...but as it currently stands, I think it may be as great as Windows 7 ... If you lower your expectations and expect some blue screens, I think you'll be fine. If you want it to be perfect out of the gate with wonderful documentation and features that are needed en masse ... give it some time.

I am.


    Hardware profile

    Right now, for POC purposes, Hadoop 0.19.1 and HBase 0.19.2 are running in a single node configuration with the following hardware:

    CPU: 

    processor       : 0
    vendor_id       : AuthenticAMD
    cpu family      : 15
    model           : 107
    model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
    stepping        : 1
    cpu MHz         : 1000.000
    cache size      : 512 KB
    physical id     : 0
    siblings        : 2
    core id         : 0
    cpu cores       : 2
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 1
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
    nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
    bogomips        : 2010.51
    TLB size        : 1024 4K pages
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp tm stc 100mhzsteps

    processor       : 1
    vendor_id       : AuthenticAMD
    cpu family      : 15
    model           : 107
    model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
    stepping        : 1
    cpu MHz         : 1000.000
    cache size      : 512 KB
    physical id     : 0
    siblings        : 2
    core id         : 1
    cpu cores       : 2
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 1
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
    nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
    bogomips        : 2010.51
    TLB size        : 1024 4K pages
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 40 bits physical, 48 bits virtual
    power management: ts fid vid ttp tm stc 100mhzsteps

    Memory:
    Looks like there's only 1gb of memory available on the machine.  Seem to have lost some over the years...!

    134G has been set aside for the POC for the HDFS

    Creating HBase Indexes

    Another interesting session trying to learn how to create HBase indexes. A few things i've picked up on so far (and there is a good possibility i'm wrong) is that you can not convert a table after it's been created to have indexes.  but then, maybe you can with alter.  for another day.

    Create a new table + index:

    public static void createIndex(String TABLE_NAME) throws IOException {
       String familyName = "entry:";
       byte[] FAMILY = Bytes.toBytes(familyName);

       IndexedTableAdmin admin;
       IndexedTable table;
       HTableDescriptor desc = new HTableDescriptor(TABLE_NAME);
       desc.addFamily(new HColumnDescriptor(FAMILY));
       String[] columns = { "hostname", "msg" };
       for (int i = 0; i <>
    byte[] COL_NAME = Bytes.toBytes(familyName + columns[i].toString());
    String INDEX_COL_NAME = columns[i].toString();
    IndexSpecification colIndex = new IndexSpecification(INDEX_COL_NAME, COL_NAME);
    desc.addIndex(colIndex);
       }

       admin = new IndexedTableAdmin(getConfig());
       // creates new table
       admin.createTable(desc);
       table = new IndexedTable(getConfig(), desc.getName());
    }

    So, once this is run, the following happens:

    Table:  TABLE_NAME is created with indexes on "entry:hostname" and "entry:msg"
    Table:  TABLE_NAME-hostname is created
    Table:  TABLE_NAME-msg is created

    Great, so now there is one table and two index tables. 

    We can now push some data into it.  Before doing this however, we need to make a configuration change to HBase.

    In $HBASE_HOME/conf/hbase-site.xml add the following:

      
            hbase.regionserver.class
            org.apache.hadoop.hbase.ipc.IndexedRegionInterface
            enable indexing
      

      
            hbase.regionserver.impl
            org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer
            enable indexing
      

    This will start the indexing service (if that's the right terminology).  Restart HBase and push some data into the table.  Once you've done this and you scan either of the index tables you'll see it working.  

    using hadoop and hdfs with java

    Data is received in parallel and is written to a queue, then a single thread reads the queue and writes those messages to a FSDataOutputStream which is kept open, but the messages never get flushed. Tried flush() and sync() with no joy.

    1. outputStream.writeBytes(rawMessage.toString());
    2. log.debug("Flushing stream, size = " + s.getOutputStream().size());
    s.getOutputStream().sync();
    log.debug("Flushed stream, size = " + s.getOutputStream().size());

    or

    log.debug("Flushing stream, size = " + s.getOutputStream().size());
    s.getOutputStream().flush();
    log.debug("Flushed stream, size = " + s.getOutputStream().size());

    Just see the size() remain the same after performing this action.

    This is using hadoop-0.20.0.

    2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
    2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
    2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
    2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731
    2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
    2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event
    2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
    2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
    2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
    2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986
    2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235

    So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file...


    Will investigate using hbase now as a container for all of the information.  It adds a little more overhead but allows the ability to still use hadoop/hdfs as the underlying storage engine while satisfying lots of concurrent writes (inserts in the context of hbase)