Great thing about this is that I can easily use JDBC in my front end code and leverage the benefits of Hadoop and everything else on the backend without major code changes. At least, this is my initial thought.
hadoop thoughts
After two months of playing with hadoop-core, hbase and the rest of the hadoop related projects, I have sat, pondered and wondered.
First, what is hadoop?
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
Ok, if you want to know more about hadoop and hbase, etc., go read http://hadoop.apache.org/ and also http://www.cloudera.com/
My initial thoughts about hadoop were in no particular order:
awesome. cool. sweet. huh? hmmmm. what?
awesome:
HDFS. Hadoop file system. Finally, I had at my fingers the ability to have a network storage system that didn't cost a lot and was fairly easy to set up. Started playing around and toying with lucerne and other aspects and that led me to:
cool:
yes, it is cool. if you are a geek, it's great. i see many applications for this ... without getting into the map/reduce that is one of the fundamental benefits of hadoop
sweet:
web interfaces out of the box for all of the different services. hadoop, hbase. sweet! as much as i love to get down and dirty with a cli (command line interface) ... loading up multiple webpages in chrome was and still is appealing.
huh?
SPOF (single point of failure) ... by design, or not, the namenode in hadoop is a single point of failure. from the website:
The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
hmmmmm:
Ok, so in the future this would be removed. I have to keep reminding myself that it's not that old and there have to be some let downs along the way...so long as when it finally gets to somewhere ... those let downs are brought back online. forward thinking and long term planning.
what?
POSIX (Portable Operating System Interface for Unix) . my issue here is that hdfs is not a POSIX file system. this means that it doesn't behave as a normal fs, yet. i had hoped and in a way had forced myself to believe that updates would be supported and supported well. after a lot of digging, this wasn't the case. a lot of digging. sure, you can add a few configuration lines and it will update files ... but you get into lots of problems.
Conclusion?
In 6-12 months I think that this technology will be amazing...but as it currently stands, I think it may be as great as Windows 7 ... If you lower your expectations and expect some blue screens, I think you'll be fine. If you want it to be perfect out of the gate with wonderful documentation and features that are needed en masse ... give it some time.
I am.
Hardware profile
Right now, for POC purposes, Hadoop 0.19.1 and HBase 0.19.2 are running in a single node configuration with the following hardware:
CPU:
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips : 2010.51
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model : 107
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping : 1
cpu MHz : 1000.000
cache size : 512 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3d
nowext 3dnow rep_good pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch
bogomips : 2010.51
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc 100mhzsteps
Memory:
Looks like there's only 1gb of memory available on the machine. Seem to have lost some over the years...!
134G has been set aside for the POC for the HDFS
Creating HBase Indexes
Another interesting session trying to learn how to create HBase indexes. A few things i've picked up on so far (and there is a good possibility i'm wrong) is that you can not convert a table after it's been created to have indexes. but then, maybe you can with alter. for another day.
hbase.regionserver.class org.apache.hadoop.hbase.ipc.IndexedRegionInterface enable indexing
hbase.regionserver.impl org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer enable indexing
Create a new table + index:
public static void createIndex(String TABLE_NAME) throws IOException {
String familyName = "entry:";
byte[] FAMILY = Bytes.toBytes(familyName);
IndexedTableAdmin admin;
IndexedTable table;
HTableDescriptor desc = new HTableDescriptor(TABLE_NAME);
desc.addFamily(new HColumnDescriptor(FAMILY));
String[] columns = { "hostname", "msg" };
for (int i = 0; i <>
byte[] COL_NAME = Bytes.toBytes(familyName + columns[i].toString());
String INDEX_COL_NAME = columns[i].toString();
IndexSpecification colIndex = new IndexSpecification(INDEX_COL_NAME, COL_NAME);
desc.addIndex(colIndex);
}
admin = new IndexedTableAdmin(getConfig());
// creates new table
admin.createTable(desc);
table = new IndexedTable(getConfig(), desc.getName());
}
So, once this is run, the following happens:
Table: TABLE_NAME is created with indexes on "entry:hostname" and "entry:msg"
Table: TABLE_NAME-hostname is created
Table: TABLE_NAME-msg is created
Great, so now there is one table and two index tables.
We can now push some data into it. Before doing this however, we need to make a configuration change to HBase.
In $HBASE_HOME/conf/hbase-site.xml add the following:
This will start the indexing service (if that's the right terminology). Restart HBase and push some data into the table. Once you've done this and you scan either of the index tables you'll see it working.
using hadoop and hdfs with java
Data is received in parallel and is written to a queue, then a single thread reads the queue and writes those messages to a FSDataOutputStream which is kept open, but the messages never get flushed. Tried flush() and sync() with no joy.
1. outputStream.writeBytes(rawMessage.toString());
2. log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().sync();
log.debug("Flushed stream, size = " + s.getOutputStream().size());
or
log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().flush();
log.debug("Flushed stream, size = " + s.getOutputStream().size());
Just see the size() remain the same after performing this action.
This is using hadoop-0.20.0.
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event
2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986
2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file...
1. outputStream.writeBytes(rawMessage.toString());
2. log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().sync();
log.debug("Flushed stream, size = " + s.getOutputStream().size());
or
log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().flush();
log.debug("Flushed stream, size = " + s.getOutputStream().size());
Just see the size() remain the same after performing this action.
This is using hadoop-0.20.0.
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event
2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986
2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file...
Will investigate using hbase now as a container for all of the information. It adds a little more overhead but allows the ability to still use hadoop/hdfs as the underlying storage engine while satisfying lots of concurrent writes (inserts in the context of hbase)
Subscribe to:
Posts (Atom)