I am against filling the internet with a duplicate copy of another's text (frequently so you can spam adverts at readers). I am against really ploddy basic notes, which information which may be easily googled, as this doesn't add anything.
This is a book specifically about Hadoop data-storage.
I wrote up the lecture for later reference (of CPD). The lecturer was Edgar Meij, I think the sample data he was discussing is on the linked website.
UPDATE: someone on linkedin sent me a very long-winded document. This is not the way I learn data, but if it helps you, goto. As a comparison, an architecture diagram and a link to the API docs would be my preferred solution. Map-reduce is a published algorithm/ pattern.
What is hadoop?
“Its an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by many people 1 It is licensed under the Apache License 2.0” 2. Some of the code is a column first database. The project in written in Java, and is massively parallel. On the config in the demo setup, each worker was given 180s to live (less then Apache workers by a large margin, for reference). At the point of the lecturers notes being compiled 2.3 was the current edition.
Where and why does one use hadoop?
The lecture was mostly focussed on map reduce operations across a data set. As hadoop includes HSDF it is quite useful for large volume storage. They assert it is faster to move computation to the datas location, then visa versa.
The hadoop project has many different sub-projects, but the lecturer mentioned Twitter, Y.A.O, facebook, Google and Baidu are using it. Cloudera 3 is recommended data source. HBase was mentioned as a KVS that worked with hadoop. For further reading, look at 4 for more tools that are part of the hadoop infrastructure.
There was discussion on verification of generated answers (of the map reduce). Obviously manual testing across a petabyte of data is niaf. The recommended approach is to take a smaller data set as a unit test.
If you have a Java based platform hadoop seems a reasonable tool.
How is hadoop structured
There seems little point in duplicating the official sources, please read the hadoop homepage. There are alot of management tools for the parallel or distributed processing.
As a comparison to HFS, hadoop relaxes some of the POSIX fileIO rules to be able to supply better stream support. This means it is faster to code for larger datasets. In terms of architecture, please read more about map reduce.