Big data, Hadoop, Cassandra; what they hell are they and why should I care?

I have been hearing about big data for a while now and even went to a meetup the other day to see what all the fuss was about. while a lot of the talk didn't mean much to me one thing caught my attention, the promise of fast, measureable ROI.

The example I honed in on was amazon and their recommendation feature. Amazon tracks everything you look at and buy on their site. They do some analytics and then recommend other things you might like, the better they get at this, the more you buy from amazon. If you imagine the number of people that go onto the amazon website and the number of items they have on their catalogue that works out to be a lot of data to crunch! Centralised databases find it hard to handle such volume so you need to distribute the load that is what hadoop does.

Hadoop is comprised of 2 basic applications, HDFS (Hadoop Distributed File System) which looks after the distributed database and MapReduce, which looks after the jobs that run on the database. Each instance of the database is generally a commodity server or VM so each has CPU and RAM connected to it, this allows MapReduce to run many small jobs, each close to the data it needs to access.

 As I understand it HDFS is more for batch processing, so you would get you recommendation back for a couple of hours. That is quite a long time, people want that recommendation back in real time. That's where Cassandra comes in. It is a distributed database like HFDS but it's built for real time. People are starting to use Hadoop's MapReduce with the Cassandra database, quicker matching means quicker buying!

There are a lot of applications people use for the storage and analysis of big data, the information about them is all over the place so I made a list of the most common ones I have heard about:


A combination of a distributed database (HDFS) and Mapreduce. Hadoop is not a real-time technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services.


MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a file system (unstructured) or within a database (structured).


Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.


A distributed database based on a piece of Google's backend. A database for real-time applications.


A cassandra competitor.


A SQL-like query language designed for use by programming novices.

Apache Pig:

Apache Pig is a platform for analysing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.


A distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.


A Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed file system (HDFS) and MapReduce framework