It is undeniable that a lot of murmur around Hadoop not being the default choice of future big data lies in its inherent relative inability for real-time applications and the evident unwillingness from potential customers to move large volumes of data from existing data stores to Hadoop. To be honest, the latter isn’t a small concern. Many a surveys reflect how healthcare, financial services and insurance companies are still struggling to figure out methods to use open source distributed processing framework in their data environments. Top this with the fact that companies with not enough large data sets do not see Hadoop, which is perceived to be best suited for processing large data sets, as an ideal tool.
Among the alternatives, those making news recently include Apache Spark, Facebook Presto, BashReduce, Disco Project, GraphLab, Impala, Hydra and so on. Not all of these were developed with the same intent. Apache Spark is a MapReduce player, GraphLab was designed for improving parallel machine learning algorithm and Hydra is a distributed task processing system that supports streaming and batch operations. Undoubtedly, it is HDFS that makes Hadoop such a reliable tool. Even most of Apache Spark’s implementations are tied to Hadoop, either by using Hadoop’s HDFS as a data store or by running it within a Hadoop cluster. Having said that, Spark isn’t totally relied on Hadoop. To begin with, it can use a number of different data stores, as independent analytics tool, as sources and repositories and not depend on Hadoop at all. Also, it is not bound to HDFS and consequently is free from the burden of HDFS shortcomings. Another tool named Presto is founded on a distributed SQL query engine optimized for ad-hoc analysis. How it is differentiated from its competitors is that Presto can concurrently use a number of data stores as sources using ‘’connectors’’, even not needing to move data into HDFS prior to querying. Cloudera built Impala, which completely omits the MapReduce layer, is architected exactly like all of the shared-nothing parallel SQL DBMSs, serving the data warehouse market.
In addition, Google has also more or less abandoned MapReduce, focusing instead on newer systems such as Dremmel, Big Table, and F1/Spanner. In fact, Spanner maintains transactional consistency despite spanning data geographies and that is something that can be imported to the Hadoop ecosystem as well. Even Microsoft could lend some inspiration to Hadoop from its Bing Infrastructure. Bing runs on a combination of tools called Cosmos, Tiger and Scope and Microsoft is looking beyond Hadoop’s original function of merely searching and aiming to build an information fabric that changes how data is indexed, searched for and presented.
However, has Hadoop remained silent to the arrival of these alternatives? No. In fact, many argue that Hadoop is how you define it. And the new definition of Hadoop that refers to the entire ecosystem argues for its perennial longevity. This might sound a little illogical to many. But the fact remains that Hadoop has been part of an industry lacking in standardization or even a collective vision of what it can aspire to be. With their own interests in mind, vendors have been responding dynamically to the market activities and have now joined the bandwagon of redefining Hadoop, in a way as it syncs with their own advancements in technology. In slang, Hadoop is now used to mean the entire stack with HDFS at the bottom and other tools which are perceived as alternatives to Hadoop at the top.
I must conclude with the interesting observation that Hadoop vendors are now gradually moving out to a competitive playfield with the data warehouse vendors by implementing the same architecture as the latter. This will be an interesting trend to watch out for in the next couple of years.