Categories

Get Started Now

Big Data: Spark or Hadoop?

featured (18)

Spark has threatened to overtake certain elements of Hadoop for a while, or at least that’s what it looks like on the forums. It’s a regular conversation starter on Quora and industry blogs. In some cases, it’s phrased as though Spark will replace Hadoop, which likely won’t be the case — at least not for a while.

They’re part of the same ecosystem. As CBT Nuggets trainer Garth Schulte points out in a recent blog post, Hadoop is a rich ecosystem of projects, and, importantly, Spark is one of them. For this reason, it might be more accurate to say that “Spark might replace Hadoop MapReduce” rather than “Spark will replace Hadoop.” Here’s why.

Both take you to the next level of Big Data. Hadoop is a distributed data infrastructure, meaning it’s an ecosystem rather than a database. In Garth’s Hadoop training course, he calls this ecosystem an intelligent complement for anyone dealing with the three big challenges of Big Data — volume, velocity, and variety. Hadoop tools working in unison, collecting and analyzing huge datasets, and then distributing them across clusters. By comparison, Spark is a tool solely used for data processing. Even used as a standalone service, it still doesn’t have a storage layer. 

Spark might replace Hadoop MapReduce. Hadoop has its own processing component, MapReduce. Spark’s primary claim to fame lies in the fact that it can process data up to 100 times faster than MapReduce because it processes data differently. The MapReduce workflow operates in steps, and the need to write each operation’s results to a cluster before it performs the next operation slows down the processing speed. Conversely, Spark conducts all analytic operations at once, writes the results to a cluster, and is finished. 

You still might not need Spark’s power. For businesses that have static data operations requirements and do not require immediate batch-mode processing, MapReduce’s processing capabilities are more than adequate. However, for real-time and multiple operations, Spark is the preferred resource because its data is accessed in-memory. But, you’ll need a lot of memory.

Here’s the point: Spark is just one great tool within a whole ecosystem. Maybe one day it’ll standalone without the Hadoop standing point. In the meantime, this discussion serves as a sign that businesses are discovering new and innovative methods for using stored data.

Apache Spark and Hadoop are not mutually exclusive. Instead, each tool serves as an excellent example of the Open Source principle and how two products can exist nested within or simultaneously alongside one another in order to enable users to extract the maximum amount of value from their data.

Comments are closed.