The world of Big Data is still a very hot topic as the data we generate continues to explode exponentially. From smart devices, to social media, to sensor-enabled things (IoT), to wearables, to mountains of structured and unstructured data… Collecting all of that data, storing it, processing it, and gaining insights from it has created a vast field of Big Data technologies.
Apache Hadoop is the flagship technology for open source Big Data, and it’s 10 years old this year! Hadoop itself is the software framework for storing and processing large datasets across a cluster of commodity machines, using the power of distributed computing to handle all of that data.
Many technologies have popped up around Hadoop during that time, forming a rich ecosystem of tools to manage and analyze Big Data. The list is enormous, and growing rapidly, with many of these technologies making it easier to work with Hadoop, it’s storage layer HDFS, it’s parallel processing layer, MapReduce, and its scheduling layer, YARN; while others can work independently of Hadoop. Some of the most popular supporting Hadoop technologies include:
- HBase: NoSQL datastore on top of Hadoop providing real-time access to specific data.
- Hive: SQL-like queries on top of Hadoop for summarization and analysis, that translates into MapReduce jobs.
- Pig: Dataflow engine on top of Hadoop for reading and writing data, that translates into MapReduce jobs.
- Spark: Blazing fast in-memory processing on top of Hadoop, 10-100x faster than MapReduce.
- Mahout: A machine learning and data mining library on top of Hadoop.
- Flume: A streaming data and log collector for ingesting data into Hadoop.
- Sqoop: Bulk data transfer between Hadoop and relational databases.
- Zookeeper: Centralized service for coordination and configuration of distributed Hadoop services.
Cloud providers also have taken notice of the Big Data hotness and are providing cloud-based products and services of their own that range from fully hosted and managed Hadoop stacks to suites of custom built tools spanning the big data ecosystem.
Google Cloud Platform offers Big Data products such as Dataproc, Dataflow, Datalab, BigQuery, and Pub/Sub. Amazon Web Services offers Elastic Map Reduce (EMR), Data Pipeline, Redshift, Kinesis, and Quicksight. Microsoft Azure offers HDInsight, Data Lake, Stream Analytics, Machine Learning, and SQL Data Warehouse.
It’s an exciting time to be a part of this growing field. The future looks bright for all involved as our data will continue to get bigger and bigger, these technologies will continue to evolve — and the demand for Big Data professionals will continue to grow. Stay tuned for more Big Data goodness here at CBT Nuggets and happy learning!
Curious about Apache Hadoop? Watch our course!