Try our training for free.

Gain instant access to our entire IT training library for 1 week. Train anytime on your desktop, tablet, or mobile devices.

This Apache Hadoop video training with Garth Schulte covers how to install, configure, and manage Hadoop clusters, as well as working with projects in Hadoop such as Pig and HBase....
This Apache Hadoop video training with Garth Schulte covers how to install, configure, and manage Hadoop clusters, as well as working with projects in Hadoop such as Pig and HBase.

Related area of expertise:
  • Big Data

Recommended skills:
  • Familiarity with Ubuntu Linux

Recommended equipment:
  • Ubuntu Linux 12.04 LTS operating system

Related certifications:
  • None

Related job functions:
  • Big Data architects
  • Big Data administrators
  • Big Data developers
  • IT professionals

This course will get you up to speed on Big Data and Hadoop. Topics include how to install, configure and manage a single and multi-node Hadoop cluster, configure and manage HDFS, write MapReduce jobs and work with many of the projects around Hadoop such as Pig, Hive, HBase, Sqoop, and Zookeeper. Topics also include configuring Hadoop in the cloud and troubleshooting a multi-node Hadoop cluster.
1. Hadoop Course Introduction (24 min)
2. Hadoop Technology Stack (19 min)
3. Hadoop Distributed File System (HDFS) (23 min)
4. Introduction to MapReduce (25 min)
5. Installing Apache Hadoop (Single Node) (29 min)
6. Installing Apache Hadoop (Multi Node) (23 min)
7. Troubleshooting, Administering and Optimizing Hadoop (38 min)
8. Managing HDFS (25 min)
9. MapReduce Development (36 min)
10. Introduction to Pig (31 min)
11. Developing with Pig (36 min)
12. Introduction to Hive (25 min)
13. Developing with Hive (27 min)
14. Introduction to HBase (27 min)
15. Developing with HBase (24 min)
16. Introduction to Zookeeper (24 min)
17. Introduction to Sqoop (24 min)
18. Local Hadoop: Cloudera CDH VM (17 min)
19. Cloud Hadoop: Amazon EMR (21 min)
20. Cloud Hadoop: Microsoft HDInsight (18 min)

Hadoop Course Introduction

00:00:00

Hadoop Series Introduction. Hey, everyone. Garth Schulte from CBT Nuggets. It's an honor to be your guide through this series and the wonderful world of big data. There's a buzz term for you, big data. Kind of like the cloud, two of the most popular buzz terms around today, and for good reason.

00:00:15

There's a big market and a big future for big data. Companies are starting to realize there's an untapped treasure trove of information sitting in unstructured documents on hard drives everywhere. And while Hadoop was built to handle terabytes and petabytes of data-- and that's what a lot of the big companies jumped on it for.

00:00:34

That's the essence of big data-- even small to medium-size companies are real what we have untapped information in unstructured documents spread across the network. We have emails. Emails, imagine if we could data mine our emails, the kind of information we can find in them.

00:00:49

Documents, PDFs, spreadsheets, text files, all of this unstructured data sitting across a network that contains answers, answers that will help us create new products, refine existing products, discover trends, improve customer relations, understand ourselves and even our company better.

00:01:05

Hadoop answers many of the big data challenges that we see today. How do you store terabytes and petabytes of information? How do you access that information quickly? And how do you work with data that's in of a variety of different formats, structured, semi-structured, unstructured? And how do you do all that in a scalable, fault-tolerant, and flexible way? That's what this series is all about, Hadoop, Hadoop, Hadoop, and big data and data in a variety of different formats.

00:01:29

Hadoop is a distributed software solution. It's a distributed framework. It's a way that we can take a cluster of machines, a lot of machines. Rather than having one or couple big expensive machines, it's a way to have a lot of commodity machines, low to medium-range machines, that work together to store and process our data.

00:01:50

We're going to kick off the Hadoop series introduction here with a look at the state of data. We'll get some statistics. We'll take a look at what this data explosion, another buzz term for you, what that's all about. It's really around unstructured data.

00:02:02

The internet is growing at an alarming rate. We are entering data at an alarming rate as it's becoming more accessible. It's becoming a lot easier to enter data. Companies and websites all over are tracking everything we do. So data really is exploding. And again, it shows no signs of slowing down.

00:02:18

We'll get some statistics. We'll get some use cases going of companies that currently have big data and what they're doing with it. And we'll talk about structured, unstructured, and semi-structured data and also the three V's, which are the three big challenges of big data, volume, velocity, and variety.

00:02:34

From there we'll jump into a high-level overview of Hadoop. We'll get familiar with the core components. We'll talk about how Hadoop is scalable, fault-tolerant, flexible, fast, and intelligent. And we'll even get some comparisons going between the relational world and the unstructured world, and we can see how Hadoop really is a complement to that world.

00:02:52

We'll even get some use cases going on internal infrastructure, such as Yahoo, who has a 4,500 node cluster set up. They said they have 40,000 machines with Hadoop on it. So we'll see some of the specs of those machines and how Yahoo uses Hadoop. We'll also talk about cloud Hadoop.

00:03:08

Everyone's got a cloud implementation these days, Amazon, Cloudera, Microsoft, Hortonworks, the list goes on and on, IBM. And we'll see how the New York Times used cloud-based Hadoop to turn all of their articles into PDFs in an incredible amount of time and an extremely low cost.

00:03:26

At the end of this Nugget, we're going to look at the series layout. We'll start with the Nugget layout, so you can get an idea of what the 20 Nuggets in this series consist of. And then we'll look at the network layout. We're going to head over to the virtual Nugget Lab and create a cluster of four machines.

00:03:40

We're going to spread gigabytes of data across those machines and use Hadoop's technology stack to manage and work with the data inside of our cluster. So strap on your seat belt. It's going to be a ride. It's going to be a fun ride. We're going to learn a lot.

00:03:52

We're going to laugh. We're going to cry, probably cry a lot more than laugh. But you haven't truly worked with Hadoop until you've shed a few tears. Let's start with the current state of data. The state of data can be summed up in one word, a lot. There's a lot of data out there.

00:04:06

And the mind-boggling stat that I still can't seem to wrap my brain around here is that 90% of the world's data was created in the last 2 years. That's a lot. And that says something. That alone tells me that, yeah, big data's here, and it's only going to get crazier.

00:04:22

So that's why big data is really spawning off its own field in IT, and it's going to be a big market here in the next 5 to 10 years. So it's a great field to get into. There's going to be plenty of jobs. And there's already a serious lack of talent in the field.

00:04:35

But 90% of the world's data created in last 2 years. A lot of that, sure, is due to everything being a lot more accessible with smartphones and tablets-- anybody can access data from pretty much anywhere-- but also the advent of social media. Social media is everywhere, Twitter, Facebook, Instagram, Tumblr, the list goes on and on.

00:04:54

And so we're generating data at an alarming rate. Check this out. This is what we're talking about with the data explosion. Any time you hear that term data explosion, they're referring to the explosion of unstructured and semi-structured data on the web.

00:05:06

Since the mid-'90s, it's been on a tear, this exponential rate that shows no signs of slowing down. Structured data over the last 40 years has been on a pretty standard manageable, predictable curve. And just to give some examples of these kinds of data, unstructured data, things like emails, PDFs, documents on the hard drive, textiles, that kind of stuff.

00:05:26

Semi-structured are things that have some form of hierarchy or a way to delimit the data, like XML. XML is a tag-based format that describes the data inside of the tags, and it's hierarchical, so that's got some structure. Any sort of delimited-based file tab or a CSV kind of stuff, those are all semi-structured, because they don't have a hard schema attached to it.

00:05:47

Structured data, in a relational world, everything as a schema associated with it. And it's checked against the scheme when you put it into the database. So what are some of these big companies doing with this data? And how do they do it? Well, Google, for instance, who seems to be the pioneers for everything, back in the day when they were starting out, they said, how do we make the internet searchable? We need to index a billion pages.

00:06:08

They built this technology called MapReduce along with GFS, the Google File System. And that's really what Hadoop is based on. A gentleman by the name of Doug Cutting back in 2003 joined Yahoo, and Yahoo gave him a team of engineers. And they built to Hadoop based on those white papers, the Google File System and MapReduce.

00:06:28

That's the two core technologies in Hadoop are MapReduce and HDFS, the Hadoop Distributed File System. Back to the story here, Google indexed a billion pages using the same technology that we're going to learn about here. Now today 60 billion pages is what Google indexes to make searchable for the internet.

00:06:47

And it still boggles my mind every time I do a Google search that it comes back in like 0.02 seconds. I'm like, how in the heck did it do that? Now we know, right? Facebook is another one. They boast they have the largest Hadoop cluster on the planet. They have a cluster that contains overall 100 petabytes of data.

00:07:05

And on top of that, they generate half a petabyte of data every single day. That's crazy. Anything everybody does on Facebook, from logging in to clicking to liking something is tracked on Facebook. And this is how they use that data to target you. They do ad targeting.

00:07:23

Not that anybody clicks on Facebook ads, but that's how they ad target you is they look at what you're clicking, what your friends are clicking, find the commonalities between them, something called collaborative filtering, otherwise known as a recommendation engine.

00:07:35

And that's how they do all of their ad targeting. Amazon's the same way. They do the exact same machine learning algorithms. They have these recommendation engines. Again, the fancy term for that is collaborative filtering. But any time you go to Amazon, if you buy something, or even if you just browse to a product, they look at all the other people that have also looked at that product, find the commonalities and then recommend them to you.

00:07:56

It's a pretty brilliant and cool system. Twitter is another one, 400 million tweets a day, which is about 100,000 tweets a minute, which equates to about 15 terabytes of data every day. What you do with that data if you're Twitter? Well, they have a lot of Hadoop clusters set up where they run thousands of MapReduce jobs every night, twisting and turning and looking at that data in a variety of ways to discover trends.

00:08:21

They know all the latest and greatest trends, and they probably sell that information to people who make products so they can target those trends. Another interesting use case is GM, General Motors, American car manufacturer. Just recently, they cut off a multi-billion-dollar contract they had with HP and some other vendors for outsourcing.

00:08:40

What are they doing? They're building two 20,000-square-feet warehouses. They're bringing it all in-house. They're going to load up those warehouses with miles and miles of racks containing low=end x86 machines. They're going to install Hadoop on them and do all their big data analytics inside.

00:08:56

That is just awesome. And thanks to Jeremy Cioara, every time I see a picture of a data center now I just want to lick it. I don't know if you've seen his Nugget where he says, don't you just want to lick it? But thanks, Jeremy. It looks very lickable. And as if you need any more proof that this big data is more than just a fad, check out these stats.

00:09:12

Global IP traffic-- and this is a Cisco stat-- is said to triple by 2015, triple, which they say will take us into the zettabyte range. Right now we're in the exabyte. I can't imagine being in the zettabytes. That is in unbelievable pile of data that's going to be out there.

00:09:29

So big data is definitely here to stay. And also, 2/3 of North American companies that were interviewed said big data is in their five-year plan. So it's a good time to get into big data. re are going to be a ton of jobs in the future. There already are plenty of jobs out there for big data, but it's just going to expand exponentially as time goes on.

00:09:48

The last thing I want to talk about on this slide are the three V's, volume, velocity, and variety, the three big reasons we cannot use traditional computing models, which is big expensive tricked out machines with lots of processors that contain lots of cores, and we have lots of hard drives, rate enabled for performance and fault tolerance.

00:10:07

It's a hardware solution. We can't use hardware solutions. I'll explain why in a minute. And also, a reason we cannot use the relational world. The relational world was really designed to handle gigabytes of data, not terabytes and petabytes. And what does the relational world do when they start getting too much data? They archive it to tape.

00:10:21

That is the death of data. It's no longer a part of your analysis. It's no longer a part of your business intelligence and your reporting. And that's bad. Velocity is another one. Velocity is the speed at which we access data. The traditional computing models, no matter how fast your computer is, your processor is still going to be bound by disk I/O, because disk transfer rates haven't evolved at nearly the pace of processing power.

00:10:45

This is why distributed computing makes more sense. And Hadoop uses the strengths of the current computing world by bringing computation to the data. Rather than bringing data to the computation, which saturates network bandwidth, it can process the data locally.

00:11:01

And when you have a cluster of nodes all working together using and harnessing the power of the processor and reducing network bandwidth and mitigating the weakness of disk transfer rates, we have some pretty impressive performance when processing. In fact, Yahoo broke records of processing and sorting a terabyte of data multiple times using Hadoop.

00:11:25

In fact, they did it back in 2008. The record at the time was 297. They smashed it in 209 seconds on a 900-node Hadoop cluster, just a bunch of commodity machines with 8 gigs of RAM, dual quad-core processors and 4 disks attached to each one, pretty impressive stuff.

00:11:41

The last V, variety, obviously relational systems can only handle the structured data. Sure, they can get semi-structured and unstructured data in with some engineers that have ETL tools that'll transform and scrub the data to bring it in. But that requires a lot of extra work.

00:11:56

So let's go get interested to Hadoop and see how it solves these three challenges and talk about it's a software-based solution and all the benefits we gain with that over the hardware-based solution of the traditional computing world. As I mentioned earlier, Hadoop is this distributed software solution.

00:12:10

It is a scalable fault-tolerant distributive system for data storage and processing. There's two main components in Hadoop, HDFS, which is the storage, and MapReduce, which is the retrieval and the processing. HDFS is this self-healing high-bandwidth clustered storage.

00:12:29

And it's pretty awesome stuff. What would happen here is if we were to put a petabyte file inside of our Hadoop cluster, HDFS would break it up into blocks and then distribute it across all the nodes in our cluster. On top of that-- and this is where the fault tolerance side is going to come in a play-- when we configure HDFS, we're going to set up a replication factor.

00:12:47

By default, it's set at 3. What that means is when we put this file in Hadoop, it's going to make sure that there are three copies of every block that make up that file spread across all the nodes in the cluster. That's pretty awesome. And why that's awesome is because if we lose a node, it's going to self-heal.

00:13:04

It's going to say, oh, I know what data was on that node. I'm just going to re-replicate the blocks that were on that node to the rest of the servers inside of the cluster. And how it does it is this. It has a NameNode and a DataNode. Generally, you have one NameNode per cluster, and then all the rest of these here going to be DataNodes.

00:13:21

And we'll get into more details of the roles and secondary NameNode nodes and all that stuff when we get there. But essentially, the NameNode is just a metadata server. It just holds in memory the location of every block and every node. And even if you have multiple racks set up, it'll know where blocks exist on what note on what rack spread across the cluster inside your network.

00:13:40

So that's the secret sauce behind HDFS, and that's how we're fault-tolerant and redundant, and it's just really awesome stuff. Now how we get data is through MapReduce. And as the name implies, it's really a two-step process. There's a little more to that.

00:13:53

But again, we're going to keep this high level. So we'll get into MapReduce. We've got a few Nuggets on MapReduce. We'll get in down to the nitty-gritty details, and we'll also break out Java and write some MapReduce jobs on our own. But it's a two-step process at the surface.

00:14:06

There's a mapper and a reducer. Programmers will write the mapper function, which will go out and tell the cluster what data points we want to retrieve. The reducer will then take all that data and aggregate it. So again, Hadoop is a batch-processing-based system.

00:14:23

And we're working on all of the data in the cluster. We're not doing any seeking or anything like that. Seeks are what slows down data retrieval. MapReduce is all about working on all of the data inside of our cluster. And MapReduce produce can scare some folks away, because you think, oh, I got to know Java in order to write these Javas to pull data out of the cluster.

00:14:42

Well, that's not entirely true. A lot of things have popped up in the Hadoop ecosystem over the last couple of years that attract many people. And this is where the flexibility comes into play, because you don't need to understand Java to get data out of the cluster.

00:14:56

In fact, the engineers at Facebook built a subproject called Hive, which is a SQL interpreter. Facebook said, you know what? We want lots of people to be able to write ad hoc jobs against our cluster, but we're not going to force everybody to learn Java. So that's why they had a team of engineers build Hive.

00:15:14

And now anybody that's familiar a SQL, which most data professionals are, can now pull data out of the cluster. Pig is another one. Yahoo went and built Pig as a high-level dataflow language to pull data out of a cluster. And all Hive and Pig both do are under the hood create MapReduce jobs and submit them to the cluster.

00:15:31

That's the beauty of an open source framework. People can build and add to it. Our community keeps growing in Hadoop. More the technologies and projects are added to Hadoop ecosystem all the time, which are just making it more attractive to more and more folks.

00:15:45

Again, as these technologies merge and Hadoop matures, you're going to see it become a lot more attractive to just the big businesses. Small, medium-sized businesses, people of all types of industries are going to jump into Hadoop and start mining all kinds of data in their network.

00:16:00

All right. So we're fault-tolerant through HDFS. We're flexible in how we can retrieve the data. And we're also flexible in the kind of data we can put in Hadoop. As we saw, structured, unstructured, semi-structured, we can put it all in there. But we're also scalable.

00:16:13

The beauty of scalability is it just kind of happens by default because we're in the distributed computing environment. We don't have to do anything special to make Hadoop scalable. We just are. Let's say our MapReduce job starts slowing down because we keep adding more data into the cluster.

00:16:25

What do we do? We add more nodes, which increases the overall processing power of our entire cluster, which is pretty awesome stuff. And adding nodes is really a piece of cake. We just install the Hadoop binaries, point them to the NameNode, and we're good to go.

00:16:39

Last but not least, Hadoop is extremely intelligent. We've already seen examples of this. The fact that we're bringing computation to the data, we're maximizing the strengths of today's computing world and mitigating the weaknesses, that alone is pretty awesome stuff.

00:16:51

But on top of that, in a multi-rack environment-- let's get some switches up here. Here's some rack switches, and here's our data center switch-- it's rack-aware. This is something that we need to do manually. We need to configure this. And it's pretty simple to do.

00:17:06

In a configuration file, we're just describing the network topology to Hadoop so it knows what DataNodes belong to what racks. And what that allows Hadoop to do is even more data locality. So whenever it receives a MapReduce job, it's going to find the shortest path to the data as possible.

00:17:23

If most of the data is on one rack and only a little bit on another rack, then it can get most of the data from one rack. And, again, this is where it's going to save on bandwidth. It's going to save on bandwidth, because it's going to keep data as local to the rack as possible.

00:17:35

Pretty awesome stuff. Let's check out a couple of use cases, one from an architectural standpoint, Yahoo. Yahoo has over 40,000 machines, as I mentioned, with Hadoop on it. Their largest cluster sits at a 4,500-node cluster. Each node inside of that cluster has a dual quad-core CPU, 4 one-terabyte disks, and 16 gigabytes of RAM, pretty good size for a commodity machine, but certainly not an extremely high-end machine that we're talking about in the traditional committing sense.

00:18:01

That's not bad at all. Another use case here is the cloud. The cloud. What about Hadoop in the cloud? Lots and lots of companies running Hadoop implementations in the cloud, Amazon being one of the more popular ones out there, in that instead of Amazon Web Services they have something called EMR, Elastic MapReduce, which is just their own implementation of Hadoop.

00:18:22

You can literally get it up and running in five minutes in the cloud. And that's going to be attractive for a lot of businesses that can't afford an internal infrastructure, such as Yahoo or GM, as we saw earlier. For instance, here's a good one, New York Times.

00:18:36

The New York Times wanted to convert all of their articles to PDFs, four terabytes of articles into PDFs. How do you do that? How do you do that in a cost-effective way without buying an entire infrastructure or an entire army of engineers to implement that infrastructure or developers and all that good stuff? Here's how they did it.

00:18:59

They fired up an AWS EC2 instance. They used S3. They put their four terabytes of TIFF data inside of S3. They spun up an EC2 instance and then just ran a MapReduce job to take those four terabytes, convert them into PDFs. It happened in less than 24 hours, and it cost them a grand total of $240. I know.

00:19:20

That's crazy. The first time I saw it too, I was like, aren't they missing a few zeroes and commas and decimals and things that would make that number bigger? But believe it or not, that's all it is. And that's a beautiful thing. In fact, the cloud's really attractive, not only for small businesses but also for hobbyists or people that are trying to learn.

00:19:38

Because you only pay for compute time. So while you're developing and learning this stuff, if you ever want to run it, spin it up, run it, spin it down. It'll cost you cents. Cents. So the cloud's huge, and it's attractive. And we're going to look later in the series how to get Hadoop up and running in the cloud.

00:19:54

But first, we need to do it the hard way. So we're going to spend the first half of the series doing everything down low level at the ground. We'll see how to do this stuff the hard way, and then we'll look at how to do it the easy way in the cloud. Lastly here, let's get a look at what's to come.

00:20:09

So starting with the Nugget layout, you are here, the series introduction. From here, we're going to move into the technology stack. We'll get familiar with all of the projects that make up the Hadoop ecosystem. We'll look at a high level, just to get familiar and make some sense out of it.

00:20:24

We'll look at Pig, Hive, HBase, Sqoop, ZooKeeper, Ambari, Avro, Flume, Oozie. Yeah, you see where I'm going with this? There's a lot of them. And we'll look at them all. We'll make sense of it. And believe me, it'll be a good exercise in what the heck is Hadoop and what are all these things around it and what are their roles and responsibilities and how are they going to make it easier to work with our cluster and the data within it.

00:20:46

So we'll take the covers off all that next. Then we'll jump into HDFS. We'll get a good hard look at the internals of the Hadoop Distributed File System. From there, we'll learn how to install Hadoop. We're going to do this first on a single node. So we've got a video dedicated to single-node installation.

00:21:05

And then we have a video dedicated to multi-node installation. So you're going to get very familiar with how to get Hadoop up and running from scratch. Then from there, we'll jump into another HDFS Nugget and learn how to configure it and manage HDFS inside of our cluster.

00:21:20

Then we'll get into MapReduce. We've got a couple of Nuggets on MapReduce. We want to introduce you to it and to give a basic introduction. And then it's just going to be straight into it. I'm going to throw you right into it by learning how to develop MapReduce applications using Java.

00:21:33

Then we'll get familiar with many of the big popular projects here in the Hadoop ecosystem. We've got a couple of Nuggets on each one of these, Hive, Pig, Hive, HBase. Sqoop and ZooKeeper we only need one Nugget for, because they're not extremely huge and have a lot of concepts associated with them.

00:21:50

But Pig, Hive, and HBase, we've got a couple Nuggets dedicated to each of those so you can get familiar with what they are and how to use them. We've got a Nugget on troubleshooting Hadoop. And then we've got a few nuggets on cloud Hadoop. We'll take a couple of different looks at how to get Hadoop up and running in the cloud.

00:22:04

By the end of this series, we'll have 20 Nuggets of Hadoop goodness that touches on a little bit of everything. Lastly, let's check out our network layout. We're going to create a four-node cluster up in the virtual Nugget Lab. All of these machines are going to be running Ubuntu Linux 12.04 LTS for long-term support. How Hadoop's versioning works, we're going to use the most recent stable release, which is 1.12. But essentially, 1.1.x are the stable releases, 1.2.x are the beta releases, and 2.x.x are the alpha releases. So we'll stick with the most recent stable release and also Java 7, which is 1.7. And again, I have Ubuntu Linux installed on these four machines, but that's it.

00:22:43

That's as far as I took it. We're going to take these machines literally from scratch. I'll show you how we can download the Hadoop tar ball, get it all installed, configured at the single-node level. And then we'll do it again at the multi-node level. How it's going to work here is we're going to name these Hadoop Nugget, HN, Names.

00:23:02

So there's our NameNode. And then we're going to have a bunch of Hadoop Nugget DataNodes. So DataNode 1, HN Data 1, HN Data 2, and HN Data 3. Once we get our cluster set up, we get HDFS configured, then we'll get some data into here. We'll start with unstructured data.

00:23:18

I'll show you how we can take books. Books are a great way to learn on, because it's easy to write MapReduce jobs to mine books for word counts. Show me how many times words appear in a book, and you can see what your favorite authors' favorite words are across all of the books that they've ever written.

00:23:35

Kind of cool. In fact, if you could data mine Nuggets, you could probably find that my favorite word is either cool or ah or something like that, probably a bunch of filler words. But anyway, the fun stuff. So we'll start with unstructured data. Then we'll move into more structured data.

00:23:48

We'll find some good data out there that we can use. I'll show you where you can find big data sets. Amazon offers some. Infochimps offers some. So we'll definitely get some good data in here to work with. And then we'll use all the tools that we learned along the way to see how to manage the cluster and work with the data inside of it.

00:24:04

It'll be fun. In this CBT Nugget, we took a Hadoop series introduction. We started off by talking about the state of data. We defined big data. We saw the challenges with big data. We saw how companies use big data for analytics. And we had some fun statistics along the way.

00:24:20

We also got familiar with Hadoop, just did a basic high-level overview to introduce you to the core components that are Hadoop, HDFS, MapReduce. We saw how it's a pretty impressive software-based open source solution to distributed computing that's scalable, fault-tolerant, fast, and flexible.

00:24:39

And at the end here, we took a look at the series layout. We got familiar with the Nuggets in this series and really what we're going to be learning about throughout this series and also the network layout to get familiar with what we're going to be working with over in the virtual Nugget Lab and the kinds of things we'll be doing.

Hadoop Technology Stack

Hadoop Distributed File System (HDFS)

Introduction to MapReduce

Installing Apache Hadoop (Single Node)

Installing Apache Hadoop (Multi Node)

Troubleshooting, Administering and Optimizing Hadoop

Managing HDFS

MapReduce Development

Introduction to Pig

Developing with Pig

Introduction to Hive

Developing with Hive

Introduction to HBase

Developing with HBase

Introduction to Zookeeper

Introduction to Sqoop

Local Hadoop: Cloudera CDH VM

Cloud Hadoop: Amazon EMR

Cloud Hadoop: Microsoft HDInsight

Please help us improve by sharing your feedback on training courses and videos. For customer service questions, please contact our support team. The views expressed in comments reflect those of the author and not of CBT Nuggets. We reserve the right to remove comments that do not adhere to our community standards.

comments powered by Disqus
Entry 9 hrs 20 videos

COURSE RATING

Basic Plan Features


Speed Control
Included in this course
Play videos at a faster or slower pace.

Bookmarks
Included in this course
Pick up where you left off watching a video.

Notes
Included in this course
Jot down information to refer back to at a later time.

Closed Captions
Included in this course
Follow what the trainers are saying with ease.

NuggetLab
Files/materials that supplement the video training

Premium Plan Features


Practice Exams
These practice tests help you review your knowledge and prepare you for exams.

Virtual Lab
Included in this course
Use a virtual environment to reinforce what you are learning and get hands-on experience.

Offline Training
Included in this course
Our mobile apps offer the ability to download videos and train anytime, anywhere offline.

Accountability Coaching
Included in this course
Develop and maintain a study plan with assistance from coaches.
Garth Schulte
Nugget trainer since 2002