Apache Hadoop

by Garth Schulte

Total Videos : 21 Course Duration: 08:47:19
1. Hadoop Course Introduction (00:24:54)
2. Hadoop Technology Stack (00:19:37)
3. Hadoop Distributed File System (HDFS) (00:23:49)
4. Introduction to MapReduce (00:25:08)
5. Installing Apache Hadoop (Single Node) (00:29:56)
6. Installing Apache Hadoop (Multi Node) (00:23:24)
7. Troubleshooting, Administering and Optimizing Hadoop (00:38:58)
8. Managing HDFS (00:25:26)
9. MapReduce Development (00:36:08)
10. Introduction to Pig (00:31:49)
11. Developing with Pig (00:36:18)
12. Introduction to Hive (00:25:14)
13. Developing with Hive (00:27:39)
14. Introduction to HBase (00:27:53)
15. Developing with HBase (00:24:35)
16. Introduction to Zookeeper (00:24:51)
17. Introduction to Sqoop (00:24:23)
18. Local Hadoop: Cloudera CDH VM (00:17:16)
19. Cloud Hadoop: Amazon EMR (00:21:35)
20. Cloud Hadoop: Microsoft HDInsight (00:18:26)
20 of 21 videos complete.
This Apache Hadoop video training with Garth Schulte covers how to install, configure, and manage Hadoop clusters, as well as working with projects in Hadoop such as Pig and HBase.

Related area of expertise:
  • Big Data

Recommended skills:
  • Familiarity with Ubuntu Linux

Recommended equipment:
  • Ubuntu Linux 12.04 LTS operating system

Related certifications:
  • None

Related job functions:
  • Big Data architects
  • Big Data administrators
  • Big Data developers
  • IT professionals

This course will get you up to speed on Big Data and Hadoop. Topics include how to install, configure and manage a single and multi-node Hadoop cluster, configure and manage HDFS, write MapReduce jobs and work with many of the projects around Hadoop such as Pig, Hive, HBase, Sqoop, and Zookeeper. Topics also include configuring Hadoop in the cloud and troubleshooting a multi-node Hadoop cluster.

Hadoop Course Introduction

00:00:00 - Hadoop Series Introduction.
00:00:02 - Hey, everyone.
00:00:03 - Garth Schulte from CBT Nuggets.
00:00:04 - It's an honor to be your guide through this series and the
00:00:07 - wonderful world of big data.
00:00:09 - There's a buzz term for you, big data.
00:00:11 - Kind of like the cloud, two of the most popular buzz terms
00:00:14 - around today, and for good reason.
00:00:15 - There's a big market and a big future for big data.
00:00:19 - Companies are starting to realize there's an untapped
00:00:21 - treasure trove of information sitting in unstructured
00:00:24 - documents on hard drives everywhere.
00:00:28 - And while Hadoop was built to handle terabytes and petabytes
00:00:31 - of data-- and that's what a lot of the big companies
00:00:33 - jumped on it for.
00:00:34 - That's the essence of big data--
00:00:35 - even small to medium-size companies are real what we
00:00:38 - have untapped information in unstructured documents spread
00:00:42 - across the network.
00:00:43 - We have emails.
00:00:44 - Emails, imagine if we could data mine our emails, the kind
00:00:47 - of information we can find in them.
00:00:49 - Documents, PDFs, spreadsheets, text files, all of this
00:00:52 - unstructured data sitting across a network that contains
00:00:55 - answers, answers that will help us create new products,
00:00:57 - refine existing products, discover trends, improve
00:01:00 - customer relations, understand ourselves and
00:01:03 - even our company better.
00:01:05 - Hadoop answers many of the big data
00:01:07 - challenges that we see today.
00:01:09 - How do you store terabytes and petabytes of information?
00:01:12 - How do you access that information quickly?
00:01:13 - And how do you work with data that's in of a variety of
00:01:16 - different formats, structured, semi-structured, unstructured?
00:01:20 - And how do you do all that in a scalable, fault-tolerant,
00:01:23 - and flexible way?
00:01:24 - That's what this series is all about, Hadoop, Hadoop, Hadoop,
00:01:27 - and big data and data in a variety of different formats.
00:01:29 - Hadoop is a distributed software solution.
00:01:33 - It's a distributed framework.
00:01:34 - It's a way that we can take a cluster of
00:01:36 - machines, a lot of machines.
00:01:38 - Rather than having one or couple big expensive machines,
00:01:40 - it's a way to have a lot of commodity machines, low to
00:01:43 - medium-range machines, that work together to store and
00:01:48 - process our data.
00:01:50 - We're going to kick off the Hadoop series introduction
00:01:52 - here with a look at the state of data.
00:01:54 - We'll get some statistics.
00:01:55 - We'll take a look at what this data explosion, another buzz
00:01:58 - term for you, what that's all about.
00:02:00 - It's really around unstructured data.
00:02:02 - The internet is growing at an alarming rate.
00:02:04 - We are entering data at an alarming rate as it's becoming
00:02:07 - more accessible.
00:02:08 - It's becoming a lot easier to enter data.
00:02:10 - Companies and websites all over are tracking
00:02:13 - everything we do.
00:02:14 - So data really is exploding.
00:02:16 - And again, it shows no signs of slowing down.
00:02:18 - We'll get some statistics.
00:02:19 - We'll get some use cases going of companies that currently
00:02:22 - have big data and what they're doing with it.
00:02:24 - And we'll talk about structured, unstructured, and
00:02:26 - semi-structured data and also the three V's, which are the
00:02:29 - three big challenges of big data, volume,
00:02:32 - velocity, and variety.
00:02:34 - From there we'll jump into a high-level overview of Hadoop.
00:02:37 - We'll get familiar with the core components.
00:02:39 - We'll talk about how Hadoop is scalable, fault-tolerant,
00:02:43 - flexible, fast, and intelligent.
00:02:45 - And we'll even get some comparisons going between the
00:02:47 - relational world and the unstructured world, and we can
00:02:50 - see how Hadoop really is a complement to that world.
00:02:52 - We'll even get some use cases going on internal
00:02:54 - infrastructure, such as Yahoo, who has a 4,500
00:02:57 - node cluster set up.
00:02:58 - They said they have 40,000 machines with Hadoop on it.
00:03:02 - So we'll see some of the specs of those machines and how
00:03:04 - Yahoo uses Hadoop.
00:03:06 - We'll also talk about cloud Hadoop.
00:03:08 - Everyone's got a cloud implementation these days,
00:03:10 - Amazon, Cloudera, Microsoft, Hortonworks, the list goes on
00:03:15 - and on, IBM.
00:03:16 - And we'll see how the New York Times used cloud-based Hadoop
00:03:20 - to turn all of their articles into PDFs in an incredible
00:03:24 - amount of time and an extremely low cost.
00:03:26 - At the end of this Nugget, we're going to look at the
00:03:28 - series layout.
00:03:28 - We'll start with the Nugget layout, so you can get an idea
00:03:31 - of what the 20 Nuggets in this series consist of.
00:03:33 - And then we'll look at the network layout.
00:03:36 - We're going to head over to the virtual Nugget Lab and
00:03:37 - create a cluster of four machines.
00:03:40 - We're going to spread gigabytes of data across those
00:03:42 - machines and use Hadoop's technology stack to manage and
00:03:46 - work with the data inside of our cluster.
00:03:48 - So strap on your seat belt.
00:03:49 - It's going to be a ride.
00:03:50 - It's going to be a fun ride.
00:03:51 - We're going to learn a lot.
00:03:52 - We're going to laugh.
00:03:53 - We're going to cry, probably cry a lot more than laugh.
00:03:55 - But you haven't truly worked with Hadoop until you've shed
00:03:58 - a few tears.
00:03:59 - Let's start with the current state of data.
00:04:01 - The state of data can be summed up in one word, a lot.
00:04:04 - There's a lot of data out there.
00:04:06 - And the mind-boggling stat that I still can't seem to
00:04:08 - wrap my brain around here is that 90% of the world's data
00:04:11 - was created in the last 2 years.
00:04:14 - That's a lot.
00:04:14 - And that says something.
00:04:15 - That alone tells me that, yeah, big data's here, and
00:04:19 - it's only going to get crazier.
00:04:22 - So that's why big data is really spawning off its own
00:04:25 - field in IT, and it's going to be a big market here in the
00:04:28 - next 5 to 10 years.
00:04:30 - So it's a great field to get into.
00:04:31 - There's going to be plenty of jobs.
00:04:32 - And there's already a serious lack of talent in the field.
00:04:35 - But 90% of the world's data created in last 2 years.
00:04:38 - A lot of that, sure, is due to everything being a lot more
00:04:41 - accessible with smartphones and tablets--
00:04:43 - anybody can access data from pretty much anywhere--
00:04:46 - but also the advent of social media.
00:04:48 - Social media is everywhere, Twitter, Facebook, Instagram,
00:04:52 - Tumblr, the list goes on and on.
00:04:54 - And so we're generating data at an alarming rate.
00:04:56 - Check this out.
00:04:57 - This is what we're talking about with the data explosion.
00:05:00 - Any time you hear that term data explosion, they're
00:05:01 - referring to the explosion of unstructured and
00:05:04 - semi-structured data on the web.
00:05:06 - Since the mid-'90s, it's been on a tear, this exponential
00:05:10 - rate that shows no signs of slowing down.
00:05:12 - Structured data over the last 40 years has been on a pretty
00:05:15 - standard manageable, predictable curve.
00:05:18 - And just to give some examples of these kinds of data,
00:05:20 - unstructured data, things like emails, PDFs, documents on the
00:05:24 - hard drive, textiles, that kind of stuff.
00:05:26 - Semi-structured are things that have some form of
00:05:29 - hierarchy or a way to delimit the data, like XML.
00:05:33 - XML is a tag-based format that describes the data inside of
00:05:36 - the tags, and it's hierarchical, so that's got
00:05:39 - some structure.
00:05:39 - Any sort of delimited-based file tab or a CSV kind of
00:05:43 - stuff, those are all semi-structured, because they
00:05:45 - don't have a hard schema attached to it.
00:05:47 - Structured data, in a relational world, everything
00:05:50 - as a schema associated with it.
00:05:52 - And it's checked against the scheme when you put it into
00:05:54 - the database.
00:05:55 - So what are some of these big companies
00:05:56 - doing with this data?
00:05:58 - And how do they do it?
00:05:59 - Well, Google, for instance, who seems to be the pioneers
00:06:01 - for everything, back in the day when they were starting
00:06:03 - out, they said, how do we make the internet searchable?
00:06:05 - We need to index a billion pages.
00:06:08 - They built this technology called MapReduce along with
00:06:11 - GFS, the Google File System.
00:06:14 - And that's really what Hadoop is based on.
00:06:16 - A gentleman by the name of Doug Cutting back in 2003
00:06:20 - joined Yahoo, and Yahoo gave him a team of engineers.
00:06:22 - And they built to Hadoop based on those white papers, the
00:06:26 - Google File System and MapReduce.
00:06:28 - That's the two core technologies in Hadoop are
00:06:30 - MapReduce and HDFS, the Hadoop Distributed File System.
00:06:35 - Back to the story here, Google indexed a billion pages using
00:06:38 - the same technology that we're going to learn about here.
00:06:41 - Now today 60 billion pages is what Google indexes to make
00:06:46 - searchable for the internet.
00:06:47 - And it still boggles my mind every time I do a Google
00:06:49 - search that it comes back in like 0.02 seconds.
00:06:52 - I'm like, how in the heck did it do that?
00:06:55 - Now we know, right?
00:06:56 - Facebook is another one.
00:06:57 - They boast they have the largest Hadoop
00:06:59 - cluster on the planet.
00:07:01 - They have a cluster that contains overall 100
00:07:03 - petabytes of data.
00:07:05 - And on top of that, they generate half a petabyte of
00:07:09 - data every single day.
00:07:11 - That's crazy.
00:07:12 - Anything everybody does on Facebook, from logging in to
00:07:16 - clicking to liking something is tracked on Facebook.
00:07:19 - And this is how they use that data to target you.
00:07:22 - They do ad targeting.
00:07:23 - Not that anybody clicks on Facebook ads, but that's how
00:07:25 - they ad target you is they look at what you're clicking,
00:07:28 - what your friends are clicking, find the
00:07:29 - commonalities between them, something called collaborative
00:07:32 - filtering, otherwise known as a recommendation engine.
00:07:35 - And that's how they do all of their ad targeting.
00:07:37 - Amazon's the same way.
00:07:38 - They do the exact same machine learning algorithms.
00:07:42 - They have these recommendation engines.
00:07:44 - Again, the fancy term for that is collaborative filtering.
00:07:46 - But any time you go to Amazon, if you buy something, or even
00:07:49 - if you just browse to a product, they look at all the
00:07:52 - other people that have also looked at that product, find
00:07:54 - the commonalities and then recommend them to you.
00:07:56 - It's a pretty brilliant and cool system.
00:07:59 - Twitter is another one, 400 million tweets a day, which is
00:08:03 - about 100,000 tweets a minute, which equates to about 15
00:08:09 - terabytes of data every day.
00:08:10 - What you do with that data if you're Twitter?
00:08:13 - Well, they have a lot of Hadoop clusters set up where
00:08:15 - they run thousands of MapReduce jobs every night,
00:08:17 - twisting and turning and looking at that data in a
00:08:18 - variety of ways to discover trends.
00:08:21 - They know all the latest and greatest trends, and they
00:08:23 - probably sell that information to people who make products so
00:08:26 - they can target those trends.
00:08:27 - Another interesting use case is GM, General Motors,
00:08:31 - American car manufacturer.
00:08:33 - Just recently, they cut off a multi-billion-dollar contract
00:08:36 - they had with HP and some other vendors for outsourcing.
00:08:40 - What are they doing?
00:08:41 - They're building two 20,000-square-feet warehouses.
00:08:45 - They're bringing it all in-house.
00:08:46 - They're going to load up those warehouses with miles and
00:08:49 - miles of racks containing low=end x86 machines.
00:08:53 - They're going to install Hadoop on them and do all
00:08:54 - their big data analytics inside.
00:08:56 - That is just awesome.
00:08:57 - And thanks to Jeremy Cioara, every time I see a picture of
00:08:59 - a data center now I just want to lick it.
00:09:01 - I don't know if you've seen his Nugget where he says,
00:09:03 - don't you just want to lick it?
00:09:04 - But thanks, Jeremy.
00:09:05 - It looks very lickable.
00:09:07 - And as if you need any more proof that this big data is
00:09:10 - more than just a fad, check out these stats.
00:09:12 - Global IP traffic-- and this is a Cisco stat-- is said to
00:09:15 - triple by 2015, triple, which they say will take us into the
00:09:20 - zettabyte range.
00:09:21 - Right now we're in the exabyte.
00:09:22 - I can't imagine being in the zettabytes.
00:09:24 - That is in unbelievable pile of data that's
00:09:28 - going to be out there.
00:09:29 - So big data is definitely here to stay.
00:09:31 - And also, 2/3 of North American companies that were
00:09:34 - interviewed said big data is in their five-year plan.
00:09:37 - So it's a good time to get into big data. re are going to
00:09:41 - be a ton of jobs in the future.
00:09:43 - There already are plenty of jobs out there for big data,
00:09:45 - but it's just going to expand exponentially as time goes on.
00:09:48 - The last thing I want to talk about on this slide are the
00:09:50 - three V's, volume, velocity, and variety, the three big
00:09:53 - reasons we cannot use traditional computing models,
00:09:57 - which is big expensive tricked out machines with lots of
00:10:01 - processors that contain lots of cores, and we have lots of
00:10:03 - hard drives, rate enabled for
00:10:05 - performance and fault tolerance.
00:10:07 - It's a hardware solution.
00:10:08 - We can't use hardware solutions.
00:10:09 - I'll explain why in a minute.
00:10:10 - And also, a reason we cannot use the relational world.
00:10:13 - The relational world was really designed to handle
00:10:14 - gigabytes of data, not terabytes and petabytes.
00:10:17 - And what does the relational world do when they start
00:10:18 - getting too much data?
00:10:19 - They archive it to tape.
00:10:21 - That is the death of data.
00:10:23 - It's no longer a part of your analysis.
00:10:25 - It's no longer a part of your business
00:10:27 - intelligence and your reporting.
00:10:29 - And that's bad.
00:10:30 - Velocity is another one.
00:10:31 - Velocity is the speed at which we access data.
00:10:34 - The traditional computing models, no matter how fast
00:10:37 - your computer is, your processor is still going to be
00:10:39 - bound by disk I/O, because disk transfer rates haven't
00:10:42 - evolved at nearly the pace of processing power.
00:10:45 - This is why distributed computing makes more sense.
00:10:47 - And Hadoop uses the strengths of the current computing world
00:10:52 - by bringing computation to the data.
00:10:54 - Rather than bringing data to the computation, which
00:10:57 - saturates network bandwidth, it can
00:10:59 - process the data locally.
00:11:01 - And when you have a cluster of nodes all working together
00:11:05 - using and harnessing the power of the processor and reducing
00:11:09 - network bandwidth and mitigating the weakness of
00:11:12 - disk transfer rates, we have some pretty impressive
00:11:16 - performance when processing.
00:11:18 - In fact, Yahoo broke records of processing and sorting a
00:11:22 - terabyte of data multiple times using Hadoop.
00:11:25 - In fact, they did it back in 2008.
00:11:27 - The record at the time was 297.
00:11:29 - They smashed it in 209 seconds on a 900-node Hadoop cluster,
00:11:34 - just a bunch of commodity machines with 8 gigs of RAM,
00:11:37 - dual quad-core processors and 4 disks attached to each one,
00:11:40 - pretty impressive stuff.
00:11:41 - The last V, variety, obviously relational systems can only
00:11:44 - handle the structured data.
00:11:45 - Sure, they can get semi-structured and
00:11:47 - unstructured data in with some engineers that have ETL tools
00:11:50 - that'll transform and scrub the data to bring it in.
00:11:53 - But that requires a lot of extra work.
00:11:56 - So let's go get interested to Hadoop and see how it solves
00:11:58 - these three challenges and talk about it's a
00:12:00 - software-based solution and all the benefits we gain with
00:12:03 - that over the hardware-based solution of the traditional
00:12:05 - computing world.
00:12:06 - As I mentioned earlier, Hadoop is this
00:12:08 - distributed software solution.
00:12:10 - It is a scalable fault-tolerant distributive
00:12:12 - system for data storage and processing.
00:12:16 - There's two main components in Hadoop, HDFS, which is the
00:12:19 - storage, and MapReduce, which is the retrieval and the
00:12:21 - processing.
00:12:23 - HDFS is this self-healing
00:12:27 - high-bandwidth clustered storage.
00:12:29 - And it's pretty awesome stuff.
00:12:31 - What would happen here is if we were to put a petabyte file
00:12:33 - inside of our Hadoop cluster, HDFS would break it up into
00:12:36 - blocks and then distribute it across all the
00:12:38 - nodes in our cluster.
00:12:40 - On top of that-- and this is where the fault tolerance side
00:12:42 - is going to come in a play--
00:12:44 - when we configure HDFS, we're going to set up
00:12:45 - a replication factor.
00:12:47 - By default, it's set at 3.
00:12:48 - What that means is when we put this file in Hadoop, it's
00:12:50 - going to make sure that there are three copies of every
00:12:53 - block that make up that file spread across all the nodes in
00:12:56 - the cluster.
00:12:57 - That's pretty awesome.
00:12:58 - And why that's awesome is because if we lose a node,
00:13:03 - it's going to self-heal.
00:13:04 - It's going to say, oh, I know what data was on that node.
00:13:07 - I'm just going to re-replicate the blocks that were on that
00:13:09 - node to the rest of the servers inside of the cluster.
00:13:13 - And how it does it is this.
00:13:14 - It has a NameNode and a DataNode.
00:13:17 - Generally, you have one NameNode per cluster, and then
00:13:19 - all the rest of these here going to be DataNodes.
00:13:21 - And we'll get into more details of the roles and
00:13:24 - secondary NameNode nodes and all that
00:13:25 - stuff when we get there.
00:13:26 - But essentially, the NameNode is just a metadata server.
00:13:28 - It just holds in memory the location of every block and
00:13:31 - every node.
00:13:32 - And even if you have multiple racks set up, it'll know where
00:13:35 - blocks exist on what note on what rack spread across the
00:13:38 - cluster inside your network.
00:13:40 - So that's the secret sauce behind HDFS, and that's how
00:13:44 - we're fault-tolerant and redundant, and it's just
00:13:46 - really awesome stuff.
00:13:47 - Now how we get data is through MapReduce.
00:13:50 - And as the name implies, it's really a two-step process.
00:13:53 - There's a little more to that.
00:13:53 - But again, we're going to keep this high level.
00:13:55 - So we'll get into MapReduce.
00:13:56 - We've got a few Nuggets on MapReduce.
00:13:57 - We'll get in down to the nitty-gritty details, and
00:14:00 - we'll also break out Java and write some
00:14:02 - MapReduce jobs on our own.
00:14:03 - But it's a two-step process at the surface.
00:14:06 - There's a mapper and a reducer.
00:14:08 - Programmers will write the mapper function, which will go
00:14:12 - out and tell the cluster what data
00:14:14 - points we want to retrieve.
00:14:16 - The reducer will then take all that data and aggregate it.
00:14:20 - So again, Hadoop is a batch-processing-based system.
00:14:23 - And we're working on all of the data in the cluster.
00:14:26 - We're not doing any seeking or anything like that.
00:14:28 - Seeks are what slows down data retrieval.
00:14:31 - MapReduce is all about working on all of the data inside of
00:14:34 - our cluster.
00:14:35 - And MapReduce produce can scare some folks away, because
00:14:38 - you think, oh, I got to know Java in order to write these
00:14:41 - Javas to pull data out of the cluster.
00:14:42 - Well, that's not entirely true.
00:14:44 - A lot of things have popped up in the Hadoop ecosystem over
00:14:47 - the last couple of years that attract many people.
00:14:50 - And this is where the flexibility comes into play,
00:14:52 - because you don't need to understand Java to get data
00:14:55 - out of the cluster.
00:14:56 - In fact, the engineers at Facebook built a subproject
00:15:00 - called Hive, which is a SQL interpreter.
00:15:02 - Facebook said, you know what?
00:15:03 - We want lots of people to be able to write ad hoc jobs
00:15:07 - against our cluster, but we're not going to force everybody
00:15:10 - to learn Java.
00:15:12 - So that's why they had a team of engineers build Hive.
00:15:14 - And now anybody that's familiar a SQL, which most
00:15:16 - data professionals are, can now pull
00:15:18 - data out of the cluster.
00:15:20 - Pig is another one.
00:15:21 - Yahoo went and built Pig as a high-level dataflow language
00:15:25 - to pull data out of a cluster.
00:15:26 - And all Hive and Pig both do are under the hood create
00:15:29 - MapReduce jobs and submit them to the cluster.
00:15:31 - That's the beauty of an open source framework.
00:15:33 - People can build and add to it.
00:15:35 - Our community keeps growing in Hadoop.
00:15:37 - More the technologies and projects are added to Hadoop
00:15:40 - ecosystem all the time, which are just making it more
00:15:42 - attractive to more and more folks.
00:15:45 - Again, as these technologies merge and Hadoop matures,
00:15:48 - you're going to see it become a lot more attractive to just
00:15:51 - the big businesses.
00:15:52 - Small, medium-sized businesses, people of all
00:15:55 - types of industries are going to jump into Hadoop and start
00:15:58 - mining all kinds of data in their network.
00:16:00 - All right.
00:16:00 - So we're fault-tolerant through HDFS.
00:16:02 - We're flexible in how we can retrieve the data.
00:16:05 - And we're also flexible in the kind of data
00:16:07 - we can put in Hadoop.
00:16:08 - As we saw, structured, unstructured, semi-structured,
00:16:10 - we can put it all in there.
00:16:11 - But we're also scalable.
00:16:13 - The beauty of scalability is it just kind of happens by
00:16:15 - default because we're in the distributed computing
00:16:17 - environment.
00:16:18 - We don't have to do anything special
00:16:19 - to make Hadoop scalable.
00:16:20 - We just are.
00:16:21 - Let's say our MapReduce job starts slowing down because we
00:16:23 - keep adding more data into the cluster.
00:16:25 - What do we do?
00:16:26 - We add more nodes, which increases the overall
00:16:29 - processing power of our entire cluster, which is pretty
00:16:32 - awesome stuff.
00:16:33 - And adding nodes is really a piece of cake.
00:16:34 - We just install the Hadoop binaries, point them to the
00:16:37 - NameNode, and we're good to go.
00:16:39 - Last but not least, Hadoop is extremely intelligent.
00:16:41 - We've already seen examples of this.
00:16:43 - The fact that we're bringing computation to the data, we're
00:16:45 - maximizing the strengths of today's computing world and
00:16:48 - mitigating the weaknesses, that alone is
00:16:50 - pretty awesome stuff.
00:16:51 - But on top of that, in a multi-rack environment--
00:16:54 - let's get some switches up here.
00:16:55 - Here's some rack switches, and here's
00:16:58 - our data center switch--
00:17:01 - it's rack-aware.
00:17:02 - This is something that we need to do manually.
00:17:04 - We need to configure this.
00:17:05 - And it's pretty simple to do.
00:17:06 - In a configuration file, we're just describing the network
00:17:08 - topology to Hadoop so it knows what DataNodes
00:17:11 - belong to what racks.
00:17:13 - And what that allows Hadoop to do is even more data locality.
00:17:18 - So whenever it receives a MapReduce job, it's going to
00:17:21 - find the shortest path to the data as possible.
00:17:23 - If most of the data is on one rack and only a little bit on
00:17:26 - another rack, then it can get most of the
00:17:27 - data from one rack.
00:17:28 - And, again, this is where it's going to save on bandwidth.
00:17:31 - It's going to save on bandwidth, because it's going
00:17:32 - to keep data as local to the rack as possible.
00:17:35 - Pretty awesome stuff.
00:17:36 - Let's check out a couple of use cases, one from an
00:17:39 - architectural standpoint, Yahoo.
00:17:41 - Yahoo has over 40,000 machines, as I mentioned, with
00:17:44 - Hadoop on it.
00:17:44 - Their largest cluster sits at a 4,500-node cluster.
00:17:48 - Each node inside of that cluster has a dual quad-core
00:17:50 - CPU, 4 one-terabyte disks, and 16 gigabytes of RAM, pretty
00:17:55 - good size for a commodity machine, but certainly not an
00:17:58 - extremely high-end machine that we're talking about in
00:18:00 - the traditional committing sense.
00:18:01 - That's not bad at all.
00:18:04 - Another use case here is the cloud.
00:18:06 - The cloud.
00:18:07 - What about Hadoop in the cloud?
00:18:08 - Lots and lots of companies running Hadoop implementations
00:18:12 - in the cloud, Amazon being one of the more popular ones out
00:18:14 - there, in that instead of Amazon Web Services they have
00:18:17 - something called EMR, Elastic MapReduce, which is just their
00:18:21 - own implementation of Hadoop.
00:18:22 - You can literally get it up and running in five
00:18:24 - minutes in the cloud.
00:18:25 - And that's going to be attractive for a lot of
00:18:27 - businesses that can't afford an internal infrastructure,
00:18:30 - such as Yahoo or GM, as we saw earlier.
00:18:32 - For instance, here's a good one, New York Times.
00:18:36 - The New York Times wanted to convert all of their articles
00:18:41 - to PDFs, four terabytes of articles into PDFs.
00:18:45 - How do you do that?
00:18:46 - How do you do that in a cost-effective way without
00:18:49 - buying an entire infrastructure or an entire
00:18:53 - army of engineers to implement that infrastructure or
00:18:56 - developers and all that good stuff?
00:18:58 - Here's how they did it.
00:18:59 - They fired up an AWS EC2 instance.
00:19:02 - They used S3.
00:19:03 - They put their four terabytes of TIFF data inside of S3.
00:19:06 - They spun up an EC2 instance and then just ran a MapReduce
00:19:11 - job to take those four terabytes,
00:19:13 - convert them into PDFs.
00:19:14 - It happened in less than 24 hours, and it cost them a
00:19:17 - grand total of $240.
00:19:20 - I know.
00:19:20 - That's crazy.
00:19:21 - The first time I saw it too, I was like, aren't they missing
00:19:23 - a few zeroes and commas and decimals and things that would
00:19:27 - make that number bigger?
00:19:29 - But believe it or not, that's all it is.
00:19:31 - And that's a beautiful thing.
00:19:32 - In fact, the cloud's really attractive, not only for small
00:19:35 - businesses but also for hobbyists or people that are
00:19:37 - trying to learn.
00:19:38 - Because you only pay for compute time.
00:19:40 - So while you're developing and learning this stuff, if you
00:19:42 - ever want to run it, spin it up, run it, spin it down.
00:19:45 - It'll cost you cents.
00:19:47 - Cents.
00:19:48 - So the cloud's huge, and it's attractive.
00:19:50 - And we're going to look later in the series how to get
00:19:53 - Hadoop up and running in the cloud.
00:19:54 - But first, we need to do it the hard way.
00:19:56 - So we're going to spend the first half of the series doing
00:19:58 - everything down low level at the ground.
00:20:02 - We'll see how to do this stuff the hard way, and then we'll
00:20:05 - look at how to do it the easy way in the cloud.
00:20:06 - Lastly here, let's get a look at what's to come.
00:20:09 - So starting with the Nugget layout, you are here, the
00:20:12 - series introduction.
00:20:14 - From here, we're going to move into the technology stack.
00:20:16 - We'll get familiar with all of the projects that make up the
00:20:18 - Hadoop ecosystem.
00:20:19 - We'll look at a high level, just to get familiar and make
00:20:23 - some sense out of it.
00:20:24 - We'll look at Pig, Hive, HBase, Sqoop, ZooKeeper,
00:20:27 - Ambari, Avro, Flume, Oozie.
00:20:29 - Yeah, you see where I'm going with this?
00:20:30 - There's a lot of them.
00:20:32 - And we'll look at them all.
00:20:33 - We'll make sense of it.
00:20:34 - And believe me, it'll be a good exercise in what the heck
00:20:37 - is Hadoop and what are all these things around it and
00:20:40 - what are their roles and responsibilities and how are
00:20:43 - they going to make it easier to work with our cluster and
00:20:45 - the data within it.
00:20:46 - So we'll take the covers off all that next.
00:20:48 - Then we'll jump into HDFS.
00:20:50 - We'll get a good hard look at the internals of the Hadoop
00:20:54 - Distributed File System.
00:20:56 - From there, we'll learn how to install Hadoop.
00:20:59 - We're going to do this first on a single node.
00:21:02 - So we've got a video dedicated to single-node installation.
00:21:05 - And then we have a video dedicated to multi-node
00:21:08 - installation.
00:21:09 - So you're going to get very familiar with how to get
00:21:11 - Hadoop up and running from scratch.
00:21:13 - Then from there, we'll jump into another HDFS Nugget and
00:21:16 - learn how to configure it and manage HDFS
00:21:19 - inside of our cluster.
00:21:20 - Then we'll get into MapReduce.
00:21:21 - We've got a couple of Nuggets on MapReduce.
00:21:23 - We want to introduce you to it and to give a basic
00:21:25 - introduction.
00:21:26 - And then it's just going to be straight into it.
00:21:29 - I'm going to throw you right into it by learning how to
00:21:31 - develop MapReduce applications using Java.
00:21:33 - Then we'll get familiar with many of the big popular
00:21:39 - projects here in the Hadoop ecosystem.
00:21:41 - We've got a couple of Nuggets on each one of these, Hive,
00:21:43 - Pig, Hive, HBase.
00:21:45 - Sqoop and ZooKeeper we only need one Nugget for, because
00:21:47 - they're not extremely huge and have a lot of concepts
00:21:49 - associated with them.
00:21:50 - But Pig, Hive, and HBase, we've got a couple Nuggets
00:21:52 - dedicated to each of those so you can get familiar with what
00:21:55 - they are and how to use them.
00:21:56 - We've got a Nugget on troubleshooting Hadoop.
00:21:58 - And then we've got a few nuggets on cloud Hadoop.
00:22:00 - We'll take a couple of different looks at how to get
00:22:02 - Hadoop up and running in the cloud.
00:22:04 - By the end of this series, we'll have 20 Nuggets of
00:22:07 - Hadoop goodness that touches on a little bit of everything.
00:22:10 - Lastly, let's check out our network layout.
00:22:11 - We're going to create a four-node cluster up in the
00:22:14 - virtual Nugget Lab.
00:22:15 - All of these machines are going to be running Ubuntu
00:22:17 - Linux 12.04 LTS for long-term support.
00:22:21 - How Hadoop's versioning works, we're going to use the most
00:22:24 - recent stable release, which is 1.12.
00:22:26 - But essentially, 1.1.x are the stable releases, 1.2.x are the
00:22:30 - beta releases, and 2.x.x are the alpha releases.
00:22:34 - So we'll stick with the most recent stable release and also
00:22:37 - Java 7, which is 1.7.
00:22:39 - And again, I have Ubuntu Linux installed on these four
00:22:42 - machines, but that's it.
00:22:43 - That's as far as I took it.
00:22:44 - We're going to take these machines
00:22:46 - literally from scratch.
00:22:47 - I'll show you how we can download the Hadoop tar ball,
00:22:50 - get it all installed, configured at
00:22:53 - the single-node level.
00:22:54 - And then we'll do it again at the multi-node level.
00:22:56 - How it's going to work here is we're going to name these
00:22:59 - Hadoop Nugget, HN, Names.
00:23:02 - So there's our NameNode.
00:23:03 - And then we're going to have a bunch of
00:23:04 - Hadoop Nugget DataNodes.
00:23:06 - So DataNode 1, HN Data 1, HN Data 2, and HN Data 3.
00:23:11 - Once we get our cluster set up, we get HDFS configured,
00:23:14 - then we'll get some data into here.
00:23:16 - We'll start with unstructured data.
00:23:18 - I'll show you how we can take books.
00:23:19 - Books are a great way to learn on, because it's easy to write
00:23:23 - MapReduce jobs to mine books for word counts.
00:23:26 - Show me how many times words appear in a book, and you can
00:23:29 - see what your favorite authors' favorite words are
00:23:33 - across all of the books that they've ever written.
00:23:35 - Kind of cool.
00:23:36 - In fact, if you could data mine Nuggets, you could
00:23:37 - probably find that my favorite word is either cool or ah or
00:23:41 - something like that, probably a bunch of filler words.
00:23:43 - But anyway, the fun stuff.
00:23:45 - So we'll start with unstructured data.
00:23:46 - Then we'll move into more structured data.
00:23:48 - We'll find some good data out there that we can use.
00:23:50 - I'll show you where you can find big data sets.
00:23:52 - Amazon offers some.
00:23:53 - Infochimps offers some.
00:23:55 - So we'll definitely get some good data in
00:23:56 - here to work with.
00:23:57 - And then we'll use all the tools that we learned along
00:23:59 - the way to see how to manage the cluster and work with the
00:24:03 - data inside of it.
00:24:04 - It'll be fun.
00:24:05 - In this CBT Nugget, we took a Hadoop series introduction.
00:24:09 - We started off by talking about the state of data.
00:24:11 - We defined big data.
00:24:13 - We saw the challenges with big data.
00:24:14 - We saw how companies use big data for analytics.
00:24:17 - And we had some fun statistics along the way.
00:24:20 - We also got familiar with Hadoop, just did a basic
00:24:23 - high-level overview to introduce you to the core
00:24:25 - components that are Hadoop, HDFS, MapReduce.
00:24:28 - We saw how it's a pretty impressive software-based open
00:24:32 - source solution to distributed computing that's scalable,
00:24:36 - fault-tolerant, fast, and flexible.
00:24:39 - And at the end here, we took a look at the series layout.
00:24:40 - We got familiar with the Nuggets in this series and
00:24:43 - really what we're going to be learning about throughout this
00:24:45 - series and also the network layout to get familiar with
00:24:47 - what we're going to be working with over in the virtual
00:24:49 - Nugget Lab and the kinds of things we'll be doing.
00:24:51 - I hope this has been informative for you, and I'd
00:24:52 - like to thank you for viewing.

Hadoop Technology Stack

Hadoop Distributed File System (HDFS)

Introduction to MapReduce

Installing Apache Hadoop (Single Node)

Installing Apache Hadoop (Multi Node)

Troubleshooting, Administering and Optimizing Hadoop

Managing HDFS

MapReduce Development

Introduction to Pig

Developing with Pig

Introduction to Hive

Developing with Hive

Introduction to HBase

Developing with HBase

Introduction to Zookeeper

Introduction to Sqoop

Local Hadoop: Cloudera CDH VM

Cloud Hadoop: Amazon EMR

Cloud Hadoop: Microsoft HDInsight

Please help us improve by sharing your feedback on training courses and videos. For customer service questions, please contact our support team. The views expressed in comments reflect those of the author and not of CBT Nuggets. We reserve the right to remove comments that do not adhere to our community standards.

comments powered by Disqus
Garth Schulte

Garth Schulte

CBT Nuggets Trainer

Certifications:
Google Certified Trainer, MCSD, MCSD.NET, MCDBA, MCSA

Area Of Expertise:
Visual Studio 6, Visual Studio.NET Windows/Web Programming, SQL Server 6.5-2012


Course Features

Speed Control

Play videos at a faster or slower pace.

Bookmarks

Pick up where you left off watching a video.

Notes

Jot down information to refer back to at a later time.

Closed Captions

Follow what the trainers are saying with ease.

Premium Features

Virtual Lab

Use a virtual environment to reinforce what you are learning and get hands-on experience.

Offline Training

Our mobile apps offer the ability to download videos and train anytime, anywhere offline.

Accountability Coaching

Develop and maintain a study plan with assistance from coaches.

Share

Stay Connected

Get the latest updates on the subjects you choose.


  © 2014 CBT Nuggets. All rights reserved. Licensing Agreement | Billing Agreement | Privacy Policy | RSS