00:00:00 - Hadoop Series Introduction.
00:00:02 - Hey, everyone.
00:00:03 - Garth Schulte from
00:00:04 - It's an honor to be your guide
through this series and the
00:00:07 - wonderful world of big data.
00:00:09 - There's a buzz term
for you, big data.
00:00:11 - Kind of like the cloud, two of
the most popular buzz terms
00:00:14 - around today, and
for good reason.
00:00:15 - There's a big market and a
big future for big data.
00:00:19 - Companies are starting to
realize there's an untapped
00:00:21 - treasure trove of information
sitting in unstructured
00:00:24 - documents on hard drives
00:00:28 - And while Hadoop was built to
handle terabytes and petabytes
00:00:31 - of data-- and that's what a
lot of the big companies
00:00:33 - jumped on it for.
00:00:34 - That's the essence
of big data--
00:00:35 - even small to medium-size
companies are real what we
00:00:38 - have untapped information in
unstructured documents spread
00:00:42 - across the network.
00:00:43 - We have emails.
00:00:44 - Emails, imagine if we could data
mine our emails, the kind
00:00:47 - of information we can
find in them.
00:00:49 - Documents, PDFs, spreadsheets,
text files, all of this
00:00:52 - unstructured data sitting across
a network that contains
00:00:55 - answers, answers that will help
us create new products,
00:00:57 - refine existing products,
discover trends, improve
00:01:00 - customer relations, understand
00:01:03 - even our company better.
00:01:05 - Hadoop answers many
of the big data
00:01:07 - challenges that we see today.
00:01:09 - How do you store terabytes and
petabytes of information?
00:01:12 - How do you access that
00:01:13 - And how do you work with data
that's in of a variety of
00:01:16 - different formats, structured,
00:01:20 - And how do you do all that in
a scalable, fault-tolerant,
00:01:23 - and flexible way?
00:01:24 - That's what this series is all
about, Hadoop, Hadoop, Hadoop,
00:01:27 - and big data and data in a
variety of different formats.
00:01:29 - Hadoop is a distributed
00:01:33 - It's a distributed framework.
00:01:34 - It's a way that we can
take a cluster of
00:01:36 - machines, a lot of machines.
00:01:38 - Rather than having one or couple
big expensive machines,
00:01:40 - it's a way to have a lot of
commodity machines, low to
00:01:43 - medium-range machines, that
work together to store and
00:01:48 - process our data.
00:01:50 - We're going to kick off the
Hadoop series introduction
00:01:52 - here with a look at
the state of data.
00:01:54 - We'll get some statistics.
00:01:55 - We'll take a look at what this
data explosion, another buzz
00:01:58 - term for you, what
that's all about.
00:02:00 - It's really around unstructured
00:02:02 - The internet is growing
at an alarming rate.
00:02:04 - We are entering data at an
alarming rate as it's becoming
00:02:07 - more accessible.
00:02:08 - It's becoming a lot easier
to enter data.
00:02:10 - Companies and websites
all over are tracking
00:02:13 - everything we do.
00:02:14 - So data really is exploding.
00:02:16 - And again, it shows no signs
of slowing down.
00:02:18 - We'll get some statistics.
00:02:19 - We'll get some use cases going
of companies that currently
00:02:22 - have big data and what they're
doing with it.
00:02:24 - And we'll talk about structured,
00:02:26 - semi-structured data and also
the three V's, which are the
00:02:29 - three big challenges of
big data, volume,
00:02:32 - velocity, and variety.
00:02:34 - From there we'll jump into a
high-level overview of Hadoop.
00:02:37 - We'll get familiar with
the core components.
00:02:39 - We'll talk about how Hadoop is
00:02:43 - flexible, fast, and
00:02:45 - And we'll even get some
comparisons going between the
00:02:47 - relational world and the
unstructured world, and we can
00:02:50 - see how Hadoop really is a
complement to that world.
00:02:52 - We'll even get some use cases
going on internal
00:02:54 - infrastructure, such as
Yahoo, who has a 4,500
00:02:57 - node cluster set up.
00:02:58 - They said they have 40,000
machines with Hadoop on it.
00:03:02 - So we'll see some of the specs
of those machines and how
00:03:04 - Yahoo uses Hadoop.
00:03:06 - We'll also talk about
00:03:08 - Everyone's got a cloud
implementation these days,
00:03:10 - Amazon, Cloudera, Microsoft,
Hortonworks, the list goes on
00:03:15 - and on, IBM.
00:03:16 - And we'll see how the New York
Times used cloud-based Hadoop
00:03:20 - to turn all of their articles
into PDFs in an incredible
00:03:24 - amount of time and an
extremely low cost.
00:03:26 - At the end of this Nugget, we're
going to look at the
00:03:28 - series layout.
00:03:28 - We'll start with the Nugget
layout, so you can get an idea
00:03:31 - of what the 20 Nuggets in
this series consist of.
00:03:33 - And then we'll look at
the network layout.
00:03:36 - We're going to head over to
the virtual Nugget Lab and
00:03:37 - create a cluster of
00:03:40 - We're going to spread gigabytes
of data across those
00:03:42 - machines and use Hadoop's
technology stack to manage and
00:03:46 - work with the data inside
of our cluster.
00:03:48 - So strap on your seat belt.
00:03:49 - It's going to be a ride.
00:03:50 - It's going to be a fun ride.
00:03:51 - We're going to learn a lot.
00:03:52 - We're going to laugh.
00:03:53 - We're going to cry, probably
cry a lot more than laugh.
00:03:55 - But you haven't truly worked
with Hadoop until you've shed
00:03:58 - a few tears.
00:03:59 - Let's start with the current
state of data.
00:04:01 - The state of data can be summed
up in one word, a lot.
00:04:04 - There's a lot of
data out there.
00:04:06 - And the mind-boggling stat that
I still can't seem to
00:04:08 - wrap my brain around here is
that 90% of the world's data
00:04:11 - was created in the
last 2 years.
00:04:14 - That's a lot.
00:04:14 - And that says something.
00:04:15 - That alone tells me that, yeah,
big data's here, and
00:04:19 - it's only going to
00:04:22 - So that's why big data is really
spawning off its own
00:04:25 - field in IT, and it's going to
be a big market here in the
00:04:28 - next 5 to 10 years.
00:04:30 - So it's a great field
to get into.
00:04:31 - There's going to be
plenty of jobs.
00:04:32 - And there's already a serious
lack of talent in the field.
00:04:35 - But 90% of the world's data
created in last 2 years.
00:04:38 - A lot of that, sure, is due to
everything being a lot more
00:04:41 - accessible with smartphones
00:04:43 - anybody can access data from
pretty much anywhere--
00:04:46 - but also the advent
of social media.
00:04:48 - Social media is everywhere,
Twitter, Facebook, Instagram,
00:04:52 - Tumblr, the list
goes on and on.
00:04:54 - And so we're generating data
at an alarming rate.
00:04:56 - Check this out.
00:04:57 - This is what we're talking about
with the data explosion.
00:05:00 - Any time you hear that term
data explosion, they're
00:05:01 - referring to the explosion
of unstructured and
00:05:04 - semi-structured data
on the web.
00:05:06 - Since the mid-'90s, it's been
on a tear, this exponential
00:05:10 - rate that shows no signs
of slowing down.
00:05:12 - Structured data over the last 40
years has been on a pretty
00:05:15 - standard manageable,
00:05:18 - And just to give some examples
of these kinds of data,
00:05:20 - unstructured data, things like
emails, PDFs, documents on the
00:05:24 - hard drive, textiles,
that kind of stuff.
00:05:26 - Semi-structured are things
that have some form of
00:05:29 - hierarchy or a way to delimit
the data, like XML.
00:05:33 - XML is a tag-based format that
describes the data inside of
00:05:36 - the tags, and it's hierarchical,
so that's got
00:05:39 - some structure.
00:05:39 - Any sort of delimited-based
file tab or a CSV kind of
00:05:43 - stuff, those are all
semi-structured, because they
00:05:45 - don't have a hard schema
attached to it.
00:05:47 - Structured data, in a relational
00:05:50 - as a schema associated
00:05:52 - And it's checked against the
scheme when you put it into
00:05:54 - the database.
00:05:55 - So what are some of these
00:05:56 - doing with this data?
00:05:58 - And how do they do it?
00:05:59 - Well, Google, for instance, who
seems to be the pioneers
00:06:01 - for everything, back in the day
when they were starting
00:06:03 - out, they said, how do we make
the internet searchable?
00:06:05 - We need to index a
00:06:08 - They built this technology
called MapReduce along with
00:06:11 - GFS, the Google File System.
00:06:14 - And that's really what
Hadoop is based on.
00:06:16 - A gentleman by the name of
Doug Cutting back in 2003
00:06:20 - joined Yahoo, and Yahoo gave
him a team of engineers.
00:06:22 - And they built to Hadoop based
on those white papers, the
00:06:26 - Google File System
00:06:28 - That's the two core technologies
in Hadoop are
00:06:30 - MapReduce and HDFS, the Hadoop
Distributed File System.
00:06:35 - Back to the story here, Google
indexed a billion pages using
00:06:38 - the same technology that we're
going to learn about here.
00:06:41 - Now today 60 billion pages is
what Google indexes to make
00:06:46 - searchable for the internet.
00:06:47 - And it still boggles my mind
every time I do a Google
00:06:49 - search that it comes back
in like 0.02 seconds.
00:06:52 - I'm like, how in the heck
did it do that?
00:06:55 - Now we know, right?
00:06:56 - Facebook is another one.
00:06:57 - They boast they have
the largest Hadoop
00:06:59 - cluster on the planet.
00:07:01 - They have a cluster that
contains overall 100
00:07:03 - petabytes of data.
00:07:05 - And on top of that, they
generate half a petabyte of
00:07:09 - data every single day.
00:07:11 - That's crazy.
00:07:12 - Anything everybody does on
Facebook, from logging in to
00:07:16 - clicking to liking something
is tracked on Facebook.
00:07:19 - And this is how they use that
data to target you.
00:07:22 - They do ad targeting.
00:07:23 - Not that anybody clicks on
Facebook ads, but that's how
00:07:25 - they ad target you is they look
at what you're clicking,
00:07:28 - what your friends are
clicking, find the
00:07:29 - commonalities between them,
something called collaborative
00:07:32 - filtering, otherwise known as
a recommendation engine.
00:07:35 - And that's how they do all
of their ad targeting.
00:07:37 - Amazon's the same way.
00:07:38 - They do the exact same machine
00:07:42 - They have these recommendation
00:07:44 - Again, the fancy term for that
is collaborative filtering.
00:07:46 - But any time you go to Amazon,
if you buy something, or even
00:07:49 - if you just browse to a product,
they look at all the
00:07:52 - other people that have also
looked at that product, find
00:07:54 - the commonalities and then
recommend them to you.
00:07:56 - It's a pretty brilliant
and cool system.
00:07:59 - Twitter is another one, 400
million tweets a day, which is
00:08:03 - about 100,000 tweets a minute,
which equates to about 15
00:08:09 - terabytes of data every day.
00:08:10 - What you do with that data
if you're Twitter?
00:08:13 - Well, they have a lot of Hadoop
clusters set up where
00:08:15 - they run thousands of MapReduce
jobs every night,
00:08:17 - twisting and turning and looking
at that data in a
00:08:18 - variety of ways to
00:08:21 - They know all the latest and
greatest trends, and they
00:08:23 - probably sell that information
to people who make products so
00:08:26 - they can target those trends.
00:08:27 - Another interesting use case
is GM, General Motors,
00:08:31 - American car manufacturer.
00:08:33 - Just recently, they cut off a
00:08:36 - they had with HP and some other
vendors for outsourcing.
00:08:40 - What are they doing?
00:08:41 - They're building two
00:08:45 - They're bringing it
00:08:46 - They're going to load up those
warehouses with miles and
00:08:49 - miles of racks containing
low=end x86 machines.
00:08:53 - They're going to install Hadoop
on them and do all
00:08:54 - their big data analytics
00:08:56 - That is just awesome.
00:08:57 - And thanks to Jeremy Cioara,
every time I see a picture of
00:08:59 - a data center now I just
want to lick it.
00:09:01 - I don't know if you've seen
his Nugget where he says,
00:09:03 - don't you just want
to lick it?
00:09:04 - But thanks, Jeremy.
00:09:05 - It looks very lickable.
00:09:07 - And as if you need any more
proof that this big data is
00:09:10 - more than just a fad, check
out these stats.
00:09:12 - Global IP traffic-- and this is
a Cisco stat-- is said to
00:09:15 - triple by 2015, triple, which
they say will take us into the
00:09:20 - zettabyte range.
00:09:21 - Right now we're in
00:09:22 - I can't imagine being
in the zettabytes.
00:09:24 - That is in unbelievable
pile of data that's
00:09:28 - going to be out there.
00:09:29 - So big data is definitely
here to stay.
00:09:31 - And also, 2/3 of North American
companies that were
00:09:34 - interviewed said big data is
in their five-year plan.
00:09:37 - So it's a good time to get into
big data. re are going to
00:09:41 - be a ton of jobs
in the future.
00:09:43 - There already are plenty of jobs
out there for big data,
00:09:45 - but it's just going to expand
exponentially as time goes on.
00:09:48 - The last thing I want to talk
about on this slide are the
00:09:50 - three V's, volume, velocity,
and variety, the three big
00:09:53 - reasons we cannot use
traditional computing models,
00:09:57 - which is big expensive tricked
out machines with lots of
00:10:01 - processors that contain lots of
cores, and we have lots of
00:10:03 - hard drives, rate enabled for
00:10:05 - performance and fault tolerance.
00:10:07 - It's a hardware solution.
00:10:08 - We can't use hardware
00:10:09 - I'll explain why in a minute.
00:10:10 - And also, a reason we cannot
use the relational world.
00:10:13 - The relational world was really
designed to handle
00:10:14 - gigabytes of data, not terabytes
00:10:17 - And what does the relational
world do when they start
00:10:18 - getting too much data?
00:10:19 - They archive it to tape.
00:10:21 - That is the death of data.
00:10:23 - It's no longer a part
of your analysis.
00:10:25 - It's no longer a part
of your business
00:10:27 - intelligence and your reporting.
00:10:29 - And that's bad.
00:10:30 - Velocity is another one.
00:10:31 - Velocity is the speed at
which we access data.
00:10:34 - The traditional computing
models, no matter how fast
00:10:37 - your computer is, your processor
is still going to be
00:10:39 - bound by disk I/O, because disk
transfer rates haven't
00:10:42 - evolved at nearly the pace
of processing power.
00:10:45 - This is why distributed
computing makes more sense.
00:10:47 - And Hadoop uses the strengths of
the current computing world
00:10:52 - by bringing computation
to the data.
00:10:54 - Rather than bringing data to
the computation, which
00:10:57 - saturates network bandwidth,
00:10:59 - process the data locally.
00:11:01 - And when you have a cluster of
nodes all working together
00:11:05 - using and harnessing the power
of the processor and reducing
00:11:09 - network bandwidth and mitigating
the weakness of
00:11:12 - disk transfer rates, we have
some pretty impressive
00:11:16 - performance when processing.
00:11:18 - In fact, Yahoo broke records
of processing and sorting a
00:11:22 - terabyte of data multiple
times using Hadoop.
00:11:25 - In fact, they did
it back in 2008.
00:11:27 - The record at the
time was 297.
00:11:29 - They smashed it in 209 seconds
on a 900-node Hadoop cluster,
00:11:34 - just a bunch of commodity
machines with 8 gigs of RAM,
00:11:37 - dual quad-core processors and 4
disks attached to each one,
00:11:40 - pretty impressive stuff.
00:11:41 - The last V, variety, obviously
relational systems can only
00:11:44 - handle the structured data.
00:11:45 - Sure, they can get
00:11:47 - unstructured data in with some
engineers that have ETL tools
00:11:50 - that'll transform and scrub
the data to bring it in.
00:11:53 - But that requires a
lot of extra work.
00:11:56 - So let's go get interested to
Hadoop and see how it solves
00:11:58 - these three challenges
and talk about it's a
00:12:00 - software-based solution and all
the benefits we gain with
00:12:03 - that over the hardware-based
solution of the traditional
00:12:05 - computing world.
00:12:06 - As I mentioned earlier,
Hadoop is this
00:12:08 - distributed software solution.
00:12:10 - It is a scalable fault-tolerant
00:12:12 - system for data storage
00:12:16 - There's two main components in
Hadoop, HDFS, which is the
00:12:19 - storage, and MapReduce, which
is the retrieval and the
00:12:21 - processing.
00:12:23 - HDFS is this self-healing
00:12:27 - high-bandwidth clustered storage.
00:12:29 - And it's pretty awesome stuff.
00:12:31 - What would happen here is if we
were to put a petabyte file
00:12:33 - inside of our Hadoop cluster,
HDFS would break it up into
00:12:36 - blocks and then distribute
it across all the
00:12:38 - nodes in our cluster.
00:12:40 - On top of that-- and this is
where the fault tolerance side
00:12:42 - is going to come in a play--
00:12:44 - when we configure HDFS,
we're going to set up
00:12:45 - a replication factor.
00:12:47 - By default, it's set at 3.
00:12:48 - What that means is when we put
this file in Hadoop, it's
00:12:50 - going to make sure that there
are three copies of every
00:12:53 - block that make up that file
spread across all the nodes in
00:12:56 - the cluster.
00:12:57 - That's pretty awesome.
00:12:58 - And why that's awesome is
because if we lose a node,
00:13:03 - it's going to self-heal.
00:13:04 - It's going to say, oh, I know
what data was on that node.
00:13:07 - I'm just going to re-replicate
the blocks that were on that
00:13:09 - node to the rest of the servers
inside of the cluster.
00:13:13 - And how it does it is this.
00:13:14 - It has a NameNode
and a DataNode.
00:13:17 - Generally, you have one NameNode
per cluster, and then
00:13:19 - all the rest of these here
going to be DataNodes.
00:13:21 - And we'll get into more details
of the roles and
00:13:24 - secondary NameNode nodes
and all that
00:13:25 - stuff when we get there.
00:13:26 - But essentially, the NameNode
is just a metadata server.
00:13:28 - It just holds in memory the
location of every block and
00:13:31 - every node.
00:13:32 - And even if you have multiple
racks set up, it'll know where
00:13:35 - blocks exist on what note on
what rack spread across the
00:13:38 - cluster inside your network.
00:13:40 - So that's the secret sauce
behind HDFS, and that's how
00:13:44 - we're fault-tolerant and
redundant, and it's just
00:13:46 - really awesome stuff.
00:13:47 - Now how we get data is
00:13:50 - And as the name implies, it's
really a two-step process.
00:13:53 - There's a little more to that.
00:13:53 - But again, we're going to
keep this high level.
00:13:55 - So we'll get into MapReduce.
00:13:56 - We've got a few Nuggets
00:13:57 - We'll get in down to the
nitty-gritty details, and
00:14:00 - we'll also break out
Java and write some
00:14:02 - MapReduce jobs on our own.
00:14:03 - But it's a two-step process
at the surface.
00:14:06 - There's a mapper
and a reducer.
00:14:08 - Programmers will write the
mapper function, which will go
00:14:12 - out and tell the cluster
00:14:14 - points we want to retrieve.
00:14:16 - The reducer will then take all
that data and aggregate it.
00:14:20 - So again, Hadoop is a
00:14:23 - And we're working on all of
the data in the cluster.
00:14:26 - We're not doing any seeking
or anything like that.
00:14:28 - Seeks are what slows down
00:14:31 - MapReduce is all about working
on all of the data inside of
00:14:34 - our cluster.
00:14:35 - And MapReduce produce can scare
some folks away, because
00:14:38 - you think, oh, I got to know
Java in order to write these
00:14:41 - Javas to pull data out
of the cluster.
00:14:42 - Well, that's not
00:14:44 - A lot of things have popped up
in the Hadoop ecosystem over
00:14:47 - the last couple of years that
attract many people.
00:14:50 - And this is where the
flexibility comes into play,
00:14:52 - because you don't need to
understand Java to get data
00:14:55 - out of the cluster.
00:14:56 - In fact, the engineers at
Facebook built a subproject
00:15:00 - called Hive, which is
a SQL interpreter.
00:15:02 - Facebook said, you know what?
00:15:03 - We want lots of people to be
able to write ad hoc jobs
00:15:07 - against our cluster, but we're
not going to force everybody
00:15:10 - to learn Java.
00:15:12 - So that's why they had a team
of engineers build Hive.
00:15:14 - And now anybody that's familiar
a SQL, which most
00:15:16 - data professionals
are, can now pull
00:15:18 - data out of the cluster.
00:15:20 - Pig is another one.
00:15:21 - Yahoo went and built Pig as a
high-level dataflow language
00:15:25 - to pull data out of a cluster.
00:15:26 - And all Hive and Pig both do
are under the hood create
00:15:29 - MapReduce jobs and submit
them to the cluster.
00:15:31 - That's the beauty of an
open source framework.
00:15:33 - People can build
and add to it.
00:15:35 - Our community keeps
growing in Hadoop.
00:15:37 - More the technologies and
projects are added to Hadoop
00:15:40 - ecosystem all the time, which
are just making it more
00:15:42 - attractive to more
and more folks.
00:15:45 - Again, as these technologies
merge and Hadoop matures,
00:15:48 - you're going to see it become a
lot more attractive to just
00:15:51 - the big businesses.
00:15:52 - Small, medium-sized businesses,
people of all
00:15:55 - types of industries are going to
jump into Hadoop and start
00:15:58 - mining all kinds of data
in their network.
00:16:00 - All right.
00:16:00 - So we're fault-tolerant
00:16:02 - We're flexible in how we
can retrieve the data.
00:16:05 - And we're also flexible
in the kind of data
00:16:07 - we can put in Hadoop.
00:16:08 - As we saw, structured,
00:16:10 - we can put it all in there.
00:16:11 - But we're also scalable.
00:16:13 - The beauty of scalability is
it just kind of happens by
00:16:15 - default because we're in the
00:16:17 - environment.
00:16:18 - We don't have to do
00:16:19 - to make Hadoop scalable.
00:16:20 - We just are.
00:16:21 - Let's say our MapReduce job
starts slowing down because we
00:16:23 - keep adding more data
into the cluster.
00:16:25 - What do we do?
00:16:26 - We add more nodes, which
increases the overall
00:16:29 - processing power of our entire
cluster, which is pretty
00:16:32 - awesome stuff.
00:16:33 - And adding nodes is really
a piece of cake.
00:16:34 - We just install the Hadoop
binaries, point them to the
00:16:37 - NameNode, and we're
good to go.
00:16:39 - Last but not least, Hadoop
is extremely intelligent.
00:16:41 - We've already seen
examples of this.
00:16:43 - The fact that we're bringing
computation to the data, we're
00:16:45 - maximizing the strengths of
today's computing world and
00:16:48 - mitigating the weaknesses,
that alone is
00:16:50 - pretty awesome stuff.
00:16:51 - But on top of that, in a
00:16:54 - let's get some switches
00:16:55 - Here's some rack switches,
00:16:58 - our data center switch--
00:17:01 - it's rack-aware.
00:17:02 - This is something that we
need to do manually.
00:17:04 - We need to configure this.
00:17:05 - And it's pretty simple to do.
00:17:06 - In a configuration file, we're
just describing the network
00:17:08 - topology to Hadoop so it
knows what DataNodes
00:17:11 - belong to what racks.
00:17:13 - And what that allows Hadoop to
do is even more data locality.
00:17:18 - So whenever it receives a
MapReduce job, it's going to
00:17:21 - find the shortest path to
the data as possible.
00:17:23 - If most of the data is on one
rack and only a little bit on
00:17:26 - another rack, then it
can get most of the
00:17:27 - data from one rack.
00:17:28 - And, again, this is where it's
going to save on bandwidth.
00:17:31 - It's going to save on bandwidth,
because it's going
00:17:32 - to keep data as local to
the rack as possible.
00:17:35 - Pretty awesome stuff.
00:17:36 - Let's check out a couple of
use cases, one from an
00:17:39 - architectural standpoint,
00:17:41 - Yahoo has over 40,000 machines,
as I mentioned, with
00:17:44 - Hadoop on it.
00:17:44 - Their largest cluster sits
at a 4,500-node cluster.
00:17:48 - Each node inside of that cluster
has a dual quad-core
00:17:50 - CPU, 4 one-terabyte disks, and
16 gigabytes of RAM, pretty
00:17:55 - good size for a commodity
machine, but certainly not an
00:17:58 - extremely high-end machine that
we're talking about in
00:18:00 - the traditional committing
00:18:01 - That's not bad at all.
00:18:04 - Another use case here
is the cloud.
00:18:06 - The cloud.
00:18:07 - What about Hadoop
in the cloud?
00:18:08 - Lots and lots of companies
running Hadoop implementations
00:18:12 - in the cloud, Amazon being one
of the more popular ones out
00:18:14 - there, in that instead of Amazon
Web Services they have
00:18:17 - something called EMR, Elastic
MapReduce, which is just their
00:18:21 - own implementation of Hadoop.
00:18:22 - You can literally get it
up and running in five
00:18:24 - minutes in the cloud.
00:18:25 - And that's going to be
attractive for a lot of
00:18:27 - businesses that can't afford
an internal infrastructure,
00:18:30 - such as Yahoo or GM,
as we saw earlier.
00:18:32 - For instance, here's a good
one, New York Times.
00:18:36 - The New York Times wanted to
convert all of their articles
00:18:41 - to PDFs, four terabytes
of articles into PDFs.
00:18:45 - How do you do that?
00:18:46 - How do you do that in a
cost-effective way without
00:18:49 - buying an entire infrastructure
or an entire
00:18:53 - army of engineers to implement
that infrastructure or
00:18:56 - developers and all
that good stuff?
00:18:58 - Here's how they did it.
00:18:59 - They fired up an AWS
00:19:02 - They used S3.
00:19:03 - They put their four terabytes
of TIFF data inside of S3.
00:19:06 - They spun up an EC2 instance and
then just ran a MapReduce
00:19:11 - job to take those
00:19:13 - convert them into PDFs.
00:19:14 - It happened in less than 24
hours, and it cost them a
00:19:17 - grand total of $240.
00:19:20 - I know.
00:19:20 - That's crazy.
00:19:21 - The first time I saw it too, I
was like, aren't they missing
00:19:23 - a few zeroes and commas and
decimals and things that would
00:19:27 - make that number bigger?
00:19:29 - But believe it or not,
that's all it is.
00:19:31 - And that's a beautiful thing.
00:19:32 - In fact, the cloud's really
attractive, not only for small
00:19:35 - businesses but also for
hobbyists or people that are
00:19:37 - trying to learn.
00:19:38 - Because you only pay
for compute time.
00:19:40 - So while you're developing and
learning this stuff, if you
00:19:42 - ever want to run it, spin it
up, run it, spin it down.
00:19:45 - It'll cost you cents.
00:19:47 - Cents.
00:19:48 - So the cloud's huge, and
00:19:50 - And we're going to look later
in the series how to get
00:19:53 - Hadoop up and running
in the cloud.
00:19:54 - But first, we need to
do it the hard way.
00:19:56 - So we're going to spend the
first half of the series doing
00:19:58 - everything down low level
at the ground.
00:20:02 - We'll see how to do this stuff
the hard way, and then we'll
00:20:05 - look at how to do it the
easy way in the cloud.
00:20:06 - Lastly here, let's get a
look at what's to come.
00:20:09 - So starting with the Nugget
layout, you are here, the
00:20:12 - series introduction.
00:20:14 - From here, we're going to move
into the technology stack.
00:20:16 - We'll get familiar with all of
the projects that make up the
00:20:18 - Hadoop ecosystem.
00:20:19 - We'll look at a high level, just
to get familiar and make
00:20:23 - some sense out of it.
00:20:24 - We'll look at Pig, Hive, HBase,
00:20:27 - Ambari, Avro, Flume, Oozie.
00:20:29 - Yeah, you see where I'm
going with this?
00:20:30 - There's a lot of them.
00:20:32 - And we'll look at them all.
00:20:33 - We'll make sense of it.
00:20:34 - And believe me, it'll be a good
exercise in what the heck
00:20:37 - is Hadoop and what are all these
things around it and
00:20:40 - what are their roles and
responsibilities and how are
00:20:43 - they going to make it easier to
work with our cluster and
00:20:45 - the data within it.
00:20:46 - So we'll take the covers
off all that next.
00:20:48 - Then we'll jump into HDFS.
00:20:50 - We'll get a good hard look at
the internals of the Hadoop
00:20:54 - Distributed File System.
00:20:56 - From there, we'll learn
how to install Hadoop.
00:20:59 - We're going to do this first
on a single node.
00:21:02 - So we've got a video dedicated
to single-node installation.
00:21:05 - And then we have a video
dedicated to multi-node
00:21:08 - installation.
00:21:09 - So you're going to get very
familiar with how to get
00:21:11 - Hadoop up and running
00:21:13 - Then from there, we'll jump into
another HDFS Nugget and
00:21:16 - learn how to configure
it and manage HDFS
00:21:19 - inside of our cluster.
00:21:20 - Then we'll get into MapReduce.
00:21:21 - We've got a couple of Nuggets
00:21:23 - We want to introduce you to
it and to give a basic
00:21:25 - introduction.
00:21:26 - And then it's just going
to be straight into it.
00:21:29 - I'm going to throw you right
into it by learning how to
00:21:31 - develop MapReduce applications
00:21:33 - Then we'll get familiar with
many of the big popular
00:21:39 - projects here in the
00:21:41 - We've got a couple of Nuggets
on each one of these, Hive,
00:21:43 - Pig, Hive, HBase.
00:21:45 - Sqoop and ZooKeeper we only need
one Nugget for, because
00:21:47 - they're not extremely huge and
have a lot of concepts
00:21:49 - associated with them.
00:21:50 - But Pig, Hive, and HBase, we've
got a couple Nuggets
00:21:52 - dedicated to each of those so
you can get familiar with what
00:21:55 - they are and how to use them.
00:21:56 - We've got a Nugget on
00:21:58 - And then we've got a few nuggets
on cloud Hadoop.
00:22:00 - We'll take a couple of different
looks at how to get
00:22:02 - Hadoop up and running
in the cloud.
00:22:04 - By the end of this series,
we'll have 20 Nuggets of
00:22:07 - Hadoop goodness that touches on
a little bit of everything.
00:22:10 - Lastly, let's check out
our network layout.
00:22:11 - We're going to create a
four-node cluster up in the
00:22:14 - virtual Nugget Lab.
00:22:15 - All of these machines are going
to be running Ubuntu
00:22:17 - Linux 12.04 LTS for
00:22:21 - How Hadoop's versioning works,
we're going to use the most
00:22:24 - recent stable release,
which is 1.12.
00:22:26 - But essentially, 1.1.x are the
stable releases, 1.2.x are the
00:22:30 - beta releases, and 2.x.x
are the alpha releases.
00:22:34 - So we'll stick with the most
recent stable release and also
00:22:37 - Java 7, which is 1.7.
00:22:39 - And again, I have Ubuntu Linux
installed on these four
00:22:42 - machines, but that's it.
00:22:43 - That's as far as I took it.
00:22:44 - We're going to take
00:22:46 - literally from scratch.
00:22:47 - I'll show you how we can
download the Hadoop tar ball,
00:22:50 - get it all installed,
00:22:53 - the single-node level.
00:22:54 - And then we'll do it again
at the multi-node level.
00:22:56 - How it's going to work here is
we're going to name these
00:22:59 - Hadoop Nugget, HN, Names.
00:23:02 - So there's our NameNode.
00:23:03 - And then we're going
to have a bunch of
00:23:04 - Hadoop Nugget DataNodes.
00:23:06 - So DataNode 1, HN Data 1, HN
Data 2, and HN Data 3.
00:23:11 - Once we get our cluster set up,
we get HDFS configured,
00:23:14 - then we'll get some
data into here.
00:23:16 - We'll start with unstructured
00:23:18 - I'll show you how we
can take books.
00:23:19 - Books are a great way to learn
on, because it's easy to write
00:23:23 - MapReduce jobs to mine books
for word counts.
00:23:26 - Show me how many times words
appear in a book, and you can
00:23:29 - see what your favorite authors'
favorite words are
00:23:33 - across all of the books that
they've ever written.
00:23:35 - Kind of cool.
00:23:36 - In fact, if you could data
mine Nuggets, you could
00:23:37 - probably find that my favorite
word is either cool or ah or
00:23:41 - something like that, probably
a bunch of filler words.
00:23:43 - But anyway, the fun stuff.
00:23:45 - So we'll start with
00:23:46 - Then we'll move into more
00:23:48 - We'll find some good data out
there that we can use.
00:23:50 - I'll show you where you can
find big data sets.
00:23:52 - Amazon offers some.
00:23:53 - Infochimps offers some.
00:23:55 - So we'll definitely get
some good data in
00:23:56 - here to work with.
00:23:57 - And then we'll use all the tools
that we learned along
00:23:59 - the way to see how to manage the
cluster and work with the
00:24:03 - data inside of it.
00:24:04 - It'll be fun.
00:24:05 - In this CBT Nugget, we took a
Hadoop series introduction.
00:24:09 - We started off by talking
about the state of data.
00:24:11 - We defined big data.
00:24:13 - We saw the challenges
with big data.
00:24:14 - We saw how companies use
big data for analytics.
00:24:17 - And we had some fun statistics
along the way.
00:24:20 - We also got familiar with
Hadoop, just did a basic
00:24:23 - high-level overview to introduce
you to the core
00:24:25 - components that are Hadoop,
00:24:28 - We saw how it's a pretty
impressive software-based open
00:24:32 - source solution to distributed
computing that's scalable,
00:24:36 - fault-tolerant, fast,
00:24:39 - And at the end here, we took a
look at the series layout.
00:24:40 - We got familiar with the Nuggets
in this series and
00:24:43 - really what we're going to be
learning about throughout this
00:24:45 - series and also the network
layout to get familiar with
00:24:47 - what we're going to be working
with over in the virtual
00:24:49 - Nugget Lab and the kinds of
things we'll be doing.
00:24:51 - I hope this has been informative
for you, and I'd
00:24:52 - like to thank you for viewing.