A whirlwind tour of
hadoop
By Eric Marshall
For LOPSA-NJ
an all too brief introduction to the world of big data
Eric Marshall
I work for Airisdata; we’re hiring!
Smallest computer I lost sleep over:
Sinclair-Timex Z81 – 1KB of memory
Largest computer I lost sleep over: SGI
Altix 4700 – 1 TB of memory
Vocabulary disclaimer
Just like your favorite swear word, which can act like
many parts of speech and refer to many a thing;
hadoop vocabulary has the same problem
Casually, people refer to hadoop as storage,
processing, a programming model(s), clustered
machines. The same problem exists for other terms in
the lexicon, so ask me when I make less sense than
usual.
My plan of attack
An intro: the good, the bad and the ugly at 50,000 ft.
2¢ tour of hadoop’s processing - map reduce
2¢ tour of hadoop’s storage – hdfs
A blitz tour of the rest of the hadoop ecosystem
Why did this happen?
Old school –> scale up == larger costlier
monolithic system (or a small cluster there of) i.e.
vertical scaling
Different approach –
all road lead to scale out
Assume failures
Smart software, cheap hardware
Don’t move data; bring processing
to data
The Good
Simple development (when
compared to Message
Passing Interface
programming )
Scale – no shared state,
programmer don’t need to
know the topology, easy to
add hardware
Automatic parallelization and
distribution of tasks
Fault tolerance
Works with commodity
hardware
Open source!
The Bad
Not a silver bullet :(
MapReduce is batch data processing
the time scale is minutes to hours
MapReduce is overly simplify/abstracted –
you are stuck with the M/R model and
it is hard to work smarter
MapReduce is low level
compared to high-level languages like SQL
Not all work decomposes well into parallelized M/R
Open source :)
Map()
Imagine a number of servers with lists of first names –
What is the most popular name?
Box 1-isabella William ava mia Emma Alexander
Box 2-Noah NOAH Isabella Isabella emma Emma
Box 3-emma Emma Liam liam mason Isabella
Map() would apply a function to each element independent
of order.
For example, capitalize each word
(MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Map()
So we would have:
Box 1-Isabella William Ava Mia Emma Alexander
Box 2-Noah Noah Isabella Isabella Emma Emma
Box 3-Emma Emma Liam Liam Mason Isabella
Map() could be apply function to make pairs
For example, Isabella becomes (Isabella, 1)
Map()
So we would have:
Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1)
(Alexander,1)
Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1)
(Emma,1) (Emma,1)
Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1)
(Isabella,1)
Now we are almost ready for the reduce, but first the sort
and shuffle
Shuffle/Sort
So we would have:
Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1) (Emma,1)
Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
Reduce
So we would have:
Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1)
Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
Box 1-(Alexander,1) (Ava,1) (Emma,5)
Box 2-(Isabella,4)
Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1)
(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )
(This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Map/reduce failures
Check the job if:
The job throws an uncaught exception.
The job exits with a nonzero exit code.
The job fails to report progress to the
tasktracker for a configurable amount of
time. (i.e. hung, stuck, slow)
Check the node if:
the same node keeps killing jobs…check
the node
Check the Job tracker/RM if:
jobs are lost or stuck and then they all fail
Instant MR test
Um, is the system working?
yarn jar /usr/hdp/2.3.2.0-2950/hadoop-
mapreduce/hadoop-mapreduce-examples.jar pi 10 100
(your jar most likely will be somewhere else)
(HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
There is a HDFS CLI
You already know some of the commands:
hdfs dfs –ls /
hdfs dfs –du /
hdfs dfs –rm /
hdfs dfs –cat /
There are other modes than dfs: dfsadmin, namenode,
datanode, fsck, zkfc, balancer, etc.
HDFS failures
Jobs fail: due to missing
blocks
Jobs fail: due to moving data
due to down datanodes or
huge ingest
Without NN HA – single point
of failure for everything
Regular file system mayhem
that you already know and
love
plus the usual perms issues
HDFS failures
Jobs fail: due to missing
blocks
Jobs fail: due to moving data
due to down datanodes or
huge ingest
Without NN HA – single point
of failure for everything
Regular file system mayhem
that you already know and
love
plus the usual perms issues
The rest of the garden
Distributed Filesystems
- Apache HDFS
outliers:
- Tachyon
- Apache GridGain
- Ignite
- XtreemFS
- Ceph Filesystem
- Red Hat GlusterFS
- Quantcast File System QFS
- Lustre
Security
outliers:
- Apache Sentry
- Apache Knox Gateway
- Apache Ranger
Distributed Programming
- Apache MapReduce also MRv2/YARN
- Apache Pig
outliers:
- JAQL
- Apache Spark
- Apache Flink (formerly Stratosphere)
- Netflix PigPen
- AMPLab SIMR
- Facebook Corona
- Apache Twill
- Damballa Parkour
- Apache Hama
- Datasalt Pangool
- Apache Tez
- Apache Llama
- Apache DataFu
- Pydoop
- Kangaroo
- TinkerPop
- Pachyderm MapReduce
NewSQL Databases
outliers:
- TokuDB
- HandlerSocket
- Akiban Server
- Drizzle
- Haeinsa
- SenseiDB
- Sky
- BayesDB
- InfluxDB
NoSQL Databases
:Columnal Data Model
- Apache HBase
outliers:
- Apache Accumulo
- Hypertable
- HP Vertica
:Key Value Data Model
- Apache Cassandra
- Riak
- Redis
- Linkedin Volemort
:Document Data Model
outliers:
- MongoDB
- RethinkDB
- ArangoDB
- CouchDB
:Stream Data Model
outliers:
- EventStore
:Key-Value Data Model
outliers:
- Redis DataBase
- Linkedin Voldemort
- RocksDB
- OpenTSDB
:Graph Data Model
outliers:
- Neo4j
- ArangoDB
- TitanDB
- OrientDB
- Intel GraphBuilder
- Giraph
- Pegasus
- Apache Spark
Scheduling
- Apache Oozie
outliers:
- Linkedin Azkaban
- Spotify Luigi
- Apache Falco
10 in 10 minutes!
Easier Programming: Pig, Spark
SQL-like tools: Hive, Impala, Hbase
Data pipefitting: Sqoop, Flume, Kafka
Book keeping: Oozie, Zookeeper
Pig
What is it: a high level programming language for data
manipulation that abstracts M/R from Yahoo
Why: a few lines of code to munge data
Example:
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
(Pig is covered in greater detail in Alan Gate’s Programming Pig by O’Reilly
And in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Spark
What is it: computing framework from ampLab, UC Berkeley
Why: high level abstractions and better use of memory
Neat trick: in-memory RDDs
Example:
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Or, in python:
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
(Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’Reilly
Also of note is Advanced Analytics with Spark – it shows Spark’s capabilities well
but moves way too quick to be truly useful. It is covered in Chapter 19 of
Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)
Hive/HQL
What is it: a data infrastructure and query language from
Facebook
Why: batched SQL queries against HDFS
Neat trick: stores metadata so you don’t have to
Example:
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’
OVERWRITE INTO TABLE BXDataSet;
hive> select yearofpublication, count(booktitle) from bxdataset group by
yearofpublication;
(Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly.
Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial.
It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Impala
What is it: SQL query engine from Cloudera
Why: fast adhoc queries on subsets of data stored in hadoop
Example:
[impala-host:21000] > select count(*) from customer_address;
(nada, let me know if you hit pay dirt)
HBase
What is it: a non-relational database from Powerset
Why: fast access to large sparse data sets
Example:
hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.4170 seconds
Hbase::Table – test
hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0850 seconds
hbase(main):006:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a,
timestamp=1421762485768, value=value1
(HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
And in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)
Sqoop
What is it: glue tool for moving data between relational
databases and hadoop
Why: make the cumbersome easier
Example:
sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe --
password myPassword
(HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
There is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)
Flume
What is it: a service for collecting and aggregating logs
Why: because log ingestion is tougher than it seems
Example:
# Define a memory channel on agent called memory-channel.
agent.channels.memory-channel.type = memory
# Define a source on agent and connect to channel memory-channel.
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /var/log/system.log
agent.sources.tail-source.channels = memory-channel
# Define a sink that outputs to logger.
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
# Define a sink that outputs to hdfs.
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
# Finally, activate.
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
(I haven’t read much on Flume; if you find something clever let me know!)
Kafka
What is it: message broker from LinkedIn
Why: fast handling of data feeds
Neat trick: no need to worry about missing data or double
processing data
Example:
> bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test
This is a message
This is another message
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-
beginning
This is a message
This is another message
(I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/
Also check out the design docs http://kafka.apache.org/documentation.html#design )
Oozie
What is it: workflow scheduler from Yahoo Banglalore
Why: because cron isn’t perfect
Example:
oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-
reduce/job.properties -run
(Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)
Zookeeper
What is it: a coordination service from Yahoo
Why: sync info for distributed systems (similar idea behind
DNS or LDAP)
Example:
[zkshell: 14] set /zk_test junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
[zkshell: 15] get /zk_test
junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
(Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’Reilly
And in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
And now a bit of common sense for
sys-admin-ing Hadoop clusters
Avoid
The usual -
Don’t let hdfs fill up
Don’t use all the memory
Don’t use up all the cpus
Don’t drop the network
<insert fav disaster>
Resource Exhaustion by users
Hardware Failure (drives are the king of this domain)
Um, backups?
Usual suspects plus
Namenode’s meta data!! (fsimage)
Hdfs? Well, it would nice but unlikely (if so distcp)
Snapshots
Monitoring
The usual suspects plus…
JMX support
Jvm via jstat, jmap etc.
hdfs
Mapred
conf/hadoop-metrics.properties
http://namenode:50070/
http://namenode:50070/jmx
User management
Hdfs quotas
Access controls
Internal and
external
MR schedulers
Fifo, Fair, Capacity
Kerberos can be used as well
Configuration
/etc/hadoop/conf
Lots of knobs!
!Ojo! –
Lots of overrides
Get the basic system solid before security and performance
Watch the units – some are in megabytes but some are in
bytes!
Have canary jobs
Ensure same configs are everywhere (including uniform
dns/host)
Fin
Thanks for listening
Slides:
http://www.slideshare.net/ericwilliammarshall/hadoop-
for-sysadmins
Any questions?
What’s in a name?
Doug Cutting seems to have been inspired by his
family. Lucene is his wife’s middle name, and her
maternal grandmother’s first name. His son, as a
toddler, used Nutch as the all-purpose word for meal
and later named a yellow stuffed elephant Hadoop.
Doug said he “was looking for a name that wasn’t
already a web domain and wasn’t trademarked, so I
tried various words that were in my life but not used by
anybody else. Kids are pretty good at making up
words.”
What to do?
Combinations of the usual stuff:
Numerical Summarizations
Filtering
Altering Data Organization
Joining Data
I/O