SlideShare a Scribd company logo
A whirlwind tour of
hadoop
By Eric Marshall
For LOPSA-NJ
an all too brief introduction to the world of big data
Eric Marshall
I work for Airisdata; we’re hiring!
Smallest computer I lost sleep over:
Sinclair-Timex Z81 – 1KB of memory
Largest computer I lost sleep over: SGI
Altix 4700 – 1 TB of memory
Vocabulary disclaimer
 Just like your favorite swear word, which can act like
many parts of speech and refer to many a thing;
hadoop vocabulary has the same problem
 Casually, people refer to hadoop as storage,
processing, a programming model(s), clustered
machines. The same problem exists for other terms in
the lexicon, so ask me when I make less sense than
usual.
My plan of attack
 An intro: the good, the bad and the ugly at 50,000 ft.
 2¢ tour of hadoop’s processing - map reduce
 2¢ tour of hadoop’s storage – hdfs
 A blitz tour of the rest of the hadoop ecosystem
Why did this happen?
 Old school –> scale up == larger costlier
monolithic system (or a small cluster there of) i.e.
vertical scaling
 Different approach –
all road lead to scale out
 Assume failures
 Smart software, cheap hardware
 Don’t move data; bring processing
to data
The Good
Simple development (when
compared to Message
Passing Interface
programming )
Scale – no shared state,
programmer don’t need to
know the topology, easy to
add hardware
Automatic parallelization and
distribution of tasks
Fault tolerance
Works with commodity
hardware
Open source!
The Bad
 Not a silver bullet :(
 MapReduce is batch data processing
the time scale is minutes to hours
 MapReduce is overly simplify/abstracted –
you are stuck with the M/R model and
it is hard to work smarter
 MapReduce is low level
compared to high-level languages like SQL
 Not all work decomposes well into parallelized M/R
 Open source :)
The Ugly?
Welcome to the rest of our talk!
First stop, Map Reduce
Hadoop’s MapReduce
Lisp’s map and reduce
plus the associative property
applied to clusters.
Map()
 Imagine a number of servers with lists of first names –
What is the most popular name?
 Box 1-isabella William ava mia Emma Alexander
 Box 2-Noah NOAH Isabella Isabella emma Emma
 Box 3-emma Emma Liam liam mason Isabella
Map() would apply a function to each element independent
of order.
For example, capitalize each word
(MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Map()
 So we would have:
 Box 1-Isabella William Ava Mia Emma Alexander
 Box 2-Noah Noah Isabella Isabella Emma Emma
 Box 3-Emma Emma Liam Liam Mason Isabella
Map() could be apply function to make pairs
For example, Isabella becomes (Isabella, 1)
Map()
 So we would have:
 Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1)
(Alexander,1)
 Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1)
(Emma,1) (Emma,1)
 Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1)
(Isabella,1)
Now we are almost ready for the reduce, but first the sort
and shuffle
Shuffle/Sort
 So we would have:
 Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1) (Emma,1)
 Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
 Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
Reduce
 So we would have:
 Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1)
(Emma,1) (Emma,1)
 Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)
 Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1)
(Noah,1) (William,1)
Now for the reduce, our function would sum all the of the 1s,
and return name and count
 Box 1-(Alexander,1) (Ava,1) (Emma,5)
 Box 2-(Isabella,4)
 Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1)
(https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )
(This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Hadoop for sysadmins
Map/reduce failures
Check the job if:
The job throws an uncaught exception.
The job exits with a nonzero exit code.
The job fails to report progress to the
tasktracker for a configurable amount of
time. (i.e. hung, stuck, slow)
Check the node if:
the same node keeps killing jobs…check
the node
Check the Job tracker/RM if:
jobs are lost or stuck and then they all fail

Instant MR test
 Um, is the system working?
 yarn jar /usr/hdp/2.3.2.0-2950/hadoop-
mapreduce/hadoop-mapreduce-examples.jar pi 10 100
(your jar most likely will be somewhere else)
(HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
There is a HDFS CLI
 You already know some of the commands:
 hdfs dfs –ls /
 hdfs dfs –du /
 hdfs dfs –rm /
 hdfs dfs –cat /
 There are other modes than dfs: dfsadmin, namenode,
datanode, fsck, zkfc, balancer, etc.
HDFS failures
 Jobs fail: due to missing
blocks
 Jobs fail: due to moving data
due to down datanodes or
huge ingest
 Without NN HA – single point
of failure for everything
 Regular file system mayhem
that you already know and
love
 plus the usual perms issues
HDFS failures
 Jobs fail: due to missing
blocks
 Jobs fail: due to moving data
due to down datanodes or
huge ingest
 Without NN HA – single point
of failure for everything
 Regular file system mayhem
that you already know and
love
 plus the usual perms issues
The rest of the garden
Distributed Filesystems
- Apache HDFS
outliers:
- Tachyon
- Apache GridGain
- Ignite
- XtreemFS
- Ceph Filesystem
- Red Hat GlusterFS
- Quantcast File System QFS
- Lustre
Security
outliers:
- Apache Sentry
- Apache Knox Gateway
- Apache Ranger
Distributed Programming
- Apache MapReduce also MRv2/YARN
- Apache Pig
outliers:
- JAQL
- Apache Spark
- Apache Flink (formerly Stratosphere)
- Netflix PigPen
- AMPLab SIMR
- Facebook Corona
- Apache Twill
- Damballa Parkour
- Apache Hama
- Datasalt Pangool
- Apache Tez
- Apache Llama
- Apache DataFu
- Pydoop
- Kangaroo
- TinkerPop
- Pachyderm MapReduce
NewSQL Databases
outliers:
- TokuDB
- HandlerSocket
- Akiban Server
- Drizzle
- Haeinsa
- SenseiDB
- Sky
- BayesDB
- InfluxDB
NoSQL Databases
:Columnal Data Model
- Apache HBase
outliers:
- Apache Accumulo
- Hypertable
- HP Vertica
:Key Value Data Model
- Apache Cassandra
- Riak
- Redis
- Linkedin Volemort
:Document Data Model
outliers:
- MongoDB
- RethinkDB
- ArangoDB
- CouchDB
:Stream Data Model
outliers:
- EventStore
:Key-Value Data Model
outliers:
- Redis DataBase
- Linkedin Voldemort
- RocksDB
- OpenTSDB
:Graph Data Model
outliers:
- Neo4j
- ArangoDB
- TitanDB
- OrientDB
- Intel GraphBuilder
- Giraph
- Pegasus
- Apache Spark
Scheduling
- Apache Oozie
outliers:
- Linkedin Azkaban
- Spotify Luigi
- Apache Falco
10 in 10 minutes!
 Easier Programming: Pig, Spark
 SQL-like tools: Hive, Impala, Hbase
 Data pipefitting: Sqoop, Flume, Kafka
 Book keeping: Oozie, Zookeeper
Easier Programming
Pig
What is it: a high level programming language for data
manipulation that abstracts M/R from Yahoo
Why: a few lines of code to munge data
Example:
filtered_words = FILTER words BY word MATCHES 'w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
(Pig is covered in greater detail in Alan Gate’s Programming Pig by O’Reilly
And in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Spark
What is it: computing framework from ampLab, UC Berkeley
Why: high level abstractions and better use of memory
Neat trick: in-memory RDDs
Example:
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Or, in python:
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
(Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’Reilly
Also of note is Advanced Analytics with Spark – it shows Spark’s capabilities well
but moves way too quick to be truly useful. It is covered in Chapter 19 of
Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)
SQL-ish
Hive/HQL
What is it: a data infrastructure and query language from
Facebook
Why: batched SQL queries against HDFS
Neat trick: stores metadata so you don’t have to
Example:
hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’
OVERWRITE INTO TABLE BXDataSet;
hive> select yearofpublication, count(booktitle) from bxdataset group by
yearofpublication;
(Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly.
Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial.
It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Impala
What is it: SQL query engine from Cloudera
Why: fast adhoc queries on subsets of data stored in hadoop
Example:
[impala-host:21000] > select count(*) from customer_address;
(nada, let me know if you hit pay dirt)
HBase
What is it: a non-relational database from Powerset
Why: fast access to large sparse data sets
Example:
hbase(main):001:0> create 'test', 'cf'
0 row(s) in 0.4170 seconds
 Hbase::Table – test
hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0850 seconds
hbase(main):006:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a,
timestamp=1421762485768, value=value1
(HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
And in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)
Data pipefitting
Sqoop
What is it: glue tool for moving data between relational
databases and hadoop
Why: make the cumbersome easier
Example:
sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe --
password myPassword
(HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly
There is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)
Flume
What is it: a service for collecting and aggregating logs
Why: because log ingestion is tougher than it seems
Example:
# Define a memory channel on agent called memory-channel.
agent.channels.memory-channel.type = memory
# Define a source on agent and connect to channel memory-channel.
agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /var/log/system.log
agent.sources.tail-source.channels = memory-channel
# Define a sink that outputs to logger.
agent.sinks.log-sink.channel = memory-channel
agent.sinks.log-sink.type = logger
# Define a sink that outputs to hdfs.
agent.sinks.hdfs-sink.channel = memory-channel
agent.sinks.hdfs-sink.type = hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
# Finally, activate.
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
(I haven’t read much on Flume; if you find something clever let me know!)
Kafka
What is it: message broker from LinkedIn
Why: fast handling of data feeds
Neat trick: no need to worry about missing data or double
processing data
Example:
> bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test
This is a message
This is another message
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-
beginning
This is a message
This is another message
(I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/
Also check out the design docs http://kafka.apache.org/documentation.html#design )
Book keeping
Oozie
What is it: workflow scheduler from Yahoo Banglalore
Why: because cron isn’t perfect
Example:
oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-
reduce/job.properties -run
(Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)
Zookeeper
What is it: a coordination service from Yahoo
Why: sync info for distributed systems (similar idea behind
DNS or LDAP)
Example:
[zkshell: 14] set /zk_test junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
[zkshell: 15] get /zk_test
junk
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 6
mtime = Fri Jun 05 14:01:52 PDT 2009
pZxid = 5
(Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’Reilly
And in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
Distributed Programming
- Apache MapReduce also MRv2/YARN
- Apache Pig
outliers:
- JAQL
- Apache Spark
- Apache Flink (formerly Stratosphere)
- Netflix PigPen
- AMPLab SIMR
- Facebook Corona
- Apache Twill
- Damballa Parkour
- Apache Hama
- Datasalt Pangool
- Apache Tez
- Apache Llama
- Apache DataFu
- Pydoop
- Kangaroo
- TinkerPop
- Pachyderm MapReduce
Distributed Filesystems
- Apache HDFS
outliers:
- Tachyon
- Apache GridGain
- Ignite
- XtreemFS
- Ceph Filesystem
- Red Hat GlusterFS
- Quantcast File System QFS
- Lustre
NoSQL Databases
:Columnal Data Model
- Apache HBase
outliers:
- Apache Accumulo
- Hypertable
- HP Vertica
:Key Value Data
Model
- Apache Cassandra
- Riak
- Redis
- Linkedin Volemort
:Document Data
Model
outliers:
- MongoDB
- RethinkDB
- ArangoDB
- CouchDB
:Stream Data Model
outliers:
- EventStore
:Key-Value Data Model
outliers:
- Redis DataBase
- Linkedin Voldemort
- RocksDB
- OpenTSDB
:Graph Data Model
outliers:
- Neo4j
- ArangoDB
- TitanDB
- OrientDB
- Intel GraphBuilder
- Giraph
- Pegasus
- Apache Spark
NewSQL Databases
outliers:
- TokuDB
- HandlerSocket
- Akiban Server
- Drizzle
- Haeinsa
- SenseiDB
- Sky
- BayesDB
- InfluxDB
Data Ingestion
:SQL on Hadoop
- Apache Hive
- Apache HCatalog
outliers:
- Cloudera Kudu
- Trafodion
- Apache Drill
- Cloudera Impala
- Facebook Presto
- Datasalt Splout SQL
- Apache Spark
- Apache Tajo
- Apache Phoenix
- Apache MRQL
- Kylin
Data Ingestion
- Apache Flume
- Apache Sqoop
outliers:
- Facebook Scribe
- Apache Chukwa
- Apache Storm
- Apache Kafka
- Netflix Suro
- Apache Samza
- Cloudera Morphline
- HIHO
- Apache NiFi
Etc.
Service Programming and
Frameworks
- Apache Zookeeper
- Apache Avro
- Apache Parquet
outliers:
- Apache Thrift
- Apache Curator
- Apache Karaf
- Twitter Elephant Bird
- Linkedin Norbert
Scheduling
- Apache Oozie
outliers:
- Linkedin Azkaban
- Spotify Luigi
- Apache Falcon
- Schedoscope
Security
outliers:
- Apache Sentry
- Apache Knox Gateway
- Apache Ranger
System Deployment and
Management
outliers:
- Apache Ambari
- Cloudera Manager
- Cloudera HUE
- Apache Whirr
- Apache Mesos
- Myriad
- Marathon
- Brooklyn
- Hortonworks HOYA
- Apache Helix
- Apache Bigtop
- Buildoop
- Deploop
And now a bit of common sense for
sys-admin-ing Hadoop clusters
Avoid
 The usual -
 Don’t let hdfs fill up
 Don’t use all the memory
 Don’t use up all the cpus
 Don’t drop the network
 <insert fav disaster>
 Resource Exhaustion by users
 Hardware Failure (drives are the king of this domain)
Um, backups?
 Usual suspects plus
 Namenode’s meta data!! (fsimage)
 Hdfs? Well, it would nice but unlikely (if so distcp)
 Snapshots
Hadoop Management
 Apache Ambari
 Cloudera Manager
Monitoring
 The usual suspects plus…
 JMX support
 Jvm via jstat, jmap etc.
 hdfs
 Mapred
 conf/hadoop-metrics.properties
 http://namenode:50070/
 http://namenode:50070/jmx
User management
 Hdfs quotas
 Access controls
 Internal and
 external
 MR schedulers
 Fifo, Fair, Capacity
 Kerberos can be used as well
Configuration
 /etc/hadoop/conf
 Lots of knobs!
 !Ojo! –
 Lots of overrides
 Get the basic system solid before security and performance
 Watch the units – some are in megabytes but some are in
bytes!
 Have canary jobs
 Ensure same configs are everywhere (including uniform
dns/host)
Want more?
(Disclaimer: I receive nothing from O’Reilly. Not even a Christmas card…)
Fin
 Thanks for listening
 Slides:
http://www.slideshare.net/ericwilliammarshall/hadoop-
for-sysadmins
 Any questions?
What’s in a name?
 Doug Cutting seems to have been inspired by his
family. Lucene is his wife’s middle name, and her
maternal grandmother’s first name. His son, as a
toddler, used Nutch as the all-purpose word for meal
and later named a yellow stuffed elephant Hadoop.
Doug said he “was looking for a name that wasn’t
already a web domain and wasn’t trademarked, so I
tried various words that were in my life but not used by
anybody else. Kids are pretty good at making up
words.”
What to do?
Combinations of the usual stuff:
 Numerical Summarizations
 Filtering
 Altering Data Organization
 Joining Data
 I/O
federation
(Image from Chapter 2 of Eric Sammer’s Hadoop Operations by O’Reilly)
Hadoop for sysadmins

More Related Content

What's hot

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
awesomesos
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Getting Hiera and Hiera
Getting Hiera and HieraGetting Hiera and Hiera
Getting Hiera and Hiera
Puppet
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
Puppet
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
MapR Technologies
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
David Wellman
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)
Kentoku
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
Chirag Ahuja
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
DataWorks Summit
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
DataWorks Summit
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
Dean Wampler
 

What's hot (20)

Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Getting Hiera and Hiera
Getting Hiera and HieraGetting Hiera and Hiera
Getting Hiera and Hiera
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)Spider HA 20100922(DTT#7)
Spider HA 20100922(DTT#7)
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Stream all the things
Stream all the thingsStream all the things
Stream all the things
 

Similar to Hadoop for sysadmins

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
Edward Capriolo
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
obdit
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
DataStax Academy
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
Guy Harrison
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
supertom
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
Лев Валкин — Программируем функционально
Лев Валкин — Программируем функциональноЛев Валкин — Программируем функционально
Лев Валкин — Программируем функционально
Daria Oreshkina
 
Diving into Functional Programming
Diving into Functional ProgrammingDiving into Functional Programming
Diving into Functional Programming
Lev Walkin
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
shravanthium111
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Hadoop
HadoopHadoop
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 

Similar to Hadoop for sysadmins (20)

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Whirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIveWhirlwind tour of Hadoop and HIve
Whirlwind tour of Hadoop and HIve
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Лев Валкин — Программируем функционально
Лев Валкин — Программируем функциональноЛев Валкин — Программируем функционально
Лев Валкин — Программируем функционально
 
Diving into Functional Programming
Diving into Functional ProgrammingDiving into Functional Programming
Diving into Functional Programming
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Hadoop
HadoopHadoop
Hadoop
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

More from ericwilliammarshall

Nosql
NosqlNosql
Spark infrastructure
Spark infrastructureSpark infrastructure
Spark infrastructure
ericwilliammarshall
 
File maker for yap
File maker for yapFile maker for yap
File maker for yap
ericwilliammarshall
 
Web arch gfdl
Web arch gfdlWeb arch gfdl
Web arch gfdl
ericwilliammarshall
 
Shibboleth
ShibbolethShibboleth
Condor
CondorCondor
high performance computing exposed
high performance computing exposedhigh performance computing exposed
high performance computing exposed
ericwilliammarshall
 

More from ericwilliammarshall (7)

Nosql
NosqlNosql
Nosql
 
Spark infrastructure
Spark infrastructureSpark infrastructure
Spark infrastructure
 
File maker for yap
File maker for yapFile maker for yap
File maker for yap
 
Web arch gfdl
Web arch gfdlWeb arch gfdl
Web arch gfdl
 
Shibboleth
ShibbolethShibboleth
Shibboleth
 
Condor
CondorCondor
Condor
 
high performance computing exposed
high performance computing exposedhigh performance computing exposed
high performance computing exposed
 

Recently uploaded

Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
Arpan Buwa
 

Recently uploaded (20)

Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
 

Hadoop for sysadmins

  • 1. A whirlwind tour of hadoop By Eric Marshall For LOPSA-NJ an all too brief introduction to the world of big data
  • 2. Eric Marshall I work for Airisdata; we’re hiring! Smallest computer I lost sleep over: Sinclair-Timex Z81 – 1KB of memory Largest computer I lost sleep over: SGI Altix 4700 – 1 TB of memory
  • 3. Vocabulary disclaimer  Just like your favorite swear word, which can act like many parts of speech and refer to many a thing; hadoop vocabulary has the same problem  Casually, people refer to hadoop as storage, processing, a programming model(s), clustered machines. The same problem exists for other terms in the lexicon, so ask me when I make less sense than usual.
  • 4. My plan of attack  An intro: the good, the bad and the ugly at 50,000 ft.  2¢ tour of hadoop’s processing - map reduce  2¢ tour of hadoop’s storage – hdfs  A blitz tour of the rest of the hadoop ecosystem
  • 5. Why did this happen?  Old school –> scale up == larger costlier monolithic system (or a small cluster there of) i.e. vertical scaling  Different approach – all road lead to scale out  Assume failures  Smart software, cheap hardware  Don’t move data; bring processing to data
  • 6. The Good Simple development (when compared to Message Passing Interface programming ) Scale – no shared state, programmer don’t need to know the topology, easy to add hardware Automatic parallelization and distribution of tasks Fault tolerance Works with commodity hardware Open source!
  • 7. The Bad  Not a silver bullet :(  MapReduce is batch data processing the time scale is minutes to hours  MapReduce is overly simplify/abstracted – you are stuck with the M/R model and it is hard to work smarter  MapReduce is low level compared to high-level languages like SQL  Not all work decomposes well into parallelized M/R  Open source :)
  • 8. The Ugly? Welcome to the rest of our talk! First stop, Map Reduce
  • 9. Hadoop’s MapReduce Lisp’s map and reduce plus the associative property applied to clusters.
  • 10. Map()  Imagine a number of servers with lists of first names – What is the most popular name?  Box 1-isabella William ava mia Emma Alexander  Box 2-Noah NOAH Isabella Isabella emma Emma  Box 3-emma Emma Liam liam mason Isabella Map() would apply a function to each element independent of order. For example, capitalize each word (MapReduce is covered in greater detail in Chapter 2 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 11. Map()  So we would have:  Box 1-Isabella William Ava Mia Emma Alexander  Box 2-Noah Noah Isabella Isabella Emma Emma  Box 3-Emma Emma Liam Liam Mason Isabella Map() could be apply function to make pairs For example, Isabella becomes (Isabella, 1)
  • 12. Map()  So we would have:  Box 1-(Isabella,1) (William,1) (Ava,1) (Mia,1) (Emma,1) (Alexander,1)  Box 2-(Noah,1) (Noah,1) (Isabella,1) (Isabella,1) (Emma,1) (Emma,1)  Box 3-(Emma,1) (Emma,1) (Liam,1) (Liam,1) (Mason,1) (Isabella,1) Now we are almost ready for the reduce, but first the sort and shuffle
  • 13. Shuffle/Sort  So we would have:  Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)  Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)  Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1) (Noah,1) (William,1) Now for the reduce, our function would sum all the of the 1s, and return name and count
  • 14. Reduce  So we would have:  Box 1-(Alexander,1) (Ava,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1) (Emma,1)  Box 2-(Isabella,1) (Isabella,1) (Isabella,1) (Isabella,1)  Box 3-(Liam,1) (Liam,1) (Mason,1) (Mia,1) (Noah,1) (Noah,1) (William,1) Now for the reduce, our function would sum all the of the 1s, and return name and count  Box 1-(Alexander,1) (Ava,1) (Emma,5)  Box 2-(Isabella,4)  Box 3-(Liam,2) (Mason,1) (Mia,1) (Noah,2) (William,1) (https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html for similar coded in java )
  • 15. (This architecture is covered in greater detail in Chapter 4 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 17. Map/reduce failures Check the job if: The job throws an uncaught exception. The job exits with a nonzero exit code. The job fails to report progress to the tasktracker for a configurable amount of time. (i.e. hung, stuck, slow) Check the node if: the same node keeps killing jobs…check the node Check the Job tracker/RM if: jobs are lost or stuck and then they all fail 
  • 18. Instant MR test  Um, is the system working?  yarn jar /usr/hdp/2.3.2.0-2950/hadoop- mapreduce/hadoop-mapreduce-examples.jar pi 10 100 (your jar most likely will be somewhere else)
  • 19. (HDFS is covered in greater detail in Chapter 3 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 20. There is a HDFS CLI  You already know some of the commands:  hdfs dfs –ls /  hdfs dfs –du /  hdfs dfs –rm /  hdfs dfs –cat /  There are other modes than dfs: dfsadmin, namenode, datanode, fsck, zkfc, balancer, etc.
  • 21. HDFS failures  Jobs fail: due to missing blocks  Jobs fail: due to moving data due to down datanodes or huge ingest  Without NN HA – single point of failure for everything  Regular file system mayhem that you already know and love  plus the usual perms issues
  • 22. HDFS failures  Jobs fail: due to missing blocks  Jobs fail: due to moving data due to down datanodes or huge ingest  Without NN HA – single point of failure for everything  Regular file system mayhem that you already know and love  plus the usual perms issues
  • 23. The rest of the garden Distributed Filesystems - Apache HDFS outliers: - Tachyon - Apache GridGain - Ignite - XtreemFS - Ceph Filesystem - Red Hat GlusterFS - Quantcast File System QFS - Lustre Security outliers: - Apache Sentry - Apache Knox Gateway - Apache Ranger Distributed Programming - Apache MapReduce also MRv2/YARN - Apache Pig outliers: - JAQL - Apache Spark - Apache Flink (formerly Stratosphere) - Netflix PigPen - AMPLab SIMR - Facebook Corona - Apache Twill - Damballa Parkour - Apache Hama - Datasalt Pangool - Apache Tez - Apache Llama - Apache DataFu - Pydoop - Kangaroo - TinkerPop - Pachyderm MapReduce NewSQL Databases outliers: - TokuDB - HandlerSocket - Akiban Server - Drizzle - Haeinsa - SenseiDB - Sky - BayesDB - InfluxDB NoSQL Databases :Columnal Data Model - Apache HBase outliers: - Apache Accumulo - Hypertable - HP Vertica :Key Value Data Model - Apache Cassandra - Riak - Redis - Linkedin Volemort :Document Data Model outliers: - MongoDB - RethinkDB - ArangoDB - CouchDB :Stream Data Model outliers: - EventStore :Key-Value Data Model outliers: - Redis DataBase - Linkedin Voldemort - RocksDB - OpenTSDB :Graph Data Model outliers: - Neo4j - ArangoDB - TitanDB - OrientDB - Intel GraphBuilder - Giraph - Pegasus - Apache Spark Scheduling - Apache Oozie outliers: - Linkedin Azkaban - Spotify Luigi - Apache Falco
  • 24. 10 in 10 minutes!  Easier Programming: Pig, Spark  SQL-like tools: Hive, Impala, Hbase  Data pipefitting: Sqoop, Flume, Kafka  Book keeping: Oozie, Zookeeper
  • 26. Pig What is it: a high level programming language for data manipulation that abstracts M/R from Yahoo Why: a few lines of code to munge data Example: filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; (Pig is covered in greater detail in Alan Gate’s Programming Pig by O’Reilly And in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 27. Spark What is it: computing framework from ampLab, UC Berkeley Why: high level abstractions and better use of memory Neat trick: in-memory RDDs Example: scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) Or, in python: >>> linesWithSpark = textFile.filter(lambda line: "Spark" in line) (Spark is covered in greater detail by Matei Zaharia et al. in Learning Spark by O’Reilly Also of note is Advanced Analytics with Spark – it shows Spark’s capabilities well but moves way too quick to be truly useful. It is covered in Chapter 19 of Tom White’s Hadoop – The Definitive Guide by O’Reilly – lastest ed. Only)
  • 29. Hive/HQL What is it: a data infrastructure and query language from Facebook Why: batched SQL queries against HDFS Neat trick: stores metadata so you don’t have to Example: hive> LOAD DATA INPATH ‘/user/work/input/BX-BooksCorrected.csv’ OVERWRITE INTO TABLE BXDataSet; hive> select yearofpublication, count(booktitle) from bxdataset group by yearofpublication; (Hive is covered in greater detail by Jason Ruthergenlen et al. in Programming HIve by O’Reilly. Instant Apache Hive Essentials How-To by Darren Lee by Packt was useful to me as tutorial. It is also covered in Chapter 17 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 30. Impala What is it: SQL query engine from Cloudera Why: fast adhoc queries on subsets of data stored in hadoop Example: [impala-host:21000] > select count(*) from customer_address; (nada, let me know if you hit pay dirt)
  • 31. HBase What is it: a non-relational database from Powerset Why: fast access to large sparse data sets Example: hbase(main):001:0> create 'test', 'cf' 0 row(s) in 0.4170 seconds  Hbase::Table – test hbase(main):003:0> put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0850 seconds hbase(main):006:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1421762485768, value=value1 (HBase is covered in Chapter 20 of Tom White’s Hadoop – The Definitive Guide by O’Reilly And in covered in greater detail in Lars George’s HBase – The Definitive Guide by O’Reilly)
  • 33. Sqoop What is it: glue tool for moving data between relational databases and hadoop Why: make the cumbersome easier Example: sqoop list-databases --connect jdbc:mysql://mysql/employees –username joe -- password myPassword (HBase is covered in greater detail in Chapter 16 of Tom White’s Hadoop – The Definitive Guide by O’Reilly There is also a cookbook that covered a few worthy gotchas: Apache Sqoop Cookbook Kathleen Ting by O’Reilly)
  • 34. Flume What is it: a service for collecting and aggregating logs Why: because log ingestion is tougher than it seems Example: # Define a memory channel on agent called memory-channel. agent.channels.memory-channel.type = memory # Define a source on agent and connect to channel memory-channel. agent.sources.tail-source.type = exec agent.sources.tail-source.command = tail -F /var/log/system.log agent.sources.tail-source.channels = memory-channel # Define a sink that outputs to logger. agent.sinks.log-sink.channel = memory-channel agent.sinks.log-sink.type = logger # Define a sink that outputs to hdfs. agent.sinks.hdfs-sink.channel = memory-channel agent.sinks.hdfs-sink.type = hdfs agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/ agent.sinks.hdfs-sink.hdfs.fileType = DataStream # Finally, activate. agent.channels = memory-channel agent.sources = tail-source agent.sinks = log-sink hdfs-sink (I haven’t read much on Flume; if you find something clever let me know!)
  • 35. Kafka What is it: message broker from LinkedIn Why: fast handling of data feeds Neat trick: no need to worry about missing data or double processing data Example: > bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic test This is a message This is another message > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from- beginning This is a message This is another message (I disliked the one book I read but I found the online docs very readable! http://kafka.apache.org/ Also check out the design docs http://kafka.apache.org/documentation.html#design )
  • 37. Oozie What is it: workflow scheduler from Yahoo Banglalore Why: because cron isn’t perfect Example: oozie job -oozie http://localhost:8080/oozie -config examples/apps/map- reduce/job.properties -run (Oozie is covered in greater detail in Islam & Srinivasan’s Apache Oozie: The Workflow Scheduler by O’Reilly)
  • 38. Zookeeper What is it: a coordination service from Yahoo Why: sync info for distributed systems (similar idea behind DNS or LDAP) Example: [zkshell: 14] set /zk_test junk cZxid = 5 ctime = Fri Jun 05 13:57:06 PDT 2009 mZxid = 6 mtime = Fri Jun 05 14:01:52 PDT 2009 pZxid = 5 [zkshell: 15] get /zk_test junk cZxid = 5 ctime = Fri Jun 05 13:57:06 PDT 2009 mZxid = 6 mtime = Fri Jun 05 14:01:52 PDT 2009 pZxid = 5 (Zookeeper is covered in greater detail in Zookeeper: Distributed Process Cooridination by O’Reilly And in Chapter 21 of Tom White’s Hadoop – The Definitive Guide by O’Reilly)
  • 39. Distributed Programming - Apache MapReduce also MRv2/YARN - Apache Pig outliers: - JAQL - Apache Spark - Apache Flink (formerly Stratosphere) - Netflix PigPen - AMPLab SIMR - Facebook Corona - Apache Twill - Damballa Parkour - Apache Hama - Datasalt Pangool - Apache Tez - Apache Llama - Apache DataFu - Pydoop - Kangaroo - TinkerPop - Pachyderm MapReduce
  • 40. Distributed Filesystems - Apache HDFS outliers: - Tachyon - Apache GridGain - Ignite - XtreemFS - Ceph Filesystem - Red Hat GlusterFS - Quantcast File System QFS - Lustre
  • 41. NoSQL Databases :Columnal Data Model - Apache HBase outliers: - Apache Accumulo - Hypertable - HP Vertica :Key Value Data Model - Apache Cassandra - Riak - Redis - Linkedin Volemort :Document Data Model outliers: - MongoDB - RethinkDB - ArangoDB - CouchDB :Stream Data Model outliers: - EventStore :Key-Value Data Model outliers: - Redis DataBase - Linkedin Voldemort - RocksDB - OpenTSDB :Graph Data Model outliers: - Neo4j - ArangoDB - TitanDB - OrientDB - Intel GraphBuilder - Giraph - Pegasus - Apache Spark NewSQL Databases outliers: - TokuDB - HandlerSocket - Akiban Server - Drizzle - Haeinsa - SenseiDB - Sky - BayesDB - InfluxDB
  • 42. Data Ingestion :SQL on Hadoop - Apache Hive - Apache HCatalog outliers: - Cloudera Kudu - Trafodion - Apache Drill - Cloudera Impala - Facebook Presto - Datasalt Splout SQL - Apache Spark - Apache Tajo - Apache Phoenix - Apache MRQL - Kylin Data Ingestion - Apache Flume - Apache Sqoop outliers: - Facebook Scribe - Apache Chukwa - Apache Storm - Apache Kafka - Netflix Suro - Apache Samza - Cloudera Morphline - HIHO - Apache NiFi
  • 43. Etc. Service Programming and Frameworks - Apache Zookeeper - Apache Avro - Apache Parquet outliers: - Apache Thrift - Apache Curator - Apache Karaf - Twitter Elephant Bird - Linkedin Norbert Scheduling - Apache Oozie outliers: - Linkedin Azkaban - Spotify Luigi - Apache Falcon - Schedoscope Security outliers: - Apache Sentry - Apache Knox Gateway - Apache Ranger System Deployment and Management outliers: - Apache Ambari - Cloudera Manager - Cloudera HUE - Apache Whirr - Apache Mesos - Myriad - Marathon - Brooklyn - Hortonworks HOYA - Apache Helix - Apache Bigtop - Buildoop - Deploop
  • 44. And now a bit of common sense for sys-admin-ing Hadoop clusters
  • 45. Avoid  The usual -  Don’t let hdfs fill up  Don’t use all the memory  Don’t use up all the cpus  Don’t drop the network  <insert fav disaster>  Resource Exhaustion by users  Hardware Failure (drives are the king of this domain)
  • 46. Um, backups?  Usual suspects plus  Namenode’s meta data!! (fsimage)  Hdfs? Well, it would nice but unlikely (if so distcp)  Snapshots
  • 47. Hadoop Management  Apache Ambari  Cloudera Manager
  • 48. Monitoring  The usual suspects plus…  JMX support  Jvm via jstat, jmap etc.  hdfs  Mapred  conf/hadoop-metrics.properties  http://namenode:50070/  http://namenode:50070/jmx
  • 49. User management  Hdfs quotas  Access controls  Internal and  external  MR schedulers  Fifo, Fair, Capacity  Kerberos can be used as well
  • 50. Configuration  /etc/hadoop/conf  Lots of knobs!  !Ojo! –  Lots of overrides  Get the basic system solid before security and performance  Watch the units – some are in megabytes but some are in bytes!  Have canary jobs  Ensure same configs are everywhere (including uniform dns/host)
  • 51. Want more? (Disclaimer: I receive nothing from O’Reilly. Not even a Christmas card…)
  • 52. Fin  Thanks for listening  Slides: http://www.slideshare.net/ericwilliammarshall/hadoop- for-sysadmins  Any questions?
  • 53. What’s in a name?  Doug Cutting seems to have been inspired by his family. Lucene is his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
  • 54. What to do? Combinations of the usual stuff:  Numerical Summarizations  Filtering  Altering Data Organization  Joining Data  I/O
  • 55. federation (Image from Chapter 2 of Eric Sammer’s Hadoop Operations by O’Reilly)