9/2017 STL HUG - Back to School

Hadoop 101: Back to School
St. Louis Hadoop Users Group
Wednesday, September 6, 2017
Photo by JJ Thompson on Unsplash

Agenda
1. The V’s of Big Data
2. Hadoop Foundation
3. Hadoop Projects
a. Flume, Hive, Sqoop, Spark, Storm, and Kafka
4. Use Cases
5. Cloud
6. Getting your own environment setup

The V’s of Big Data
Photo by Bruno Martins on Unsplash

1. Volume - quantity of data, too much for one machine
2. Variety - tweets, videos, iot, databases, logs
3. Velocity - batch, streaming from many devices
4. Variability - meaning of data changes, ex: sentiment
5. Veracity - data quality, accuracy

Hadoop Goals
● Scalability
● Reliability
● Cost
● Parallel processing

Hadoop Support among distros
● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog
● Five supporters
○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive,
Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper
● Four supporters
○ Apache Kafka, Apache Mahout, Hue
● Three supporters
○ Apache DataFu, Apache Impala, Cascading
● Be careful about versions!
○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2

38
Total number of projects on the Apache Software Foundation “big data” list
Not counting Apache Hive, Apache HBase + others!

Apache Hadoop - Hadoop Distributed File System (HDFS)
● Store data across many machines
● Designed to store large files
○ Files are split into blocks
○ Blocks are replicated across different nodes in the cluster
● Many other Hadoop projects store their data in HDFS
● Using HDFS
○ Indirectly via other services (Hive, HBase, Spark, etc)
○ Access it directly using the command line:
■ hdfs dfs -help
■ hdfs dfs -ls
■ hdfs dfs -mkdir /tmp/something

Apache MapReduce
● Framework for processing data in HDFS
● Largely being replaced by higher level frameworks like Spark, Hive, etc.
● Core concepts are still important
○ A Job is split into multiple tasks to execute in parallel
○ Map - a transformation, filter, and/or sorting
○ Reduce - summarization like count, average..
● Using MapReduce
○ Write a Java app using MapReduce API
○ Submit to run on the cluster
bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives
myarchive.zip input output

Apache Flume
● Tool for reliably ingesting data into Hadoop
● Core concepts
○ Agent - JVM processing event flow
○ Source - input - events from files, avro, thrift, twitter, kafka, etc.
○ Channel - passive store until event is consumed by the sink
○ Sink - output - to HDFS or another agent
● Using Flume
○ Create configuration file (Java properties file)
○ Start flume agent on nodes using command line

Apache Hive
● Query files in HDFS with “SQL”
● Schema on read
● Supports a variety of file formats
○ Plain text - delimited files like CSV, TSV
○ Columnar file formats - ORC, Parquet
○ Avro
○ JSON (with a serde)
● Using Hive
○ Command line with hive from the edge node
○ beeline (command line tool) - uses JDBC
○ Web UI like Hue or Ambari
○ SQuirreL or other clients

Apache Sqoop
● Move between Hadoop and structured data stores like relational databases
○ Import - From RDBMS to Hadoop
○ Export - From Hadoop to RDBMS
● Uses JDBC to connect to the database and can write files HDFS and/or Hive
● Using Sqoop
○ Use the command line tool from the edge node
$ sqoop import
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS'
--split-by a.id --target-dir /user/foo/joinresults

Apache Spark
● Framework for batch and streaming (micro-batch) data processing
● Faster (in memory!) and easier to use than MapReduce
● Modules
○ Spark SQL for SQL and structured data processing
○ MLlib for machine learning
○ GraphX for graph processing
○ Spark Streaming.
● Using Spark
○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster
○ Use pyspark, python REPL (read-eval-print loop)
○ Use spark-shell, scala REPL
○ Notebook like Jupyter, Zeppelin

Apache Storm
● Framework for processing streaming data in real-time
● Message at a time, not micro-batch
● Concepts
○ Tuples – an ordered list of elements
○ Streams – an unbounded sequence of tuples
○ Spouts – bring data in, create tuples
○ Bolts – process streams of data
○ Topologies – network of spouts and bolts
● Using Storm
○ Write Java code to build a storm topology
○ Submit uber jar to the cluster with storm CLI

Apache Kafka
● Publish-subscribe messaging for streaming data
● Installed on a cluster, data stored locally on disk
● Core concepts
○ Topics - stream of records (key, value) stored in order split up across partitions
○ Producer - puts data on topics
○ Consumer(s) - read data off topics
● Data is retained for a limited amount of time
● Consumers can read data from a given offset
● Using Kafka
○ Client API to produce/consume data or from another service to persist data for streaming
○ Command line utilities for debugging

Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics
Photo by Igor Ovsyannykov on Unsplash

Quiz #1 Answers
Blue lines are Flume agents used to install web logs from servers into hadoop
Orange line is Sqoop used to move data from Hadoop to a relational database

Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation
Photo by Samuel Zeller on Unsplash

Quiz #2 Answers
Blue lines are Sqoop used to move data from relational database to Hadoop
Orange lines would be Hive to query the data in Hadoop with SQL

Use Case #3 - IoTUse Case #3 - IoT

Quiz #3 Answers
Blue lines are Kafka, good intermediary between IoT devices and your stream processor
Orange lines could be Spark Streaming or Storm to process the data

Cloud
● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc
● Roll your own with Infrastructure as a Service
● Pros: Quicker time to market, easier to scale, integration with other cloud services
● Separation of storage and compute
○ Sacrifice storage performance for faster/easier scalability

Getting Started
● Useful skills
○ Java - troubleshooting errors
○ Linux - command line, ssh
● Locally
○ PC with 16 GB of RAM
○ VirtualBox, Putty, Browser
○ Sandbox from Hortonworks / Cloudera
● Cloud
○ Images available on Azure/Amazon
● Learning
○ Hadoop weekly email newsletter https://hadoopweekly.com/
○ YouTube, Slideshare

Links
Hadoop Apache Project Commercial Support Tracker April 2016
http://blogs.gartner.com/merv-adrian/2016/04/27/hadoop-apache-project-commercial-support-tracker-april-2016/
HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Flume https://flume.apache.org/FlumeUserGuide.html
Kafka http://kafka.apache.org/intro
Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Sqoop http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/
Sandboxes https://hortonworks.com/products/sandbox/ https://www.cloudera.com/downloads/quickstart_vms/5-12.html

Thanks!
Contact me:
Kit Menke
@kitmenke
kmenke@1904labs.com

9/2017 STL HUG - Back to School

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 9/2017 STL HUG - Back to School

Similar to 9/2017 STL HUG - Back to School (20)

More from Adam Doyle

More from Adam Doyle (20)

Recently uploaded

Recently uploaded (20)

9/2017 STL HUG - Back to School