SlideShare a Scribd company logo
1 of 25
Download to read offline
Hadoop 101: Back to School
St. Louis Hadoop Users Group
Wednesday, September 6, 2017
Photo by JJ Thompson on Unsplash
Agenda
1. The V’s of Big Data
2. Hadoop Foundation
3. Hadoop Projects
a. Flume, Hive, Sqoop, Spark, Storm, and Kafka
4. Use Cases
5. Cloud
6. Getting your own environment setup
The V’s of Big Data
Photo by Bruno Martins on Unsplash
The V’s of Big Data
The V’s of Big Data
1. Volume - quantity of data, too much for one machine
2. Variety - tweets, videos, iot, databases, logs
3. Velocity - batch, streaming from many devices
4. Variability - meaning of data changes, ex: sentiment
5. Veracity - data quality, accuracy
Hadoop Goals
● Scalability
● Reliability
● Cost
● Parallel processing
Hadoop Support among distros
● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog
● Five supporters
○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive,
Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper
● Four supporters
○ Apache Kafka, Apache Mahout, Hue
● Three supporters
○ Apache DataFu, Apache Impala, Cascading
● Be careful about versions!
○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2
38
Total number of projects on the Apache Software Foundation “big data” list
Not counting Apache Hive, Apache HBase + others!
Apache Hadoop - Hadoop Distributed File System (HDFS)
● Store data across many machines
● Designed to store large files
○ Files are split into blocks
○ Blocks are replicated across different nodes in the cluster
● Many other Hadoop projects store their data in HDFS
● Using HDFS
○ Indirectly via other services (Hive, HBase, Spark, etc)
○ Access it directly using the command line:
■ hdfs dfs -help
■ hdfs dfs -ls
■ hdfs dfs -mkdir /tmp/something
Apache MapReduce
● Framework for processing data in HDFS
● Largely being replaced by higher level frameworks like Spark, Hive, etc.
● Core concepts are still important
○ A Job is split into multiple tasks to execute in parallel
○ Map - a transformation, filter, and/or sorting
○ Reduce - summarization like count, average..
● Using MapReduce
○ Write a Java app using MapReduce API
○ Submit to run on the cluster
bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives
myarchive.zip input output
Apache Flume
● Tool for reliably ingesting data into Hadoop
● Core concepts
○ Agent - JVM processing event flow
○ Source - input - events from files, avro, thrift, twitter, kafka, etc.
○ Channel - passive store until event is consumed by the sink
○ Sink - output - to HDFS or another agent
● Using Flume
○ Create configuration file (Java properties file)
○ Start flume agent on nodes using command line
Apache Hive
● Query files in HDFS with “SQL”
● Schema on read
● Supports a variety of file formats
○ Plain text - delimited files like CSV, TSV
○ Columnar file formats - ORC, Parquet
○ Avro
○ JSON (with a serde)
● Using Hive
○ Command line with hive from the edge node
○ beeline (command line tool) - uses JDBC
○ Web UI like Hue or Ambari
○ SQuirreL or other clients
Apache Sqoop
● Move between Hadoop and structured data stores like relational databases
○ Import - From RDBMS to Hadoop
○ Export - From Hadoop to RDBMS
● Uses JDBC to connect to the database and can write files HDFS and/or Hive
● Using Sqoop
○ Use the command line tool from the edge node
$ sqoop import 
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' 
--split-by a.id --target-dir /user/foo/joinresults
Apache Spark
● Framework for batch and streaming (micro-batch) data processing
● Faster (in memory!) and easier to use than MapReduce
● Modules
○ Spark SQL for SQL and structured data processing
○ MLlib for machine learning
○ GraphX for graph processing
○ Spark Streaming.
● Using Spark
○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster
○ Use pyspark, python REPL (read-eval-print loop)
○ Use spark-shell, scala REPL
○ Notebook like Jupyter, Zeppelin
Apache Storm
● Framework for processing streaming data in real-time
● Message at a time, not micro-batch
● Concepts
○ Tuples – an ordered list of elements
○ Streams – an unbounded sequence of tuples
○ Spouts – bring data in, create tuples
○ Bolts – process streams of data
○ Topologies – network of spouts and bolts
● Using Storm
○ Write Java code to build a storm topology
○ Submit uber jar to the cluster with storm CLI
Apache Kafka
● Publish-subscribe messaging for streaming data
● Installed on a cluster, data stored locally on disk
● Core concepts
○ Topics - stream of records (key, value) stored in order split up across partitions
○ Producer - puts data on topics
○ Consumer(s) - read data off topics
● Data is retained for a limited amount of time
● Consumers can read data from a given offset
● Using Kafka
○ Client API to produce/consume data or from another service to persist data for streaming
○ Command line utilities for debugging
Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics
Photo by Igor Ovsyannykov on Unsplash
Quiz #1 Answers
Blue lines are Flume agents used to install web logs from servers into hadoop
Orange line is Sqoop used to move data from Hadoop to a relational database
Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation
Photo by Samuel Zeller on Unsplash
Quiz #2 Answers
Blue lines are Sqoop used to move data from relational database to Hadoop
Orange lines would be Hive to query the data in Hadoop with SQL
Use Case #3 - IoTUse Case #3 - IoT
Quiz #3 Answers
Blue lines are Kafka, good intermediary between IoT devices and your stream processor
Orange lines could be Spark Streaming or Storm to process the data
Cloud
● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc
● Roll your own with Infrastructure as a Service
● Pros: Quicker time to market, easier to scale, integration with other cloud services
● Separation of storage and compute
○ Sacrifice storage performance for faster/easier scalability
Getting Started
● Useful skills
○ Java - troubleshooting errors
○ Linux - command line, ssh
● Locally
○ PC with 16 GB of RAM
○ VirtualBox, Putty, Browser
○ Sandbox from Hortonworks / Cloudera
● Cloud
○ Images available on Azure/Amazon
● Learning
○ Hadoop weekly email newsletter https://hadoopweekly.com/
○ YouTube, Slideshare
Links
Hadoop Apache Project Commercial Support Tracker April 2016
http://blogs.gartner.com/merv-adrian/2016/04/27/hadoop-apache-project-commercial-support-tracker-april-2016/
HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Flume https://flume.apache.org/FlumeUserGuide.html
Kafka http://kafka.apache.org/intro
Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Sqoop http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/
Sandboxes https://hortonworks.com/products/sandbox/ https://www.cloudera.com/downloads/quickstart_vms/5-12.html
Thanks!
Contact me:
Kit Menke
@kitmenke
kmenke@1904labs.com

More Related Content

What's hot

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

What's hot (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
알쓸신잡
알쓸신잡알쓸신잡
알쓸신잡
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Hadoop
HadoopHadoop
Hadoop
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 

Similar to 9/2017 STL HUG - Back to School

Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 

Similar to 9/2017 STL HUG - Back to School (20)

Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Survey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big DataSurvey of Real-time Processing Systems for Big Data
Survey of Real-time Processing Systems for Big Data
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014Hadoop and OpenStack - Hadoop Summit San Jose 2014
Hadoop and OpenStack - Hadoop Summit San Jose 2014
 
Apache spark its place within a big data stack
Apache spark  its place within a big data stackApache spark  its place within a big data stack
Apache spark its place within a big data stack
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 

More from Adam Doyle

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 

Recently uploaded

一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 

9/2017 STL HUG - Back to School

  • 1. Hadoop 101: Back to School St. Louis Hadoop Users Group Wednesday, September 6, 2017 Photo by JJ Thompson on Unsplash
  • 2. Agenda 1. The V’s of Big Data 2. Hadoop Foundation 3. Hadoop Projects a. Flume, Hive, Sqoop, Spark, Storm, and Kafka 4. Use Cases 5. Cloud 6. Getting your own environment setup
  • 3. The V’s of Big Data Photo by Bruno Martins on Unsplash The V’s of Big Data
  • 4. The V’s of Big Data 1. Volume - quantity of data, too much for one machine 2. Variety - tweets, videos, iot, databases, logs 3. Velocity - batch, streaming from many devices 4. Variability - meaning of data changes, ex: sentiment 5. Veracity - data quality, accuracy
  • 5. Hadoop Goals ● Scalability ● Reliability ● Cost ● Parallel processing
  • 6. Hadoop Support among distros ● Commercial offerings from Amazon, Cloudera, Hortonworks, IBM, & MapR - Merv Adrian’s blog ● Five supporters ○ Apache HDFS, Apache MapReduce, Apache YARN, Apache Avro, Apache Flume, Apache HBase, Apache Hive, Apache Oozie, Apache Parquet, Apache Pig, Apache Solr, Apache Spark, Apache Sqoop, Apache Zookeeper ● Four supporters ○ Apache Kafka, Apache Mahout, Hue ● Three supporters ○ Apache DataFu, Apache Impala, Cascading ● Be careful about versions! ○ Ex: Spark 1.6 vs Spark 2.x, Sqoop1 vs Sqoop2
  • 7. 38 Total number of projects on the Apache Software Foundation “big data” list Not counting Apache Hive, Apache HBase + others!
  • 8. Apache Hadoop - Hadoop Distributed File System (HDFS) ● Store data across many machines ● Designed to store large files ○ Files are split into blocks ○ Blocks are replicated across different nodes in the cluster ● Many other Hadoop projects store their data in HDFS ● Using HDFS ○ Indirectly via other services (Hive, HBase, Spark, etc) ○ Access it directly using the command line: ■ hdfs dfs -help ■ hdfs dfs -ls ■ hdfs dfs -mkdir /tmp/something
  • 9. Apache MapReduce ● Framework for processing data in HDFS ● Largely being replaced by higher level frameworks like Spark, Hive, etc. ● Core concepts are still important ○ A Job is split into multiple tasks to execute in parallel ○ Map - a transformation, filter, and/or sorting ○ Reduce - summarization like count, average.. ● Using MapReduce ○ Write a Java app using MapReduce API ○ Submit to run on the cluster bin/hadoop jar hadoop-mapreduce-examples-<ver>.jar wordcount -files cachefile.txt -libjars mylib.jar -archives myarchive.zip input output
  • 10. Apache Flume ● Tool for reliably ingesting data into Hadoop ● Core concepts ○ Agent - JVM processing event flow ○ Source - input - events from files, avro, thrift, twitter, kafka, etc. ○ Channel - passive store until event is consumed by the sink ○ Sink - output - to HDFS or another agent ● Using Flume ○ Create configuration file (Java properties file) ○ Start flume agent on nodes using command line
  • 11. Apache Hive ● Query files in HDFS with “SQL” ● Schema on read ● Supports a variety of file formats ○ Plain text - delimited files like CSV, TSV ○ Columnar file formats - ORC, Parquet ○ Avro ○ JSON (with a serde) ● Using Hive ○ Command line with hive from the edge node ○ beeline (command line tool) - uses JDBC ○ Web UI like Hue or Ambari ○ SQuirreL or other clients
  • 12. Apache Sqoop ● Move between Hadoop and structured data stores like relational databases ○ Import - From RDBMS to Hadoop ○ Export - From Hadoop to RDBMS ● Uses JDBC to connect to the database and can write files HDFS and/or Hive ● Using Sqoop ○ Use the command line tool from the edge node $ sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/foo/joinresults
  • 13. Apache Spark ● Framework for batch and streaming (micro-batch) data processing ● Faster (in memory!) and easier to use than MapReduce ● Modules ○ Spark SQL for SQL and structured data processing ○ MLlib for machine learning ○ GraphX for graph processing ○ Spark Streaming. ● Using Spark ○ Write a Spark application using Python, Scala, or Java APIs, then “submit” the application to the cluster ○ Use pyspark, python REPL (read-eval-print loop) ○ Use spark-shell, scala REPL ○ Notebook like Jupyter, Zeppelin
  • 14. Apache Storm ● Framework for processing streaming data in real-time ● Message at a time, not micro-batch ● Concepts ○ Tuples – an ordered list of elements ○ Streams – an unbounded sequence of tuples ○ Spouts – bring data in, create tuples ○ Bolts – process streams of data ○ Topologies – network of spouts and bolts ● Using Storm ○ Write Java code to build a storm topology ○ Submit uber jar to the cluster with storm CLI
  • 15. Apache Kafka ● Publish-subscribe messaging for streaming data ● Installed on a cluster, data stored locally on disk ● Core concepts ○ Topics - stream of records (key, value) stored in order split up across partitions ○ Producer - puts data on topics ○ Consumer(s) - read data off topics ● Data is retained for a limited amount of time ● Consumers can read data from a given offset ● Using Kafka ○ Client API to produce/consume data or from another service to persist data for streaming ○ Command line utilities for debugging
  • 16. Use Case #1 - Website AnalyticsUse Case #1 - Website Analytics Photo by Igor Ovsyannykov on Unsplash
  • 17. Quiz #1 Answers Blue lines are Flume agents used to install web logs from servers into hadoop Orange line is Sqoop used to move data from Hadoop to a relational database
  • 18. Use Case #2 - Data Warehouse AugmentationUse Case #2 - Data Warehouse Augmentation Photo by Samuel Zeller on Unsplash
  • 19. Quiz #2 Answers Blue lines are Sqoop used to move data from relational database to Hadoop Orange lines would be Hive to query the data in Hadoop with SQL
  • 20. Use Case #3 - IoTUse Case #3 - IoT
  • 21. Quiz #3 Answers Blue lines are Kafka, good intermediary between IoT devices and your stream processor Orange lines could be Spark Streaming or Storm to process the data
  • 22. Cloud ● Cloud offerings of Hadoop: Azure HDInsight, Amazon EMR, Google Cloud Dataproc ● Roll your own with Infrastructure as a Service ● Pros: Quicker time to market, easier to scale, integration with other cloud services ● Separation of storage and compute ○ Sacrifice storage performance for faster/easier scalability
  • 23. Getting Started ● Useful skills ○ Java - troubleshooting errors ○ Linux - command line, ssh ● Locally ○ PC with 16 GB of RAM ○ VirtualBox, Putty, Browser ○ Sandbox from Hortonworks / Cloudera ● Cloud ○ Images available on Azure/Amazon ● Learning ○ Hadoop weekly email newsletter https://hadoopweekly.com/ ○ YouTube, Slideshare
  • 24. Links Hadoop Apache Project Commercial Support Tracker April 2016 http://blogs.gartner.com/merv-adrian/2016/04/27/hadoop-apache-project-commercial-support-tracker-april-2016/ HDFS http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html MapReduce http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html Flume https://flume.apache.org/FlumeUserGuide.html Kafka http://kafka.apache.org/intro Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual Sqoop http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/ Sandboxes https://hortonworks.com/products/sandbox/ https://www.cloudera.com/downloads/quickstart_vms/5-12.html