SlideShare a Scribd company logo
1 of 47
Tonight’s Required Resources
• Azure Cloud Account
• Putty
• Files from GitHub:
https://github.com/MassStreetAnalytics/Hands-On-Hadoop
Please come in and get set up.
Hands On: Introduction to
the Hadoop Ecosystem
Bob Wakefield
Principal
bob@MassStreet.net
Twitter:
@BobLovesData
Who is Mass Street?
•Boutique data consultancy
•We work on data problems big or
“small”
•Free Big Data training
Mass Street Partnerships and
Capability
•Hortonworks Partner
•Confluent Partner
•ARG Back Office
Bob’s Background
• IT professional 16 years
• Currently working as a Data Engineer
• Education
• BS Business Admin (MIS) from KState
• MBA (finance concentration) from KU
• Coursework in Mathematics at Washburn
• Graduate certificate Data Science from Rockhurst
•Addicted to everything data
My Experience With Hadoop
• Trained by MetaScale. (Rip MetaScale)
• Running a seven node lab cluster for three years
• Built because sandboxes are limited and tired of waiting for AWS
• Used for clients and R&D
• Cluster specs:
• CentOS 7
• Hortonworks Data Platform 2.6.1
• 1 Name Node
• 3 Data Nodes
• 3 Node Kafka cluster
• Total HD space: 19TB
• Total RAM: 196 GB
• Total cost ~ $3,000
Follow Me!
•Personal Twitter: @BobLovesData
•Company Twitter: @MassStreet
•Blog: DataDrivenPerspectives.com
•Website: www.MassStreet.net
•Facebook: @MassStreetAnalyticsLLC
KC Learn Big Data Objectives
•Educate people about what you
can do with all the new technology
surrounding data.
•Grow the big data career field.
•Teach skills not products
KC Learn Big Data Objectives
0
2
4
6
8
10
12
14
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Prodcuctivity
Time
Bob's Big Data Learning Curve
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7
Prouctivity
Time
Your Learning Curve
After ClassesWith
Bob
Logistics
•Modalities of material delivery.
•We need a better meeting place.
I’m open to ideas!
Upcoming MeetUps
September: Architecting your first Hadoop implementation
October: Joint MeetUp with Data Science KC. Up and Running
with Spark and R
November: Joint MeetUp with Kansas City Apache Spark
MeetUp. Practicing IoT skills with Satori
December: Survey of common data science algorithms.
This Evening’s Learning
Objectives
• Get familiar with Ambari
• Get familiar with Hive
• Demonstrate BI for Big Data
• If there is time, go over a real
world use case.
•Always expect crazy at lab time!
Resources For This Evening’s
Material: Books
• Programming Hive: Data
Warehouse and Query
Language for Hadoop
• Hadoop: The Definitive
Guide: Storage and
Analysis at Internet Scale
Resources For This Evening’s
Material: Online Classes
• Udemy course: Comprehensive Course Hadoop Analytic Tool:
Apache Hive
• No good Hadoop admin classes online
• Links to resources can be found in the Resources folder of the
GitHub repo for this class.
• https://github.com/MassStreetAnalytics/Hands-On-Hadoop
I want to do big data work. What
do I need to know?
1. Basic understanding of Hadoop
ecosystem.
2. Basic understanding of
MapReduce Concepts.
3. Basic understanding of Data
Science Principals.
Pick A Path
1. Data Science
2. Data Engineering
3. Data Interpreter?
Basic Skills For Each Role
Data Scientist
1. Full understanding
of DS/ML concepts
2. R or Python
Data Engineer
1. Java or Scala or Python or R
2. Build tools like Maven, Ant, or
Gradle
3. Source Control preferably
GitHub
4. Full understanding of
MapReduce
5. Full understanding of Hadoop
Ecosystem
6. Full understanding of NoSQL
7. Full understanding of MPP
DBs
What is Hadoop?
• Provides distributed fault tolerant data
storage
• Provides linear scalability on commodity
hardware
• Translation: Take all of your data, throw it
across a bunch of cheap machines, and
analyze it. Get more data? Add more
machines
Common Hadoop Use Cases
• Easier analysis of big data
• Lower TCO of data
• Offloading ETL workloads
• Data warehousing
• Data lakes
• Backbone of various advanced data systems
What these things do
• Sqoop – ETL tool
• Pig – data flow language
• Drill/HAWQ – SQL on Hadoop
• Knox/Ranger – Security
• Zookeeper – Coordination
• Spark – In memory computing
• Kafka – Distributed Pub/Sub messaging
It’s like a child’s erector set!
HDFS
• Hadoop Distributed File System
• The data operating system!
• Manages nodes in the cluster
• Scalable and highly fault tolerant
HDFS
• Mechanics
• Cuts up data into blocks and spreads across
nodes
• Replicates blocks across nodes
• Process optimization
MapReduce
• This is how Google indexes the web
• It’s a low level programming framework for
pulling data out of the cluster
• Communicates with HDFS
• Designed for batch processing
• Can use any language to write MapReduce
jobs
• How does MR work? Pffffft!
Hive
• Hadoop warehouse solution
• SQLesque language called Hive Query
Language
• Adds structure to unstructured data
• Provides a window into HDFS
Hive
• Writes MapReduce behind the scenes (sorta)
• Should be used for OLAP task
• Queries aren’t fast, just faster than normal
• You can connect through ODBC/JDBC
• Which means you can work with Hive using
standard BI tools
Tez
• Tez generalizes the MapReduce paradigm to a
more powerful framework for executing a
complex DAG (directed acyclic graph) of tasks
for near real-time big data processing.
Bob’s Super Crashy Crash Course In SQL
• SQL – Structured Query Language
• SQL is how you retrieve data from relational
databases
• Each statement consist of a minimum of three
elements
• SELECT
• FROM
• WHERE
Bob’s Super Crashy Crash Course In SQL
• In the old world, data is stored in two dimensional
tables; rows and columns
• There are rules about how this information is
stored.
• Data is usually modeled as a set of relationships
• These relationships can be thought of in terms of
parent child.
Bob’s Super Crashy Crash Course In SQL
• There are all kinds of relationships.
• Many to one. Simplest and most common.
• One to one.
• Many to many.
Bob’s Super Crashy Crash Course In SQL
• Normally, each record is identified by some
value called a key.
• The are all kinds of keys but the most
important are:
• Primary Keys
• Foreign Keys
Bob’s Super Crashy Crash Course In SQL
• Primary Keys uniquely identify a record
• Foreign Keys are the primary key of a parent
record.
Bob’s Super Crashy Crash Course In SQL
A brief history of SQL on Hadoop
• Data on HDFS is not stored in a relational style.
• NoSQL
• Data retrieval very difficult
• Attempts to make HDFS talk SQL
• HAWQ
• Drill
• Opinion: Efforts are floundering because of the rise
of MPP databases
Tonight’s Scenario
You work in a large organization with a lot of
data but it is stored in old world relational
databases. You need to analyze this data but
there are over a billion records and your queries
are incredibly slow. You convince your boss to let
you try Hadoop. You spin up a cluster and get to
work.
Analyst Task
1. Get your data into HDFS.
2. Get your data into Hive.
3. Run some initial queries on the data.
4. Share your analysis.
5. (Make sure you destroy your cloud instance!)

More Related Content

What's hot

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJDataWorks Summit/Hadoop Summit
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 

What's hot (18)

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Setting up Hadoop YARN Clustering
Setting up Hadoop YARN ClusteringSetting up Hadoop YARN Clustering
Setting up Hadoop YARN Clustering
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Data Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJData Science with Apache Spark - Crash Course - HS16SJ
Data Science with Apache Spark - Crash Course - HS16SJ
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 

Similar to Hands On: Introduction to the Hadoop Ecosystem

Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoBitly
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 

Similar to Hands On: Introduction to the Hadoop Ecosystem (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache drill
Apache drillApache drill
Apache drill
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 

More from Adaryl "Bob" Wakefield, MBA

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataAdaryl "Bob" Wakefield, MBA
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkAdaryl "Bob" Wakefield, MBA
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionAdaryl "Bob" Wakefield, MBA
 

More from Adaryl "Bob" Wakefield, MBA (6)

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Getting Started With Sparklyr
Getting Started With SparklyrGetting Started With Sparklyr
Getting Started With Sparklyr
 
Perfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT DataPerfecting Your Streaming Skills with Spark and Real World IoT Data
Perfecting Your Streaming Skills with Spark and Real World IoT Data
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Hands On: Linux survival skills
Hands On: Linux survival skillsHands On: Linux survival skills
Hands On: Linux survival skills
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 

Recently uploaded

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Hands On: Introduction to the Hadoop Ecosystem

  • 1. Tonight’s Required Resources • Azure Cloud Account • Putty • Files from GitHub: https://github.com/MassStreetAnalytics/Hands-On-Hadoop Please come in and get set up.
  • 2. Hands On: Introduction to the Hadoop Ecosystem Bob Wakefield Principal bob@MassStreet.net Twitter: @BobLovesData
  • 3. Who is Mass Street? •Boutique data consultancy •We work on data problems big or “small” •Free Big Data training
  • 4. Mass Street Partnerships and Capability •Hortonworks Partner •Confluent Partner •ARG Back Office
  • 5. Bob’s Background • IT professional 16 years • Currently working as a Data Engineer • Education • BS Business Admin (MIS) from KState • MBA (finance concentration) from KU • Coursework in Mathematics at Washburn • Graduate certificate Data Science from Rockhurst •Addicted to everything data
  • 6. My Experience With Hadoop • Trained by MetaScale. (Rip MetaScale) • Running a seven node lab cluster for three years • Built because sandboxes are limited and tired of waiting for AWS • Used for clients and R&D • Cluster specs: • CentOS 7 • Hortonworks Data Platform 2.6.1 • 1 Name Node • 3 Data Nodes • 3 Node Kafka cluster • Total HD space: 19TB • Total RAM: 196 GB • Total cost ~ $3,000
  • 7. Follow Me! •Personal Twitter: @BobLovesData •Company Twitter: @MassStreet •Blog: DataDrivenPerspectives.com •Website: www.MassStreet.net •Facebook: @MassStreetAnalyticsLLC
  • 8. KC Learn Big Data Objectives •Educate people about what you can do with all the new technology surrounding data. •Grow the big data career field. •Teach skills not products
  • 9. KC Learn Big Data Objectives 0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Prodcuctivity Time Bob's Big Data Learning Curve 0 2 4 6 8 10 12 14 16 1 2 3 4 5 6 7 Prouctivity Time Your Learning Curve After ClassesWith Bob
  • 10. Logistics •Modalities of material delivery. •We need a better meeting place. I’m open to ideas!
  • 11. Upcoming MeetUps September: Architecting your first Hadoop implementation October: Joint MeetUp with Data Science KC. Up and Running with Spark and R November: Joint MeetUp with Kansas City Apache Spark MeetUp. Practicing IoT skills with Satori December: Survey of common data science algorithms.
  • 12. This Evening’s Learning Objectives • Get familiar with Ambari • Get familiar with Hive • Demonstrate BI for Big Data • If there is time, go over a real world use case. •Always expect crazy at lab time!
  • 13. Resources For This Evening’s Material: Books • Programming Hive: Data Warehouse and Query Language for Hadoop • Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale
  • 14. Resources For This Evening’s Material: Online Classes • Udemy course: Comprehensive Course Hadoop Analytic Tool: Apache Hive • No good Hadoop admin classes online • Links to resources can be found in the Resources folder of the GitHub repo for this class. • https://github.com/MassStreetAnalytics/Hands-On-Hadoop
  • 15. I want to do big data work. What do I need to know? 1. Basic understanding of Hadoop ecosystem. 2. Basic understanding of MapReduce Concepts. 3. Basic understanding of Data Science Principals.
  • 16. Pick A Path 1. Data Science 2. Data Engineering 3. Data Interpreter?
  • 17. Basic Skills For Each Role Data Scientist 1. Full understanding of DS/ML concepts 2. R or Python Data Engineer 1. Java or Scala or Python or R 2. Build tools like Maven, Ant, or Gradle 3. Source Control preferably GitHub 4. Full understanding of MapReduce 5. Full understanding of Hadoop Ecosystem 6. Full understanding of NoSQL 7. Full understanding of MPP DBs
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. What is Hadoop? • Provides distributed fault tolerant data storage • Provides linear scalability on commodity hardware • Translation: Take all of your data, throw it across a bunch of cheap machines, and analyze it. Get more data? Add more machines
  • 28. Common Hadoop Use Cases • Easier analysis of big data • Lower TCO of data • Offloading ETL workloads • Data warehousing • Data lakes • Backbone of various advanced data systems
  • 29.
  • 30. What these things do • Sqoop – ETL tool • Pig – data flow language • Drill/HAWQ – SQL on Hadoop • Knox/Ranger – Security • Zookeeper – Coordination • Spark – In memory computing • Kafka – Distributed Pub/Sub messaging
  • 31. It’s like a child’s erector set!
  • 32. HDFS • Hadoop Distributed File System • The data operating system! • Manages nodes in the cluster • Scalable and highly fault tolerant
  • 33. HDFS • Mechanics • Cuts up data into blocks and spreads across nodes • Replicates blocks across nodes • Process optimization
  • 34. MapReduce • This is how Google indexes the web • It’s a low level programming framework for pulling data out of the cluster • Communicates with HDFS • Designed for batch processing • Can use any language to write MapReduce jobs • How does MR work? Pffffft!
  • 35.
  • 36. Hive • Hadoop warehouse solution • SQLesque language called Hive Query Language • Adds structure to unstructured data • Provides a window into HDFS
  • 37. Hive • Writes MapReduce behind the scenes (sorta) • Should be used for OLAP task • Queries aren’t fast, just faster than normal • You can connect through ODBC/JDBC • Which means you can work with Hive using standard BI tools
  • 38. Tez • Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
  • 39. Bob’s Super Crashy Crash Course In SQL • SQL – Structured Query Language • SQL is how you retrieve data from relational databases • Each statement consist of a minimum of three elements • SELECT • FROM • WHERE
  • 40. Bob’s Super Crashy Crash Course In SQL • In the old world, data is stored in two dimensional tables; rows and columns • There are rules about how this information is stored. • Data is usually modeled as a set of relationships • These relationships can be thought of in terms of parent child.
  • 41. Bob’s Super Crashy Crash Course In SQL • There are all kinds of relationships. • Many to one. Simplest and most common. • One to one. • Many to many.
  • 42. Bob’s Super Crashy Crash Course In SQL • Normally, each record is identified by some value called a key. • The are all kinds of keys but the most important are: • Primary Keys • Foreign Keys
  • 43. Bob’s Super Crashy Crash Course In SQL • Primary Keys uniquely identify a record • Foreign Keys are the primary key of a parent record.
  • 44. Bob’s Super Crashy Crash Course In SQL
  • 45. A brief history of SQL on Hadoop • Data on HDFS is not stored in a relational style. • NoSQL • Data retrieval very difficult • Attempts to make HDFS talk SQL • HAWQ • Drill • Opinion: Efforts are floundering because of the rise of MPP databases
  • 46. Tonight’s Scenario You work in a large organization with a lot of data but it is stored in old world relational databases. You need to analyze this data but there are over a billion records and your queries are incredibly slow. You convince your boss to let you try Hadoop. You spin up a cluster and get to work.
  • 47. Analyst Task 1. Get your data into HDFS. 2. Get your data into Hive. 3. Run some initial queries on the data. 4. Share your analysis. 5. (Make sure you destroy your cloud instance!)