SlideShare a Scribd company logo
1 of 74
Hadoop for the Absolute
Beginner
Ike Ellis, MVP
Agenda
• What is Big Data?
• Why is it a problem?
• What is Hadoop?
– MapReduce
– HDFS

•
•
•
•
•
•
•

Pig
Hive
Sqoop
HCAT
The Players
Maybe data visualization (depending on time)
Q&A
What is Big Data?
• Trendy?
• Buzz words?
• Process?
• Big data is “a collection of data sets so large and complex
that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications” – Wikipedia
• So how do you know your data is big data?
• When your existing data processing methodologies are no
longer good enough.
Traditional Data Warehouse Stack
There are a lot of moving pieces back there…
• Sometimes, that‟s our biggest challenge
– Simple question – massive data

• Do we really need to go through the pain of that huge
stack?
Big Data Characteristics
• Volume
– Large amount of data

• Velocity
– Need to be processed quickly

• Variety
– Excel, SQL, OData feeds, CSVs, web downloads, JSON

• Variability
– Different semantics, in terms of meaning or context

• Value
Big Data Examples
• Structured Data
– Pre-defined Schema
– Highly Structured
– Relational

• Semi-structured Data
– Inconsistent structure
– Cannot be stored in rows and tables in a typical database
– Logs, tweets, data feeds, GPS coordinates

• Unstructured Data
–
–
–
–
–

Lacks structure
Free-form text
Customer feedback forms
Audio
Video
The Problem

• So you have some data
The Problem
• And you want to clean and/or analyze it
So you use the technology that you know
•
•
•
•

Excel
SQL Server
SQL Server Integration Services
SQL Server Reporting Services
But what happens if it’s TONS of data
• Like all the real estate transactions in the US for the last ten
years?
• Or GPS data from every bike in your bike rental store?
• Or every swing and every pitch from every baseball game
since 1890?
Or what happens when the analysis is very complicated?
• Tell me when earthquakes happen!
• Tell me how shoppers view my website!
• Tell me how to win my next election!
So you use SQL Server, and have a lot of data, so….
• YOU SCALE UP!
• But SQL can only have so much RAM, CPU, Disk I/O,
Network I/O
• So you hit a wall, probably with disk I/O
• So you….
Scale Out!
• Add servers until the pain goes away….
All analysis is done away from the data servers
But that’s easier said than done
• What‟s the process?
• You take one large task, and break it up into lots of smaller
tasks
– How do you break them up?
– Once it‟s broken up and processed, how do you put them back together?
– How do you make sure you break them up evenly so they all execute at the
same rate?
– And really, you‟re breaking up two things:
• Physical data
• Computational Analysis
– If one small task fails, how to you restart it? Log it? Recover from failure?
– If one SQL Server fails, how do you divert all the new tasks away from it?
– How do you load balance?

• So you end up writing a lot of plumbing code….and even
when you get done….you have one GIANT PROBLEM!
Data Movement
Data moves to achieve fault tolerance, to segment data, to reassemble data, to derive data,
to output data, etc, etc….and network (and disk) is SLOW..you’ve saturated it.
Oh, and another problem
• In SQL, the performance between a query over 1MB of
data and 1TB of data is significant
• The performance of a query over one server and over 20
servers is also significant
So to summarize and repeat
•
•
•
•
•

Drive seek time….BIG PROBLEM
Drive channel latency…BIG PROBLEM
Data + processing time…BIG PROBLEM
Network Pipe I/O saturation…BIG PROBLEM
Lots of human problems
– Building a data warehouse stack is a difficult challenge

• Semi-structured data is difficult to handle
– When data changes, it becomes less structured and less valuable as
it changes
– Flexible structures often give us fits
Enter Hadoop
• Why write your own framework to handle fault tolerance,
logging, data partitioning, heavy analysis when you can
just use this one?
What is Hadoop?
• Hadoop is a distributed storage and processing technology
for large scale applications
– HDFS
• Self-healing, distributed file system. Breaks files into blocks and
stores them redundantly across cluster
– MapReduce
• Framework for running large data processing jobs in parallel
across many nodes and combining results

•
•
•
•
•
•
•

Open Source
Distributed Data Replication
Commodity hardware
Disparate hardware
Data and analysis co-location
Scalability
Reliable error handling
Hadoop Ecosystem
Under the covers
Hadoop works by keeping the compute next to the data (to minimize network I/O costs)
MapReduce
Segmentation Problem
MapReduce Process – Very simple example
Programming MapReduce
• Steps
– Define the inputs
• Usually some files in HDFS/HBase (Or Azure Blob Storage)
– Write a map function
– Write a reduce function
– Define outputs
• Usually some files in HDFS/HBase (Or Azure Blob Storage)

• Lots of options for both inputs and outputs
• Functions are usually written in Java
– Or Python
– Even .NET (C#, F#)
Scalability
• Hadoop scales linearly with data size
– Or analysis complexity
– Scales to hundreds of petabytes

•
•
•
•

Data-parallel or computer-parallel
Extensive machine learning on <100GB of image data
Simple SQL queries on >100TB of clickstream data
Hadoop works for both!
Hadoop allows you to write a query like this
Select productname, sum(costpergoods)
From salesorders
Group by productname
• Over a ton of data, or a little data, and have it perform
about the same
• If it slows down, throw more nodes at it
• Map is like the GROUP BY
• While reduce is like the aggregate
Why use Hadoop?
• Who wants to write all that plumbing?
–
–
–
–
–
–
–
–

Segmenting data
Making it redundant and fault tolerant
Overcoming job failure
Logging
All those data providers
All the custom scripting languages and tooling
Synchonization
Scale-free programming model

• Wide adoption
• You specify the map() and reduce() functions
– Let the framework do the rest
What is Hadoop Good For?
•
•
•
•
•
•
•
•

Enormous datasets
Log Analysis
Calculating statistics on enormous datasets
Running large simulations
ETL
Machine learning
Building inverted indexes
Sorting
– World record

•
•
•
•
•

Distributed Search
Tokenization
Image processing
No fancy hardware…good in the cloud
And so much more!
What is Hadoop Bad For?
• Low latency (not current data)
• Sequential algorithms
– Recursion

• Joins (sometimes)
• When all the data is structured and can fit on one database
server with scaling up
– It is NOT a replacement for a good RDBMS
Relational vs Hadoop
Another Problem
• MapReduce functions are written in Java, Python, .NET, and
a few other languages
• Those are languages that are widely known
• Except by analysts and DBAs, the exact kind of people who
struggle with big data
• Enter Pig & Hive
– Abstraction for MapReduce
– Sits over MapReduce
– Spawns MapReduce jobs
What MapReduce Functions look like
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
Introduction to Pig
• Pig – ETL for big data
– Structure
– Pig Latin

• Parallel data processing for Hadoop
• Not trying to get you to learn Pig. Just want you to want
to learn it.
Here’s what SQL looks like
Select customername, count(orderdate) as totalOrders
From salesOrders so
Join customers c
On so.custid = c.custid

Group by customername
Pig
Trx = load „transaction‟ as (customer, orderamount);
Grouped = group trx by customer;
Ttl = foreach grouped generate group, sum(trx.orderamount)
as tp;
Cust = load „customers‟ as (customer, postalcode);
Result = join ttl by group, cust by customer;
Dump result;

Executes on step at a time
Pig is like SSIS
• One step at a time. One thing executes, then the next in
the script, acting on the variable declarations above it
How Pig Works
• Pig Latin goes to pre-processor
• Pre-processor creates MapReduce jobs that get submitted
to the JobTracker
Pig Components
•
•
•
•
•

Data Types
Inputs & Outputs
Relational Operators
UDFs
Scripts & Testing
Pig Data Types
• Scalar
–
–
–
–
–
–

Int
Long
Float
Double
CharArray
ByteArray

• Complex
– Map (key/value pair)
– Tuple (fixed-size ordered collection)
– Bag(collection of tuples)
Pig: Inputs/Outputs
• Load
– PigStorage
– TextLoader
– HBaseStorage

• Store
– PigStorage
– HBaseStorage

• Dump
– Dumps to console
– Don‟t dump a ton of data…uh oh…
Pig: Relational Operators
• Foreach – projection operator, applies expression to every
row in the pipeline
– Flatten – used with complex types, PIVOT

•
•
•
•
•
•
•
•
•

Filter – WHERE
Group, Cogroup – GROUP BY (Cogroup on multiple keys)
ORDER BY
Distinct
JOIN (INNER, OUTER, CROSS)
LIMIT – TOP
Sample – Random sample
Parallel – level of parallelism on the reducer side
Union
Pig: UDFs
• Written in Java/Python
• String manipulation, math, complex type operations,
parsing
Pig: Useful commands
Describe – shows schema
Explain – shows the logical and physical MapReduce plan
Illustrate – runs a sample of your data to test your script
Stats – produced after every run and includes start/end
times, # of records, MapReduce info
• Supports parameter substitution and parameter files
• Supports macros and functions (define)
• Supports includes for script organization
•
•
•
•
Pig Demo
Introduction to HIVE
•
•
•
•
•
•
•
•

Very popular
Hive Query Language
Defining Tables, Views, Partitioning
Querying and Integration
VERY SQL-LIKE
Developed by FaceBook
Data Warehouse for Hadoop
Based on SQL-92 specification
SQL vs Hive
• Almost useless to compare the two, because they are so
similar
• Create table Internal/External
• Hive is schema on read
– It defines a schema over your data that already exists in the
database
Hive is not a replacement for SQL
• So don‟t throw out SQL just yet
• Hive is for batch processing large data sets that may span
hundreds, or even thousands, of machines
– Not for row-level updates

• Hive has high overhead when starting a job. It translates
queries to MR so it takes time
• Hive does not cache data
• Hive performance tuning is mainly Hadoop performance
tuning
• Similarity in the query engine, but different architectures
for different purposes
• Way too slow for OLTP workloads
Hive Components
•
•
•
•
•
•

Data Types
DDL
DML
Queries
Views, Indexes, Partitions
UDFs
Hive Data Types
• Scalar
–
–
–
–
–
–
–
–
–
–

TinyInt
SmallInt
Int
BigInt
Boolean
Float
Double
TimeStamp
String
Binary

• Complex
– Struct
– Array(Collection)
– Map(key/value pair)
What is a Hive Table?
• CREATE DATABASE NewDB
– LOCATION „hdfshuaNewDB‟

• CREATE TABLE
• A Hive table consists of:
– Data: typically a file in HDFS
– Schema: in the form of metadata stored in a relational database

• Schema and data are separate
– A schema can be defined for existing data
– Data can be added or removed independently
– Hive can be pointed to existing data

• You have to define schema if you have existing data in
HDFS that you want to use in Hive
How does Hive work?
• Hive as a Translation Tool
– Compiles and executes queries
– Hive translates the SQL Query to a MapReduce job

• Hive as a structuring tool
– Creates a schema around the data in HDFS
– Tables stored in directories

• Hive Tables have rows and columns and data types
• Hive Metastore
– Namespace with a set of tables
– Holds table definitions

• Partitioning
– Choose a partition key
– Specify key when you load data
Define a Hive Table
Create Table myTable (name string, age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „;‟
STORED AS TEXFILE;
Loading Data
Use LOAD DATA to import data into a Hive table
LOAD DATA LOCAL INPATH „input/mydata/data.txt‟
INTO TABLE myTable
The files are not modified in Hive – they are loaded as is
Use the word OVERWRITE to write over a file of the same
name
• Hive can read all the files in particular directory
• The schema is checked when the data is queried
•
•
•
•
•

– If a row does not match the schema, it will be read as null
Querying Data
• SELECT
–
–
–
–
–
–

WHERE
UNION ALL/DISTINCT
GROUP BY
HAVING
LIMIT
REGEX

• Subqueries
• JOIN
– INNER
– OUTER

• ORDER BY
– Reducer is 1

• SORT BY
– Multiple reducers with a sorted file from each
Hive Demo
Pig Vs Hive
• Famous Yahoo Blog Post
– http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo464.html

• PIG
–
–
–
–

ETL
For preparing data for easier analysis
Good for SQL authors that take the time to learn something new
Unless you store it, all data goes away when the script is finished

• Hive
– Analysis
• When you have to answer a specific question
– Good for SQL authors
– Excel connectivity
– Persists data in the Hadoop data store
Sqoop
• SQL to Hadoop
– SQL Server/Oracle/Something with a JDBC driver

• Import
– From RDBME into HFDS

• Export
– From HDFS into RDMBS

• Other Commands
– Create hive table
– Evaluate import statement
HUE
• Hadoop User Experience
HCatalog
• Metadata and table management system for Hadoop
• Provides a shared schema and data type mechanism for
various Hadoop tools (Pig, Hive, MapReduce)
– Enables interoperability across data processing tools
– Enables users to choose the best tools for their environments

• Provides a table abstraction so that users need not be
concerned with how data is stored
– Presents users with a relational view of data
HCatalog DDL
•
•
•
•
•

CREATE/ALTER/DROP Table
SHOW TABLES
SHOW FUNCTIONS
DESCRIBE
Supports a subset of Hive DDL
Why do we have HCat?
• Tools don‟t tend to agree on
– What a schema is
– What data types are
– How data is stored

• HCatalog solution
– Provides one consistent dta model for various Hadoop tools
– Provides shared schema
– Allows users to see when shared data is available
HCatalog – HBase Integration
• Connects HBase tables to HCatalog
• Uses various Hadoop tools
• Provides flexibility with data in HBase or HDFS
HCat Demo
HBase
•
•
•
•
•

NoSQL Database
Modeled after Google BigTable
Written in Java
Runs on top of HDFS
Features
– Compression
– In-memory operations
– Bloom filters

• Can serve as input or output for MapReduce jobs
• FaceBook‟s messaging platform uses it
Yarn
• Apache Hadoop Next Gen MapReduce
• Yet aNother Resource Negotiator
• Seperates resource management and processing
components
– Breaking up the job tracker

• YARN was born of a need to enable a broader array of
interaction patterns for data stored in HDFS beyond
MapReduce
Impala
• Cloudara
• Real-time queries for Hadoop
• Low-latency Queries using SQL to HDFS or HBase
Storm
• Free and open source distributed real-time computation
system
• Makes it easy to process unbounded streams of data
• Storm is fast
– Million tuples processed per second per node
The Players
•
•
•
•
•
•
•
•
•
•

HortonWorks
Cloudara
MapR
Microsoft HDInsight
Microsoft PDW
IBM
Oracle
Amazon
Rackspace
Google
The Future
• Hadoop features will push into RDBMS systems
• RDBMS features will continue to push into Hadoop
• Tons of 3rd party vendors and open source projects have
applications for Hadoop and RDBMS/Hadoop integration
• Lots of buy-in, lots of progress, lots of changes
How to Learn Hadoop
• Lots of YouTube videos online
• HortonWorks, MapR, and Cloudara all have good videos
for free
• HortonWorks sandbox
• Azure HDInsight VMs
• Hadoop: The Definitive Guide
• Tons of blog posts
• Lots of open source projects
Ike Ellis
•
•
•
•
•
•
•
•

www.ikeellis.com
SQL Pass Book Readers – VC Leader
@Ike_Ellis
619.922.9801
Microsoft MVP
Quick Tips – YouTube
San Diego TIG Founder and Chairman
San Diego .NET User Group Steering Committee Member

More Related Content

What's hot

Simple Works Best
 Simple Works Best Simple Works Best
Simple Works BestEDB
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs FasterBob Ward
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Fwdays
 
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
Brk3043 azure sql db   intelligent cloud database for app developers - wash dcBrk3043 azure sql db   intelligent cloud database for app developers - wash dc
Brk3043 azure sql db intelligent cloud database for app developers - wash dcBob Ward
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from OracleEDB
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at MendeleyDan Harvey
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 

What's hot (20)

Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Presto
PrestoPresto
Presto
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
 
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
Brk3043 azure sql db   intelligent cloud database for app developers - wash dcBrk3043 azure sql db   intelligent cloud database for app developers - wash dc
Brk3043 azure sql db intelligent cloud database for app developers - wash dc
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
HBase at Mendeley
HBase at MendeleyHBase at Mendeley
HBase at Mendeley
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 

Viewers also liked

Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
Machine Learning on dirty data - Dataiku - Forum du GFII 2014
Machine Learning on dirty data - Dataiku - Forum du GFII 2014Machine Learning on dirty data - Dataiku - Forum du GFII 2014
Machine Learning on dirty data - Dataiku - Forum du GFII 2014Le_GFII
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Hive on kafka
Hive on kafkaHive on kafka
Hive on kafkaSzehon Ho
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxHortonworks
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of conceptETLSolutions
 

Viewers also liked (12)

Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Machine Learning on dirty data - Dataiku - Forum du GFII 2014
Machine Learning on dirty data - Dataiku - Forum du GFII 2014Machine Learning on dirty data - Dataiku - Forum du GFII 2014
Machine Learning on dirty data - Dataiku - Forum du GFII 2014
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Hive on kafka
Hive on kafkaHive on kafka
Hive on kafka
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
Big Data Proof of Concept
Big Data Proof of ConceptBig Data Proof of Concept
Big Data Proof of Concept
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
An example of a successful proof of concept
An example of a successful proof of conceptAn example of a successful proof of concept
An example of a successful proof of concept
 

Similar to Hadoop for the Absolute Beginner

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 

Similar to Hadoop for the Absolute Beginner (20)

Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 

More from Ike Ellis

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BIIke Ellis
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxIke Ellis
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azureIke Ellis
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsIke Ellis
 
Power bi premium
Power bi premiumPower bi premium
Power bi premiumIke Ellis
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to daxIke Ellis
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI ExamIke Ellis
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATEIke Ellis
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAXIke Ellis
 
60 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 201860 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 2018Ike Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Ike Ellis
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformIke Ellis
 

More from Ike Ellis (20)

Storytelling with Data with Power BI
Storytelling with Data with Power BIStorytelling with Data with Power BI
Storytelling with Data with Power BI
 
Storytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptxStorytelling with Data with Power BI.pptx
Storytelling with Data with Power BI.pptx
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Migrate a successful transactional database to azure
Migrate a successful transactional database to azureMigrate a successful transactional database to azure
Migrate a successful transactional database to azure
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Power bi premium
Power bi premiumPower bi premium
Power bi premium
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
Pass the Power BI Exam
Pass the Power BI ExamPass the Power BI Exam
Pass the Power BI Exam
 
Slides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATESlides for PUG 2018 - DAX CALCULATE
Slides for PUG 2018 - DAX CALCULATE
 
Introduction to DAX
Introduction to DAXIntroduction to DAX
Introduction to DAX
 
60 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 201860 reporting tips in 60 minutes - SQLBits 2018
60 reporting tips in 60 minutes - SQLBits 2018
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Hadoop for the Absolute Beginner

  • 1. Hadoop for the Absolute Beginner Ike Ellis, MVP
  • 2. Agenda • What is Big Data? • Why is it a problem? • What is Hadoop? – MapReduce – HDFS • • • • • • • Pig Hive Sqoop HCAT The Players Maybe data visualization (depending on time) Q&A
  • 3. What is Big Data? • Trendy? • Buzz words? • Process? • Big data is “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” – Wikipedia • So how do you know your data is big data? • When your existing data processing methodologies are no longer good enough.
  • 5. There are a lot of moving pieces back there… • Sometimes, that‟s our biggest challenge – Simple question – massive data • Do we really need to go through the pain of that huge stack?
  • 6. Big Data Characteristics • Volume – Large amount of data • Velocity – Need to be processed quickly • Variety – Excel, SQL, OData feeds, CSVs, web downloads, JSON • Variability – Different semantics, in terms of meaning or context • Value
  • 7. Big Data Examples • Structured Data – Pre-defined Schema – Highly Structured – Relational • Semi-structured Data – Inconsistent structure – Cannot be stored in rows and tables in a typical database – Logs, tweets, data feeds, GPS coordinates • Unstructured Data – – – – – Lacks structure Free-form text Customer feedback forms Audio Video
  • 8. The Problem • So you have some data
  • 9. The Problem • And you want to clean and/or analyze it
  • 10. So you use the technology that you know • • • • Excel SQL Server SQL Server Integration Services SQL Server Reporting Services
  • 11. But what happens if it’s TONS of data • Like all the real estate transactions in the US for the last ten years? • Or GPS data from every bike in your bike rental store? • Or every swing and every pitch from every baseball game since 1890?
  • 12. Or what happens when the analysis is very complicated? • Tell me when earthquakes happen! • Tell me how shoppers view my website! • Tell me how to win my next election!
  • 13. So you use SQL Server, and have a lot of data, so…. • YOU SCALE UP! • But SQL can only have so much RAM, CPU, Disk I/O, Network I/O • So you hit a wall, probably with disk I/O • So you….
  • 14. Scale Out! • Add servers until the pain goes away…. All analysis is done away from the data servers
  • 15. But that’s easier said than done • What‟s the process? • You take one large task, and break it up into lots of smaller tasks – How do you break them up? – Once it‟s broken up and processed, how do you put them back together? – How do you make sure you break them up evenly so they all execute at the same rate? – And really, you‟re breaking up two things: • Physical data • Computational Analysis – If one small task fails, how to you restart it? Log it? Recover from failure? – If one SQL Server fails, how do you divert all the new tasks away from it? – How do you load balance? • So you end up writing a lot of plumbing code….and even when you get done….you have one GIANT PROBLEM!
  • 16. Data Movement Data moves to achieve fault tolerance, to segment data, to reassemble data, to derive data, to output data, etc, etc….and network (and disk) is SLOW..you’ve saturated it.
  • 17. Oh, and another problem • In SQL, the performance between a query over 1MB of data and 1TB of data is significant • The performance of a query over one server and over 20 servers is also significant
  • 18. So to summarize and repeat • • • • • Drive seek time….BIG PROBLEM Drive channel latency…BIG PROBLEM Data + processing time…BIG PROBLEM Network Pipe I/O saturation…BIG PROBLEM Lots of human problems – Building a data warehouse stack is a difficult challenge • Semi-structured data is difficult to handle – When data changes, it becomes less structured and less valuable as it changes – Flexible structures often give us fits
  • 19. Enter Hadoop • Why write your own framework to handle fault tolerance, logging, data partitioning, heavy analysis when you can just use this one?
  • 20. What is Hadoop? • Hadoop is a distributed storage and processing technology for large scale applications – HDFS • Self-healing, distributed file system. Breaks files into blocks and stores them redundantly across cluster – MapReduce • Framework for running large data processing jobs in parallel across many nodes and combining results • • • • • • • Open Source Distributed Data Replication Commodity hardware Disparate hardware Data and analysis co-location Scalability Reliable error handling
  • 22. Under the covers Hadoop works by keeping the compute next to the data (to minimize network I/O costs)
  • 25. MapReduce Process – Very simple example
  • 26. Programming MapReduce • Steps – Define the inputs • Usually some files in HDFS/HBase (Or Azure Blob Storage) – Write a map function – Write a reduce function – Define outputs • Usually some files in HDFS/HBase (Or Azure Blob Storage) • Lots of options for both inputs and outputs • Functions are usually written in Java – Or Python – Even .NET (C#, F#)
  • 27. Scalability • Hadoop scales linearly with data size – Or analysis complexity – Scales to hundreds of petabytes • • • • Data-parallel or computer-parallel Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream data Hadoop works for both!
  • 28. Hadoop allows you to write a query like this Select productname, sum(costpergoods) From salesorders Group by productname • Over a ton of data, or a little data, and have it perform about the same • If it slows down, throw more nodes at it • Map is like the GROUP BY • While reduce is like the aggregate
  • 29. Why use Hadoop? • Who wants to write all that plumbing? – – – – – – – – Segmenting data Making it redundant and fault tolerant Overcoming job failure Logging All those data providers All the custom scripting languages and tooling Synchonization Scale-free programming model • Wide adoption • You specify the map() and reduce() functions – Let the framework do the rest
  • 30. What is Hadoop Good For? • • • • • • • • Enormous datasets Log Analysis Calculating statistics on enormous datasets Running large simulations ETL Machine learning Building inverted indexes Sorting – World record • • • • • Distributed Search Tokenization Image processing No fancy hardware…good in the cloud And so much more!
  • 31. What is Hadoop Bad For? • Low latency (not current data) • Sequential algorithms – Recursion • Joins (sometimes) • When all the data is structured and can fit on one database server with scaling up – It is NOT a replacement for a good RDBMS
  • 33. Another Problem • MapReduce functions are written in Java, Python, .NET, and a few other languages • Those are languages that are widely known • Except by analysts and DBAs, the exact kind of people who struggle with big data • Enter Pig & Hive – Abstraction for MapReduce – Sits over MapReduce – Spawns MapReduce jobs
  • 34. What MapReduce Functions look like function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  • 35. Introduction to Pig • Pig – ETL for big data – Structure – Pig Latin • Parallel data processing for Hadoop • Not trying to get you to learn Pig. Just want you to want to learn it.
  • 36. Here’s what SQL looks like Select customername, count(orderdate) as totalOrders From salesOrders so Join customers c On so.custid = c.custid Group by customername
  • 37. Pig Trx = load „transaction‟ as (customer, orderamount); Grouped = group trx by customer; Ttl = foreach grouped generate group, sum(trx.orderamount) as tp; Cust = load „customers‟ as (customer, postalcode); Result = join ttl by group, cust by customer; Dump result; Executes on step at a time
  • 38. Pig is like SSIS • One step at a time. One thing executes, then the next in the script, acting on the variable declarations above it
  • 39. How Pig Works • Pig Latin goes to pre-processor • Pre-processor creates MapReduce jobs that get submitted to the JobTracker
  • 40. Pig Components • • • • • Data Types Inputs & Outputs Relational Operators UDFs Scripts & Testing
  • 41. Pig Data Types • Scalar – – – – – – Int Long Float Double CharArray ByteArray • Complex – Map (key/value pair) – Tuple (fixed-size ordered collection) – Bag(collection of tuples)
  • 42. Pig: Inputs/Outputs • Load – PigStorage – TextLoader – HBaseStorage • Store – PigStorage – HBaseStorage • Dump – Dumps to console – Don‟t dump a ton of data…uh oh…
  • 43. Pig: Relational Operators • Foreach – projection operator, applies expression to every row in the pipeline – Flatten – used with complex types, PIVOT • • • • • • • • • Filter – WHERE Group, Cogroup – GROUP BY (Cogroup on multiple keys) ORDER BY Distinct JOIN (INNER, OUTER, CROSS) LIMIT – TOP Sample – Random sample Parallel – level of parallelism on the reducer side Union
  • 44. Pig: UDFs • Written in Java/Python • String manipulation, math, complex type operations, parsing
  • 45. Pig: Useful commands Describe – shows schema Explain – shows the logical and physical MapReduce plan Illustrate – runs a sample of your data to test your script Stats – produced after every run and includes start/end times, # of records, MapReduce info • Supports parameter substitution and parameter files • Supports macros and functions (define) • Supports includes for script organization • • • •
  • 47. Introduction to HIVE • • • • • • • • Very popular Hive Query Language Defining Tables, Views, Partitioning Querying and Integration VERY SQL-LIKE Developed by FaceBook Data Warehouse for Hadoop Based on SQL-92 specification
  • 48. SQL vs Hive • Almost useless to compare the two, because they are so similar • Create table Internal/External • Hive is schema on read – It defines a schema over your data that already exists in the database
  • 49. Hive is not a replacement for SQL • So don‟t throw out SQL just yet • Hive is for batch processing large data sets that may span hundreds, or even thousands, of machines – Not for row-level updates • Hive has high overhead when starting a job. It translates queries to MR so it takes time • Hive does not cache data • Hive performance tuning is mainly Hadoop performance tuning • Similarity in the query engine, but different architectures for different purposes • Way too slow for OLTP workloads
  • 51. Hive Data Types • Scalar – – – – – – – – – – TinyInt SmallInt Int BigInt Boolean Float Double TimeStamp String Binary • Complex – Struct – Array(Collection) – Map(key/value pair)
  • 52. What is a Hive Table? • CREATE DATABASE NewDB – LOCATION „hdfshuaNewDB‟ • CREATE TABLE • A Hive table consists of: – Data: typically a file in HDFS – Schema: in the form of metadata stored in a relational database • Schema and data are separate – A schema can be defined for existing data – Data can be added or removed independently – Hive can be pointed to existing data • You have to define schema if you have existing data in HDFS that you want to use in Hive
  • 53. How does Hive work? • Hive as a Translation Tool – Compiles and executes queries – Hive translates the SQL Query to a MapReduce job • Hive as a structuring tool – Creates a schema around the data in HDFS – Tables stored in directories • Hive Tables have rows and columns and data types • Hive Metastore – Namespace with a set of tables – Holds table definitions • Partitioning – Choose a partition key – Specify key when you load data
  • 54. Define a Hive Table Create Table myTable (name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY „;‟ STORED AS TEXFILE;
  • 55. Loading Data Use LOAD DATA to import data into a Hive table LOAD DATA LOCAL INPATH „input/mydata/data.txt‟ INTO TABLE myTable The files are not modified in Hive – they are loaded as is Use the word OVERWRITE to write over a file of the same name • Hive can read all the files in particular directory • The schema is checked when the data is queried • • • • • – If a row does not match the schema, it will be read as null
  • 56. Querying Data • SELECT – – – – – – WHERE UNION ALL/DISTINCT GROUP BY HAVING LIMIT REGEX • Subqueries • JOIN – INNER – OUTER • ORDER BY – Reducer is 1 • SORT BY – Multiple reducers with a sorted file from each
  • 58. Pig Vs Hive • Famous Yahoo Blog Post – http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo464.html • PIG – – – – ETL For preparing data for easier analysis Good for SQL authors that take the time to learn something new Unless you store it, all data goes away when the script is finished • Hive – Analysis • When you have to answer a specific question – Good for SQL authors – Excel connectivity – Persists data in the Hadoop data store
  • 59. Sqoop • SQL to Hadoop – SQL Server/Oracle/Something with a JDBC driver • Import – From RDBME into HFDS • Export – From HDFS into RDMBS • Other Commands – Create hive table – Evaluate import statement
  • 60. HUE • Hadoop User Experience
  • 61. HCatalog • Metadata and table management system for Hadoop • Provides a shared schema and data type mechanism for various Hadoop tools (Pig, Hive, MapReduce) – Enables interoperability across data processing tools – Enables users to choose the best tools for their environments • Provides a table abstraction so that users need not be concerned with how data is stored – Presents users with a relational view of data
  • 62.
  • 63. HCatalog DDL • • • • • CREATE/ALTER/DROP Table SHOW TABLES SHOW FUNCTIONS DESCRIBE Supports a subset of Hive DDL
  • 64. Why do we have HCat? • Tools don‟t tend to agree on – What a schema is – What data types are – How data is stored • HCatalog solution – Provides one consistent dta model for various Hadoop tools – Provides shared schema – Allows users to see when shared data is available
  • 65. HCatalog – HBase Integration • Connects HBase tables to HCatalog • Uses various Hadoop tools • Provides flexibility with data in HBase or HDFS
  • 67. HBase • • • • • NoSQL Database Modeled after Google BigTable Written in Java Runs on top of HDFS Features – Compression – In-memory operations – Bloom filters • Can serve as input or output for MapReduce jobs • FaceBook‟s messaging platform uses it
  • 68. Yarn • Apache Hadoop Next Gen MapReduce • Yet aNother Resource Negotiator • Seperates resource management and processing components – Breaking up the job tracker • YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce
  • 69. Impala • Cloudara • Real-time queries for Hadoop • Low-latency Queries using SQL to HDFS or HBase
  • 70. Storm • Free and open source distributed real-time computation system • Makes it easy to process unbounded streams of data • Storm is fast – Million tuples processed per second per node
  • 72. The Future • Hadoop features will push into RDBMS systems • RDBMS features will continue to push into Hadoop • Tons of 3rd party vendors and open source projects have applications for Hadoop and RDBMS/Hadoop integration • Lots of buy-in, lots of progress, lots of changes
  • 73. How to Learn Hadoop • Lots of YouTube videos online • HortonWorks, MapR, and Cloudara all have good videos for free • HortonWorks sandbox • Azure HDInsight VMs • Hadoop: The Definitive Guide • Tons of blog posts • Lots of open source projects
  • 74. Ike Ellis • • • • • • • • www.ikeellis.com SQL Pass Book Readers – VC Leader @Ike_Ellis 619.922.9801 Microsoft MVP Quick Tips – YouTube San Diego TIG Founder and Chairman San Diego .NET User Group Steering Committee Member