SlideShare a Scribd company logo
Pig and Hive
Csaba Toth
Central California .NET User Group
Meeting
Date: April 17th, 2014
Location: Bitwise Industries, Fresno
Agenda
• Little recap of Hadoop and Map-Reduce
• Pig and Hive
• Recommendation engine
• Demos 1: Exercises with on-premise Hadoop
emulator
• Demos 2: Azure HDInsight
Hadoop
• Hadoop is an open-source software
framework that supports data-intensive
distributed applications.
• Has two main pieces:
– Storing large amounts of data: HDFS, Hadoop
Distributed File System
– Processing large amounts of data: implementation
of the MapReduce programming model
Hadoop
• All of this in a cost effective way: Hadoop is
managing a cluster of commodity hardware
computers.
• The cluster is composed of a single master
node and multiple worker nodes
• It is written in Java, utilizes JVMs
Name node
HDFS visually
Metadata
Store
Data node Data node Data node
Node 1 Node 2
Block A Block B Block A Block B
Node 3
Block A Block B
Name node
Heart beat signals and
communication
Job / task management
Jobtracker
Data node Data node Data node
Tasktracker Tasktracker
Map 1 Reduce 1 Map 2 Reduce 2
Tasktracker
Map 3 Reduce 3
MapReduce
• Hadoop leverages the functional programming model
of map/reduce.
• Moves away from shared resources and related
synchronization and contention issues
• Thus inherently scalable and suitable for processing
large data sets, distributed computing on clusters of
computers/nodes.
• The goal of map reduce is to break huge data sets into
smaller pieces, distribute those pieces to various
worker nodes, and process the data in parallel.
• Hadoop leverages a distributed file system to store the
data on various nodes.
MapReduce
• It is about two functions: map and reduce
1. Map Step:
– Processes a key/value pairs and generate a set of
intermediate key/value pairs form that
2. Shuffle step:
– Groups all intermediate values associated with the
same intermediate key into one set
3. Reduce Step:
– Processes the intermediate values associated with the
same intermediate key and produces a set of values
based on the groups (usually some kind of aggregate)
Word count
http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
Map, Shuffle, and Reduce
https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem and MapReduce
• Writing MapReduce Jobs in Java or C# or other
languages is useful but it is the hard way
• Several domain-specific higher level languages
and frameworks exist
– They allow phrase complicated tasks way more
simpler and shorter than Java
– These languages translate everything into Map-
Reduce jobs under the hood transparently
• We’ll see two examples: Hive and Pig
Hive and Pig
• Hive (http://hive.apache.org/)
– Provides SQL-like approach
– The best if the input data is at least conform to some
schema, so it can be consumed (SQL requires columnar
format, tables)
– Good for someone coming from SQL background
• Pig (http://pig.apache.org/)
– The syntax is closer to a programming language
– It defines a series of transformations, projecting one
schema into another
– Also the best if the data is not totally free form and have
some kind of schema
Hadoop Ecosystem / Architecture
Log Data RDBMS
Data Integration Layer
Flume Sqoop
Storage Layer (HDFS)
Computing Layer (MapReduce)
Advanced Query Engine (Hive, Pig)
Data Mining
(Pegasus,
Mahout)
Index,
Searches
(Lucene)
DB drivers
(Hive driver)
Web Browser (JS)Presentation
Layer
Simple recommendation engine
• Sites such as Amazon.com and Netflix.com use
complex algorithms
• But the underlying concepts are simple:
finding correlation between data
• Pearson Coefficient, Excel CORREL function
Pearson coefficient
Pearson product-moment correlation
coefficient value
Comments
-1 Perfectly correlated data, but as one rises
the other decreases
0 Uncorrelated data
+1 Perfectly correlated data
• DEMO
Ratings data
Name The Lord of the Rings The Chronicles of Narnia
Jack 2 3
Mark 4 4.5
Albert 4 3.5
John 5 5
• Pearson Correlation Coefficient: 0.8705715
Ratings data
Name of
movie critic
Name of movie Rating
Lisa Rose Lady in the Water 2.5
Lisa Rose Snakes on a Plane 3.5
Lisa Rose Just My Luck 3
Lisa Rose Superman Returns 3.5
Lisa Rose You Me and Dupree 2.5
Lisa Rose The Night Listener 3
Gene Seymour Lady in the Water 3
Gene Seymour Snakes on a Plane 3.5
Gene Seymour Just My Luck 1.5
Gene Seymour Superman Returns 5
Gene Seymour The Night Listener 3
Simple recommendation
• DEMO
– C# implementation
– Simple naïve algorithm
– Non parallel, not Hadoop
Pig
• A data-flow language
• Express the processing as a series of
transformations
• Steps are translated into Map Reduce jobs
• We can look at it like LINQ
• We’ll learn it by example
– Pig’s command line shell: grunt
– Pig’s language: Pig Latin
Pig
• Load and store:
– can load/store data from/to HDFS
• Relations
– The transformations are performed on ‘relations’ – Pig
calls the collections like that, don not confuse with
traditional relational DB terminology!
– Think of it as a table with rows and columns of data
– When grouped relations can contain associative key-
values
Pig
• Joins
– Can accomplish joins in an conceptually intuitive
manner (using a common key)
• Filter
– Can apply filters to data. A predicate should be
specified
• Projection
– Can project from an existing collection = form a new
collection in a way like an SQL select does. That is the
“GENERATE” command of Pig
Pig
• Grouping
– Can group data by one or more keys. Once grouped,
you can maintain the hierarchical structure in the
relation throughout the transformations. Projections
can be made, or sometimes you can flatten some of
the hierarchy.
• Dump
– DUMP statement outputs a contents of a relation onto
the console. Useful when fooling around in the Pig
shell
• Extensible
– UDFs: User Defined Functions
Simple recommendation
• DEMO
– Pig implementation
– Local pseudo cluster (HDInsight on-premise)
Hive DEMO
• Analyzing previous meeting’s wordcount
(warpeace) results
• Analyzing the recommendation engine results
HDInsight
• Microsoft’s Hadoop PaaS solution in the cloud
• Hortonworks Hadoop implementation
• New Azure portal is coming, in preview:
http://azure.microsoft.com/en-us/services/preview/
• Build conference: 40 new functions in Azure
• Latest addition: ISS, Azure Intelligent Systems
Service (https://connect.microsoft.com/site1132/)
• Analytics Platform System (APS): evolutionary
combination of SQL Server PDW and Hadoop
References
• Daniel Jebaraj: Ignore HDInsight at Your Own
Peril: Everything You Need to Know
• Tom White: Hadoop: The Definitive Guide, 3rd
Edition, Yahoo Press
• Lynn Langit’s various presentations and YouTube
videos
• Dattatrey Sindol: Big Data Basics - Part 1 -
Introduction to Big Data
• Bruno Terkaly’s presentations (for example
Hadoop on Azure: Introduction)
Thanks for your attention!
Hadoop vs RDBMS
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Interactive and Batch Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate

More Related Content

What's hot

Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Neelesh Srinivas Salian
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreducebeaknit
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
OVHcloud
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop MeetupSqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
gethue
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
J Singh
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
Aws multi-region High Availability
Aws multi-region High Availability Aws multi-region High Availability
Aws multi-region High Availability
Adam Book
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 

What's hot (19)

Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
Improve your SQL workload with observability
Improve your SQL workload with observabilityImprove your SQL workload with observability
Improve your SQL workload with observability
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop MeetupSqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
 
Facebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/ReduceFacebook Analytics with Elastic Map/Reduce
Facebook Analytics with Elastic Map/Reduce
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Aws multi-region High Availability
Aws multi-region High Availability Aws multi-region High Availability
Aws multi-region High Availability
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 

Similar to Hive and Pig for .NET User Group

Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Hadoop
HadoopHadoop
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
Hadoop
HadoopHadoop
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
Bryan Downing
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
TarjeiRomtveit
 

Similar to Hive and Pig for .NET User Group (20)

Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
How to use hadoop and r for big data parallel processing
How to use hadoop and r for big data  parallel processingHow to use hadoop and r for big data  parallel processing
How to use hadoop and r for big data parallel processing
 
R and-hadoop
R and-hadoopR and-hadoop
R and-hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 

More from Csaba Toth

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
Csaba Toth
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
Csaba Toth
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
Csaba Toth
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
Csaba Toth
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
Csaba Toth
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
Csaba Toth
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
Csaba Toth
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
Csaba Toth
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Csaba Toth
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
Csaba Toth
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
Csaba Toth
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
Csaba Toth
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Csaba Toth
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
Csaba Toth
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
Csaba Toth
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
Csaba Toth
 

More from Csaba Toth (17)

Git, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websitesGit, GitHub gh-pages and static websites
Git, GitHub gh-pages and static websites
 
Eclipse RCP Demo
Eclipse RCP DemoEclipse RCP Demo
Eclipse RCP Demo
 
The Health of Networks
The Health of NetworksThe Health of Networks
The Health of Networks
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Windows 10 preview
Windows 10 previewWindows 10 preview
Windows 10 preview
 
Developing Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay FrameworkDeveloping Multi Platform Games using PlayN and TriplePlay Framework
Developing Multi Platform Games using PlayN and TriplePlay Framework
 
Trends and future of java
Trends and future of javaTrends and future of java
Trends and future of java
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Setting up a free open source java e-commerce website
Setting up a free open source java e-commerce websiteSetting up a free open source java e-commerce website
Setting up a free open source java e-commerce website
 
CCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSRCCJUG inaugural meeting and Adopt a JSR
CCJUG inaugural meeting and Adopt a JSR
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction into windows 8 application development
Introduction into windows 8 application developmentIntroduction into windows 8 application development
Introduction into windows 8 application development
 
Ups and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research settingUps and downs of enterprise Java app in a research setting
Ups and downs of enterprise Java app in a research setting
 
Adopt a JSR NJUG edition
Adopt a JSR NJUG editionAdopt a JSR NJUG edition
Adopt a JSR NJUG edition
 

Recently uploaded

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 

Recently uploaded (20)

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 

Hive and Pig for .NET User Group

  • 1. Pig and Hive Csaba Toth Central California .NET User Group Meeting Date: April 17th, 2014 Location: Bitwise Industries, Fresno
  • 2. Agenda • Little recap of Hadoop and Map-Reduce • Pig and Hive • Recommendation engine • Demos 1: Exercises with on-premise Hadoop emulator • Demos 2: Azure HDInsight
  • 3. Hadoop • Hadoop is an open-source software framework that supports data-intensive distributed applications. • Has two main pieces: – Storing large amounts of data: HDFS, Hadoop Distributed File System – Processing large amounts of data: implementation of the MapReduce programming model
  • 4. Hadoop • All of this in a cost effective way: Hadoop is managing a cluster of commodity hardware computers. • The cluster is composed of a single master node and multiple worker nodes • It is written in Java, utilizes JVMs
  • 5. Name node HDFS visually Metadata Store Data node Data node Data node Node 1 Node 2 Block A Block B Block A Block B Node 3 Block A Block B
  • 6. Name node Heart beat signals and communication Job / task management Jobtracker Data node Data node Data node Tasktracker Tasktracker Map 1 Reduce 1 Map 2 Reduce 2 Tasktracker Map 3 Reduce 3
  • 7. MapReduce • Hadoop leverages the functional programming model of map/reduce. • Moves away from shared resources and related synchronization and contention issues • Thus inherently scalable and suitable for processing large data sets, distributed computing on clusters of computers/nodes. • The goal of map reduce is to break huge data sets into smaller pieces, distribute those pieces to various worker nodes, and process the data in parallel. • Hadoop leverages a distributed file system to store the data on various nodes.
  • 8. MapReduce • It is about two functions: map and reduce 1. Map Step: – Processes a key/value pairs and generate a set of intermediate key/value pairs form that 2. Shuffle step: – Groups all intermediate values associated with the same intermediate key into one set 3. Reduce Step: – Processes the intermediate values associated with the same intermediate key and produces a set of values based on the groups (usually some kind of aggregate)
  • 10. Map, Shuffle, and Reduce https://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 11. Hadoop Ecosystem and MapReduce • Writing MapReduce Jobs in Java or C# or other languages is useful but it is the hard way • Several domain-specific higher level languages and frameworks exist – They allow phrase complicated tasks way more simpler and shorter than Java – These languages translate everything into Map- Reduce jobs under the hood transparently • We’ll see two examples: Hive and Pig
  • 12. Hive and Pig • Hive (http://hive.apache.org/) – Provides SQL-like approach – The best if the input data is at least conform to some schema, so it can be consumed (SQL requires columnar format, tables) – Good for someone coming from SQL background • Pig (http://pig.apache.org/) – The syntax is closer to a programming language – It defines a series of transformations, projecting one schema into another – Also the best if the data is not totally free form and have some kind of schema
  • 13. Hadoop Ecosystem / Architecture Log Data RDBMS Data Integration Layer Flume Sqoop Storage Layer (HDFS) Computing Layer (MapReduce) Advanced Query Engine (Hive, Pig) Data Mining (Pegasus, Mahout) Index, Searches (Lucene) DB drivers (Hive driver) Web Browser (JS)Presentation Layer
  • 14. Simple recommendation engine • Sites such as Amazon.com and Netflix.com use complex algorithms • But the underlying concepts are simple: finding correlation between data • Pearson Coefficient, Excel CORREL function
  • 15. Pearson coefficient Pearson product-moment correlation coefficient value Comments -1 Perfectly correlated data, but as one rises the other decreases 0 Uncorrelated data +1 Perfectly correlated data • DEMO
  • 16. Ratings data Name The Lord of the Rings The Chronicles of Narnia Jack 2 3 Mark 4 4.5 Albert 4 3.5 John 5 5 • Pearson Correlation Coefficient: 0.8705715
  • 17. Ratings data Name of movie critic Name of movie Rating Lisa Rose Lady in the Water 2.5 Lisa Rose Snakes on a Plane 3.5 Lisa Rose Just My Luck 3 Lisa Rose Superman Returns 3.5 Lisa Rose You Me and Dupree 2.5 Lisa Rose The Night Listener 3 Gene Seymour Lady in the Water 3 Gene Seymour Snakes on a Plane 3.5 Gene Seymour Just My Luck 1.5 Gene Seymour Superman Returns 5 Gene Seymour The Night Listener 3
  • 18. Simple recommendation • DEMO – C# implementation – Simple naïve algorithm – Non parallel, not Hadoop
  • 19. Pig • A data-flow language • Express the processing as a series of transformations • Steps are translated into Map Reduce jobs • We can look at it like LINQ • We’ll learn it by example – Pig’s command line shell: grunt – Pig’s language: Pig Latin
  • 20. Pig • Load and store: – can load/store data from/to HDFS • Relations – The transformations are performed on ‘relations’ – Pig calls the collections like that, don not confuse with traditional relational DB terminology! – Think of it as a table with rows and columns of data – When grouped relations can contain associative key- values
  • 21. Pig • Joins – Can accomplish joins in an conceptually intuitive manner (using a common key) • Filter – Can apply filters to data. A predicate should be specified • Projection – Can project from an existing collection = form a new collection in a way like an SQL select does. That is the “GENERATE” command of Pig
  • 22. Pig • Grouping – Can group data by one or more keys. Once grouped, you can maintain the hierarchical structure in the relation throughout the transformations. Projections can be made, or sometimes you can flatten some of the hierarchy. • Dump – DUMP statement outputs a contents of a relation onto the console. Useful when fooling around in the Pig shell • Extensible – UDFs: User Defined Functions
  • 23. Simple recommendation • DEMO – Pig implementation – Local pseudo cluster (HDInsight on-premise)
  • 24. Hive DEMO • Analyzing previous meeting’s wordcount (warpeace) results • Analyzing the recommendation engine results
  • 25. HDInsight • Microsoft’s Hadoop PaaS solution in the cloud • Hortonworks Hadoop implementation • New Azure portal is coming, in preview: http://azure.microsoft.com/en-us/services/preview/ • Build conference: 40 new functions in Azure • Latest addition: ISS, Azure Intelligent Systems Service (https://connect.microsoft.com/site1132/) • Analytics Platform System (APS): evolutionary combination of SQL Server PDW and Hadoop
  • 26. References • Daniel Jebaraj: Ignore HDInsight at Your Own Peril: Everything You Need to Know • Tom White: Hadoop: The Definitive Guide, 3rd Edition, Yahoo Press • Lynn Langit’s various presentations and YouTube videos • Dattatrey Sindol: Big Data Basics - Part 1 - Introduction to Big Data • Bruno Terkaly’s presentations (for example Hadoop on Azure: Introduction)
  • 27. Thanks for your attention!
  • 28. Hadoop vs RDBMS Hadoop / MapReduce RDBMS Size of data Petabytes Gigabytes Integrity of data Low High (referential, typed) Data schema Dynamic Static Access method Interactive and Batch Batch Scaling Linear Nonlinear (worse than linear) Data structure Unstructured Structured Normalization of data Not Required Required Query Response Time Has latency (due to batch processing) Can be near immediate