SlideShare a Scribd company logo
Clouds, Hadoop and Cascading Christopher Curtin
About Me 19+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop Silverpop is a leading marketing automation and email marketing company
Cloud Computing What exactly is ‘cloud computing’? Beats me Most overused term since ‘dot com’ Ask 10 people, get 11 answers
What is Map/Reduce? Pioneered by Google Parallel processing of large data sets across many computers Highly fault tolerant Splits work into two steps Map Reduce
Map Identifies what is in the input that you want to process Can be simple: occurrence of a word Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.* Output is a list of name/value pairs Name and value do not have to be primitives
Reduce Takes the name/value pairs from the Map step and does something useful with them Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values Output is the ‘answer’ to the question Example: bytes streamed by IP address from Apache logs
Hadoop Apache’s Map/Reduce framework Apache License Yahoo! Uses a version and they release their enhancements back to the community
HDFS Distributed File System WITHOUT NFS etc. Hadoop knows which parts of which files are on which machine (say that 5 times fast!) “Move the processing to the data” if possible Simple API to move files in and out of HDFS
Runtime Distribution © Concurrent 2009
Getting Started with Map/Reduce First challenge: real examples Second challenge: when to map and when to reduce? Third challenge: what if I need more than one of each? How to coordinate? Fourth challenge: non-trivial business logic
Cascading Open Source Puts a wrapper on top of Hadoop And so much more …
Main Concepts Tuple Tap Operations Pipes Flows Cascades
Tuple A single ‘row’ of data being processed Each column is named Can access data by name or position
TAP Abstraction on top of Hadoop files Allows you to define own parser for files Example: Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
Operations Define what to do on the data Each – for each “tuple” in data do this to it Group – similar to a ‘group by’ in SQL CoGroup – joins of tuple streams together Every – for every key in the Group or CoGroup do this
Operations - advanced Each operations allow logic on the row, such a parsing dates, creating new attributes etc. Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Both allow multiple operations in same function, so no nested function calls!
Pipes Pipes tie Operations together Pipes can be thought of as ‘tuple streams’ Pipes can be split, allowing parallel execution of Operations
Example Operation RowAggregator aggr = new RowAggregator(row); Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME); Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile); formatPipe = new GroupBy(formatPipe, groupBy); formatPipe = new Every(formatPipe, Fields.ALL, aggr);
Flows Flows are reusable combinations of Taps, Pipes and Operations Allows you to build library of functions  Flows are where the Cascading scheduler comes into play
Cascades Cascades are groups of Flows to address a need Cascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow) Determines what operations can be run in parallel and which need to be serialized
Cascading Scheduler Once the Flows and Cascades are defined, looks for dependencies When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used Knows what can be executed in parallel Knows when a step completes what other steps can execute
Dynamic Flow Creation Flows can be created at run time based on inputs. 5 input files one week, 10 the next, Java code creates 10 Flows instead of 5 Group and Every don’t care how many input Pipes
Dynamic Tuple Definition Each operations on input Taps can parse text lines into different Fields So one source may have 5 inputs, another 10 Each operations can used meta data to know how to parse Can write Each operations to output common Tuples Every operations can output new Tuples as well
Example: Dynamic Fields splice = new Each(splice,                 new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes),                 new Fields("mailing_id", "offset"));
Example: Custom Every Operation RowAggregator.java
Mixing non-Hadoop code Cascading allows you to mix regular java between Flows in a Cascade So you can call out to databases, write intermediates to a file etc. We use it to load meta data about the columns in the source files
Features I don’t use Failure traps – allow you to write Tuples into ‘error’ files when something goes wrong Non-Java scripting Assert() statements Sampling – throw away % of rows being imported
Quick HDFS/Local Disk Using a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classes So you can pass configuration information into these operations in bulk You can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!
Real Example For the hundreds of mailings sent last year To millions of recipients Show me who opened, how often Break it down by how long they have been a subscriber And their Gender And the number of times clicked on the offer
RDBMS solution Lots of million + row joins Lots of million + row counts Temporary tables since we want multiple answers Lots of memory Lots of CPU and I/O $$ becomes bottleneck to adding more rows or more clients to same logic
Cascading Solution Let Hadoop parse input files Let Hadoop group all inputs by recipient’s email Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links Bandwidth to export data from RDBMS becomes bottleneck
Pros and Cons Pros Mix java between map/reduce steps Don’t have to worry about when to map, when to reduce Don’t have to think about dependencies or how to process Data definition can change on the fly Cons Level above Hadoop – sometimes ‘black magic’ Data must (should) be outside of database to get most concurrency
Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’  Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Amazon Hadoop http://aws.amazon.com/elasticmapreduce/ Runs on EC2 Provide Map and Reduce functions Can use Cascading  Pay as you go
Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel  Web site: www.cascading.org Mailing list off website AWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/ O’Reilly Hadoop Book:  http://oreilly.com/catalog/9780596521974/

More Related Content

What's hot

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Data Con LA
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineeringnathanmarz
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 

What's hot (20)

Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Final deck
Final deckFinal deck
Final deck
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 

Similar to Hadoop and Cascading At AJUG July 2009

Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopStuart Ainsworth
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to CascadingBen Speakmon
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Usersgreenwop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Applicationsupertom
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDBBrian Ritchie
 

Similar to Hadoop and Cascading At AJUG July 2009 (20)

Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
No sql
No sqlNo sql
No sql
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to Cascading
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Using Cassandra with your Web Application
Using Cassandra with your Web ApplicationUsing Cassandra with your Web Application
Using Cassandra with your Web Application
 
Document Databases & RavenDB
Document Databases & RavenDBDocument Databases & RavenDB
Document Databases & RavenDB
 

Recently uploaded

Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 

Recently uploaded (20)

Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Hadoop and Cascading At AJUG July 2009

  • 1. Clouds, Hadoop and Cascading Christopher Curtin
  • 2. About Me 19+ years in Technology Background in Factory Automation, Warehouse Management and Food Safety system development before Silverpop CTO of Silverpop Silverpop is a leading marketing automation and email marketing company
  • 3. Cloud Computing What exactly is ‘cloud computing’? Beats me Most overused term since ‘dot com’ Ask 10 people, get 11 answers
  • 4. What is Map/Reduce? Pioneered by Google Parallel processing of large data sets across many computers Highly fault tolerant Splits work into two steps Map Reduce
  • 5. Map Identifies what is in the input that you want to process Can be simple: occurrence of a word Can be difficult: evaluate each row and toss those older than 90 days or from IP Range 192.168.1.* Output is a list of name/value pairs Name and value do not have to be primitives
  • 6. Reduce Takes the name/value pairs from the Map step and does something useful with them Map/Reduce Framework determines which Reduce instance to call for which Map values so a Reduce only ‘sees’ one set of ‘Name’ values Output is the ‘answer’ to the question Example: bytes streamed by IP address from Apache logs
  • 7. Hadoop Apache’s Map/Reduce framework Apache License Yahoo! Uses a version and they release their enhancements back to the community
  • 8. HDFS Distributed File System WITHOUT NFS etc. Hadoop knows which parts of which files are on which machine (say that 5 times fast!) “Move the processing to the data” if possible Simple API to move files in and out of HDFS
  • 9. Runtime Distribution © Concurrent 2009
  • 10. Getting Started with Map/Reduce First challenge: real examples Second challenge: when to map and when to reduce? Third challenge: what if I need more than one of each? How to coordinate? Fourth challenge: non-trivial business logic
  • 11. Cascading Open Source Puts a wrapper on top of Hadoop And so much more …
  • 12. Main Concepts Tuple Tap Operations Pipes Flows Cascades
  • 13. Tuple A single ‘row’ of data being processed Each column is named Can access data by name or position
  • 14. TAP Abstraction on top of Hadoop files Allows you to define own parser for files Example: Input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name);
  • 15. Operations Define what to do on the data Each – for each “tuple” in data do this to it Group – similar to a ‘group by’ in SQL CoGroup – joins of tuple streams together Every – for every key in the Group or CoGroup do this
  • 16. Operations - advanced Each operations allow logic on the row, such a parsing dates, creating new attributes etc. Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Both allow multiple operations in same function, so no nested function calls!
  • 17. Pipes Pipes tie Operations together Pipes can be thought of as ‘tuple streams’ Pipes can be split, allowing parallel execution of Operations
  • 18. Example Operation RowAggregator aggr = new RowAggregator(row); Fields groupBy = new Fields(MetricColumnDefinition.RECIPIENT_ID_NAME); Pipe formatPipe = new Each("reformat_“ new Fields("line"), a_sentFile); formatPipe = new GroupBy(formatPipe, groupBy); formatPipe = new Every(formatPipe, Fields.ALL, aggr);
  • 19. Flows Flows are reusable combinations of Taps, Pipes and Operations Allows you to build library of functions Flows are where the Cascading scheduler comes into play
  • 20. Cascades Cascades are groups of Flows to address a need Cascading scheduler looks for dependencies between Flows in a Cascade (and operations in a Flow) Determines what operations can be run in parallel and which need to be serialized
  • 21. Cascading Scheduler Once the Flows and Cascades are defined, looks for dependencies When executed, tells Hadoop what Map, Reduce or Shuffle steps to take based on what Operations were used Knows what can be executed in parallel Knows when a step completes what other steps can execute
  • 22. Dynamic Flow Creation Flows can be created at run time based on inputs. 5 input files one week, 10 the next, Java code creates 10 Flows instead of 5 Group and Every don’t care how many input Pipes
  • 23. Dynamic Tuple Definition Each operations on input Taps can parse text lines into different Fields So one source may have 5 inputs, another 10 Each operations can used meta data to know how to parse Can write Each operations to output common Tuples Every operations can output new Tuples as well
  • 24. Example: Dynamic Fields splice = new Each(splice, new ExpressionFunction(new Fields("offset"), "(open_time - sent_time)/(60*60*1000)", parameters, classes), new Fields("mailing_id", "offset"));
  • 25. Example: Custom Every Operation RowAggregator.java
  • 26. Mixing non-Hadoop code Cascading allows you to mix regular java between Flows in a Cascade So you can call out to databases, write intermediates to a file etc. We use it to load meta data about the columns in the source files
  • 27. Features I don’t use Failure traps – allow you to write Tuples into ‘error’ files when something goes wrong Non-Java scripting Assert() statements Sampling – throw away % of rows being imported
  • 28. Quick HDFS/Local Disk Using a Path() object you can access an HDFS file directly in your Buffer/Aggregator derived classes So you can pass configuration information into these operations in bulk You can also access the local disk, but make sure you have NFS or something similar to access the files later – you have no idea where your job will run!
  • 29. Real Example For the hundreds of mailings sent last year To millions of recipients Show me who opened, how often Break it down by how long they have been a subscriber And their Gender And the number of times clicked on the offer
  • 30. RDBMS solution Lots of million + row joins Lots of million + row counts Temporary tables since we want multiple answers Lots of memory Lots of CPU and I/O $$ becomes bottleneck to adding more rows or more clients to same logic
  • 31. Cascading Solution Let Hadoop parse input files Let Hadoop group all inputs by recipient’s email Let Cascading call Every functions to look at all rows for a recipient and ‘flatten’ data Split ‘flattened’ data Pipes to process in parallel: time in list, gender, clicked on links Bandwidth to export data from RDBMS becomes bottleneck
  • 32. Pros and Cons Pros Mix java between map/reduce steps Don’t have to worry about when to map, when to reduce Don’t have to think about dependencies or how to process Data definition can change on the fly Cons Level above Hadoop – sometimes ‘black magic’ Data must (should) be outside of database to get most concurrency
  • 33. Other Solutions Apache Pig: http://hadoop.apache.org/pig/ More ‘sql-like’ Not as easy to mix regular Java into processes More ‘ad hoc’ than Cascading Amazon Hadoop http://aws.amazon.com/elasticmapreduce/ Runs on EC2 Provide Map and Reduce functions Can use Cascading Pay as you go
  • 34. Resources Me: ccurtin@silverpop.com @ChrisCurtin Chris Wensel: @cwensel Web site: www.cascading.org Mailing list off website AWSomeAtlanta Group: http://www.meetup.com/awsomeatlanta/ O’Reilly Hadoop Book: http://oreilly.com/catalog/9780596521974/