SlideShare a Scribd company logo
DRIVINGINNOVATION
THROUGHDATAREDUCINGDEVELOPMENTTIMEFORPRODUCTION-GRADEHADOOP
APPLICATIONS
Ryan Desmond
Solutions Architect, Concurrent Inc
2
Robust Open-Source, Commercially Supported Platform
• A simpler alternative to the Hadoop MapReduce APIs
• Cascading model is easier to “think” in
• Improves reusability and testability
• For creating sophisticated data processing applications
• A Java library that makes integration “first class”
• From ETL to Machine Learning applications
DATAAPPLICATIONS-DEVELOPERNEEDS
Cascading Apps
CASCADING -DE-FACTOFORDATAAPPS
3
New Fabrics
ClojureSQL
StormTez
System Integration
Mainframe DB / DW Data Stores HadoopIn-Memory
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …
THESTANDARDFORDATAAPPLICATIONDEVELOPMENT
4
www.cascading.org
Build data apps
that are 

scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java,Scala,
SQL, modeling skill sets
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Application
Portability
Write once, then run on
different computation
fabrics
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
5
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
SOMECOMMONPATTERNS
6
filter
filter
function
functionfilterfunction
data
Pipeline
Split Join
Merge
data
Topology
WORDCOUNTEXAMPLEWITHCASCADING
7
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
Hadoop2TezFlowConnector flowConnector = new Hadoop2TezFlowConnector( properties );
configuration
integration
// create source and sink taps
Tap docTap = new TeradataTap( new TextDelimited( true, "t" ), docPath, x, y, z );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster
CASCADINGDATAAPPLICATIONS
8
Enterprise IT
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
Corporate Apps
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
Telecom
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail
Mobile, Social, Search Analytics
Funnel Analysis
Revenue Attribution
Customer Experiments
Ad Optimization
Retail Recommenders
Consumer / Entertainment
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
Finance
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
Health / Biotech
Aggregate Metrics For Govt
Person Biometrics
Veterinary Diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
DRIVENFORHIVE:OPERATIONALVISIBILITYFORYOURHIVEAPPS
9
• Cascading framework enables developers to intuitively create data applications that scale
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite
• Scalding — a Scala-based extension to Cascading — provides crisp programming
constructs for algorithm developers and data scientists
• Driven — an application visualization product — provides rich insights into how your
applications executes, improving developer productivity by 10x
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric
SUMMARY-BUILDROBUSTDATAAPPSRIGHTTHEFIRSTTIMEWITHCASCADING
10
Concurrent offers training classes for Cascading & Scalding
BROADSUPPORT
11
Hadoop ecosystem supports Cascading
Confidential
CASCADINGDEPLOYMENTS
1212
CONTACTINFORMATION
Ryan Desmond
ryan@concurrentinc.com
DRIVINGINNOVATION
THROUGHDATATHANKYOU
Ryan Desmond

More Related Content

What's hot

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Big Data Spain
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Amazon Web Services
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Spark Summit
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDB
ArangoDB Database
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Spark Summit
 
Introduction to RDF*
Introduction to RDF*Introduction to RDF*
Introduction to RDF*
Cambridge Semantics
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
Sercan Karaoglu
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
DataWorks Summit
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
BizTalk360
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
sparkflows
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Sri Ambati
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Databricks
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
 
Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...Design for Scale - Building Real Time, High Performing Marketing Technology p...
Design for Scale - Building Real Time, High Performing Marketing Technology p...
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDBMongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
MongoDB .local Houston 2019: Wide Ranging Analytical Solutions on MongoDB
 
Webinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDBWebinar: How native multi model works in ArangoDB
Webinar: How native multi model works in ArangoDB
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Introduction to RDF*
Introduction to RDF*Introduction to RDF*
Introduction to RDF*
 
Big Data on the Cloud
Big Data on the CloudBig Data on the Cloud
Big Data on the Cloud
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Democratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Analyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data LakeAnalyzing StackExchange data with Azure Data Lake
Analyzing StackExchange data with Azure Data Lake
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
 
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest ProblemsUnified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 

Similar to Reducing Development Time for Production-Grade Hadoop Applications

Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
Cascading
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
Cascading
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
Paco Nathan
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Paco Nathan
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
Manoj Mahalingam
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Paco Nathan
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
Time's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data NowTime's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data Now
Eric Kavanagh
 
A Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameA Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality Game
Inside Analysis
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
Etu Solution
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
David Talby
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
Modern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
Cascading
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
OSCON Byrum
 

Similar to Reducing Development Time for Production-Grade Hadoop Applications (20)

Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Time's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data NowTime's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data Now
 
A Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality GameA Tighter Weave – How YARN Changes the Data Quality Game
A Tighter Weave – How YARN Changes the Data Quality Game
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 

More from Cascading

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
Cascading
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
Cascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
Cascading
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
Cascading
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
Cascading
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
Cascading
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
Cascading
 

More from Cascading (8)

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
 

Recently uploaded

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 

Recently uploaded (20)

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 

Reducing Development Time for Production-Grade Hadoop Applications

  • 2. 2 Robust Open-Source, Commercially Supported Platform • A simpler alternative to the Hadoop MapReduce APIs • Cascading model is easier to “think” in • Improves reusability and testability • For creating sophisticated data processing applications • A Java library that makes integration “first class” • From ETL to Machine Learning applications DATAAPPLICATIONS-DEVELOPERNEEDS
  • 3. Cascading Apps CASCADING -DE-FACTOFORDATAAPPS 3 New Fabrics ClojureSQL StormTez System Integration Mainframe DB / DW Data Stores HadoopIn-Memory • Standard for enterprise data app development • Your programming language of choice • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 4. THESTANDARDFORDATAAPPLICATIONDEVELOPMENT 4 www.cascading.org Build data apps that are 
 scale-free Design principals ensure best practices at any scale Test-Driven Development Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck Use existing Java,Scala, SQL, modeling skill sets Operational Complexity Simple - Package up into one jar and hand to operations Application Portability Write once, then run on different computation fabrics Systems Integration Hadoop never lives alone. Easily integrate to existing systems Proven application development framework for building data apps Application platform that addresses:
  • 5. • Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 5 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 6. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc SOMECOMMONPATTERNS 6 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  • 7. WORDCOUNTEXAMPLEWITHCASCADING 7 String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); Hadoop2TezFlowConnector flowConnector = new Hadoop2TezFlowConnector( properties ); configuration integration // create source and sink taps Tap docTap = new TeradataTap( new TextDelimited( true, "t" ), docPath, x, y, z ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); processing // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); scheduling // connect the taps, pipes, etc., into a flow definition FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // create the Flow Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 8. CASCADINGDATAAPPLICATIONS 8 Enterprise IT Extract Transform Load Log File Analysis Systems Integration Operations Analysis Corporate Apps HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting Telecom Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders Consumer / Entertainment Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast Finance Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric Health / Biotech Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps
  • 9. • Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster DRIVENFORHIVE:OPERATIONALVISIBILITYFORYOURHIVEAPPS 9
  • 10. • Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite • Scalding — a Scala-based extension to Cascading — provides crisp programming constructs for algorithm developers and data scientists • Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x • Cascading 3.0 opens up the query planner — write apps once, run on any fabric SUMMARY-BUILDROBUSTDATAAPPSRIGHTTHEFIRSTTIMEWITHCASCADING 10 Concurrent offers training classes for Cascading & Scalding