SlideShare a Scribd company logo
DRIVING INNOVATION 
THROUGH DATA BUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH 
CASCADING 
Supreet Oberoi 
VP Field Engineering, Concurrent Inc
ABOUT ME 
2 
• I am a Data Engineer, not a Data Scientist 
• I help Enterprises develop decisions on building their “Big Data” roadmap and 
technical strategy — use cases, products, technology decisions, employee skills 
• I design Hadoop applications with the intent to operationalize them in Enterprise 
settings — applications on which business depend, and last longer than the 
technologies underneath them… 
• This talk is about learning how to design your BD strategy that leverages best
BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN 
3 
• Open Language 
• Open Data 
• Open Hardware 
• Open Compute Platform 
• Open Development Platform
OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE 
4 
• Don’t equate architecture with language; 
develop architecture to support multiple 
• Support SQL and SQL-like languages 
• Encourage development in proven & scalable 
languages as Java 
• Develop architecture to support change of 
programming languages (even for same app) 
• Have common performance-management 
tools across all programming environments
OPEN DATA ENABLES REUSE OF DATA AND APPS 
5 
Develop a common operating picture by promoting reuse with open data 
• Prevent exclusive access to data sets 
through proprietary tools 
• Promote a common meta-data repository 
• Forbid storing data in proprietary formats 
• Build seamless integration capabilities
OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE 
6 
• Get commodity hardware — commodity hardware will 
always cost less than “optimized” specialized 
hardware (note: definition of “specialized” is up for 
debate) 
• Develop and maintain a cluster that can be reused by 
different applications and technology stacks — avoid 
custom software installations on the cluster, or setting 
up dedicated clusters for given tech stacks 
• Harness the power of collective from the cluster — 
avoid fragmenting the cluster if possible
OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM 
7 
Make tradeoffs between reliability & speed based on your business context 
• Ensure that moving your application from one 
Hadoop compute platform (e.g. MapReduce) to 
another (e.g., Tez) does not: 
• impact application code 
• impact production-monitoring tools 
• Resist compute platforms that require your 
enterprise to acquire significantly new skills (even 
if it is easy) to become productive 
• Avoid new platforms that partition the cluster 
• Avoid platforms that do not support Open Data
OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY 
8 
Development platforms improve developer productivity and operational excellence — picking a 
correct platform gives you best practices developed by the community, achieving higher quality 
• Invest in picking the correct development platform 
— open, easy, scalable, popular, tools, … 
• Bet on a sustainable open source platform 
• Measure the vitality of the community: 
• number of downloads, extensions (living 
ecosystem), extensible architecture, consumers of 
the technology, code stability… 
A proven platform provides tools to get your apps to production
GET TO KNOW CONCURRENT 
9 
Leader in Application Infrastructure for Big Data 
• Building enterprise software to simplify Big Data application 
development and management 
Products and Technology 
• CASCADING 
Open Source - The most widely used application infrastructure for 
building Big Data apps with over 175,000 downloads each month 
• DRIVEN 
Enterprise data application management for Big Data apps 
Proven — Simple, Reliable, Robust 
• Thousands of enterprises rely on Concurrent to provide their data 
application infrastructure. 
Founded: 2008 
HQ: San Francisco, CA 
CEO: Gary Nakamura 
CTO, Founder: Chris Wensel 
www.concurrentinc.com
BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING 
10 
“It’s all about the apps” 
There needs to be a comprehensive solution for building, deploying, running 
and managing this new class of enterprise applications. 
Business Strategy Connecting Business and Data 
Data & Technology 
Challenges 
Skill sets, systems integration, 
standard op procedure and 
operational visibility
DATA APPLICATIONS - ENTERPRISE NEEDS 
Enterprise Data Application Infrastructure 
• Need reliable, reusable tooling to quickly build and consistently deliver 
data products 
• Need the degrees of freedom to solve problems ranging from simple to 
complex with existing skill sets 
• Need the flexibility to easily adapt an application to meet business needs 
(latency, scale, SLA), without having to rewrite the application 
• Need operational visibility for entire data application lifecycle 
11
CASCADING - DE-FACTO FOR DATA APPS 
Cascading Apps 
12 
SQL Clojure Ruby 
New Fabrics 
Tez Storm 
System Integration 
Mainframe DB / DW In-Memory Data Stores Hadoop 
• Standard for enterprise 
data app development 
• Your programming 
language of choice 
• Cascading applications 
that run on MapReduce 
will also run on Apache 
Spark, Storm, and …
WORD COUNT EXAMPLE WITH CASCADING 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 
13 
configuration 
integration 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
processing 
// specify a regex to split "document" text lines into token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
scheduling 
// connect the taps, pipes, etc., into a flow definition 
FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
// create the Flow 
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work 
wcFlow.complete(); // <<-- Runs jobs on Cluster
SOME COMMON PATTERNS 
• Functions 
• Filters 
• Joins 
‣ Inner / Outer / Mixed 
‣ Asymmetrical / Symmetrical 
• Merge (Union) 
• Grouping 
‣ Secondary Sorting 
‣ Unique (Distinct) 
• Aggregations 
‣ Count, Average, etc 
14 
filter 
filter 
function 
function filter function 
data 
Pipeline 
Split Join 
Merge 
data 
Topology
CASCADING 
• Java API 
• Separates business logic from integration 
• Testable at every lifecycle stage 
• Works with any JVM language 
• Many integration adapters 
15 
Processing API Integration API 
Process Planner 
Scheduler API 
Scheduler 
Apache Hadoop 
Cascading 
Data Stores 
Scripting 
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
THE STANDARD FOR DATA APPLICATION DEVELOPMENT 
16 
www.cascading.org 
Build data apps 
that are 
scale-free 
Design principals ensure 
best practices at any scale 
Test-Driven 
Development 
Efficiently test code and 
process local files before 
deploying on a cluster 
Staffing 
Bottleneck 
Use existing Java,Scala, 
SQL, modeling skill sets 
Application 
Portability 
Write once, then run on 
different computation 
fabrics 
Operational 
Complexity 
Simple - Package up into 
one jar and hand to 
operations 
Systems 
Integration 
Hadoop never lives alone. 
Easily integrate to existing 
systems 
Proven application development 
framework for building data apps 
Application platform that addresses:
STRONG ORGANIC GROWTH 
17 
175,000+ downloads / month 
7000+ Deployments
CASCADING DATA APPLICATIONS 
18 
Enterprise IT 
Extract Transform Load 
Log File Analysis 
Systems Integration 
Operations Analysis 
Corporate Apps 
HR Analytics 
Employee Behavioral Analysis 
Customer Support | eCRM 
Business Reporting 
Telecom 
Data processing of Open Data 
Geospatial Indexing 
Consumer Mobile Apps 
Location based services 
Marketing / Retail 
Mobile, Social, Search Analytics 
Funnel Analysis 
Revenue Attribution 
Customer Experiments 
Ad Optimization 
Retail Recommenders 
Consumer / Entertainment 
Music Recommendation 
Comparison Shopping 
Restaurant Rankings 
Real Estate 
Rental Listings 
Travel Search & Forecast 
Finance 
Fraud and Anomaly Detection 
Fraud Experiments 
Customer Analytics 
Insurance Risk Metric 
Health / Biotech 
Aggregate Metrics For Govt 
Person Biometrics 
Veterinary Diagnostics 
Next-Gen Genomics 
Argonomics 
Environmental Maps
BUSINESSES DEPEND ON US 
• Cascading Java API 
• Data normalization and cleansing of search and click-through logs for 
use by analytics tools, Hive analysts 
• Easy to operationalize heavy lifting of data in one framework 
19
BUSINESSES DEPEND ON US 
• Cascalog (Clojure) 
• Weather pattern modeling to protect growers against loss 
• ETL against 20+ datasets daily 
• Machine learning to create models 
• Purchased by Monsanto for $930M US 
20
BUSINESSES DEPEND ON US 
• Scalding (Scala) 
• Makes complex analysis of very large data sets simple 
• Machine learning, linear algebra to improve 
• 30,000 jobs a day — this works @ scale 
• Ad quality (matching users and ad effectiveness) 
21 
TWITTER
CASCADING DEPLOYMENTS 
Confidential 
22
BROAD SUPPORT 
23 
Hadoop ecosystem supports Cascading
… AND INCLUDES RICH SET OF EXTENSIONS 
24 
http://www.cascading.org/extensions/
CASCADING 3.0 
25 
“Write once and deploy on your fabric of choice.” 
• The Innovation — Cascading 3.0 will 
allow for data apps to execute on 
existing and emerging fabrics 
through its new customizable query 
planner. 
• Cascading 3.0 will support — Local 
In-Memory, Apache MapReduce and 
soon thereafter (3.1) Apache Tez, 
Apache Spark and Apache Storm 
Enterprise Data Applications 
Local In-Memory MapReduce 
Other 
Custom 
Computation Fabrics
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP 
• Lingual is an extension to Cascading that 
executes ANSI SQL queries as Cascading apps 
• Supports integrating with any data source that can 
be accessed through JDBC — Cascading Tap 
can be created for any source supporting JDBC 
• Great for migration of data, integrating with non- 
Big Data assets — extends life of existing IT 
assets in an organization 
26 
CLI / Shell Enterprise Java 
Provider API JDBC API Lingual API 
Query Planner 
Cascading 
Apache Hadoop 
Lingual 
Data Stores 
Catalog
SCALDING 
• Scalding is a language binding to Cascading for Scala 
27 
• The name Scalding comes from the combining of SCALa and cascaDING 
• Scalding is great for Scala developers; can crisply write constructs for matrix 
math… 
• Scalding has very large commercial deployments at: 
• Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality 
• Ebay - Use cases include search analytics and other production data pipelines
PATTERN SCORES MODELS AT SCALE 
28 
• Pattern is an open source project that allows to leverage Predictive Model 
Markup Language (PMML) models and translate them into Cascading 
apps. 
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and 
machine learning algorithms 
• PMML models from popular analytics frameworks can be reused and 
deployed within Cascading workflows 
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle 
• Open source frameworks - R, Weka, KNIME, RapidMiner 
• Pattern is great for migrating your model scoring to Hadoop from your 
decision systems
PATTERN SCORES MODELS AT SCALE 
Step 1: Train your model with industry-leading Tools 
Step 2: Score your models at scale with Pattern 
29 Confidential
OPERATIONAL EXCELLENCE WITH DRIVEN 
Visibility Through All Stages of App Lifecycle 
From Development — Building and Testing 
• Design & Development 
• Debugging 
• Tuning 
To Production — Monitoring and Tracking 
• Maintain Business SLAs 
• Balance & Controls 
• Application and Data Quality 
• Operational Health 
• Real-time Insights 
30
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS 
• Understand the anatomy of your Hive app 
• Track execution of queries as single business process 
• Identify outlier behavior by comparison with historical runs 
• Analyze rich operational meta-data 
• Correlate Hive app behavior with other events on cluster 
31
DRIVEN ARCHITECTURE
DEEPER VISUALIZATION INTO YOUR HADOOP CODE 
• Easily comprehend, debug, and tune 
your data applications 
• Get rich insights on your application 
performance 
• Monitor applications in real-time 
• Compare app performance with 
historical (previous) iterations 
33 
Debug and optimize your Hadoop applications more effectively with Driven
GET OPERATIONAL INSIGHTS WITH DRIVEN 
• Quickly breakdown how often 
applications execute based on their tags, 
teams, or names 
• Immediately identify if any application is 
monopolizing cluster resources 
• Understand the utilization of your cluster 
with a timeline of all applications running 
34 
Visualize the activity of your applications to help maintain SLAs
ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY 
• Easily keep track of all your 
applications by segmenting them with 
user-defined tags 
• Segment your applications for 
trending analysis, cluster analysis, 
and developing chargeback models 
• Quickly breakdown how often 
applications execute based on their 
tags, teams, or names 
35 
Segment your applications for greater insights across all your applications
COLLABORATE WITH TEAMS 
Utilize teams to collaborate and gain visibility over your set of applications 
• Invite others to view and collaborate 
on a specific application 
• Gain visibility to all the apps and their 
owners associated with each team 
• Simply manage your teams and the 
users assigned to them 
36
MANAGE PORTFOLIO OF BIG DATA APPLICATIONS 
Fast, powerful, rich search capabilities enable you to easily find the exact set of 
• Identify problematic apps with their 
owners and teams 
• Search for groups of applications 
segmented by user-defined tags 
• Compare specific applications with their 
previous iterations to ensure that your 
application can meet its SL 
37 
applications that you’re looking for
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS 
• Understand the anatomy of your Hive app 
• Track execution of queries as single business process 
• Identify outlier behavior by comparison with historical runs 
• Analyze rich operational meta-data 
• Correlate Hive app behavior with other events on cluster 
38
SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING 
• Cascading framework enables developers to intuitively create data applications that scale 
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite 
• Scalding — a Scala-based extension to Cascading — provides crisp programming 
constructs for algorithm developers and data scientists 
• Driven — an application visualization product — provides rich insights into how your 
applications executes, improving developer productivity by 10x 
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric 
39 
Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)
CONTACT INFORMATION 
Supreet Oberoi 
supreet@concurrentinc.com 
650-868-7675 (m) 
@supreet_online
DRIVING INNOVATION 
THROUGH DATA 
THANK YOU 
Supreet Oberoi
DRIVING INNOVATION 
THROUGH DATA 
BACKUP 
Supreet Oberoi
DIFFERENT PHILOSOPHY THAN GUI TOOLS 
• Cascading is a general-purpose framework to develop data applications; supports development through the 
entire lifecycle of data - from staging to final data sets 
• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices 
43 
• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler) 
• Can test locally and deploy production without code change 
• Because it is based in code: debuggable, extensible, deployable, traceable 
• GUI tools do not help in visualizing the must-have insights 
• Real-time application visualization, application bottlenecks, anatomy of the application, application 
dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster
PATTERN: ALGOS IMPLEMENTED 
• Hierarchical Clustering 
• K-Means Clustering 
• Linear Regression 
• Logistic Regression 
• Random Forest 
algorithms extended based on customer use cases – 
44 Confidential
CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK 
• Cascading 3.0 will ease application migration to Spark 
• Enterprises can standardize on one API to meet business challenges and solve a 
variety of business problems ranging from simple to complex, regardless of latency or 
scale 
• Third party products, data apps, frameworks and dynamic programming languages 
on Cascading will immediately benefit from this portability 
• Even more operational visibility from development through production with Driven 
45
BUSINESSES DEPEND ON US 
• Estimate suicide risk from what people write online 
• Cascading + Cassandra 
• You can do more than optimize add yields 
• http://www.durkheimproject.org 
46

More Related Content

What's hot

IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM Analytics
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
How to Automate Offloading ETL Processes to Hadoop
How to Automate Offloading ETL Processes to HadoopHow to Automate Offloading ETL Processes to Hadoop
How to Automate Offloading ETL Processes to Hadoop
Driven Inc.
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
Wilfried Hoge
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Cynthia Saracco
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Hortonworks
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
DataStax Academy
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
RTTS
 
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneOpen-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Innovative Management Services
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Cynthia Saracco
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
OW2
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Data Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache HadoopData Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache Hadoop
Hortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
Cécile Poyet
 
Oracle Solaris Build and Run Applications Better on 11.3
Oracle Solaris  Build and Run Applications Better on 11.3Oracle Solaris  Build and Run Applications Better on 11.3
Oracle Solaris Build and Run Applications Better on 11.3
OTN Systems Hub
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Diego Alberto Tamayo
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
BMC Software
 

What's hot (20)

IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love It
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
How to Automate Offloading ETL Processes to Hadoop
How to Automate Offloading ETL Processes to HadoopHow to Automate Offloading ETL Processes to Hadoop
How to Automate Offloading ETL Processes to Hadoop
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th JuneOpen-BDA - Big Data Hadoop Developer Training 10th & 11th June
Open-BDA - Big Data Hadoop Developer Training 10th & 11th June
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache HadoopData Discovery, Visualization, and Apache Hadoop
Data Discovery, Visualization, and Apache Hadoop
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Oracle Solaris Build and Run Applications Better on 11.3
Oracle Solaris  Build and Run Applications Better on 11.3Oracle Solaris  Build and Run Applications Better on 11.3
Oracle Solaris Build and Run Applications Better on 11.3
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_finalPresentacin webinar move_up_to_power8_with_scale_out_servers_final
Presentacin webinar move_up_to_power8_with_scale_out_servers_final
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 

Similar to Cascading concurrent yahoo lunch_nlearn

Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
Cascading
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
Trivadis
 
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraWebinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
DataStax
 
Adam azure presentation
Adam   azure presentationAdam   azure presentation
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
ScyllaDB
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
Kangaroot
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
Manoj Mahalingam
 
Erik Baardse - Bringing Agility to Traditional application by docker
Erik Baardse - Bringing Agility to Traditional application by dockerErik Baardse - Bringing Agility to Traditional application by docker
Erik Baardse - Bringing Agility to Traditional application by docker
Agile Impact Conference
 
Azure App Modernization
Azure App ModernizationAzure App Modernization
Azure App Modernization
Phi Huynh
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Hortonworks
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Felicia Haggarty
 
Cloudera Showcase Cask
Cloudera Showcase CaskCloudera Showcase Cask
Cloudera Showcase Cask
Cloudera, Inc.
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
Cascading
 
DevOps LA Meetup Intro to Habitat
DevOps LA Meetup Intro to HabitatDevOps LA Meetup Intro to Habitat
DevOps LA Meetup Intro to Habitat
Jessica DeVita
 
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Transforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and MicrosoftTransforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and Microsoft
Perficient, Inc.
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
Mesut Güneş
 
The intersection of Traditional IT and New-Generation IT
The intersection of Traditional IT and New-Generation ITThe intersection of Traditional IT and New-Generation IT
The intersection of Traditional IT and New-Generation IT
Kangaroot
 

Similar to Cascading concurrent yahoo lunch_nlearn (20)

Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraWebinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
 
Adam azure presentation
Adam   azure presentationAdam   azure presentation
Adam azure presentation
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
Erik Baardse - Bringing Agility to Traditional application by docker
Erik Baardse - Bringing Agility to Traditional application by dockerErik Baardse - Bringing Agility to Traditional application by docker
Erik Baardse - Bringing Agility to Traditional application by docker
 
Azure App Modernization
Azure App ModernizationAzure App Modernization
Azure App Modernization
 
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 
Cloudera Showcase Cask
Cloudera Showcase CaskCloudera Showcase Cask
Cloudera Showcase Cask
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
 
DevOps LA Meetup Intro to Habitat
DevOps LA Meetup Intro to HabitatDevOps LA Meetup Intro to Habitat
DevOps LA Meetup Intro to Habitat
 
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Transforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and MicrosoftTransforming Business in a Digital Era with Big Data and Microsoft
Transforming Business in a Digital Era with Big Data and Microsoft
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
The intersection of Traditional IT and New-Generation IT
The intersection of Traditional IT and New-Generation ITThe intersection of Traditional IT and New-Generation IT
The intersection of Traditional IT and New-Generation IT
 

More from Cascading

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
Cascading
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
Cascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
Cascading
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
Cascading
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
Cascading
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
Cascading
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
Cascading
 

More from Cascading (8)

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
 

Recently uploaded

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 

Recently uploaded (20)

WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 

Cascading concurrent yahoo lunch_nlearn

  • 1. DRIVING INNOVATION THROUGH DATA BUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc
  • 2. ABOUT ME 2 • I am a Data Engineer, not a Data Scientist • I help Enterprises develop decisions on building their “Big Data” roadmap and technical strategy — use cases, products, technology decisions, employee skills • I design Hadoop applications with the intent to operationalize them in Enterprise settings — applications on which business depend, and last longer than the technologies underneath them… • This talk is about learning how to design your BD strategy that leverages best
  • 3. BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN 3 • Open Language • Open Data • Open Hardware • Open Compute Platform • Open Development Platform
  • 4. OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE 4 • Don’t equate architecture with language; develop architecture to support multiple • Support SQL and SQL-like languages • Encourage development in proven & scalable languages as Java • Develop architecture to support change of programming languages (even for same app) • Have common performance-management tools across all programming environments
  • 5. OPEN DATA ENABLES REUSE OF DATA AND APPS 5 Develop a common operating picture by promoting reuse with open data • Prevent exclusive access to data sets through proprietary tools • Promote a common meta-data repository • Forbid storing data in proprietary formats • Build seamless integration capabilities
  • 6. OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE 6 • Get commodity hardware — commodity hardware will always cost less than “optimized” specialized hardware (note: definition of “specialized” is up for debate) • Develop and maintain a cluster that can be reused by different applications and technology stacks — avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks • Harness the power of collective from the cluster — avoid fragmenting the cluster if possible
  • 7. OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM 7 Make tradeoffs between reliability & speed based on your business context • Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not: • impact application code • impact production-monitoring tools • Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive • Avoid new platforms that partition the cluster • Avoid platforms that do not support Open Data
  • 8. OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY 8 Development platforms improve developer productivity and operational excellence — picking a correct platform gives you best practices developed by the community, achieving higher quality • Invest in picking the correct development platform — open, easy, scalable, popular, tools, … • Bet on a sustainable open source platform • Measure the vitality of the community: • number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability… A proven platform provides tools to get your apps to production
  • 9. GET TO KNOW CONCURRENT 9 Leader in Application Infrastructure for Big Data • Building enterprise software to simplify Big Data application development and management Products and Technology • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA CEO: Gary Nakamura CTO, Founder: Chris Wensel www.concurrentinc.com
  • 10. BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING 10 “It’s all about the apps” There needs to be a comprehensive solution for building, deploying, running and managing this new class of enterprise applications. Business Strategy Connecting Business and Data Data & Technology Challenges Skill sets, systems integration, standard op procedure and operational visibility
  • 11. DATA APPLICATIONS - ENTERPRISE NEEDS Enterprise Data Application Infrastructure • Need reliable, reusable tooling to quickly build and consistently deliver data products • Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets • Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application • Need operational visibility for entire data application lifecycle 11
  • 12. CASCADING - DE-FACTO FOR DATA APPS Cascading Apps 12 SQL Clojure Ruby New Fabrics Tez Storm System Integration Mainframe DB / DW In-Memory Data Stores Hadoop • Standard for enterprise data app development • Your programming language of choice • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 13. WORD COUNT EXAMPLE WITH CASCADING String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 13 configuration integration // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); processing // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); scheduling // connect the taps, pipes, etc., into a flow definition FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // create the Flow Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 14. SOME COMMON PATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 14 filter filter function function filter function data Pipeline Split Join Merge data Topology
  • 15. CASCADING • Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 15 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 16. THE STANDARD FOR DATA APPLICATION DEVELOPMENT 16 www.cascading.org Build data apps that are scale-free Design principals ensure best practices at any scale Test-Driven Development Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck Use existing Java,Scala, SQL, modeling skill sets Application Portability Write once, then run on different computation fabrics Operational Complexity Simple - Package up into one jar and hand to operations Systems Integration Hadoop never lives alone. Easily integrate to existing systems Proven application development framework for building data apps Application platform that addresses:
  • 17. STRONG ORGANIC GROWTH 17 175,000+ downloads / month 7000+ Deployments
  • 18. CASCADING DATA APPLICATIONS 18 Enterprise IT Extract Transform Load Log File Analysis Systems Integration Operations Analysis Corporate Apps HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting Telecom Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders Consumer / Entertainment Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast Finance Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric Health / Biotech Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps
  • 19. BUSINESSES DEPEND ON US • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts • Easy to operationalize heavy lifting of data in one framework 19
  • 20. BUSINESSES DEPEND ON US • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US 20
  • 21. BUSINESSES DEPEND ON US • Scalding (Scala) • Makes complex analysis of very large data sets simple • Machine learning, linear algebra to improve • 30,000 jobs a day — this works @ scale • Ad quality (matching users and ad effectiveness) 21 TWITTER
  • 23. BROAD SUPPORT 23 Hadoop ecosystem supports Cascading
  • 24. … AND INCLUDES RICH SET OF EXTENSIONS 24 http://www.cascading.org/extensions/
  • 25. CASCADING 3.0 25 “Write once and deploy on your fabric of choice.” • The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner. • Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm Enterprise Data Applications Local In-Memory MapReduce Other Custom Computation Fabrics
  • 26. USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP • Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps • Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC • Great for migration of data, integrating with non- Big Data assets — extends life of existing IT assets in an organization 26 CLI / Shell Enterprise Java Provider API JDBC API Lingual API Query Planner Cascading Apache Hadoop Lingual Data Stores Catalog
  • 27. SCALDING • Scalding is a language binding to Cascading for Scala 27 • The name Scalding comes from the combining of SCALa and cascaDING • Scalding is great for Scala developers; can crisply write constructs for matrix math… • Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
  • 28. PATTERN SCORES MODELS AT SCALE 28 • Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. • PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms • PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows • Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner • Pattern is great for migrating your model scoring to Hadoop from your decision systems
  • 29. PATTERN SCORES MODELS AT SCALE Step 1: Train your model with industry-leading Tools Step 2: Score your models at scale with Pattern 29 Confidential
  • 30. OPERATIONAL EXCELLENCE WITH DRIVEN Visibility Through All Stages of App Lifecycle From Development — Building and Testing • Design & Development • Debugging • Tuning To Production — Monitoring and Tracking • Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights 30
  • 31. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS • Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster 31
  • 33. DEEPER VISUALIZATION INTO YOUR HADOOP CODE • Easily comprehend, debug, and tune your data applications • Get rich insights on your application performance • Monitor applications in real-time • Compare app performance with historical (previous) iterations 33 Debug and optimize your Hadoop applications more effectively with Driven
  • 34. GET OPERATIONAL INSIGHTS WITH DRIVEN • Quickly breakdown how often applications execute based on their tags, teams, or names • Immediately identify if any application is monopolizing cluster resources • Understand the utilization of your cluster with a timeline of all applications running 34 Visualize the activity of your applications to help maintain SLAs
  • 35. ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY • Easily keep track of all your applications by segmenting them with user-defined tags • Segment your applications for trending analysis, cluster analysis, and developing chargeback models • Quickly breakdown how often applications execute based on their tags, teams, or names 35 Segment your applications for greater insights across all your applications
  • 36. COLLABORATE WITH TEAMS Utilize teams to collaborate and gain visibility over your set of applications • Invite others to view and collaborate on a specific application • Gain visibility to all the apps and their owners associated with each team • Simply manage your teams and the users assigned to them 36
  • 37. MANAGE PORTFOLIO OF BIG DATA APPLICATIONS Fast, powerful, rich search capabilities enable you to easily find the exact set of • Identify problematic apps with their owners and teams • Search for groups of applications segmented by user-defined tags • Compare specific applications with their previous iterations to ensure that your application can meet its SL 37 applications that you’re looking for
  • 38. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS • Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster 38
  • 39. SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING • Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite • Scalding — a Scala-based extension to Cascading — provides crisp programming constructs for algorithm developers and data scientists • Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x • Cascading 3.0 opens up the query planner — write apps once, run on any fabric 39 Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)
  • 40. CONTACT INFORMATION Supreet Oberoi supreet@concurrentinc.com 650-868-7675 (m) @supreet_online
  • 41. DRIVING INNOVATION THROUGH DATA THANK YOU Supreet Oberoi
  • 42. DRIVING INNOVATION THROUGH DATA BACKUP Supreet Oberoi
  • 43. DIFFERENT PHILOSOPHY THAN GUI TOOLS • Cascading is a general-purpose framework to develop data applications; supports development through the entire lifecycle of data - from staging to final data sets • Developing with an API is more productive and intuitive than with a UI — incorporate best-practices 43 • Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler) • Can test locally and deploy production without code change • Because it is based in code: debuggable, extensible, deployable, traceable • GUI tools do not help in visualizing the must-have insights • Real-time application visualization, application bottlenecks, anatomy of the application, application dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster
  • 44. PATTERN: ALGOS IMPLEMENTED • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest algorithms extended based on customer use cases – 44 Confidential
  • 45. CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK • Cascading 3.0 will ease application migration to Spark • Enterprises can standardize on one API to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale • Third party products, data apps, frameworks and dynamic programming languages on Cascading will immediately benefit from this portability • Even more operational visibility from development through production with Driven 45
  • 46. BUSINESSES DEPEND ON US • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields • http://www.durkheimproject.org 46