Cascading concurrent yahoo lunch_nlearn

DRIVING INNOVATION
THROUGH DATA BUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH
CASCADING
Supreet Oberoi
VP Field Engineering, Concurrent Inc

ABOUT ME
2
• I am a Data Engineer, not a Data Scientist
• I help Enterprises develop decisions on building their “Big Data” roadmap and
technical strategy — use cases, products, technology decisions, employee skills
• I design Hadoop applications with the intent to operationalize them in Enterprise
settings — applications on which business depend, and last longer than the
technologies underneath them…
• This talk is about learning how to design your BD strategy that leverages best

BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN
3
• Open Language
• Open Data
• Open Hardware
• Open Compute Platform
• Open Development Platform

OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE
4
• Don’t equate architecture with language;
develop architecture to support multiple
• Support SQL and SQL-like languages
• Encourage development in proven & scalable
languages as Java
• Develop architecture to support change of
programming languages (even for same app)
• Have common performance-management
tools across all programming environments

OPEN DATA ENABLES REUSE OF DATA AND APPS
5
Develop a common operating picture by promoting reuse with open data
• Prevent exclusive access to data sets
through proprietary tools
• Promote a common meta-data repository
• Forbid storing data in proprietary formats
• Build seamless integration capabilities

OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE
6
• Get commodity hardware — commodity hardware will
always cost less than “optimized” specialized
hardware (note: definition of “specialized” is up for
debate)
• Develop and maintain a cluster that can be reused by
different applications and technology stacks — avoid
custom software installations on the cluster, or setting
up dedicated clusters for given tech stacks
• Harness the power of collective from the cluster —
avoid fragmenting the cluster if possible

OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM
7
Make tradeoffs between reliability & speed based on your business context
• Ensure that moving your application from one
Hadoop compute platform (e.g. MapReduce) to
another (e.g., Tez) does not:
• impact application code
• impact production-monitoring tools
• Resist compute platforms that require your
enterprise to acquire significantly new skills (even
if it is easy) to become productive
• Avoid new platforms that partition the cluster
• Avoid platforms that do not support Open Data

OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY
8
Development platforms improve developer productivity and operational excellence — picking a
correct platform gives you best practices developed by the community, achieving higher quality
• Invest in picking the correct development platform
— open, easy, scalable, popular, tools, …
• Bet on a sustainable open source platform
• Measure the vitality of the community:
• number of downloads, extensions (living
ecosystem), extensible architecture, consumers of
the technology, code stability…
A proven platform provides tools to get your apps to production

GET TO KNOW CONCURRENT
9
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application
development and management
Products and Technology
• CASCADING
Open Source - The most widely used application infrastructure for
building Big Data apps with over 175,000 downloads each month
• DRIVEN
Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data
application infrastructure.
Founded: 2008
HQ: San Francisco, CA
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
www.concurrentinc.com

BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING
10
“It’s all about the apps”
There needs to be a comprehensive solution for building, deploying, running
and managing this new class of enterprise applications.
Business Strategy Connecting Business and Data
Data & Technology
Challenges
Skill sets, systems integration,
standard op procedure and
operational visibility

DATA APPLICATIONS - ENTERPRISE NEEDS
Enterprise Data Application Infrastructure
• Need reliable, reusable tooling to quickly build and consistently deliver
data products
• Need the degrees of freedom to solve problems ranging from simple to
complex with existing skill sets
• Need the flexibility to easily adapt an application to meet business needs
(latency, scale, SLA), without having to rewrite the application
• Need operational visibility for entire data application lifecycle
11

CASCADING - DE-FACTO FOR DATA APPS
Cascading Apps
12
SQL Clojure Ruby
New Fabrics
Tez Storm
System Integration
Mainframe DB / DW In-Memory Data Stores Hadoop
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …

WORD COUNT EXAMPLE WITH CASCADING
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
13
configuration
integration
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster

SOME COMMON PATTERNS
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
14
filter
filter
function
function filter function
data
Pipeline
Split Join
Merge
data
Topology

CASCADING
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
15
Processing API Integration API
Process Planner
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java

THE STANDARD FOR DATA APPLICATION DEVELOPMENT
16
www.cascading.org
Build data apps
that are
scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java,Scala,
SQL, modeling skill sets
Application
Portability
Write once, then run on
different computation
fabrics
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:

STRONG ORGANIC GROWTH
17
175,000+ downloads / month
7000+ Deployments

CASCADING DATA APPLICATIONS
18
Enterprise IT
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
Corporate Apps
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
Telecom
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail
Mobile, Social, Search Analytics
Funnel Analysis
Revenue Attribution
Customer Experiments
Ad Optimization
Retail Recommenders
Consumer / Entertainment
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
Finance
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
Health / Biotech
Aggregate Metrics For Govt
Person Biometrics
Veterinary Diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps

BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
19

• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
20

• Scalding (Scala)
• Makes complex analysis of very large data sets simple
• Machine learning, linear algebra to improve
• 30,000 jobs a day — this works @ scale
• Ad quality (matching users and ad effectiveness)
21
TWITTER

CASCADING DEPLOYMENTS
Confidential
22

BROAD SUPPORT
23
Hadoop ecosystem supports Cascading

… AND INCLUDES RICH SET OF EXTENSIONS
24
http://www.cascading.org/extensions/

CASCADING 3.0
25
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will
allow for data apps to execute on
existing and emerging fabrics
through its new customizable query
planner.
• Cascading 3.0 will support — Local
In-Memory, Apache MapReduce and
soon thereafter (3.1) Apache Tez,
Apache Spark and Apache Storm
Enterprise Data Applications
Local In-Memory MapReduce
Other
Custom
Computation Fabrics

USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP
• Lingual is an extension to Cascading that
executes ANSI SQL queries as Cascading apps
• Supports integrating with any data source that can
be accessed through JDBC — Cascading Tap
can be created for any source supporting JDBC
• Great for migration of data, integrating with non-
Big Data assets — extends life of existing IT
assets in an organization
26
CLI / Shell Enterprise Java
Provider API JDBC API Lingual API
Query Planner
Cascading
Apache Hadoop
Lingual
Data Stores
Catalog

SCALDING
• Scalding is a language binding to Cascading for Scala
27
• The name Scalding comes from the combining of SCALa and cascaDING
• Scalding is great for Scala developers; can crisply write constructs for matrix
math…
• Scalding has very large commercial deployments at:
• Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality
• Ebay - Use cases include search analytics and other production data pipelines

PATTERN SCORES MODELS AT SCALE
28
• Pattern is an open source project that allows to leverage Predictive Model
Markup Language (PMML) models and translate them into Cascading
apps.
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and
machine learning algorithms
• PMML models from popular analytics frameworks can be reused and
deployed within Cascading workflows
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle
• Open source frameworks - R, Weka, KNIME, RapidMiner
• Pattern is great for migrating your model scoring to Hadoop from your
decision systems

PATTERN SCORES MODELS AT SCALE
Step 1: Train your model with industry-leading Tools
Step 2: Score your models at scale with Pattern
29 Confidential

OPERATIONAL EXCELLENCE WITH DRIVEN
Visibility Through All Stages of App Lifecycle
From Development — Building and Testing
• Design & Development
• Debugging
• Tuning
To Production — Monitoring and Tracking
• Maintain Business SLAs
• Balance & Controls
• Application and Data Quality
• Operational Health
• Real-time Insights
30

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
31

DEEPER VISUALIZATION INTO YOUR HADOOP CODE
• Easily comprehend, debug, and tune
your data applications
• Get rich insights on your application
performance
• Monitor applications in real-time
• Compare app performance with
historical (previous) iterations
33
Debug and optimize your Hadoop applications more effectively with Driven

GET OPERATIONAL INSIGHTS WITH DRIVEN
• Quickly breakdown how often
applications execute based on their tags,
teams, or names
• Immediately identify if any application is
monopolizing cluster resources
• Understand the utilization of your cluster
with a timeline of all applications running
34
Visualize the activity of your applications to help maintain SLAs

ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
• Easily keep track of all your
applications by segmenting them with
user-defined tags
• Segment your applications for
trending analysis, cluster analysis,
and developing chargeback models
• Quickly breakdown how often
applications execute based on their
tags, teams, or names
35
Segment your applications for greater insights across all your applications

COLLABORATE WITH TEAMS
Utilize teams to collaborate and gain visibility over your set of applications
• Invite others to view and collaborate
on a specific application
• Gain visibility to all the apps and their
owners associated with each team
• Simply manage your teams and the
users assigned to them
36

MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
Fast, powerful, rich search capabilities enable you to easily find the exact set of
• Identify problematic apps with their
owners and teams
• Search for groups of applications
segmented by user-defined tags
• Compare specific applications with their
previous iterations to ensure that your
application can meet its SL
37
applications that you’re looking for

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
38

SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING
• Cascading framework enables developers to intuitively create data applications that scale
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite
• Scalding — a Scala-based extension to Cascading — provides crisp programming
constructs for algorithm developers and data scientists
• Driven — an application visualization product — provides rich insights into how your
applications executes, improving developer productivity by 10x
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric
39
Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)

CONTACT INFORMATION
Supreet Oberoi
supreet@concurrentinc.com
650-868-7675 (m)
@supreet_online

DRIVING INNOVATION
THROUGH DATA
THANK YOU
Supreet Oberoi

DRIVING INNOVATION
THROUGH DATA
BACKUP
Supreet Oberoi

DIFFERENT PHILOSOPHY THAN GUI TOOLS
• Cascading is a general-purpose framework to develop data applications; supports development through the
entire lifecycle of data - from staging to final data sets
• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices
43
• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler)
• Can test locally and deploy production without code change
• Because it is based in code: debuggable, extensible, deployable, traceable
• GUI tools do not help in visualizing the must-have insights
• Real-time application visualization, application bottlenecks, anatomy of the application, application
dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster

PATTERN: ALGOS IMPLEMENTED
• Hierarchical Clustering
• K-Means Clustering
• Linear Regression
• Logistic Regression
• Random Forest
algorithms extended based on customer use cases –
44 Confidential

CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK
• Cascading 3.0 will ease application migration to Spark
• Enterprises can standardize on one API to meet business challenges and solve a
variety of business problems ranging from simple to complex, regardless of latency or
scale
• Third party products, data apps, frameworks and dynamic programming languages
on Cascading will immediately benefit from this portability
• Even more operational visibility from development through production with Driven
45

• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
46

Cascading concurrent yahoo lunch_nlearn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cascading concurrent yahoo lunch_nlearn

Similar to Cascading concurrent yahoo lunch_nlearn (20)

More from Cascading

More from Cascading (8)

Recently uploaded

Recently uploaded (20)

Cascading concurrent yahoo lunch_nlearn