Big Data Meetup 
October Page 1 
29, 2014 
C-BAG 
Chennai
C-BAG 
Chennai Big Data Analytic Group 
C-BAG is an open group formed in the interest of creating a good BIG 
DATA Environment. 
C-BAG is conducting weekly and monthly online/offline free sessions, 
creating awareness on the BIG DATA technologies and support BIG DATA 
initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, 
discussions and support ! 
Contact Us : 
chennaibigdataanalyticgroup@gmail.com 
Page 2
Speakers 
Page 3 
About Dhruv Kumar 
Solutions Architect 
Concurrent Inc. 
Dhruv Kumar has over six years of 
diverse software development 
experience in Big Data, Web and High 
Performance Computing applications. 
Prior to joining Concurrent, he worked at 
Terracotta as a Software Engineer. He 
has a MS degree in Computer 
Engineering from the University of 
Massachusetts-Amherst. 
About Vinay Shukla 
Director of Product Management 
Hortonworks 
Vinay Shukla is a seasoned Enterprise 
Software professional with extensive 
experience in Product management, 
Product development and Project 
management. Prior to Hortonworks, Vinay 
has worked as security architect, product 
manager, developer and project manager. 
Vinay admits to being a caffeine addict 
and spends his free time on a Yoga mat 
and on Hikes.
Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform 
• Founded in 2011 
• Original 24 architects, developers, 
operators of Hadoop from Yahoo! 
• Leaders in Hadoop community 
• 500+ employees 
Page 4 
Customer Momentum 
• 300+ customers in seven quarters, growing at 75+/quarter 
• Two thirds of customers come from F1000 
Partner Momentum 
• Over 1000 Partners, Hundreds of Certified Solutions 
• Some key 
partners include: 
Hortonworks and Hadoop at Scale 
• HDP in production on largest clusters on planet 
• Most +1000 node clusters
Page 5 
The Forrester Wave™ 
Big Data Hadoop Solutions 
Q1 2014 
A Leader in Hadoop 
“Hortonworks loves and lives 
open source innovation” 
World Class Support and Services. 
Hortonworks' Customer Support received a 
maximum score and was significantly higher than 
both Cloudera and MapR
HDP IS Apache Hadoop 
There is ONE Enterprise Hadoop: everything else is a vendor derivation 
HDP 2.2 
October 
2014 
HDP 2.1 
April 
2014 
Page 6 
0.98.0 
1.4.0 
0.5.0 
0.60 
0.4.0 
Tez 
Slider 
4.10.0 
4.7.2 
Hortonworks Data Platform 2.2 
Hadoop 
&YARN 
Pig 
Hive & HCatalog 
HBase 
Sqoop 
4.0.0 
Oozie 
3.4.5 
Zookeeper 
1.5.1 
Ambari 
Storm 
Flume 
Knox 
Phoenix 
Accumulo 
2.2.0 
0.12.0 
0.12.0 
2.4.0 0.12.1 
Data 
Management 
0.13.0 
0.96.1 
0.9.1 1.4.4 
1.3.1 
1.4.4 
3.3.2 
3.4.5 
0.4.0 
4.0.0 
1.5.1 
Falcon 
Ranger 
Spark 
Kafka 
0.14.0 
0.14.0 
0.98.4 
1.6.1 
4.2 0.9.3 
1.2.0 
0.6.0 
0.8.1 
1.4.5 
1.5.0 
1.7.0 
4.1.0 
0.5.0 
0.4.0 
2.6.0 
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process 
HDP 2.0 
October 
2013 
Solr 
0.5.1 
Data Access Governance 
& Integration Operations Security
The Modern Data Architecture w/ HDP 
Page 7
Enterprise Goals for the Modern Data Architecture 
Page 8 
• Consolidate siloed data sets structured 
and unstructured 
• Central data set on a single cluster 
• Multiple workloads across batch 
interactive and real time 
• Central services for security, governance 
and operation 
• Preserve existing investment in current 
tools and platforms 
• Single view of the customer, product, 
supply chain 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
Packaged 
Applications 
RDBMS 
EDW 
MPP 
Batch Interactive Real-Time 
YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° 
° ° ° ° ° ° ° ° N 
CRM 
ERP 
Other 
1 ° ° ° 
° ° ° HDFS 
(Hadoop Distributed File System) 
SOURCES 
EXISTING( 
Systems( 
Clickstream( Web(( 
&Social( 
Geoloca9on( Sensor(( 
&(Machine( 
Server(( 
Logs( 
Unstructured(
1. Unlock New Applications from New Types of Data 
INDUSTRY USE CASE Sentiment 
Page 9 
& Web 
Clickstream 
& Behavior 
Machine 
& Sensor Geographic Server Logs Structured & 
Unstructured 
Financial Services 
New Account Risk Screens ✔ ✔ 
Trading Risk ✔ 
Insurance Underwriting ✔ ✔ ✔ 
Telecom 
Call Detail Records (CDR) ✔ ✔ 
Infrastructure Investment ✔ ✔ 
Real-time Bandwidth Allocation ✔ ✔ ✔ 
Retail 
360° View of the Customer ✔ ✔ ✔ 
Localized, Personalized Promotions ✔ 
Website Optimization ✔ 
Manufacturing 
Supply Chain and Logistics ✔ 
Assembly Line Quality Assurance ✔ 
Crowd-sourced Quality Assurance ✔ 
Healthcare 
Use Genomic Data in Medial Trials ✔ ✔ ✔ 
Monitor Patient Vitals in Real-Time 
Pharmaceuticals 
Recruit and Retain Patients for Drug Trials ✔ ✔ 
Improve Prescription Adherence ✔ ✔ ✔ ✔ 
Oil & Gas 
Unify Exploration & Production Data ✔ ✔ ✔ ✔ 
Monitor Rig Safety in Real-Time ✔ ✔ ✔ 
Government 
ETL Offload/Federal Budgetary Pressures ✔ ✔ 
Sentiment Analysis for Government Programs ✔
..to shift from reactive to proactive interactions 
A shift in Advertising 
From mass branding …to 1x1 Targeting 
A shift in Financial Services 
From Educated Investing …to Automated Algorithms 
A shift in Healthcare 
From mass treatment …to Designer Medicine 
A shift in Retail 
A shift in Telco 
Page 10 
HDP and Hadoop allow 
organizations to shift 
interactions from… 
Reactive 
Post Transaction 
Proactive 
Pre Decision 
…to Real-t From static branding ime Personalization 
From break then fix …to repair before break
2. Or to realize a dramatic cost savings… 
EDW Optimization 
Page 11 
✚ 
OPERATIONS 
50% 
ANALYTICS 
20% 
ETL PROCESS 
30% 
OPERATIONS 
50% ANALYTICS 
50% 
Current Reality 
EDW at capacity: some usage 
from low value workloads 
Older data archived, unavailable 
for ongoing exploration 
Source data often discarded 
Hadoop 
Parse, Cleanse 
Apply Structure, Transform 
Augment w/ Hadoop 
Free up EDW resources from 
low value tasks 
Keep 100% of source data and historical 
data for ongoing exploration 
Mine data for value after loading it 
because of schema-on-read
2. Or to realize a dramatic cost savings… 
EDW Optimization 
Page 12 
✚ 
OPERATIONS 
50% 
ANALYTICS 
20% 
ETL PROCESS 
30% 
OPERATIONS 
50% ANALYTICS 
50% 
Current Reality 
EDW at capacity: some usage 
from low value workloads 
Older data archived, unavailable 
for ongoing exploration 
Source data often discarded 
Augment w/ Hadoop 
Free up EDW resources from 
low value tasks 
Keep 100% of source data and historical 
data for ongoing exploration 
Mine data for value after loading it 
because of schema-on-read 
Commodity Compute & Storage 
Hadoop Enables Scalable Compute & 
Storage at a Compelling Cost Structure 
Cloud Storage 
Engineered System 
MPP 
SAN 
HADOOP 
NAS 
$0 $20,000 $40,000 $60,000 $80,000 $180,000 
Fully-loaded Cost Per Raw TB of Data (Min–Max Cost) 
Hadoop 
Parse, Cleanse 
Apply Structure, Transform 
Storage Costs/Compute Costs 
from $19/GB to $0.23/GB
3. Data Lake: An architectural shift 
SCALE 
Page 13 
SCOPE 
Unlocking the Data Lake 
( 
RDBMS 
MPP 
EDW 
Data Lake 
Enabled by YARN 
• Single data repository, 
shared infrastructure 
• Multiple biz apps 
accessing all the data 
• Enable a shift from 
reactive to proactive 
interactions 
• Gain new insight across 
the entire enterprise 
New Analytic Apps 
or IT Optimization 
HDP 2.1 
Governance 
& Integration 
Security 
Operations 
Data Access 
YARN 
Data Management
Case Study: 12 month Hadoop evolution at TrueCar 
Data Platform Capabilities 
Page 14 
June 2013 
Begin 
Hadoop 
Execution 
July 2013 
Hortonworks 
Partnership 
12 months execution plan 
May ‘14 
IPO 
Aug 2013 
Training 
& Dev 
Begins 
Nov 2013 
Production 
Cluster 
60 Nodes 
2 PB 
Jan 2014 
40% Dev 
Staff 
Perficient 
Dec 2013 
Three 
Production 
Apps 
(3 total) 
Feb 2014 
Three More 
Production 
Apps 
(6 total) 
12 Month Results at TrueCAR 
• Six Production Hadoop Applications 
• Sixty nodes/2PB data 
• Storage Costs/Compute Costs 
from $19/GB to $0.23/GB 
“We addressed our data platform capabilities 
strategically as a pre-cursor to IPO.”
DRIVING 
INNOVATION 
THROUGH 
DREDUACINGT DEAVELOPMENT TIME FOR 
PRODUCTION-GRADE HADOOP APPLICATIONS 
Dhruv Kumar 
Solutions Architect, Concurrent Inc
GET TO KNOW CONCURRENT 
2 
Leader in Application Infrastructure for Big Data! 
• Building enterprise software to simplify Big Data application 
development and management 
Products and Technology! 
• CASCADING 
Open Source - The most widely used application infrastructure for 
building Big Data apps with over 200,000 downloads each month. 
8000 deployments worldwide. 
• DRIVEN 
Enterprise data application management for Big Data apps 
Proven — Simple, Reliable, Robust! 
• Thousands of enterprises rely on Concurrent to provide their data 
application infrastructure. 
Founded: 2008 
HQ: San Francisco, CA 
! 
CEO: Gary Nakamura 
CTO, Founder: Chris Wensel 
! 
! 
www.concurrentinc.com
BIG DATA APPLICATION INFRASTRUCTURE 
3 
“It’s all about the apps”" 
There needs to be a comprehensive solution for building, deploying, running 
and managing this new class of enterprise applications. 
Business Strategy Connecting Business and Data 
Data & Technology 
Challenges! 
! 
Skill sets, systems integration, 
standard op procedure and 
operational visibility
DATA APPLICATIONS - ENTERPRISE NEEDS 
Enterprise Data Application Infrastructure! 
! 
• Need reliable, reusable tooling to quickly build and consistently deliver 
data products 
4 
! 
• Need the degrees of freedom to solve problems ranging from simple to 
complex with existing skill sets 
! 
• Need the flexibility to easily adapt an application to meet business needs 
(latency, scale, SLA), without having to rewrite the application 
! 
• Need operational visibility for entire data application lifecycle
WORD COUNT EXAMPLE WITH CASCADING 
5 
! 
! 
String docPath = args[ 0 ];! 
String wcPath = args[ 1 ];! 
Properties properties = new Properties();! 
AppProps.setApplicationJarClass( properties, Main.class );! 
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! 
! 
configuration 
integration 
! 
// create source and sink taps! 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! 
! 
processing 
// specify a regex to split "document" text lines into token stream! 
Fields token = new Fields( "token" );! 
Fields text = new Fields( "text" );! 
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! 
// only returns "token"! 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! 
// determine the word counts! 
Pipe wcPipe = new Pipe( "wc", docPipe );! 
wcPipe = new GroupBy( wcPipe, token );! 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! 
scheduling 
! 
// connect the taps, pipes, etc., into a flow definition! 
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! 
.addSource( docPipe, docTap )! 
.addTailSink( wcPipe, wcTap );! 
// create the Flow! 
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! 
wcFlow.complete(); // <<-- Runs jobs on Cluster
SOME COMMON PROCESSING PATTERNS 
• Functions 
• Filters 
• Joins 
‣ Inner / Outer / Mixed 
‣ Asymmetrical / Symmetrical 
• Merge (Union) 
• Grouping 
‣ Secondary Sorting 
‣ Unique (Distinct) 
• Aggregations 
‣ Count, Average, etc 
6 
filter 
filter 
function 
function filter function 
data 
Pipeline 
Split Join 
Merge 
data 
Topology
CASCADING API 
• Java API 
• Separates business logic from integration 
• Testable at every lifecycle stage 
• Works with any JVM language 
• Many integration adapters 
7 
Processing API Integration API 
Process Planner 
Scheduler API 
Scheduler 
Apache Hadoop 
Cascading 
Data Stores 
Scripting 
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
FRAMEWORK AND PROGRAMMING LANGUAGE 
INDEPENDENCE 
Cascading Domain Specific Languages (DSLs) 
8 
SQL Clojure Ruby 
New Fabrics 
Tez Storm 
Supported Fabrics and Data Stores 
Mainframe DB / DW In-Memory Data Stores Hadoop 
! 
• Any JVM language can 
use Cascading API 
• Cascading applications 
that run on MapReduce 
will also run on Apache 
Spark, Storm, and …
THE STANDARD FOR DATA APPLICATION DEVELOPMENT 
9 
www.cascading.org 
Build data apps 
that are 
scale-free! 
!!! 
Design principles 
ensure best practices at 
any scale 
Test-Driven 
Development! 
! 
Efficiently test code and 
process local files before 
deploying on a cluster 
Staffing 
Bottleneck! 
! 
Use existing Java, SQL, 
modeling skill sets 
Application 
Portability! 
! 
! 
Write once, then run on 
different computation 
fabrics 
Operational 
Complexity! 
! 
Simple - Package up into 
one jar and hand to 
operations 
Systems 
Integration! 
! 
! 
Hadoop never lives alone. 
Easily integrate to existing 
systems 
! 
Proven application development 
framework for building data apps 
Application platform that addresses:
CASCADING DATA APPLICATIONS 
10 
Enterprise IT! 
Extract Transform Load 
Log File Analysis 
Systems Integration 
Operations Analysis 
! 
Corporate Apps! 
HR Analytics 
Employee Behavioral Analysis 
Customer Support | eCRM 
Business Reporting 
! 
Telecom! 
Data processing of Open Data 
Geospatial Indexing 
Consumer Mobile Apps 
Location based services 
Marketing / Retail! 
Mobile, Social, Search Analytics 
Funnel Analysis 
Revenue Attribution 
Customer Experiments 
Ad Optimization 
Retail Recommenders 
! 
Consumer / Entertainment! 
Music Recommendation 
Comparison Shopping 
Restaurant Rankings 
Real Estate 
Rental Listings 
Travel Search & Forecast 
! 
! 
Finance! 
Fraud and Anomaly Detection 
Fraud Experiments 
Customer Analytics 
Insurance Risk Metric 
! 
Health / Biotech! 
Aggregate Metrics For Govt 
Person Biometrics 
Veterinary Diagnostics 
Next-Gen Genomics 
Argonomics 
Environmental Maps 
!
STRONG ORGANIC GROWTH 
11 
200,000+ downloads / month! 
8000+ Deployments!
BUSINESSES DEPEND ON US 
• 30000 Jobs per day! 
• Makes complex analysis of very large data sets simple! 
• Machine learning, linear algebra to improve! 
• User experience! 
• Ad quality (matching users and ad effectiveness)! 
• All revenue applications are running on Cascading/Scalding! 
12 
TWITTER
BUSINESSES DEPEND ON US 
• Cascading Java API! 
• Data normalization and cleansing of search and click-through logs for 
use by analytics tools, Hive analysts! 
• Easy to operationalize heavy lifting of data in one framework 
13
BUSINESSES DEPEND ON US 
• Cascalog (Clojure)! 
• Weather pattern modeling to protect growers against loss! 
• ETL against 20+ datasets daily! 
• Machine learning to create models! 
• Purchased by Monsanto for $930M US 
14
BROAD SUPPORT 
15 
Hadoop ecosystem supports Cascading!
… AND INCLUDES RICH SET OF EXTENSIONS 
16 
http://www.cascading.org/extensions/
WORD COUNT DEMO ON HDP 
17
SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME 
WITH CASCADING 
• Cascading framework enables developers to intuitively create data applications that scale 
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite 
! 
• Driven — an application visualization product — provides rich insights into how your 
applications executes, improving developer productivity by 10x 
! 
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric 
18 
Concurrent offers training classes for Cascading & Scalding
CONTACT INFORMATION 
Dhruv Kumar! 
Solutions Architect! 
Concurrent Inc.! 
dkumar@concurrentinc.com
DRIVING 
INNOVATION 
THROUGH 
DTHAANKT YAOU 
Dhruv Kumar
APPENDIX 
21
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK 
• Lingual is an extension to Cascading that 
executes ANSI SQL queries as Cascading apps 
! 
• Supports integrating with any data source that can 
be accessed through JDBC — Cascading Tap 
can be created for any source supporting JDBC 
! 
• Great for migration of data, integrating with non- 
Big Data assets — extends life of existing IT 
assets in an organization 
22 
CLI / Shell Enterprise Java 
Provider API JDBC API Lingual API 
Query Planner 
Cascading 
Apache Hadoop 
Lingual 
Data Stores 
Catalog
SCALDING 
• Scalding is a language binding to Cascading for Scala 
23 
• The name Scalding comes from the combining of SCALa and cascaDING 
! 
• Scalding is great for Scala developers; can crisply write constructs for matrix 
math… 
! 
• Scalding has very large commercial deployments at: 
• Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality 
• Ebay - Use cases include search analytics and other production data pipelines
PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK 
24 
• Pattern is an open source project that allows to leverage Predictive Model 
Markup Language (PMML) models and translate them into Cascading 
apps. 
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and 
machine learning algorithms 
• PMML models from popular analytics frameworks can be reused and 
deployed within Cascading workflows! 
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle 
• Open source frameworks - R, Weka, KNIME, RapidMiner 
• Pattern is great for migrating your model scoring to Hadoop from your 
decision systems
PATTERN: ALGOS IMPLEMENTED 
• Hierarchical Clustering 
• K-Means Clustering 
• Linear Regression 
• Logistic Regression 
• Random Forest 
! 
algorithms extended based on customer use cases – 
25 Confidential
BUILDING AND RUNNING PMML MODELS 
LINGUAL 
Data PMML 
LINGUAL 
26 Confidential 
Model 
Producer 
Model Explore data and build model 
using Regression, clustering, etc. 
Training 
Scoring 
New 
Data 
PMML model 
Measure and improve model 
Post 
Processing 
Model 
Consumer 
Data 
Data 
scores 
PATTERN 
ETL, prepare data 
ETL, prepare data

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

  • 1.
    Big Data Meetup October Page 1 29, 2014 C-BAG Chennai
  • 2.
    C-BAG Chennai BigData Analytic Group C-BAG is an open group formed in the interest of creating a good BIG DATA Environment. C-BAG is conducting weekly and monthly online/offline free sessions, creating awareness on the BIG DATA technologies and support BIG DATA initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, discussions and support ! Contact Us : chennaibigdataanalyticgroup@gmail.com Page 2
  • 3.
    Speakers Page 3 About Dhruv Kumar Solutions Architect Concurrent Inc. Dhruv Kumar has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst. About Vinay Shukla Director of Product Management Hortonworks Vinay Shukla is a seasoned Enterprise Software professional with extensive experience in Product management, Product development and Project management. Prior to Hortonworks, Vinay has worked as security architect, product manager, developer and project manager. Vinay admits to being a caffeine addict and spends his free time on a Yoga mat and on Hikes.
  • 4.
    Hortonworks enables adoptionof Apache Hadoop with Hortonworks Data Platform • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • Leaders in Hadoop community • 500+ employees Page 4 Customer Momentum • 300+ customers in seven quarters, growing at 75+/quarter • Two thirds of customers come from F1000 Partner Momentum • Over 1000 Partners, Hundreds of Certified Solutions • Some key partners include: Hortonworks and Hadoop at Scale • HDP in production on largest clusters on planet • Most +1000 node clusters
  • 5.
    Page 5 TheForrester Wave™ Big Data Hadoop Solutions Q1 2014 A Leader in Hadoop “Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR
  • 6.
    HDP IS ApacheHadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP 2.2 October 2014 HDP 2.1 April 2014 Page 6 0.98.0 1.4.0 0.5.0 0.60 0.4.0 Tez Slider 4.10.0 4.7.2 Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop 4.0.0 Oozie 3.4.5 Zookeeper 1.5.1 Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.9.1 1.4.4 1.3.1 1.4.4 3.3.2 3.4.5 0.4.0 4.0.0 1.5.1 Falcon Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process HDP 2.0 October 2013 Solr 0.5.1 Data Access Governance & Integration Operations Security
  • 7.
    The Modern DataArchitecture w/ HDP Page 7
  • 8.
    Enterprise Goals forthe Modern Data Architecture Page 8 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP Batch Interactive Real-Time YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING( Systems( Clickstream( Web(( &Social( Geoloca9on( Sensor(( &(Machine( Server(( Logs( Unstructured(
  • 9.
    1. Unlock NewApplications from New Types of Data INDUSTRY USE CASE Sentiment Page 9 & Web Clickstream & Behavior Machine & Sensor Geographic Server Logs Structured & Unstructured Financial Services New Account Risk Screens ✔ ✔ Trading Risk ✔ Insurance Underwriting ✔ ✔ ✔ Telecom Call Detail Records (CDR) ✔ ✔ Infrastructure Investment ✔ ✔ Real-time Bandwidth Allocation ✔ ✔ ✔ Retail 360° View of the Customer ✔ ✔ ✔ Localized, Personalized Promotions ✔ Website Optimization ✔ Manufacturing Supply Chain and Logistics ✔ Assembly Line Quality Assurance ✔ Crowd-sourced Quality Assurance ✔ Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔ Monitor Patient Vitals in Real-Time Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔ Improve Prescription Adherence ✔ ✔ ✔ ✔ Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔ Monitor Rig Safety in Real-Time ✔ ✔ ✔ Government ETL Offload/Federal Budgetary Pressures ✔ ✔ Sentiment Analysis for Government Programs ✔
  • 10.
    ..to shift fromreactive to proactive interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 10 HDP and Hadoop allow organizations to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  • 11.
    2. Or torealize a dramatic cost savings… EDW Optimization Page 11 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Hadoop Parse, Cleanse Apply Structure, Transform Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read
  • 12.
    2. Or torealize a dramatic cost savings… EDW Optimization Page 12 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Cloud Storage Engineered System MPP SAN HADOOP NAS $0 $20,000 $40,000 $60,000 $80,000 $180,000 Fully-loaded Cost Per Raw TB of Data (Min–Max Cost) Hadoop Parse, Cleanse Apply Structure, Transform Storage Costs/Compute Costs from $19/GB to $0.23/GB
  • 13.
    3. Data Lake:An architectural shift SCALE Page 13 SCOPE Unlocking the Data Lake ( RDBMS MPP EDW Data Lake Enabled by YARN • Single data repository, shared infrastructure • Multiple biz apps accessing all the data • Enable a shift from reactive to proactive interactions • Gain new insight across the entire enterprise New Analytic Apps or IT Optimization HDP 2.1 Governance & Integration Security Operations Data Access YARN Data Management
  • 14.
    Case Study: 12month Hadoop evolution at TrueCar Data Platform Capabilities Page 14 June 2013 Begin Hadoop Execution July 2013 Hortonworks Partnership 12 months execution plan May ‘14 IPO Aug 2013 Training & Dev Begins Nov 2013 Production Cluster 60 Nodes 2 PB Jan 2014 40% Dev Staff Perficient Dec 2013 Three Production Apps (3 total) Feb 2014 Three More Production Apps (6 total) 12 Month Results at TrueCAR • Six Production Hadoop Applications • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.23/GB “We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
  • 15.
    DRIVING INNOVATION THROUGH DREDUACINGT DEAVELOPMENT TIME FOR PRODUCTION-GRADE HADOOP APPLICATIONS Dhruv Kumar Solutions Architect, Concurrent Inc
  • 16.
    GET TO KNOWCONCURRENT 2 Leader in Application Infrastructure for Big Data! • Building enterprise software to simplify Big Data application development and management Products and Technology! • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 200,000 downloads each month. 8000 deployments worldwide. • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust! • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA ! CEO: Gary Nakamura CTO, Founder: Chris Wensel ! ! www.concurrentinc.com
  • 17.
    BIG DATA APPLICATIONINFRASTRUCTURE 3 “It’s all about the apps”" There needs to be a comprehensive solution for building, deploying, running and managing this new class of enterprise applications. Business Strategy Connecting Business and Data Data & Technology Challenges! ! Skill sets, systems integration, standard op procedure and operational visibility
  • 18.
    DATA APPLICATIONS -ENTERPRISE NEEDS Enterprise Data Application Infrastructure! ! • Need reliable, reusable tooling to quickly build and consistently deliver data products 4 ! • Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets ! • Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application ! • Need operational visibility for entire data application lifecycle
  • 19.
    WORD COUNT EXAMPLEWITH CASCADING 5 ! ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! configuration integration ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! processing // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! scheduling ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 20.
    SOME COMMON PROCESSINGPATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 6 filter filter function function filter function data Pipeline Split Join Merge data Topology
  • 21.
    CASCADING API •Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 7 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 22.
    FRAMEWORK AND PROGRAMMINGLANGUAGE INDEPENDENCE Cascading Domain Specific Languages (DSLs) 8 SQL Clojure Ruby New Fabrics Tez Storm Supported Fabrics and Data Stores Mainframe DB / DW In-Memory Data Stores Hadoop ! • Any JVM language can use Cascading API • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 23.
    THE STANDARD FORDATA APPLICATION DEVELOPMENT 9 www.cascading.org Build data apps that are scale-free! !!! Design principles ensure best practices at any scale Test-Driven Development! ! Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck! ! Use existing Java, SQL, modeling skill sets Application Portability! ! ! Write once, then run on different computation fabrics Operational Complexity! ! Simple - Package up into one jar and hand to operations Systems Integration! ! ! Hadoop never lives alone. Easily integrate to existing systems ! Proven application development framework for building data apps Application platform that addresses:
  • 24.
    CASCADING DATA APPLICATIONS 10 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps !
  • 25.
    STRONG ORGANIC GROWTH 11 200,000+ downloads / month! 8000+ Deployments!
  • 26.
    BUSINESSES DEPEND ONUS • 30000 Jobs per day! • Makes complex analysis of very large data sets simple! • Machine learning, linear algebra to improve! • User experience! • Ad quality (matching users and ad effectiveness)! • All revenue applications are running on Cascading/Scalding! 12 TWITTER
  • 27.
    BUSINESSES DEPEND ONUS • Cascading Java API! • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts! • Easy to operationalize heavy lifting of data in one framework 13
  • 28.
    BUSINESSES DEPEND ONUS • Cascalog (Clojure)! • Weather pattern modeling to protect growers against loss! • ETL against 20+ datasets daily! • Machine learning to create models! • Purchased by Monsanto for $930M US 14
  • 29.
    BROAD SUPPORT 15 Hadoop ecosystem supports Cascading!
  • 30.
    … AND INCLUDESRICH SET OF EXTENSIONS 16 http://www.cascading.org/extensions/
  • 31.
    WORD COUNT DEMOON HDP 17
  • 32.
    SUMMARY - BUILDROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING • Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite ! • Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x ! • Cascading 3.0 opens up the query planner — write apps once, run on any fabric 18 Concurrent offers training classes for Cascading & Scalding
  • 33.
    CONTACT INFORMATION DhruvKumar! Solutions Architect! Concurrent Inc.! dkumar@concurrentinc.com
  • 34.
    DRIVING INNOVATION THROUGH DTHAANKT YAOU Dhruv Kumar
  • 35.
  • 36.
    USE LINGUAL TOMIGRATE ITERATIVE ETL TASKS TO SPARK • Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps ! • Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC ! • Great for migration of data, integrating with non- Big Data assets — extends life of existing IT assets in an organization 22 CLI / Shell Enterprise Java Provider API JDBC API Lingual API Query Planner Cascading Apache Hadoop Lingual Data Stores Catalog
  • 37.
    SCALDING • Scaldingis a language binding to Cascading for Scala 23 • The name Scalding comes from the combining of SCALa and cascaDING ! • Scalding is great for Scala developers; can crisply write constructs for matrix math… ! • Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
  • 38.
    PATTERN ENABLES MIGRATINGYOUR MODELS TO SPARK 24 • Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. • PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms • PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows! • Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner • Pattern is great for migrating your model scoring to Hadoop from your decision systems
  • 39.
    PATTERN: ALGOS IMPLEMENTED • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! algorithms extended based on customer use cases – 25 Confidential
  • 40.
    BUILDING AND RUNNINGPMML MODELS LINGUAL Data PMML LINGUAL 26 Confidential Model Producer Model Explore data and build model using Regression, clustering, etc. Training Scoring New Data PMML model Measure and improve model Post Processing Model Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data