Accelerate Big Data Application Development with Cascading and HDP


Published on

Accelerate Big Data Application Development with Cascading and HDP

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Accelerate Big Data Application Development with Cascading and HDP

  1. 1. Page 1 Accelerate Big Data Application Development with Cascading and HDP April 22, 2014
  2. 2. Page 2 Agenda •  Take advantage of the latest Hadoop processing frameworks like YARN and Tez in HDP 2.1 •  How developers can create future proof, data-driven applications built on Apache Hadoop with Cascading •  How Cascading accelerates Hadoop application development by abstracting the platforms underneath
  3. 3. Page 3 Speakers Ajay Singh, Director of Technical Channels, Hortonworks Supreet Oberoi, VP of Field Engineering, Concurrent
  4. 4. Page 4 Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Enable your Modern Data Architecture by delivering Enterprise Apache Hadoop Our Mission: Reseller Partners: Headquartered in Palo Alto, CA; 300+ employees and growing
  5. 5. Page 5 A data architecture under pressure from new data APPLICATIONS*DATA**SYSTEM* REPOSITORIES* SOURCES* Exis4ng*Sources** (CRM,*ERP,*Clickstream,* Logs)* RDBMS* EDW* MPP* Business** Analy4cs* Custom* Applica4ons* Packaged* Applica4ons* Source: IDC 2.8*ZB*in*2012* 85%*from*New*Data*Types* 15x*Machine*Data*by*2020* 40*ZB*by*2020* OLTP,&ERP,&CRM& Systems& Unstructured&documents,& emails& Clickstream& Server&logs& Sen>ment,&Web& Data& Sensor.&Machine& Data& Geoloca>on&
  6. 6. Page 6 A Modern Data ArchitectureAPPLICATIONS*DATA**SYSTEM* REPOSITORIES* SOURCES* Exis4ng*Sources** (CRM,*ERP,*Clickstream,*Logs)* RDBMS* EDW* MPP* Emerging*Sources** (Sensor,*Sen4ment,*Geo,*Unstructured)* OPERATIONAL* TOOLS* MANAGE*&* MONITOR* DEV*&*DATA* TOOLS* BUILD*&* TEST* Business** Analy4cs* Custom* Applica4ons* Packaged* Applica4ons* Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management
  7. 7. Page 7 Clickstream Capture and analyze website visitors’ data trails and optimize your website Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches New types of dataHadoop Value: Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  8. 8. Page 8 Enterprise Hadoop: Core Foundation of Hadoop Applications
  9. 9. Page 9 Core Capabilities of Enterprise Hadoop Load data and manage according to policy Deploy and effectively manage the platform Store and process all of your Corporate Data Assets & Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection & DATA**MANAGEMENT* SECURITY*DATA**ACCESS* GOVERNANCE*&* INTEGRATION* OPERATIONS* Enable both existing and new application to provide value to the organization PRESENTATION*&*APPLICATION* Empower existing operations and security tools to manage Hadoop ENTERPRISE*MGMT*&*SECURITY* Provide deployment choice across physical, virtual, cloud DEPLOYMENT*OPTIONS*
  10. 10. Page 10 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform ** Provision,* Manage*&* Monitor* & Ambari& Zookeeper& Scheduling* & Oozie& Data*Workflow,* Lifecycle*&* Governance* * Falcon& Sqoop& Flume& NFS& WebHDFS& YARN*:*Data*Opera4ng*System& DATA**MANAGEMENT* SECURITY*DATA**ACCESS* GOVERNANCE*&* INTEGRATION* Authen4ca4on* Authoriza4on* Accoun4ng* Data*Protec4on* & Storage:&HDFS& Resources:&YARN& Access:&Hive,&…&& Pipeline:&Falcon& Cluster:&Knox& OPERATIONS* Script* & Pig& * * Search* * Solr& * * SQL* * Hive/Tez,& HCatalog& * * NoSQL* * HBase& Accumulo& * * Stream* ** Storm& & * * Others* * InUMemory& Analy>cs,&& ISV&engines& 1& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& °& N* HDFS** (Hadoop&Distributed&File&System)& Batch* * Map& Reduce& * * Deployment*Choice& Linux Windows On-Premise Cloud
  11. 11. Page 11 Hadoop is wholly integrated into the data center APPLICATIONS*DATA**SYSTEM*SOURCES* RDBMS* EDW* MPP* Emerging*Sources** (Sensor,*Sen4ment,*Geo,*Unstructured)* HANA BusinessObjects BI OPERATIONAL*TOOLS* DEV*&*DATA*TOOLS* Exis4ng*Sources** (CRM,*ERP,*Clickstream,*Logs)* INFRASTRUCTURE* HDP 2.1Governance &Integration Security Operations Data Access Data Management
  12. 12. Page 12 Developing Apps on Hadoop •  Spring XD Framework –  Consistent configuration & Java API across wide range of Hadoop ecosystem projects •  Microsoft .NET SDK For Hadoop –  API access to HDP on windows and HDInsight service –  LINQ libraries for accessing Hive •  Cascading –  Delivers an easy to use abstraction layer for developing Hadoop applications –  Supports development in Scala & Clojure –  Hortonworks to Certify, Support & Deliver Cascading SDK with Hortonworks Data Platform
  14. 14. HORTONWORKSPARTNERSWITHCONCURRENT • The Cascading SDK will now be integrated with the Hortonworks Data Platform (HDP) • Hortonworks will certify and support Cascading™ SDK with HDP • Cascading will support Apache Tez; companies using Cascading or domain-specific languages on Cascading can seamlessly migrate HDP supporting Apache Tez The partnership benefits users by combining the power and simplicity of Cascading with the reliability and stability of HDP.
  15. 15. Confidential AGENDA 3 • Who is Concurrent • What is Cascading • Where is it used • What problems does Cascading solve • What is included in the Cascading kit !
  16. 16. Confidential ABOUTCONCURRENT,INC. 4
  17. 17. Confidential GETTOKNOWCONCURRENT 5 Leader in Application Infrastructure for Big Data! • Building enterprise software to simplify Big Data application development and management Products and Technology! • CASCADING
 The most widely used application infrastructure for building Big Data applications with over 150,000 downloads each month • DRIVEN
 Enterprise Data Application management for Big Data apps Proven - Simple, Reliable, Robust! • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA ! CEO: Gary Nakamura CTO, Founder: Chris Wensel !
  18. 18. PRODUCTSANDTECHNOLOGY ! ! Big Data Application Development! Simple, Reliable, Repeatable ! ! Unmatched Application Insight! Visibility into your Data Applications Open Source Commercial Open Source Community! Focused on Data App Development ! Project home of Cascading Collection of sub-projects / tools ! ! Data App Management! Realtime monitoring Performance Management Operational Control Data Provenance Compliance Governance
  19. 19. BUSINESSESDEPENDONUS • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts • Easy to operationalize heavy lifting of data
  20. 20. BUSINESSESDEPENDONUS • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US
  21. 21. BUSINESSESDEPENDONUS • Scalding (Scala) • Machine learning (linear algebra) to improve • User experience • Ad quality (matching users and ad effectiveness) • All revenue applications are running on Cascading/Scalding • IPO TWITTER
  22. 22. BUSINESSESDEPENDONUS • Estimate suicide risk from what people write online • Cascading + Cassandra • You can do more than optimize add yields •
  24. 24. DRIVINGADVANTAGEWITHDATAAPPLICATIONS Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel analysis Revenue attribution Customer experiments Ad Optimization Retail recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate metrics for Govt Person biometrics Veterinary diagnostics Next-Gen Genomics Argonomics Environmental Maps !
  25. 25. BIGDATA—THENEXTPHASEOFMATURITY “It’s all about the Apps”" There needs to be a comprehensive solution for building, deploying, running and managing these new class of enterprise applications Business Strategy Data & Technology Loyalty and promotions analysis Retention campaigns Marketing campaign optimization Fraud detection Risk management Scientific research Remote monitoring and diagnosis and more! Your Data & Systems Hadoop, EDW, Mainframe, System Logs, NO SQL DBs, etc.Challenges! ! Leveraging existing skill sets, existing systems, past investments and existing business processes Connecting Business and Data
  26. 26. Confidential PRODUCTSOVERVIEW 14
  27. 27. • Java API (alternative to Hadoop MapReduce) • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters CASCADING 15 Process Planner Processing API Integration API Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  30. 30. • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc ‣ Rolling windows SOMECOMMONPATTERNS 18 filter filter function functionfilterfunction data Pipeline Split Join Merge data Topology
  31. 31. WORDCOUNTEXAMPLE! ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! configuration integration ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! processing // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! scheduling ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )!  .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.complete(); // <<-- Runs jobs on Cluster
  32. 32. CASCADINGOVERVIEW Build Data Apps that are scale-free!! ! ! Design principals ensure best practices at any scale Test-Driven Development! ! Efficiently test code and process local files before you deploy on a cluster Staffing Bottleneck! ! Use existing Java, SQL, modeling skills sets Operational Complexity! ! Simple - Package up into one jar and hand to operations Application Portability! ! ! Write once, then run on different computation fabrics. Systems Integration! ! ! Hadoop never lives alone. Easily integrate to your existing systems! Proven application development framework for building Data applications Framework addresses
  33. 33. OPERATIONALREADINESS:DISCIPLINE&ABILITYTOMEASURE • Visibility into app development • Business SLA • Balance & Controls • Application testing • Data quality • Process to “productionalize” apps • High fidelity execution analysis • Real-time monitoring • …
  34. 34. PRODUCTSANDTECHNOLOGY LINGUAL Simplifying Systems Integration PATTERN Enabling Machine Scoring Algorithms ! ! Big Data Application Development! Simple, Reliable, Repeatable ! ! Unmatched Application Insight! Visibility into your Data Applications Open Source Commercial
  35. 35. CASCADINGECOSYSTEMISMORETHANCASCADINGFRAMEWORK Lingual, Pattern and other Dynamic Programming Languages such as Scalding are part of the Cascading Ecosystem and are included as part of the Cascading kit
  36. 36. LINGUAL • Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps! • Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC! • Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization Query Planner JDBC API Lingual APIProvider API Cascading Apache Hadoop Lingual Data Stores CLI / Shell Enterprise Java Catalog
  37. 37. SCALDING • Scalding is a language binding to Cascading for Scala! - The name Scalding comes from the combining of SCALa and cascaDING! • Scalding is great for Scala developers; can crisply write constructs for matrix math… ! • Scalding has very large commercial deployments at:! - Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality! - Ebay - Use cases include search analytics and other production data pipelines
  38. 38. DRIVENOVERVIEW What is Driven?! The first application performance management product for Big Data applications Capabilities Visualize your Data App! No more black box! Instantly visualize your running app in real-time Diagnose App Failures! Identify where and how your app failed… all without sorting through logs! Track App Performance! For all your apps, view and compare history of your app’s runtime performance Insight into your Applications! At any moment, quickly understand what your app is doing on your cluster LINGUAL PATTERN SCALDING CASCALOG Benefits Key Features • Accelerate Time to Market • Build Reliable Applications • Optimize Application Performance • Application visualization • Dashboard performance view • Application performance history • Insights for each application (workflow, telemetry, error types) • Team collaboration and management Works with:
  39. 39. Driven is free for developer use (cloud)
  40. 40. Lingual Pattern Availability Cascading 2.5 
 Available Now Lingual 1.1 
 Available Now Pattern 1.0-WIP
 WIP Available Now License Apache License 2.0 Apache License 2.0 Apache License 2.0 Support Community Forums & Mailing List, Enterprise Support Community Forums & Mailing List, Enterprise Support Community Forums & Mailing List, Enterprise Support CASCADINGAVAILABILITY Cascading, Lingual and Pattern are open source projects freely available to the general public under Apache License 2.0
  41. 41. ConfidentialConfidential29 Summary! • APM for Big Data | The first application performance management product for Big Data applications ! ! ! ! • For Developers and Operators | Significantly improves developer productivity and operations control by providing an unprecedented level of insight into building and managing enterprise-grade data applications • Collaboration | Facilitates and encourages user collaboration to build enterprise data applications • Community Integration | Driven is a free cloud service integrated with the Cascading open source community • Licensing | Driven is free for development (cloud only) and licensable for production or on-premise deployments • Deployment Options | Deploy in the cloud or on-premise Accelerate Time to Market Process visualization and monitoring capabilities in a rich UI Build Reliable Apps Detailed insight into data processing logic and algorithms Optimize App Performance Key application behavior metrics with historical data to trend performance
  42. 42. GETSTARTEDWITHCASCADINGONHDP2.1 1. Download HDP 2.1 2. Take Cascading for a spin by running the Impatient tutorial at impatient/
  43. 43. CONTACTINFORMATION Supreet Oberoi! 650-868-7675 (m) @supreet_online
  44. 44. DRIVINGINNOVATION THROUGHDATATHANKYOU Supreet Oberoi | April 18, 2014
  45. 45. Page 13 SAN JOSE June 3-5 AMSTERDAM April 2-3 •  6 tracks, 3 days, and 120+ sessions to choose from •  Community Focused - Sessions voted on by the public and selected by a committee of industry luminaries •  Deep Dive Technical Content - Including a Committer track with content presented by Apache committers •  Business and Technical Topics •  Community Activities - Hadoop Summit will host community meet- ups and birds of a feather sessions The Largest Hadoop Community Events in Europe and North America
  46. 46. Page 14 Questions? Use the Q/A panel to ask your questions Download the Hortonworks Sandbox and Cascading •  Cascading and HDP 2.1 Sandbox •  Hortonworks Sandbox •  Cascading Impatient Tutorial