Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
The document discusses the OpenPOWER Foundation and its collaboration with the HPC Advisory Council. Some key points:
- OpenPOWER is pleased to announce its membership in HPCAC to further cross-community collaboration opportunities in HPC. OpenPOWER is contributing several POWER8 systems with NVIDIA GPUs to the HPCAC lab for benchmarking and demonstrations.
- OpenPOWER aims to fuel innovation through the collaboration of its partners. It has over 100 members from over 20 countries working on technologies across 24 work groups.
- One goal is to accelerate the technology roadmap by defining open interconnect standards and expanding the ecosystem of solutions that leverage the POWER architecture.
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
As more applications are created using Apache Hadoop that derive value from the new types of data from sensors/machines, server logs, click-streams, and other sources, the enterprise "Data Lake" forms with Hadoop acting as a shared service. While these Data Lakes are important, a broader life-cycle needs to be considered that spans development, test, production, and archival and that is deployed across a hybrid cloud architecture.
If you have already deployed Hadoop on-premise, this session will also provide an overview of the key scenarios and benefits of joining your on-premise Hadoop implementation with the cloud, by doing backup/archive, dev/test or bursting. Learn how you can get the benefits of an on-premise Hadoop that can seamlessly scale with the power of the cloud.
Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don't, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready.
Learn more at http://infochimps.com
YARN Ready: Integrating to YARN with Tez Hortonworks
YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
Hortonworks Oracle Big Data Integration Hortonworks
Slides from joint Hortonworks and Oracle webinar on November 11, 2014. Covers the Modern Data Architecture with Apache Hadoop and Oracle Data Integration products.
The document discusses the OpenPOWER Foundation and its collaboration with the HPC Advisory Council. Some key points:
- OpenPOWER is pleased to announce its membership in HPCAC to further cross-community collaboration opportunities in HPC. OpenPOWER is contributing several POWER8 systems with NVIDIA GPUs to the HPCAC lab for benchmarking and demonstrations.
- OpenPOWER aims to fuel innovation through the collaboration of its partners. It has over 100 members from over 20 countries working on technologies across 24 work groups.
- One goal is to accelerate the technology roadmap by defining open interconnect standards and expanding the ecosystem of solutions that leverage the POWER architecture.
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
As more applications are created using Apache Hadoop that derive value from the new types of data from sensors/machines, server logs, click-streams, and other sources, the enterprise "Data Lake" forms with Hadoop acting as a shared service. While these Data Lakes are important, a broader life-cycle needs to be considered that spans development, test, production, and archival and that is deployed across a hybrid cloud architecture.
If you have already deployed Hadoop on-premise, this session will also provide an overview of the key scenarios and benefits of joining your on-premise Hadoop implementation with the cloud, by doing backup/archive, dev/test or bursting. Learn how you can get the benefits of an on-premise Hadoop that can seamlessly scale with the power of the cloud.
Ironfan is the foundation for your Big Data stack, making provisioning and configuring your Big Data infrastructure simple. Spin up clusters when you need them, kill them when you don't, so you can spend your time, money, and engineering focus on finding insights, not getting your machines ready.
Learn more at http://infochimps.com
YARN Ready: Integrating to YARN with Tez Hortonworks
YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM Analytics
Originally Published on Oct 15, 2014
IBM InfoSphere BigInsights is an industry-standard Hadoop offering that combines the best of open-source software with enterprise-grade features.
- #1 InfoSphere BigInsights is 100% standard, open-source Hadoop
- #2 Big SQL - Lightning fast, ANSI-compliant, native Hadoop formats
- #3 BigSheets - Spreadsheet-like data access for business users
- #4 Big Text - Simplify text analytics and natural language
- #5 Adaptive MapReduce - Fully compatible, four times faster
- #6 In-Hadoop Analytics - Deploy the analytics to the data
- #7 HDFS and POSIX - a more capable enterprise file system
- #8 Big R - Deep R Language integration in Hadoop
- #9 IBM Watson Explorer - Search, explore and visualize all your data
- #10 Accelerators - Get to market faster leveraging pre-written code
To learn more about IBM InfoSphere BigInsights, download the free InfoSphere BigInsights QuickStart Edition from http://ibm.com/hadoop.
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
The document discusses Cask Data Application Platform (CDAP), an open source platform for building data applications on Hadoop. It provides an overview of CDAP's key components including datasets, programs, and applications. Datasets are standardized containers that encapsulate data access patterns and data models through reusable APIs. Programs are containers for different processing paradigms like batch and real-time. Applications in CDAP compose multiple datasets and programs.
How to Automate Offloading ETL Processes to HadoopDriven Inc.
Offloading ETL processes to Hadoop is often one of the first Big Data efforts because of the obvious ROI benefits. However, you have hundreds, maybe thousands, of legacy ETL processes to migrate which makes achieving the benefits of Hadoop and ROI a distant goal.
What if you could automatically convert up to 70% of your existing ETL processes to run on Hadoop with no code changes?
In this presentation you will see:
- A detailed walk-through of migrating existing ETL processes to Hadoop without changing anything
- How you can cut development time of new ETL process on Hadoop by up to 50%
- How you can leverage your existing developers’ Java skills to turn them into Hadoop developers
- Best practices for monitoring the performance of your ETL processes to ensure you meet your service level agreements
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
The document discusses re-platforming existing enterprise business intelligence and analytic workloads from platforms like Oracle, Teradata, SAP and IBM to the Hadoop platform. It notes that many existing analytic workloads are struggling with increasing data volumes and are too costly. Hadoop offers a modern distributed platform that can address these issues through the use of a production-grade SQL database like VectorH on Hadoop. The document provides guidelines for re-platforming workloads and notes potential benefits such as improved performance, reduced costs and leveraging the Hadoop ecosystem.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
Got Big Data? Then check out what Big SQL can do for you . . . . Learn how IBM's industry-standard SQL interface enables you to leverage your existing SQL skills to query, analyze, and manipulate data managed in an Apache Hadoop environment on cloud or on premise. This quick technical tour is filled with practical examples designed to get you started working with Big SQL in no time. Specifically, you'll learn how to create Big SQL tables over Hadoop data in HDFS, Hive, or HBase; populate Big SQL tables with data from HDFS, a remote file system, or a remote RDBMS; execute simple and complex Big SQL queries; work with non-traditional data formats and more. These charts are for session ALB-3663 at the IBM World of Watson 2016 conference.
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
Many enterprises are turning to Apache Hadoop to enable Big Data Analytics and reduce the costs of traditional data warehousing. Yet, it is hard to succeed when 80% of the time is spent on moving data and only 20% on using it. It’s time to swap the 80/20! The Big Data experts at Attunity and Hortonworks have a solution for accelerating data movement into and out of Hadoop that enables faster time-to-value for Big Data projects and a more complete and trusted view of your business. Join us to learn how this solution can work for you.
How to Automate your Enterprise Application / ERP TestingRTTS
This document discusses automating enterprise application and data warehouse testing using QuerySurge. It begins with an introduction to QuerySurge and its modules for automating data interface testing. These modules allow testing across different data sources with no coding required. The document then covers data maturity models and how QuerySurge can help improve testing processes. It demonstrates how QuerySurge can automate testing to gain full coverage while decreasing testing time. In conclusion, it discusses how QuerySurge provides value through increased testing efficiency and data quality.
Are you excited and want to learn Big Data Technologies? Do you feel that internet is loaded with free materials is complicated for a newbie?
There are many things that may go wrong when learning a new technology. Free internet material are sometimes can of worms for a beginner and training is advised for a jumpstart.
Open-BDA Big Data Hadoop Developer Training which is going to be held on 11th & 12th May 2015 @ Marriott Hotel Karachi, will cover everything you need to know to start a career in Hadoop technology and achieve expertise to a level where you can take certification exams with MAPR, Cloudera & Hortonworks with confidence. You can start as a beginner and this course will help you become a certified professional.
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
This document provides an overview of IBM's BigInsights product for analyzing big data. It discusses how BigInsights uses the open source Apache Hadoop and Spark platforms as its core with additional IBM technologies and features added on. BigInsights allows users to analyze both structured and unstructured data at large volumes and in real-time. It also integrates with other IBM analytics and data management products to provide a full big data analytics solution.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Data Discovery, Visualization, and Apache HadoopHortonworks
In this webinar, we will discuss how Apache Hadoop works with your current infrastructure and how you can use data discovery and visualization tools to gain deeper insights from new data types stored in Hadoop and your existing data center investments.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Oracle Solaris Build and Run Applications Better on 11.3OTN Systems Hub
Build and Run Applications Better on Oracle Solaris 11.3
Tech Day, NYC
Liane Praza, Senior Principal Software Engineer
Ikroop Dhillon, Principal Product Manager
June, 2016
Slides from the joint webinar. Learn how Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your Data Science efforts.
Together, Pivotal HAWQ and the Hortonworks Data Platform provide businesses with a Modern Data Architecture for IT transformation.
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
Learn how CARFAX utilized the power of Control-M to help drive big data processing via Cloudera. See why it was a no-brainer to choose Control-M to help manage workflows through Hadoop, some of the challenges faced, and the benefits the business received by using an existing, enterprise-wide workload management system instead of choosing “yet another tool.”
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Elasticsearch + Cascading for Scalable Log ProcessingCascading
Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
IBM InfoSphere BigInsights for Hadoop: 10 Reasons to Love ItIBM Analytics
Originally Published on Oct 15, 2014
IBM InfoSphere BigInsights is an industry-standard Hadoop offering that combines the best of open-source software with enterprise-grade features.
- #1 InfoSphere BigInsights is 100% standard, open-source Hadoop
- #2 Big SQL - Lightning fast, ANSI-compliant, native Hadoop formats
- #3 BigSheets - Spreadsheet-like data access for business users
- #4 Big Text - Simplify text analytics and natural language
- #5 Adaptive MapReduce - Fully compatible, four times faster
- #6 In-Hadoop Analytics - Deploy the analytics to the data
- #7 HDFS and POSIX - a more capable enterprise file system
- #8 Big R - Deep R Language integration in Hadoop
- #9 IBM Watson Explorer - Search, explore and visualize all your data
- #10 Accelerators - Get to market faster leveraging pre-written code
To learn more about IBM InfoSphere BigInsights, download the free InfoSphere BigInsights QuickStart Edition from http://ibm.com/hadoop.
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
The document discusses Cask Data Application Platform (CDAP), an open source platform for building data applications on Hadoop. It provides an overview of CDAP's key components including datasets, programs, and applications. Datasets are standardized containers that encapsulate data access patterns and data models through reusable APIs. Programs are containers for different processing paradigms like batch and real-time. Applications in CDAP compose multiple datasets and programs.
How to Automate Offloading ETL Processes to HadoopDriven Inc.
Offloading ETL processes to Hadoop is often one of the first Big Data efforts because of the obvious ROI benefits. However, you have hundreds, maybe thousands, of legacy ETL processes to migrate which makes achieving the benefits of Hadoop and ROI a distant goal.
What if you could automatically convert up to 70% of your existing ETL processes to run on Hadoop with no code changes?
In this presentation you will see:
- A detailed walk-through of migrating existing ETL processes to Hadoop without changing anything
- How you can cut development time of new ETL process on Hadoop by up to 50%
- How you can leverage your existing developers’ Java skills to turn them into Hadoop developers
- Best practices for monitoring the performance of your ETL processes to ensure you meet your service level agreements
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
The document discusses re-platforming existing enterprise business intelligence and analytic workloads from platforms like Oracle, Teradata, SAP and IBM to the Hadoop platform. It notes that many existing analytic workloads are struggling with increasing data volumes and are too costly. Hadoop offers a modern distributed platform that can address these issues through the use of a production-grade SQL database like VectorH on Hadoop. The document provides guidelines for re-platforming workloads and notes potential benefits such as improved performance, reduced costs and leveraging the Hadoop ecosystem.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
Got Big Data? Then check out what Big SQL can do for you . . . . Learn how IBM's industry-standard SQL interface enables you to leverage your existing SQL skills to query, analyze, and manipulate data managed in an Apache Hadoop environment on cloud or on premise. This quick technical tour is filled with practical examples designed to get you started working with Big SQL in no time. Specifically, you'll learn how to create Big SQL tables over Hadoop data in HDFS, Hive, or HBase; populate Big SQL tables with data from HDFS, a remote file system, or a remote RDBMS; execute simple and complex Big SQL queries; work with non-traditional data formats and more. These charts are for session ALB-3663 at the IBM World of Watson 2016 conference.
Webinar - Accelerating Hadoop Success with Rapid Data Integration for the Mod...Hortonworks
Many enterprises are turning to Apache Hadoop to enable Big Data Analytics and reduce the costs of traditional data warehousing. Yet, it is hard to succeed when 80% of the time is spent on moving data and only 20% on using it. It’s time to swap the 80/20! The Big Data experts at Attunity and Hortonworks have a solution for accelerating data movement into and out of Hadoop that enables faster time-to-value for Big Data projects and a more complete and trusted view of your business. Join us to learn how this solution can work for you.
How to Automate your Enterprise Application / ERP TestingRTTS
This document discusses automating enterprise application and data warehouse testing using QuerySurge. It begins with an introduction to QuerySurge and its modules for automating data interface testing. These modules allow testing across different data sources with no coding required. The document then covers data maturity models and how QuerySurge can help improve testing processes. It demonstrates how QuerySurge can automate testing to gain full coverage while decreasing testing time. In conclusion, it discusses how QuerySurge provides value through increased testing efficiency and data quality.
Are you excited and want to learn Big Data Technologies? Do you feel that internet is loaded with free materials is complicated for a newbie?
There are many things that may go wrong when learning a new technology. Free internet material are sometimes can of worms for a beginner and training is advised for a jumpstart.
Open-BDA Big Data Hadoop Developer Training which is going to be held on 11th & 12th May 2015 @ Marriott Hotel Karachi, will cover everything you need to know to start a career in Hadoop technology and achieve expertise to a level where you can take certification exams with MAPR, Cloudera & Hortonworks with confidence. You can start as a beginner and this course will help you become a certified professional.
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Cynthia Saracco
This document provides an overview of IBM's BigInsights product for analyzing big data. It discusses how BigInsights uses the open source Apache Hadoop and Spark platforms as its core with additional IBM technologies and features added on. BigInsights allows users to analyze both structured and unstructured data at large volumes and in real-time. It also integrates with other IBM analytics and data management products to provide a full big data analytics solution.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
ETL is the process of extracting data from one location, transforming it, and loading it into a different location, often for the purposes of collection and analysis. As Hadoop becomes a common technology for sophisticated analysis and transformation of petabytes of structured and unstructured data, the task of moving data in and out efficiently becomes more important and writing transformation jobs becomes more complicated. Talend provides a way to build and automate complex ETL jobs for migration, synchronization, or warehousing tasks. Using Talend's Hadoop capabilities allows users to easily move data between Hadoop and a number of external data locations using over 450 connectors. Also, Talend can simplify the creation of MapReduce transformations by offering a graphical interface to Hive, Pig, and HDFS. In this talk, Cédric Carbone will discuss how to use Talend to move large amounts of data in and out of Hadoop and easily perform transformation tasks in a scalable way.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Data Discovery, Visualization, and Apache HadoopHortonworks
In this webinar, we will discuss how Apache Hadoop works with your current infrastructure and how you can use data discovery and visualization tools to gain deeper insights from new data types stored in Hadoop and your existing data center investments.
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics.
In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.
Oracle Solaris Build and Run Applications Better on 11.3OTN Systems Hub
Build and Run Applications Better on Oracle Solaris 11.3
Tech Day, NYC
Liane Praza, Senior Principal Software Engineer
Ikroop Dhillon, Principal Product Manager
June, 2016
Slides from the joint webinar. Learn how Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your Data Science efforts.
Together, Pivotal HAWQ and the Hortonworks Data Platform provide businesses with a Modern Data Architecture for IT transformation.
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
Learn how CARFAX utilized the power of Control-M to help drive big data processing via Cloudera. See why it was a no-brainer to choose Control-M to help manage workflows through Hadoop, some of the challenges faced, and the benefits the business received by using an existing, enterprise-wide workload management system instead of choosing “yet another tool.”
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Elasticsearch + Cascading for Scalable Log ProcessingCascading
Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraDataStax
Apache Cassandra is the open source database technology that pioneered distributed data at scale. DataStax Enterprise, powered by the best distribution of Apache Cassandra, gives you up to 2x better compaction throughput, 3x better operational analytics performance, ease-of-use, and a secure, comprehensive multi-model data platform including search and operational analytics integrated with Cassandra to help you take on whatever challenges you might face along the way.
This document discusses Microsoft Azure and its capabilities. It highlights that Azure has over 100 datacenters globally, with 19 regions currently online. It also notes that Azure has one of the top 3 networks in the world and offers larger VM sizes than AWS or Google Cloud. The document then summarizes some of Azure's core capabilities like compute, storage, databases, analytics and more. It provides examples of how customers can use Azure's tools and services.
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere.
This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools.
Topics:
- Business Platform Challenges
- What Enterprise Consciousness Solves
- How ScyllaDB Empowers Enterprise Consciousness
- What can ScyllaDB do for Big Companies
- What can ScyllaDB do for smaller companies.
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresKangaroot
Postgres is the leading open source database management system that is being developed by a very active community for more than 15 years. Gaby Schilders is Sales Engineer at EnterpriseDB, supplier of the EDB Postgres data platform.
Gaby Schilders, Sales Engineer at EnterpriseDB, will be explaining why companies take open source as the centerpiece for modernising their IT infrastructure, thus increasing their scalability and taking full advantage today's technologies offer them.
The document discusses the development of an internal data pipeline platform at Indix to democratize access to data. It describes the scale of data at Indix, including over 2.1 billion product URLs and 8 TB of HTML data crawled daily. Previously, the data was not discoverable, schemas changed and were hard to track, and using code limited who could access the data. The goals of the new platform were to enable easy discovery of data, transparent schemas, minimal coding needs, UI-based workflows for anyone to use, and optimized costs. The platform developed was called MDA (Marketplace of Datasets and Algorithms) and enabled SQL-based workflows using Spark. It has continued improving since its first release in 2016
The document discusses containers and Docker Enterprise Edition (EE). It notes that by 2020, over 50% of organizations will be running containers in production. Containers simplify infrastructure by allowing applications to run on any infrastructure. Docker EE provides additional capabilities for enterprises like security features, automation, and support that are required beyond the open source Docker Engine. It highlights customer examples where Docker EE helped accelerate projects, increase scalability, and migrate applications to the cloud. The document promotes Docker services to help customers develop a containerization strategy and achieve benefits like cost savings, agility, and productivity gains.
The document discusses various options for modernizing applications, including rehosting, refactoring, rearchitecting, and rebuilding apps. Rehosting involves moving apps to cloud infrastructure with minimal changes. Refactoring leverages existing code while taking advantage of cloud capabilities. Rearchitecting involves major code revisions for cloud-native apps and microservices. Rebuilding apps is building new apps using cloud-native platforms from the ground up. The document provides benefits, definitions, considerations, and technologies for each option to help determine the best modernization approach.
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
Accelerate Big Data Application Development with Cascading and HDP, webinar hosted by Hortonworks and Concurrent. Visit Hortonworks.com/webinars to access the recording.
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
The document discusses challenges with building operational data applications on Hadoop and introduces the Cask Data Application Platform (CDAP) as a solution. It provides an agenda that covers data applications, challenges, CDAP motivation and goals, use cases, and an introduction and architecture overview of CDAP. The document aims to demonstrate how CDAP provides a unified platform that simplifies application development and lifecycle while supporting reusable data and processing patterns.
Cask provides the Cask Data Application Platform (CDAP) which provides an integrated platform for developers and organizations to build, deploy, and manage big data applications. CDAP hides the complexity of Hadoop, provides reusable components, and integrates with Cloudera's data platform. It allows both technical and non-technical users to easily develop applications for ingesting, processing, and analyzing large amounts of data. The document discusses CDAP capabilities and provides an example of how a marketing SaaS company used it to build a real-time customer analytics application.
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
This video dives into 7 best practices for how IT organizations can achieve true operational readiness on Hadoop using Driven and Cascading.
For any person, organization or enterprise that is currently involved in planning, deploying or managing a Hadoop infrastructure. Development Teams, IT Ops, Executive Management.
Key Takeaways:
- Connecting execution problems with application context
- Defining and enforcing SLAs
- Understanding inter-app dependencies
- Rationing your cluster
- Tracing data access at the operational level
- Building culture and tools supporting collaboration between developers, operators, & other Hadoop team members
DataStax on Azure: Deploying an industry-leading data platform for cloud apps...DataStax
Learn how DataStax Enterprise (DSE) on Microsoft Azure delivers experiences to cloud applications beyond customer expectations. Powered by the industry’s best version of Apache Cassandra™ and leveraging the global scale, hybrid deployment capabilities, and ease of integration of Azure, DSE is the always-on data platform that allows you to focus on what matters most to you by ensuring your applications scale reliably and effortlessly while delivering actionable insight in real-time.
View recording: https://youtu.be/kLEkqTH_2Bc
Explore all DataStax webinars: http://www.datastax.com/resources/webinars
This document provides an overview of big data fundamentals and considerations for setting up a big data practice. It discusses key big data concepts like the four V's of big data. It also outlines common big data questions around business context, architecture, skills, and presents sample reference architectures. The document recommends starting a big data practice by identifying use cases, gaining management commitment, and setting up a center of excellence. It provides an example use case of retail web log analysis and presents big data architecture patterns.
Transforming Business in a Digital Era with Big Data and MicrosoftPerficient, Inc.
The socially integrated world, the rise of mobile, the Internet of Things - this explosion of data can be directed and used, rather than simply managed. That's why Big Data and advanced analytics are key components of most digital transformation strategies.
In the last year, Microsoft has made key moves to extend its data platform into this realm. Stalwart platforms like SQL Server and Excel join up with new PaaS offerings to make up a dynamic and powerful Big Data/advanced analytics ecosystem.
In this webinar, our experts covered:
-Why you should include Big Data and advanced analytics in your digital transformation strategy
-Challenges facing digital transformation initiatives
-What options the Microsoft toolset offers for Big Data (Hadoop) and advanced analytics
-How to leverage products and services you already own for your digital transformation
The document discusses DevOps practices like continuous integration (CI) and continuous delivery/deployment (CD). It explains that DevOps aims to improve software development and operations by increasing automation, reducing deployment times, and enabling more frequent and safer software releases. CI principles include automating builds, testing, and deployments. CD builds on CI by further automating the software release process and reducing risks of major releases.
The intersection of Traditional IT and New-Generation ITKangaroot
Keynote from Franz Meyer - VP, EMEA Strategic Business Development Red Hat about "The intersection of Traditional IT and New-Generation IT : the Red Hat Open Hybrid Journey". This presentation was given during the Open Source Cloud Day of Kangaroot & Red Hat.
Similar to Cascading concurrent yahoo lunch_nlearn (20)
Overview of Cascading 3.0 on Apache Flink Cascading
Cascading is a Java API for building batch data applications on Hadoop. This document discusses executing Cascading programs on Apache Flink instead of Hadoop MapReduce. With Cascading on Flink, programs are translated to single Flink jobs instead of multiple MapReduce jobs. This improves performance by allowing pipelined execution without writing intermediate data to HDFS. For example, a TF-IDF program runs 3.5 hours faster on Flink than MapReduce. Cascading on Flink leverages Flink's efficient in-memory operators while requiring minimal code changes.
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
This document summarizes the results of a survey of Cascading users. It finds that Cascading is most popular among those building and managing big data applications. Many users explored alternatives like Hive and Pig before adopting Cascading due to its scalability and portability across compute frameworks. The survey also shows that Cascading users value reliability and performance at scale and are interested in new frameworks like Spark.
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
This document discusses offloading ETL workloads from data warehouses to Hadoop. It provides an overview of Bitwise, an ISO-certified company that provides ETL and data quality services. It also describes Driven, a platform for building, running, and managing big data applications. Driven provides visibility into data pipelines, monitors application performance, and enables collaboration around operational issues. It stores metadata about application telemetry in a scalable and searchable manner to provide end-to-end operational visibility for Hadoop applications.
How To Get Hadoop App Intelligence with DrivenCascading
You built Cascading/Scalding apps to mine all that data you collected in Hadoop. But just when you were seeing results, something went wrong — the app broke, data flows stopped, and business came to a halt.
So what do you do next? How do you find out what went wrong in the shortest time possible? How do you pinpoint the line of code where the error occurred? How do you know which SLA is going to be impacted? How do you view the lineage of data to adhere to compliance requirements?
In this presentation, we show you how to easily find the answers with Driven, the most comprehensive Big Data App Performance Management Platform.
Furthermore, this presentation describes how Driven can help you build higher quality big data apps; run big data apps more reliably; and manage big data apps more effectively.
Who should view this PPT: Any person or organization that is currently involved in planning, deploying or managing a Hadoop application infrastructure.
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
André Kelpe's presentation at Hadoop User Group France - 25.11.2014.
Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without rewriting any of the application code.
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
Presentation by Dhruv Kumar, Sr. Field Engineer at Concurrent.
Amid all the hype and investment around Big Data technologies, many Java software engineers are asking what it takes to become big data engineers. As Java professionals, towards which path shall I steer my career?
Join Dhruv Kumar as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. We’ll provide an overview of the application development landscape for developing applications on Hadoop and explain why Cascading has become so popular, comparing it to other abstractions such as Pig and Hive. Dhruv will also show you how Java developers can easily get started building applications on Hadoop with live examples of good ‘ole Java code.
Introduction to Cascading by Bryce Lohr
Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop.
Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies.
https://www.linkedin.com/pub/bryce-lohr/3/589/225
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Trusted Execution Environment for Decentralized Process Mining
Cascading concurrent yahoo lunch_nlearn
1. DRIVING INNOVATION
THROUGH DATA BUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH
CASCADING
Supreet Oberoi
VP Field Engineering, Concurrent Inc
2. ABOUT ME
2
• I am a Data Engineer, not a Data Scientist
• I help Enterprises develop decisions on building their “Big Data” roadmap and
technical strategy — use cases, products, technology decisions, employee skills
• I design Hadoop applications with the intent to operationalize them in Enterprise
settings — applications on which business depend, and last longer than the
technologies underneath them…
• This talk is about learning how to design your BD strategy that leverages best
3. BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN
3
• Open Language
• Open Data
• Open Hardware
• Open Compute Platform
• Open Development Platform
4. OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE
4
• Don’t equate architecture with language;
develop architecture to support multiple
• Support SQL and SQL-like languages
• Encourage development in proven & scalable
languages as Java
• Develop architecture to support change of
programming languages (even for same app)
• Have common performance-management
tools across all programming environments
5. OPEN DATA ENABLES REUSE OF DATA AND APPS
5
Develop a common operating picture by promoting reuse with open data
• Prevent exclusive access to data sets
through proprietary tools
• Promote a common meta-data repository
• Forbid storing data in proprietary formats
• Build seamless integration capabilities
6. OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE
6
• Get commodity hardware — commodity hardware will
always cost less than “optimized” specialized
hardware (note: definition of “specialized” is up for
debate)
• Develop and maintain a cluster that can be reused by
different applications and technology stacks — avoid
custom software installations on the cluster, or setting
up dedicated clusters for given tech stacks
• Harness the power of collective from the cluster —
avoid fragmenting the cluster if possible
7. OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM
7
Make tradeoffs between reliability & speed based on your business context
• Ensure that moving your application from one
Hadoop compute platform (e.g. MapReduce) to
another (e.g., Tez) does not:
• impact application code
• impact production-monitoring tools
• Resist compute platforms that require your
enterprise to acquire significantly new skills (even
if it is easy) to become productive
• Avoid new platforms that partition the cluster
• Avoid platforms that do not support Open Data
8. OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY
8
Development platforms improve developer productivity and operational excellence — picking a
correct platform gives you best practices developed by the community, achieving higher quality
• Invest in picking the correct development platform
— open, easy, scalable, popular, tools, …
• Bet on a sustainable open source platform
• Measure the vitality of the community:
• number of downloads, extensions (living
ecosystem), extensible architecture, consumers of
the technology, code stability…
A proven platform provides tools to get your apps to production
9. GET TO KNOW CONCURRENT
9
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application
development and management
Products and Technology
• CASCADING
Open Source - The most widely used application infrastructure for
building Big Data apps with over 175,000 downloads each month
• DRIVEN
Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data
application infrastructure.
Founded: 2008
HQ: San Francisco, CA
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
www.concurrentinc.com
10. BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING
10
“It’s all about the apps”
There needs to be a comprehensive solution for building, deploying, running
and managing this new class of enterprise applications.
Business Strategy Connecting Business and Data
Data & Technology
Challenges
Skill sets, systems integration,
standard op procedure and
operational visibility
11. DATA APPLICATIONS - ENTERPRISE NEEDS
Enterprise Data Application Infrastructure
• Need reliable, reusable tooling to quickly build and consistently deliver
data products
• Need the degrees of freedom to solve problems ranging from simple to
complex with existing skill sets
• Need the flexibility to easily adapt an application to meet business needs
(latency, scale, SLA), without having to rewrite the application
• Need operational visibility for entire data application lifecycle
11
12. CASCADING - DE-FACTO FOR DATA APPS
Cascading Apps
12
SQL Clojure Ruby
New Fabrics
Tez Storm
System Integration
Mainframe DB / DW In-Memory Data Stores Hadoop
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …
13. WORD COUNT EXAMPLE WITH CASCADING
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
13
configuration
integration
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster
14. SOME COMMON PATTERNS
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
14
filter
filter
function
function filter function
data
Pipeline
Split Join
Merge
data
Topology
15. CASCADING
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
15
Processing API Integration API
Process Planner
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
16. THE STANDARD FOR DATA APPLICATION DEVELOPMENT
16
www.cascading.org
Build data apps
that are
scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java,Scala,
SQL, modeling skill sets
Application
Portability
Write once, then run on
different computation
fabrics
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:
18. CASCADING DATA APPLICATIONS
18
Enterprise IT
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
Corporate Apps
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
Telecom
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail
Mobile, Social, Search Analytics
Funnel Analysis
Revenue Attribution
Customer Experiments
Ad Optimization
Retail Recommenders
Consumer / Entertainment
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
Finance
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
Health / Biotech
Aggregate Metrics For Govt
Person Biometrics
Veterinary Diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
19. BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
19
20. BUSINESSES DEPEND ON US
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
20
21. BUSINESSES DEPEND ON US
• Scalding (Scala)
• Makes complex analysis of very large data sets simple
• Machine learning, linear algebra to improve
• 30,000 jobs a day — this works @ scale
• Ad quality (matching users and ad effectiveness)
21
TWITTER
24. … AND INCLUDES RICH SET OF EXTENSIONS
24
http://www.cascading.org/extensions/
25. CASCADING 3.0
25
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will
allow for data apps to execute on
existing and emerging fabrics
through its new customizable query
planner.
• Cascading 3.0 will support — Local
In-Memory, Apache MapReduce and
soon thereafter (3.1) Apache Tez,
Apache Spark and Apache Storm
Enterprise Data Applications
Local In-Memory MapReduce
Other
Custom
Computation Fabrics
26. USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP
• Lingual is an extension to Cascading that
executes ANSI SQL queries as Cascading apps
• Supports integrating with any data source that can
be accessed through JDBC — Cascading Tap
can be created for any source supporting JDBC
• Great for migration of data, integrating with non-
Big Data assets — extends life of existing IT
assets in an organization
26
CLI / Shell Enterprise Java
Provider API JDBC API Lingual API
Query Planner
Cascading
Apache Hadoop
Lingual
Data Stores
Catalog
27. SCALDING
• Scalding is a language binding to Cascading for Scala
27
• The name Scalding comes from the combining of SCALa and cascaDING
• Scalding is great for Scala developers; can crisply write constructs for matrix
math…
• Scalding has very large commercial deployments at:
• Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality
• Ebay - Use cases include search analytics and other production data pipelines
28. PATTERN SCORES MODELS AT SCALE
28
• Pattern is an open source project that allows to leverage Predictive Model
Markup Language (PMML) models and translate them into Cascading
apps.
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and
machine learning algorithms
• PMML models from popular analytics frameworks can be reused and
deployed within Cascading workflows
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle
• Open source frameworks - R, Weka, KNIME, RapidMiner
• Pattern is great for migrating your model scoring to Hadoop from your
decision systems
29. PATTERN SCORES MODELS AT SCALE
Step 1: Train your model with industry-leading Tools
Step 2: Score your models at scale with Pattern
29 Confidential
30. OPERATIONAL EXCELLENCE WITH DRIVEN
Visibility Through All Stages of App Lifecycle
From Development — Building and Testing
• Design & Development
• Debugging
• Tuning
To Production — Monitoring and Tracking
• Maintain Business SLAs
• Balance & Controls
• Application and Data Quality
• Operational Health
• Real-time Insights
30
31. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
31
33. DEEPER VISUALIZATION INTO YOUR HADOOP CODE
• Easily comprehend, debug, and tune
your data applications
• Get rich insights on your application
performance
• Monitor applications in real-time
• Compare app performance with
historical (previous) iterations
33
Debug and optimize your Hadoop applications more effectively with Driven
34. GET OPERATIONAL INSIGHTS WITH DRIVEN
• Quickly breakdown how often
applications execute based on their tags,
teams, or names
• Immediately identify if any application is
monopolizing cluster resources
• Understand the utilization of your cluster
with a timeline of all applications running
34
Visualize the activity of your applications to help maintain SLAs
35. ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
• Easily keep track of all your
applications by segmenting them with
user-defined tags
• Segment your applications for
trending analysis, cluster analysis,
and developing chargeback models
• Quickly breakdown how often
applications execute based on their
tags, teams, or names
35
Segment your applications for greater insights across all your applications
36. COLLABORATE WITH TEAMS
Utilize teams to collaborate and gain visibility over your set of applications
• Invite others to view and collaborate
on a specific application
• Gain visibility to all the apps and their
owners associated with each team
• Simply manage your teams and the
users assigned to them
36
37. MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
Fast, powerful, rich search capabilities enable you to easily find the exact set of
• Identify problematic apps with their
owners and teams
• Search for groups of applications
segmented by user-defined tags
• Compare specific applications with their
previous iterations to ensure that your
application can meet its SL
37
applications that you’re looking for
38. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
38
39. SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING
• Cascading framework enables developers to intuitively create data applications that scale
and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite
• Scalding — a Scala-based extension to Cascading — provides crisp programming
constructs for algorithm developers and data scientists
• Driven — an application visualization product — provides rich insights into how your
applications executes, improving developer productivity by 10x
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric
39
Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)
43. DIFFERENT PHILOSOPHY THAN GUI TOOLS
• Cascading is a general-purpose framework to develop data applications; supports development through the
entire lifecycle of data - from staging to final data sets
• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices
43
• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler)
• Can test locally and deploy production without code change
• Because it is based in code: debuggable, extensible, deployable, traceable
• GUI tools do not help in visualizing the must-have insights
• Real-time application visualization, application bottlenecks, anatomy of the application, application
dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster
44. PATTERN: ALGOS IMPLEMENTED
• Hierarchical Clustering
• K-Means Clustering
• Linear Regression
• Logistic Regression
• Random Forest
algorithms extended based on customer use cases –
44 Confidential
45. CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK
• Cascading 3.0 will ease application migration to Spark
• Enterprises can standardize on one API to meet business challenges and solve a
variety of business problems ranging from simple to complex, regardless of latency or
scale
• Third party products, data apps, frameworks and dynamic programming languages
on Cascading will immediately benefit from this portability
• Even more operational visibility from development through production with Driven
45
46. BUSINESSES DEPEND ON US
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
46