This document summarizes the results of a survey of Cascading users. It finds that Cascading is most popular among those building and managing big data applications. Many users explored alternatives like Hive and Pig before adopting Cascading due to its scalability and portability across compute frameworks. The survey also shows that Cascading users value reliability and performance at scale and are interested in new frameworks like Spark.
DevOps and Modern Application Development in the Cloud: Red Hat, T-Systems, a...Stefan Zosel
The document discusses a partnership between Red Hat, T-Systems, and Microsoft to offer a managed hybrid PaaS solution for application development in Europe. Through this collaboration, customers can leverage Red Hat OpenShift's container-based PaaS capabilities, Microsoft Azure's scalable cloud infrastructure hosted in Germany and managed by T-Systems, and T-Systems' AppAgile managed services for a compliant solution that addresses skills gaps. The combination aims to provide the benefits of public cloud with the security and compliance of local infrastructure for European customers developing modern applications.
IDC interviewed nine organizations that are using Red Hat OpenShift as their primary
application development platform. These organizations reported that OpenShift helps
them deliver timely and compelling applications and features across their complex and
heterogeneous IT environments and supports key IT initiatives such as containerization,
microservices, and cloud migration strategies.
Overview of the core elements of the alliance. Presented to enterprise customers at the Microsoft NorCal MTC on November 11th, 2016
Kevin McCauley
Red Hat
An IT provider of a large bank has centrally distributed Hadoop solutions as a self-service for all the group subsidiaries. It did not have in-house expertise on complex Hadoop solutions. Alien4cloud accelerated the availability of Hadoop cluster provisioning from days to minutes and easily adapted the standard Hadoop topology to business departments by modifying components, versions, etc. One of the largest professional services firms in the world implemented an enterprise application marketplace powered by alien4cloud. It will provide applications and data sets through a self-service interface with configurable business rules.
This document discusses combining Hadoop with big data analytics. It begins by exploring how Hadoop has become a popular framework for handling big data challenges. It then discusses some of the key skills needed for successful big data analytics programs, including technical skills with tools like Hadoop as well as business knowledge. Specifically, it recommends including business analysts, BI developers, predictive model builders, data architects, data integration developers, and technology architects on any big data analytics team.
This document discusses big data analysis and Hadoop. It begins by describing different stages of data analysis and roles of various personnel. It then discusses challenges of analyzing big data using traditional tools and how Hadoop addresses these challenges through its distributed architecture and MapReduce programming model. Several case studies are presented where companies have used Hadoop to perform large-scale data analysis. Key components of Hadoop like MapReduce, Pig, Hive and Mahout are also introduced.
This document discusses the shift to on-demand business intelligence (BI). It identifies five key drivers of this shift: 1) It just makes sense from a cost and usability perspective; 2) The need for simple, easy to use BI tools; 3) The shift from transactional systems (OLTP) to analytical systems (OLAP); 4) Analyst predictions that on-demand BI will grow significantly; 5) Most traditional BI vendors are now adopting on-demand strategies to follow market trends. The document argues that on-demand BI can help address the gap in meeting the needs of business users who want simple, flexible and low-cost access to information and insights.
7 Habits for Big Data in Production - keynote Big Data London Nov 2018Ellen Friedman
You can improve your chances for success with data intensive large scale applications (AI, machine learning and analytics) in production.
This keynote presentation from Big Data London shows you how.
DevOps and Modern Application Development in the Cloud: Red Hat, T-Systems, a...Stefan Zosel
The document discusses a partnership between Red Hat, T-Systems, and Microsoft to offer a managed hybrid PaaS solution for application development in Europe. Through this collaboration, customers can leverage Red Hat OpenShift's container-based PaaS capabilities, Microsoft Azure's scalable cloud infrastructure hosted in Germany and managed by T-Systems, and T-Systems' AppAgile managed services for a compliant solution that addresses skills gaps. The combination aims to provide the benefits of public cloud with the security and compliance of local infrastructure for European customers developing modern applications.
IDC interviewed nine organizations that are using Red Hat OpenShift as their primary
application development platform. These organizations reported that OpenShift helps
them deliver timely and compelling applications and features across their complex and
heterogeneous IT environments and supports key IT initiatives such as containerization,
microservices, and cloud migration strategies.
Overview of the core elements of the alliance. Presented to enterprise customers at the Microsoft NorCal MTC on November 11th, 2016
Kevin McCauley
Red Hat
An IT provider of a large bank has centrally distributed Hadoop solutions as a self-service for all the group subsidiaries. It did not have in-house expertise on complex Hadoop solutions. Alien4cloud accelerated the availability of Hadoop cluster provisioning from days to minutes and easily adapted the standard Hadoop topology to business departments by modifying components, versions, etc. One of the largest professional services firms in the world implemented an enterprise application marketplace powered by alien4cloud. It will provide applications and data sets through a self-service interface with configurable business rules.
This document discusses combining Hadoop with big data analytics. It begins by exploring how Hadoop has become a popular framework for handling big data challenges. It then discusses some of the key skills needed for successful big data analytics programs, including technical skills with tools like Hadoop as well as business knowledge. Specifically, it recommends including business analysts, BI developers, predictive model builders, data architects, data integration developers, and technology architects on any big data analytics team.
This document discusses big data analysis and Hadoop. It begins by describing different stages of data analysis and roles of various personnel. It then discusses challenges of analyzing big data using traditional tools and how Hadoop addresses these challenges through its distributed architecture and MapReduce programming model. Several case studies are presented where companies have used Hadoop to perform large-scale data analysis. Key components of Hadoop like MapReduce, Pig, Hive and Mahout are also introduced.
This document discusses the shift to on-demand business intelligence (BI). It identifies five key drivers of this shift: 1) It just makes sense from a cost and usability perspective; 2) The need for simple, easy to use BI tools; 3) The shift from transactional systems (OLTP) to analytical systems (OLAP); 4) Analyst predictions that on-demand BI will grow significantly; 5) Most traditional BI vendors are now adopting on-demand strategies to follow market trends. The document argues that on-demand BI can help address the gap in meeting the needs of business users who want simple, flexible and low-cost access to information and insights.
7 Habits for Big Data in Production - keynote Big Data London Nov 2018Ellen Friedman
You can improve your chances for success with data intensive large scale applications (AI, machine learning and analytics) in production.
This keynote presentation from Big Data London shows you how.
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media
Moustafa Soliman, Business Intelligence Developer from Hewlett Packard presented "HP Vertica - Solving Facebook Big Data Challenges" as part of "Big Data Stockholm" meetup on April 1st at SUP46.
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP), and Hortonworks' role as a partner for customer success. It also summarizes challenges with traditional data systems, how Hadoop emerged as a foundation for a new data architecture, and how HDP delivers a comprehensive data management platform.
Early adopters of cloud technology—companies that have planned, implemented and seen the benefits in real deployments—are beginning to establish a track record of “lessons learned”. The Economist Intelligence Unit, sponsored by SAP, has analysed the experiences of six companies that have implemented cloud solutions specifically designed to foster collaboration in the workplace.
Jan van der Vegt. Challenges faced with machine learning in practiceLviv Startup Club
Machine learning projects often fail to make it from development to production. Looking at the full machine learning lifecycle is essential for success. The lifecycle includes development, deployment, infrastructure, monitoring, automation, standardization, lineage and reproducibility. A machine learning operations (MLOps) platform can provide an end-to-end system view for increased efficiency, collaboration, and trust across the lifecycle. Key takeaways are to focus on what is important, avoid doing nothing which fails to scale, and doing everything which stifles progress.
Digital Transformation - #StrataData London 2017 - Data101Ellen Friedman
Presented at Strata Data London conference May 2017 in the Data 101 track, this presentation explores what is needed in planning, architecture, and cultural organization for effective digital transformation.
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
This document provides an overview of a presentation given by Ellen Friedman on machine learning. Some key points discussed include:
- Domain knowledge is very important for machine learning to work effectively. Small differences in input data or labels can significantly impact model performance.
- Stream processing and microservices architectures are useful for managing the many models needed for machine learning. Having the right messaging infrastructure is also important.
- Deploying and managing machine learning models at scale poses logistical challenges. The Rendezvous architecture and DataOps approaches aim to help with continuous model evaluation, deployment and adaptation.
- Both software engineers and data scientists have important roles to play in machine learning projects. Cross-functional teams are needed to
ACCELERATE SAP® APPLICATIONS WITH CDNETWORKSCDNetworks
CDNetworks and SAP conducted a proof of concept project to test how CDNetworks' content delivery network (CDN) service could accelerate SAP applications. Testing showed the CDN provided significant performance improvements, reducing response times for login and file downloads by 50-66% on average globally. The CDN also improved reliability, with no errors observed during stress testing of 10,000 transactions, whereas the internet saw around a 4% failure rate. The CDN's global infrastructure and security features were found to enhance the delivery, speed, and reliability of SAP applications for distributed users worldwide.
The document discusses embedding machine learning in business processes using the example of baking cakes. It notes that while bakers follow exact recipes and processes, the results are not always perfect due to various factors. It then discusses how manufacturers are "data rich but information poor" as they cannot derive meaningful insights from their operational data. The document advocates generating "actionable intelligence" through deep analysis of production data to determine the root causes of issues like cracked cakes, rather than just reporting what problems occurred. This would help manufacturers diagnose and address process flaws more precisely.
Haven OnDemand is a machine learning platform that provides APIs and services to help developers easily build data-rich applications. It has over 60 composable machine learning APIs that can be combined to power use cases like text analysis, image recognition, and predictive modeling. Developers can build powerful applications with minimal coding by leveraging these APIs. Haven OnDemand also offers purpose-built solutions like Haven Search OnDemand that are built on top of the API platform.
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
In today’s world of exponentially growing big data, enterprises are becoming increasingly more aware of the business utility and necessity of harnessing, storing and analyzing this information. Apache Hadoop has rapidly evolved to become a leading platform for managing and processing big data, with the vital management, monitoring, metadata and integration services required by organizations to glean maximum business value and intelligence from their burgeoning amounts of information on customers, web trends, products and competitive markets. In this session, Hortonworks' Himanshu Bari will discuss the opportunities for deriving business value from big data by looking at how organizations utilize Hadoop to store, transform and refine large volumes of this multi-structured information. Connolly will also discuss the evolution of Apache Hadoop and where it is headed, the component requirements of a Hadoop-powered platform, as well as solution architectures that allow for Hadoop integration with existing data discovery and data warehouse platforms. In addition, he will look at real-world use cases where Hadoop has helped to produce more business value, augment productivity or identify new and potentially lucrative opportunities.
Understanding The Cloud For Enterprise Businesses. Triaxil
Cloud is getting lots of attention these days. Cloud is a transformational platform that can support the opportunities of today’s digital business being shaped and driven by mobile, social, IoT (Internet of Things), Big Data and other forces. Cloud Computing not only is a powerful agent of change, but it also can accelerate transformation.
The benefits are big. “Cloud computing is a disruptive phenomenon, with the potential to make IT organizations more responsive than ever,” says research firm Gartner. “Cloud computing promises economic advantages, speed, agility, flexibility,infinite elasticity an dinnovation.” As a result, more and more enterprises are moving to the cloud. According to Gartner, 78 percent of enterprises are planning to increase their investment in cloud through 2017.
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
The document discusses the future of data and modern data applications. It notes that data is growing exponentially and will reach 44 zettabytes by 2020. This growth is driving the need for new data architectures like Apache Hadoop which can handle diverse data types from sources like the internet of things. Hadoop provides distributed storage and processing to enable real-time insights from all available data.
A successful enterprise Journey to Cloud requires more than technical execution, and we’ll help you learn what to consider, the pitfalls and how to succeed. We’ve helped many companies – in Australia and globally – execute their digital vision and accelerate change on their Journey to Cloud. We’ll share some of their experiences to help you discover how an optimised migration can transform your business.
Speakers:
Chris Fleishmann, Managing Director, Journey to Cloud Chief Architect
Attilio Di Lorenzo, Senior manager, Journey to Cloud Architect
The document discusses how businesses are increasingly adopting public and private cloud services. It provides statistics showing that 58% of organizations currently use cloud services for small applications and workloads. The use of cloud infrastructure as a service (IaaS) and platform as a service (PaaS) is growing significantly and driving digital business innovation. The top challenges with public cloud include bandwidth costs, performance constraints, and cloud services going down. The document argues that adding flash memory to cloud infrastructure can enhance performance, reliability, and cost effectiveness by providing predictable performance, high throughput, and redundancy for critical workloads.
The document discusses big data and open source tools and technologies. It provides an overview of key challenges for data leaders, introduces the top 10 big data tools including Apache Spark, R, and Talend Open Studio. It outlines the benefits of open source including low costs, flexibility, and innovation. The document advocates adopting both corporate and open source software using a "bi-modal" approach to support innovative and engineered analytics. It provides a template for a 1-page big data strategy.
SnapLogic has been gaining traction in big-data integration. It recently announced the Fall 2015 release of its Elastic Integration Platform, which adds capabilities for big- data integration that now include Spark (an open source in-memory data-processing framework), a new Snap (preconfigured connector) for Cassandra (an open source distributed ‘big’ database) and support for Microsoft Cortana Analytics. SnapLogic is positioning this release as a self-service hybrid cloud integration offering, and it is intended to strengthen its position among Microsoft customers and others seeking cloud-based big-data analytics.
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media
Moustafa Soliman, Business Intelligence Developer from Hewlett Packard presented "HP Vertica - Solving Facebook Big Data Challenges" as part of "Big Data Stockholm" meetup on April 1st at SUP46.
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP), and Hortonworks' role as a partner for customer success. It also summarizes challenges with traditional data systems, how Hadoop emerged as a foundation for a new data architecture, and how HDP delivers a comprehensive data management platform.
Early adopters of cloud technology—companies that have planned, implemented and seen the benefits in real deployments—are beginning to establish a track record of “lessons learned”. The Economist Intelligence Unit, sponsored by SAP, has analysed the experiences of six companies that have implemented cloud solutions specifically designed to foster collaboration in the workplace.
Jan van der Vegt. Challenges faced with machine learning in practiceLviv Startup Club
Machine learning projects often fail to make it from development to production. Looking at the full machine learning lifecycle is essential for success. The lifecycle includes development, deployment, infrastructure, monitoring, automation, standardization, lineage and reproducibility. A machine learning operations (MLOps) platform can provide an end-to-end system view for increased efficiency, collaboration, and trust across the lifecycle. Key takeaways are to focus on what is important, avoid doing nothing which fails to scale, and doing everything which stifles progress.
Digital Transformation - #StrataData London 2017 - Data101Ellen Friedman
Presented at Strata Data London conference May 2017 in the Data 101 track, this presentation explores what is needed in planning, architecture, and cultural organization for effective digital transformation.
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
This document provides an overview of a presentation given by Ellen Friedman on machine learning. Some key points discussed include:
- Domain knowledge is very important for machine learning to work effectively. Small differences in input data or labels can significantly impact model performance.
- Stream processing and microservices architectures are useful for managing the many models needed for machine learning. Having the right messaging infrastructure is also important.
- Deploying and managing machine learning models at scale poses logistical challenges. The Rendezvous architecture and DataOps approaches aim to help with continuous model evaluation, deployment and adaptation.
- Both software engineers and data scientists have important roles to play in machine learning projects. Cross-functional teams are needed to
ACCELERATE SAP® APPLICATIONS WITH CDNETWORKSCDNetworks
CDNetworks and SAP conducted a proof of concept project to test how CDNetworks' content delivery network (CDN) service could accelerate SAP applications. Testing showed the CDN provided significant performance improvements, reducing response times for login and file downloads by 50-66% on average globally. The CDN also improved reliability, with no errors observed during stress testing of 10,000 transactions, whereas the internet saw around a 4% failure rate. The CDN's global infrastructure and security features were found to enhance the delivery, speed, and reliability of SAP applications for distributed users worldwide.
The document discusses embedding machine learning in business processes using the example of baking cakes. It notes that while bakers follow exact recipes and processes, the results are not always perfect due to various factors. It then discusses how manufacturers are "data rich but information poor" as they cannot derive meaningful insights from their operational data. The document advocates generating "actionable intelligence" through deep analysis of production data to determine the root causes of issues like cracked cakes, rather than just reporting what problems occurred. This would help manufacturers diagnose and address process flaws more precisely.
Haven OnDemand is a machine learning platform that provides APIs and services to help developers easily build data-rich applications. It has over 60 composable machine learning APIs that can be combined to power use cases like text analysis, image recognition, and predictive modeling. Developers can build powerful applications with minimal coding by leveraging these APIs. Haven OnDemand also offers purpose-built solutions like Haven Search OnDemand that are built on top of the API platform.
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
In today’s world of exponentially growing big data, enterprises are becoming increasingly more aware of the business utility and necessity of harnessing, storing and analyzing this information. Apache Hadoop has rapidly evolved to become a leading platform for managing and processing big data, with the vital management, monitoring, metadata and integration services required by organizations to glean maximum business value and intelligence from their burgeoning amounts of information on customers, web trends, products and competitive markets. In this session, Hortonworks' Himanshu Bari will discuss the opportunities for deriving business value from big data by looking at how organizations utilize Hadoop to store, transform and refine large volumes of this multi-structured information. Connolly will also discuss the evolution of Apache Hadoop and where it is headed, the component requirements of a Hadoop-powered platform, as well as solution architectures that allow for Hadoop integration with existing data discovery and data warehouse platforms. In addition, he will look at real-world use cases where Hadoop has helped to produce more business value, augment productivity or identify new and potentially lucrative opportunities.
Understanding The Cloud For Enterprise Businesses. Triaxil
Cloud is getting lots of attention these days. Cloud is a transformational platform that can support the opportunities of today’s digital business being shaped and driven by mobile, social, IoT (Internet of Things), Big Data and other forces. Cloud Computing not only is a powerful agent of change, but it also can accelerate transformation.
The benefits are big. “Cloud computing is a disruptive phenomenon, with the potential to make IT organizations more responsive than ever,” says research firm Gartner. “Cloud computing promises economic advantages, speed, agility, flexibility,infinite elasticity an dinnovation.” As a result, more and more enterprises are moving to the cloud. According to Gartner, 78 percent of enterprises are planning to increase their investment in cloud through 2017.
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
The document discusses the future of data and modern data applications. It notes that data is growing exponentially and will reach 44 zettabytes by 2020. This growth is driving the need for new data architectures like Apache Hadoop which can handle diverse data types from sources like the internet of things. Hadoop provides distributed storage and processing to enable real-time insights from all available data.
A successful enterprise Journey to Cloud requires more than technical execution, and we’ll help you learn what to consider, the pitfalls and how to succeed. We’ve helped many companies – in Australia and globally – execute their digital vision and accelerate change on their Journey to Cloud. We’ll share some of their experiences to help you discover how an optimised migration can transform your business.
Speakers:
Chris Fleishmann, Managing Director, Journey to Cloud Chief Architect
Attilio Di Lorenzo, Senior manager, Journey to Cloud Architect
The document discusses how businesses are increasingly adopting public and private cloud services. It provides statistics showing that 58% of organizations currently use cloud services for small applications and workloads. The use of cloud infrastructure as a service (IaaS) and platform as a service (PaaS) is growing significantly and driving digital business innovation. The top challenges with public cloud include bandwidth costs, performance constraints, and cloud services going down. The document argues that adding flash memory to cloud infrastructure can enhance performance, reliability, and cost effectiveness by providing predictable performance, high throughput, and redundancy for critical workloads.
The document discusses big data and open source tools and technologies. It provides an overview of key challenges for data leaders, introduces the top 10 big data tools including Apache Spark, R, and Talend Open Studio. It outlines the benefits of open source including low costs, flexibility, and innovation. The document advocates adopting both corporate and open source software using a "bi-modal" approach to support innovative and engineered analytics. It provides a template for a 1-page big data strategy.
SnapLogic has been gaining traction in big-data integration. It recently announced the Fall 2015 release of its Elastic Integration Platform, which adds capabilities for big- data integration that now include Spark (an open source in-memory data-processing framework), a new Snap (preconfigured connector) for Cassandra (an open source distributed ‘big’ database) and support for Microsoft Cortana Analytics. SnapLogic is positioning this release as a self-service hybrid cloud integration offering, and it is intended to strengthen its position among Microsoft customers and others seeking cloud-based big-data analytics.
The document discusses the development of an internal data pipeline platform at Indix to democratize access to data. It describes the scale of data at Indix, including over 2.1 billion product URLs and 8 TB of HTML data crawled daily. Previously, the data was not discoverable, schemas changed and were hard to track, and using code limited who could access the data. The goals of the new platform were to enable easy discovery of data, transparent schemas, minimal coding needs, UI-based workflows for anyone to use, and optimized costs. The platform developed was called MDA (Marketplace of Datasets and Algorithms) and enabled SQL-based workflows using Spark. It has continued improving since its first release in 2016
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
Functional programming for optimization problems in Big DataPaco Nathan
Enterprise Data Workflows with Cascading.
Silicon Valley Cloud Computing Meetup talk at Cloud Tech IV, 4/20 2013
http://www.meetup.com/cloudcomputing/events/111082032/
BIG Data & Hadoop Applications in Social MediaSkillspeed
This document discusses how major social media networks like Facebook, Twitter, LinkedIn, Pinterest, and Instagram utilize big data and Hadoop technologies. It provides examples of how each network uses Hadoop for tasks like storing user data, performing analytics, and generating personalized recommendations at massive scales as their user bases and data volumes grow enormously. The document also briefly outlines SkillSpeed's Hadoop training course, which covers topics like HDFS, MapReduce, Pig, Hive, HBase and more to prepare students for jobs working with big data.
The document summarizes the key findings from a survey on the future of cloud computing in 2012. Some of the main points covered include:
1) Software is increasingly becoming cloud-based, with SaaS spending growing much faster than traditional software and over 50% of categories being disrupted.
2) SaaS is widely adopted, with 82% currently using it and 84% of new software predicted to be SaaS. PaaS adoption is also increasing significantly.
3) Hybrid cloud models are becoming more popular, with 100% of deployments predicted to be hybrid by 2017.
4) While cloud adoption is increasing, concerns around security, compliance and other issues remain barriers for some.
The document discusses cloud computing trends, including:
- Most large enterprises are transitioning infrastructure to cloud computing to cut costs and risks. Critical workloads are also moving to cloud.
- Hybrid cloud strategies that maintain some workloads on-premise while moving others to cloud are becoming more common and supported.
- Hardware companies are struggling to remain relevant as cloud platforms commoditize infrastructure. They are pursuing mergers and spin-offs.
- DevOps practices emphasize continuous delivery over traditional ITIL change processes. The role of IT is shifting from systems maintenance to innovation brokerage and service management between internal and cloud resources.
FlexPod Select for Hadoop is a pre-validated solution from Cisco and NetApp that provides an enterprise-class architecture for deploying Apache Hadoop workloads at scale. The solution includes Cisco UCS servers and fabric interconnects for compute, NetApp storage arrays, and Cloudera's Distribution of Apache Hadoop for the software stack. It offers benefits like high performance, reliability, scalability, simplified management, and reduced risk for organizations running business-critical Hadoop workloads.
SnapLogic Raises $37.5M to Fuel Big Data Integration PushSnapLogic
SnapLogic has grown well and rapidly since it pivoted in 2012 to focus on cloud-based iPaaS; however, the company continues to compete with on-premises providers, especially for big-data integration, thanks to its hybrid execution framework, which separates the design and management of integration pipelines from the runtime environment. Microsoft’s involvement in the latest funding round is sure to be a blessing, and builds on an existing agreement to provide integration for the Cortana Analytics Suite and Azure cloud.
This document provides an overview of IT/Network Operations concepts and strategies to improve cloud production. It begins with Joe Dietz introducing himself as a Network Security Professional and listing his current certifications. It then discusses various local user groups and events related to cloud security. The document covers topics such as selecting public vs private clouds, choosing cloud providers and applications, operational considerations, and approaches to connecting networks to the cloud such as extending datacenters or enabling edge services. It emphasizes that moving to the cloud still requires planning and not all applications are good candidates. The summary concludes by mentioning related reading on hybrid cloud services and tools.
The business analytics marketplace is experiencing a challenge as classic BI tools meet up with evolving big data technologies, in particular Hadoop. We explore how IBM works to meet this challenge, providing a big picture perspective of their big data offerings around Hadoop, its open data platform and BigInsights.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
Infochimps report 451 research impact reportAccenture
Infochimps, a big data PaaS provider, has updated its platform with stream processing capabilities from technologies developed at Twitter and LinkedIn. With its first paying customer, the company is now seeking partnerships to support its enterprise-focused offering. It provides an easy-to-use managed service for Hadoop that masks complexity and can generate insights from data in 30 days without specialized hiring or infrastructure. While competition is increasing, Infochimps' strengths include its Chef-based cluster platform and integration of existing tools via its Data Delivery Service.
Infochimps report 451 research impact reportAccenture
Infochimps, a big data PaaS provider, has updated its platform with stream processing capabilities from technologies developed at Twitter and LinkedIn. With its first paying customer, the company is now seeking partnerships to support its enterprise-focused offering. It provides an easy-to-use managed service for Hadoop that masks complexity and can generate insights from data in 30 days without specialized hiring or infrastructure. While competition is increasing, Infochimps' strengths include its Chef-based cluster platform and integration of existing tools via its Data Delivery Service.
Similar to Cascading 2015 User Survey Results (20)
Overview of Cascading 3.0 on Apache Flink Cascading
Cascading is a Java API for building batch data applications on Hadoop. This document discusses executing Cascading programs on Apache Flink instead of Hadoop MapReduce. With Cascading on Flink, programs are translated to single Flink jobs instead of multiple MapReduce jobs. This improves performance by allowing pipelined execution without writing intermediate data to HDFS. For example, a TF-IDF program runs 3.5 hours faster on Flink than MapReduce. Cascading on Flink leverages Flink's efficient in-memory operators while requiring minimal code changes.
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
This document discusses offloading ETL workloads from data warehouses to Hadoop. It provides an overview of Bitwise, an ISO-certified company that provides ETL and data quality services. It also describes Driven, a platform for building, running, and managing big data applications. Driven provides visibility into data pipelines, monitors application performance, and enables collaboration around operational issues. It stores metadata about application telemetry in a scalable and searchable manner to provide end-to-end operational visibility for Hadoop applications.
How To Get Hadoop App Intelligence with DrivenCascading
You built Cascading/Scalding apps to mine all that data you collected in Hadoop. But just when you were seeing results, something went wrong — the app broke, data flows stopped, and business came to a halt.
So what do you do next? How do you find out what went wrong in the shortest time possible? How do you pinpoint the line of code where the error occurred? How do you know which SLA is going to be impacted? How do you view the lineage of data to adhere to compliance requirements?
In this presentation, we show you how to easily find the answers with Driven, the most comprehensive Big Data App Performance Management Platform.
Furthermore, this presentation describes how Driven can help you build higher quality big data apps; run big data apps more reliably; and manage big data apps more effectively.
Who should view this PPT: Any person or organization that is currently involved in planning, deploying or managing a Hadoop application infrastructure.
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
This video dives into 7 best practices for how IT organizations can achieve true operational readiness on Hadoop using Driven and Cascading.
For any person, organization or enterprise that is currently involved in planning, deploying or managing a Hadoop infrastructure. Development Teams, IT Ops, Executive Management.
Key Takeaways:
- Connecting execution problems with application context
- Defining and enforcing SLAs
- Understanding inter-app dependencies
- Rationing your cluster
- Tracing data access at the operational level
- Building culture and tools supporting collaboration between developers, operators, & other Hadoop team members
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
André Kelpe's presentation at Hadoop User Group France - 25.11.2014.
Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without rewriting any of the application code.
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
Presentation by Dhruv Kumar, Sr. Field Engineer at Concurrent.
Amid all the hype and investment around Big Data technologies, many Java software engineers are asking what it takes to become big data engineers. As Java professionals, towards which path shall I steer my career?
Join Dhruv Kumar as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. We’ll provide an overview of the application development landscape for developing applications on Hadoop and explain why Cascading has become so popular, comparing it to other abstractions such as Pig and Hive. Dhruv will also show you how Java developers can easily get started building applications on Hadoop with live examples of good ‘ole Java code.
Elasticsearch + Cascading for Scalable Log ProcessingCascading
Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Introduction to Cascading by Bryce Lohr
Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop.
Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies.
https://www.linkedin.com/pub/bryce-lohr/3/589/225
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
"What does it really mean for your system to be available, or how to define w...Fwdays
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
2. Confidential
WHAT’S
BEHIND
THE
RISE
OF
CASCADING?
Enterprise
IT
teams
designing
their
big
data
platforms
must
choose
from
a
daunting
array
of
development
frameworks
and
compute
fabrics.
On
the
one
hand,
they
want
a
development
framework
that
leverages
existing
skillsets.
At
the
same
time,
they
want
the
flexibility
to
benefit
from
performance
gains
of
the
latest,
greatest
compute
fabrics.
Cascading
is
a
robust
framework
with
over
10,000
known
production
deployments,
over
275,000
downloads
per
month.
Twitter,
AirBnB,
Climate
Corp,
Apple,
EBay,
Netflix,
are
examples
of
few
of
the
enterprises
that
have
built
their
Hadoop
practices
with
Cascading.
The
Cascading
user
group
is
diverse,
self-‐supporting
community
who
are
helping
innovate
Cascading’s
scalability,
portability,
performance
and
value.
In
addition,
the
presence
of
a
large
number
of
open
source
projects
contributed
by
mainstream
enterprises
such
as
by
Netflix,
Commonwealth
Bank
of
Australia,
Expedia
attests
to
vibrancy
of
the
Cascading
ecosystem.
In
this
paper,
we'll
reveal
what’s
behind
Cascading's
growth
by
digging
into
the
results
of
a
new
Cascading
user
survey.
In
general,
Cascading
users
turn
out
to
be
extremely
concerned
about
reliability
and
performance
at
scale.
Many
experimented
with
early
Hadoop
frameworks
like
Hive
and
Pig,
but
found
Cascading
to
be
a
more
scalable
approach.
And
lately,
the
easy
portability
of
Cascading
applications
between
compute
fabrics
has
generated
a
lot
of
excitement
in
the
community.
3. Confidential
0 10 20 30 40 50 60 70
Head/VP of IT
Head of IT Infrastructure
Application Manager/Director
BI/EDW Manager/Director
CIO/SVP of IT
IT Specialist
Architect
IT Manager or Director
Developer/Engineer
What title best describes your role?
N=121 Liverpool Street station crowd blur. Photo by David Sim.
CASCADING
IS
MOST
POPULAR
AMONG
BUILDERS
AND
MANAGERS
OF
BIG
DATA
APPLICATIONS
4. Confidential
CASCADING
COMMUNITY
MEMBERS
ARE
MATURE,
PRODUCTION
USERS
8%
26%
25%
41%
How long have you been using
Hadoop?
0-12 months
12-24 months
24-36 months
Over 3 years
N=69
Most
respondents
have
been
using
Hadoop
for
over
3
years.
Assuming
the
sample
is
representative,
the
Cascading
community
largely
consists
of
early
Hadoop
adopters.
Furthermore,
the
Cascading
community
isn’t
just
dabbling:
Over
84% have
already
put
their
Cascading
applications
into
production
or
plan
to
do
so.
As
for
why,
many
likely
found
out
the
hard
way
that
developing
directly
on
Hadoop
was
painful,
tedious
and
poorly
suited
to
scale.
0 5 10 15 20 25 30 35 40 45
Other
Poor integration into existing IT
infrastructure
Lack of scalability
Lack of portability across compute
fabrics
Difficult to integrate to existing systems
Poor troubleshooting capabilities
Lack of skilled Hadoop resources
High cost of development in existing
platform
Slow development in existing platform
What challenges did you have that made you look for
an application development framework?
5. Confidential
THE
PATH
TO
CASCADING:
HIVE,
PIG,
AND
GUI
TOOLS
N=69
Given
the
maturity
of
Cascading
users,
it’s
no
surprise
that
many
explored
alternatives
before
settling
on
Cascading.
The
majority
(51%)
tried
Hive
and
Pig,
both
of
which
were
early
abstraction
layers
for
MapReduce.
Today,
many
Pig
applications
run
alongside
Cascading
and
many
Hive
applications
run
within Cascading.
Why
didn’t
they
stick
with
Hive
and
Pig?
Most
organizations
determined
they
could
not
scale
with
Hive
and
Pig.
Typically
that
was
because
Hive
and
Pig
required
scarce
technical
resources
and
because
development
in
those
frameworks
was
slow.
Those
who
opted
for
other
API
frameworks
found
them
not
yet
ready
for
the
enterprise.
A
smaller
group
experimented
with
GUI-‐based
ETL
tools.
While
these
tools
made
it
easy
to
leverage
existing
resources
and
skill
sets,
their
capabilities
were
too
limited.
They
also
required
building
special
scripts
to
achieve
complex
functionality,
which
negated
the
benefits
of
simplicity.
Additionally,
many
users
did
not
like
being
locked
into
a
single-‐vendor
solution.
26%
25%22%
19%
8%
Before selecting Cascading, what alternative solutions
did you explore? (select all that apply)
Pig
Hive
Other API frameworks (Spark,
Crunch)
GUI-based ETL tools (Talend,
Informatica, Pentaho)
No other alternatives were
explored
6. Confidential
0 10 20 30 40 50 60
Other
Flink
Tez
Storm
Kafka
MapReduce
Spark
Which compute fabric(s) are you using or
planning to use in the next 18 mths?
PORTABILITY
ACROSS
FABRICS
N=69
New
compute fabrics
appear
all
the
time,
though
not
all
are
production-‐ready.
The
responses
reflect high
interest
in
Spark
and
a
desire
for
true
streaming
(not
micro-‐batches).
MapReduce isn’t going
away any
time
soon,
especially
where
reliability
is
a
requirement.
Still,
many
are
experimenting
with other
compute
fabrics.
Because
each
fabric
offers
application-‐specific
advantages,
most
organizations
will
likely
wind
up
running
multiple
fabrics.
Cascading
3.0
supports
Tez,
MapReduce,
and
local/in-‐memory,
so
users
can
port
applications
from
MapReduce to
Tez simply
by
changing
a
few
lines
of
code.
Easy
portability
makes
Cascading
an
ideal
platform
for
moving
from
MapReduce to
Tez without
incurring
the
cost
of
rewriting
applications.
Soon,
Cascading
will
support
the
same
portability
for
Spark
and
Flink (for
Flink,
support
will
be
community
contributed).
7. Confidential
CASCADING
BRIDGES
OTHER
DEVELOPMENT
FRAMEWORKS
N=69
Despite
their
shortcomings,
MapReduce,
Hive
and
Pig
are
still
widely
in
use
as
development
frameworks,
largely
because
many
early
Hadoop
applications
were
built
through
these
interfaces.
No
surprise
that
we
see
a
lot
of
excitement
about
Spark
as
a
new
development
framework
as
well;
many
users
are
experimenting
with
developing
directly
in
the
Spark
API.
Cascading
will
support
Spark
in
a
future
WIP,
adding
an
important
framework
option
for
Spark
developers.
Developers
who
build
in
Cascading
will
be
able
to
port
their
applications
from
MapReduce to
Spark
without
having
to
rewrite
them
in
the
Spark
API.
In
summary,
there
is
no
one-‐size-‐fits-‐all
framework.
Flexibility
is
key
as
organizations
build
out
their
big
data
strategies
and
platforms.
Cascalog
Scalding
Pig
Hive
MapReduce
Cascading
Spark
0 10 20 30 40 50 60
What data application development
framework do you use?
“[Cascading] Best Hadoop API for enterprise data-
intensive apps.” – Architect.Fortune 500 Healthcare Payer
8. Confidential
COMMON
USE
CASES:
ETL,
ANALYTICS
&
DATA
INTEGRATION
N=69
Most
organizations
rely
on
Hadoop
for
heavy
processing
steps
within
ETL,
analytics
or
data
integration
flows.
Some
have
moved
their
entire
ETL
processing
to
Hadoop,
while
others
have
moved
only
portions
of
their
workflows.
For
example,
AirBnB uses
Cascading
for
complicated
infrastructure
tasks
such
as
data
normalization
and
cleansing.
AirBnB also
leverages
Cascading
for
reconstructing
corrupted
files
and
merging
data.
In
combination
with
Cascading,
Pig
and
Hive
are
used
by
analysts
to
run
batch
scripts
to
perform
ad
hoc
analysis.
With
these
tools,
analysts
are
able
to
more
easily
study
crucial
metrics
like
click-‐through
rates,
page
statistics,
and
drop-‐off
rates.
0 10 20 30 40 50
Other
Search Optimization
Recommendation Engines
Data Quality
Machine Learning and Scoring
Data Integration
Analytics
ETL
What best describes the projects where you
are using Cascading?
45%
Offloading
ETL to
Hadoop
40%
To Support
Analytics/BI
Projects
33%
Data
Integration
Projects
9. Confidential
Extremely
likely - 10
23%
9
10%
8
20%
7
19%
6
11%
5
6%
4
1%
3
3%
2
4%
Not at all
likely - 0
3%
How likely is it that you would
recommend Cascading to a friend or
colleague?
WHY
THEY
LOVE
CASCADING:
TDD,
JAVA
API,
PORTABILITY
N=79
Top
3
Most
Impactful
Capabilities
v Test
Driven
Development
(49%)
-‐ Efficiently
test
code
and
process
local
files
before
you
deploy
on
a
cluster
with
Cascading’s
local
or
in-‐
memory
mode.
Incorporate
inline
data
assertions
to
define
results
at
any
point
in
your
pipeline.
Failed
assertions
are
easily
visible
and
available
for
analysis.
v JavaAPI
(44%)
-‐ Cascading
is
a
Java
library
and
does
not
require
installation.
Cascading
fits
directly
into
a
standard
development
process;
all
you
have
to
do
is
code
to
the
API.
v Application
Portability
(43%)
-‐ When
you
compile
a
Cascading
job,
it
automatically
creates
a
run-‐time
executable
for
your
specified
compute
fabric.
Simply
by
changing
a
few
lines
of
code,
you
can
test
your
application
on
multiple
fabrics
and
choose
the
best
for
your
needs.
53%Of Respondents
are Promoters
(8/10)
11. Confidential
CASCADING
SLASHES
TIME
TO MARKET
N=79
Most improved time to market by at least
40%
5%
17%
12%
18%
17%
18%
13%
What percentage would you estimate your
time to market has improved?
Over 300%
Over 100%
80%-100%
60%-80%
40%-60%
20%-40%
Less than 20%
12. Confidential
N=69
0 10 20 30 40 50 60
Other
Supporting chargeback models
Forecasting big data infrastructure
needs
Monitoring SLA's for Hadoop
applications
Identify and resolve Hadoop
application issues faster
Optimizing application performance
What future challenges do you anticipate in
managing your data applications?
THE
FUTURE:
BETTER
PERFORMANCE,
DATA
PIPELINE
VISIBILITY
Application
performance
management
is
a
top-‐of-‐mind
concern
for
most
respondents.
While
performance
tuning
happens
on
the
operations
side,
optimizing
applications
to
meet
service-‐ level
commitments
is
usually
a
collaborative
effort
between
development
and
operations teams.
Developers
need
better
tools
to
visualize
data
pipelines
and
detect
undesirable
behavior
before they
promote
applications
to
production.
Operations
teams
need
better
tools
to
monitor,
manage
and
optimize
data
delivery.
An
important,
though
secondary
concern,
is
tracking
the
rate
of
Hadoop
resource
consumption
so
clusters
can
be
right-‐sized
and
costs
distributed
across
divisions.
This
is
particularly
true
as
more
of
of
an
organization’s
departments/teams
build
and
rely
on
big
data
applications,
transforming
their
Hadoop
cluster
from
a
side
project
into
core
production
IT
infrastructure.
With
new
application
performance
management
tools
such
as
Driven,
teams
can
visualize
data
pipelines
and
identify
unwanted
behavior
more
effectively.
Tools
like
Driven
also
arm
teams
with
the
data
necessary
to
pinpoint
issues
quickly
and
resolve
them
collaboratively.
14. Confidential
DISTRIBUTIONS
0 5 10 15 20 25 30 35 40
Count of Other (please specify)
Count of MapR
Count of Hortonworks
Count of Apache Hadoop
Count of Amazon EMR
Count of Cloudera
Distributions
N=69
15. Confidential
NUMBER OFAPPLICATIONSANDVOLUME
Over 100 60-100 30-60 15-30 5-15 1-5
Less than 250 pipelines 4 5 4 26
500 - 1,000 pipelines 2 2 1 1 2
250 - 500 pipelines 1 3 5
2,500 - 5,000 pipelines 1 1
1,000 - 2,500 pipelines 2 3 1
Over 5,000 pipelines 1
Over 10,000 pipelines 1 1 2
0
5
10
15
20
25
30
35
40
Average Numberof Cascading Applications and Pipelines N=69
16. Confidential
PRODUCTIONSTATUS
0 5 10 15 20 25 30 35 40 45 50
No and not planned
Not yet but planned
Yes
Are you using your Cascading data applications in a
production environment?
N=69