The document discusses real-time processing in Hadoop using the Hortonworks Data Platform (HDP). It provides an overview of using HDP for real-time streaming analytics in a logistics scenario. Example applications and architectures are presented, including using Kafka for ingesting sensor data, Storm for stream processing, and HBase for real-time querying. Demos will also illustrate integrating predictive analytics into streaming scenarios.
Storm Demo Talk - Colorado Springs May 2015Mac Moore
The document discusses real-time processing capabilities in Hadoop and Hortonworks Data Platform (HDP). It begins with an introduction to Hortonworks and an overview of real-time streaming architectures on HDP. It then demonstrates streaming capabilities with and without predictive analytics additions. The document highlights how HDP provides a centralized architecture and open data platform to enable real-time and batch processing of any type of data for analytics applications.
The document discusses real-time processing in Hadoop and provides an overview of streaming architectures using the Hortonworks Data Platform (HDP). It includes two demos, the first showing a basic streaming scenario and the second integrating predictive analytics. The document aims to introduce HDP's capabilities for real-time streaming and predictive analytics and demonstrate them through examples relevant to logistics companies.
Trucking demo w Spark ML - Paul Hargis - HortonworksKelly Kohlleffel
A trucking company generates millions of event logs from its fleet of trucks that are monitored in real-time. These include normal driving events as well as violation events like speeding. The company analyzes these event logs using Hadoop to understand routes, trucks, and drivers that are more prone to violations. Streaming data is processed using Storm on Hadoop and violations are detected. Machine learning models are trained using Spark on the historical enriched event data to predict violations in real-time and provide recommendations to reduce violations.
Enabling the Real Time Analytical EnterpriseHortonworks
This document discusses enabling real-time analytics in the enterprise. It begins with an overview of the challenges of real-time analytics due to non-integrated systems, varied data types and volumes, and data management complexity. A case study on real-time quality analytics in automotive is presented, highlighting the need to analyze varied data sources quickly to address issues. The Hortonworks/Attunity solution is then introduced using Attunity Replicate to integrate data from various sources in real-time into Hortonworks Data Platform for analysis. A brief demonstration of data streaming from a database into Kafka and then Hortonworks Data Platform is shown.
Introduction to Hortonworks Data PlatformHortonworks
This document introduces the Hortonworks Data Platform. It summarizes the key features of the platform, including its ability to simplify deployment, monitor and manage large clusters, integrate with any data source, and provide metadata services. The document demonstrates the Hortonworks Management Center and features for high availability, data integration, and metadata services. It concludes by discussing training, support, and certification services available from Hortonworks.
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
Developers increasingly are building dynamic, interactive real-time applications on fast streaming data to extract maximum value from data in the moment. To do so requires a data pipeline, the ability to make transactional decisions against state, and an export functionality that pushes data at high speeds to long-term Hadoop analytics stores like Hortonworks Data Platform (HDP). This enables data to arrive in your analytic store sooner, and allows these analytics to be leveraged with radically lower latency.
But successfully writing fast data applications that manage, process, and export streams of data generated from mobile, smart devices, sensors and social interactions is a big challenge.
Join Hortonworks and VoltDB, an in-memory scale-out relational database that simplifies fast data application development, to learn how you can ingest large volumes of fast-moving, streaming data and process it in real time. We will also cover how developing fast data applications is simplified, faster - and delivers more value when built on a fast in-memory, scale-out SQL database.
Schlumberger is the world's largest oilfield services company that helps customers find and produce oil and gas. It faces big data challenges in its upstream operations which involve subsurface activity to wellhead. Schlumberger uses Hadoop to analyze vast amounts of data from sensors to improve operations and has seen positive results such as reducing costs and improving recovery rates.
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopHortonworks
Rich media is exploding all around us. From our personal usage to retailers monitoring store traffic for optimized associate placement, there is wide and growing application of rich media. Despite the pervasive usage, enterprises have had limited choice of generally available tools to analyze rich media. In this session we will look into leveraging deep learning algorithms for rich media analysis and provide practical hands on example of image recognition using Apache Hadoop and Spark.
Storm Demo Talk - Colorado Springs May 2015Mac Moore
The document discusses real-time processing capabilities in Hadoop and Hortonworks Data Platform (HDP). It begins with an introduction to Hortonworks and an overview of real-time streaming architectures on HDP. It then demonstrates streaming capabilities with and without predictive analytics additions. The document highlights how HDP provides a centralized architecture and open data platform to enable real-time and batch processing of any type of data for analytics applications.
The document discusses real-time processing in Hadoop and provides an overview of streaming architectures using the Hortonworks Data Platform (HDP). It includes two demos, the first showing a basic streaming scenario and the second integrating predictive analytics. The document aims to introduce HDP's capabilities for real-time streaming and predictive analytics and demonstrate them through examples relevant to logistics companies.
Trucking demo w Spark ML - Paul Hargis - HortonworksKelly Kohlleffel
A trucking company generates millions of event logs from its fleet of trucks that are monitored in real-time. These include normal driving events as well as violation events like speeding. The company analyzes these event logs using Hadoop to understand routes, trucks, and drivers that are more prone to violations. Streaming data is processed using Storm on Hadoop and violations are detected. Machine learning models are trained using Spark on the historical enriched event data to predict violations in real-time and provide recommendations to reduce violations.
Enabling the Real Time Analytical EnterpriseHortonworks
This document discusses enabling real-time analytics in the enterprise. It begins with an overview of the challenges of real-time analytics due to non-integrated systems, varied data types and volumes, and data management complexity. A case study on real-time quality analytics in automotive is presented, highlighting the need to analyze varied data sources quickly to address issues. The Hortonworks/Attunity solution is then introduced using Attunity Replicate to integrate data from various sources in real-time into Hortonworks Data Platform for analysis. A brief demonstration of data streaming from a database into Kafka and then Hortonworks Data Platform is shown.
Introduction to Hortonworks Data PlatformHortonworks
This document introduces the Hortonworks Data Platform. It summarizes the key features of the platform, including its ability to simplify deployment, monitor and manage large clusters, integrate with any data source, and provide metadata services. The document demonstrates the Hortonworks Management Center and features for high availability, data integration, and metadata services. It concludes by discussing training, support, and certification services available from Hortonworks.
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
Developers increasingly are building dynamic, interactive real-time applications on fast streaming data to extract maximum value from data in the moment. To do so requires a data pipeline, the ability to make transactional decisions against state, and an export functionality that pushes data at high speeds to long-term Hadoop analytics stores like Hortonworks Data Platform (HDP). This enables data to arrive in your analytic store sooner, and allows these analytics to be leveraged with radically lower latency.
But successfully writing fast data applications that manage, process, and export streams of data generated from mobile, smart devices, sensors and social interactions is a big challenge.
Join Hortonworks and VoltDB, an in-memory scale-out relational database that simplifies fast data application development, to learn how you can ingest large volumes of fast-moving, streaming data and process it in real time. We will also cover how developing fast data applications is simplified, faster - and delivers more value when built on a fast in-memory, scale-out SQL database.
Schlumberger is the world's largest oilfield services company that helps customers find and produce oil and gas. It faces big data challenges in its upstream operations which involve subsurface activity to wellhead. Schlumberger uses Hadoop to analyze vast amounts of data from sensors to improve operations and has seen positive results such as reducing costs and improving recovery rates.
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopHortonworks
Rich media is exploding all around us. From our personal usage to retailers monitoring store traffic for optimized associate placement, there is wide and growing application of rich media. Despite the pervasive usage, enterprises have had limited choice of generally available tools to analyze rich media. In this session we will look into leveraging deep learning algorithms for rich media analysis and provide practical hands on example of image recognition using Apache Hadoop and Spark.
Analytics Modernization: Configuring SAS® Grid Manager for HadoopHortonworks
Improve the efficiency and accelerate job execution by moving traditional SAS workloads into Hadoop to modernize and optimize SAS analytics. How can we run traditional SAS® jobs, including SAS® Workspace Servers, on Hadoop worker nodes? The answer is SAS® Grid Manager for Hadoop, which is integrated with the Hadoop ecosystem to provide resource management, high availability and enterprise scheduling for SAS customers. By moving SAS workloads inside the Hadoop cluster, efficiency is improved and job execution is accelerated. We will also cover the role of Hadoop YARN, Hadoop Distributed File System (HDFS) storage, and Hadoop client services. We review SAS metadata definitions for SAS Grid Manager, SAS® Object Spawner, and SAS® Workspace Servers. Audio broadcast: https://hortonworks.com/webinar/configuring-sas-grid-manager-hadoop/
Hortonworks Data In Motion Series Part 4Hortonworks
How real-world enterprises leverage Hortonworks DataFlow/Apache NiFi to to create real-time data flows in record time to enable new business opportunities, improve customer retention, accelerate big data projects from months to minutes through increased efficiency and reduced costs.
On-Demand webinar: http://hortonworks.com/webinar/paradigm-shift-business-usual-real-time-dataflows-record-time/
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises.
To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1.
https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/
Hortonworks for Financial Analysts PresentationHortonworks
Hortonworks was founded in 2011 by former Yahoo engineers to support the growth of Apache Hadoop. Their strategy is to overcome technology gaps by making Hadoop easier to install and use, enable an ecosystem of partners by defining open APIs, and overcome knowledge gaps by expanding technical content and training. This will help drive wider adoption of Apache Hadoop as the platform for managing big data in the enterprise.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseHortonworks
Apache Ambari 2.5 helps customers simplify the experience for provisioning, managing, monitoring, securing and troubleshooting Hadoop deployments. Find out how the combination of Ambari and SmartSense delivers a path to success to help IT get Hadoop up and running effectively. The end result – you get the full business impact management and benefits of Big Data for your organization.
https://hortonworks.com/webinar/streamline-apache-hadoop-operations-apache-ambari-smartsense/
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
The document discusses a Big Data Meetup organized by C-BAG (Chennai Big Data Analytic Group) on October 29, 2014 in Chennai. It provides details about two speakers, Dhruv Kumar from Concurrent Inc. and Vinay Shukla from Hortonworks, who will discuss reducing development time for production-grade Hadoop applications and Hortonworks' Hadoop platform respectively. The remainder of the document consists of presentation slides that cover topics including the modern data architecture with Hadoop, enterprise goals for data architecture, unlocking applications from new data types, and case studies.
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
The document discusses new features in Apache Hive 0.14 that improve SQL query performance. It introduces a cost-based optimizer that can optimize join orders, enabling faster query times. An example TPC-DS query is shown to demonstrate how the optimizer selects an efficient join order based on statistics about table and column sizes. Faster SQL queries are now possible in Hive through this query optimization capability.
Powering Big Data Success On-Prem and in the CloudHortonworks
How do you optimize Apache Spark workloads in the cloud? How do you tune your resources for maximum performance and efficiency? Find out how the new Hortonworks Flex support subscriptions enables IT agility and success in the cloud. We will cover:
* Options for running Data Science, Analytics and ETL workloads in the cloud
* Hortonworks support offerings including new Flex Support Subscription
* How to run Cloud workloads more efficiently with SmartSense
* Case study on the impact of SmartSense
https://hortonworks.com/webinar/powering-big-data-success-cloud/
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
1. Hortonworks Data Platform 1.2 focuses on continued innovation with Apache Ambari and enhanced security and performance for Hive and HCatalog.
2. Key features include root cause analysis, usage heat maps, and improved ecosystem integration in Ambari, as well as enhanced security models and concurrency improvements.
3. Hortonworks ensures tight alignment with open source Apache projects by certifying the latest stable components and contributing leadership and code back to projects.
This document provides an agenda and overview of topics for a Hortonworks data movement and management meetup. The agenda includes networking, introductions, discussions on Falcon use cases and releases, Hive disaster recovery, server-side extensions, ADF/instance search, Hive-based ingestion/export, Spark integration, and Sqoop 2 features. An overview of Falcon describes its high-level abstraction of Hadoop data processing services. Usage scenarios focus on dataset replication, lifecycle management, and lineage/traceability. The document also discusses Falcon examples for replication, retention, and late data handling.
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
Enhancing a customer experience has become essential for communication service providers to effectively manage customer churn and build a strong, long lasting relationship with their customers. This has become increasingly challenging as customer interactions occur across multiple channels. Understanding customer behavior and how it applies across channels is the key to ensuring the best level of experience is achieved by each customer.
In this webinar Hortonworks and Apigee discuss how service providers can capture and visualize customer behavior across customer interaction points like call center events (IVR and chat) and combine it with network data, to predict customer calls and patterns of digital channel abandonment using Hadoop and predictive analysis and visualization tools..
We will identify ways to develop a 360 degree view across a customer’s household through an HDP Data Lake and visualize customer interaction patterns and predict expected behavior using Apigee Insights to identify and initiate the Next-Best-Action for a customer to ensure a superior level of customer experience.
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
This document discusses Hortonworks' HDF 2.0 platform for managing data in motion and at rest. The platform includes tools for data ingestion, streaming, and storage. It also allows partners to integrate their solutions and get certified. Use cases highlighted include log analytics, IoT, and connected vehicles. The ecosystem supports ingesting data from various sources and processing it using tools like NiFi, Kafka, and Storm.
This document discusses the author's 10 year journey with Hadoop, from 2006 to 2016. It describes the evolution of key Hadoop technologies like HDFS, MapReduce, YARN and the addition of engines for SQL, NoSQL, streaming and in-memory processing. The document also addresses trends around growth of data from devices, users and the internet of things. It presents a vision of the future where Hadoop (YARN.next) will assemble and securely operate a flexible menu of data access applications and engines.
Your Self-Driving Car - How Did it Get So Smart?Hortonworks
This document summarizes a presentation given by Michael Ger, Dr. Andreas Pawlik, and Dr. Seunghan Han of NorCom and Hortonworks about their DaSense data science platform. DaSense is designed to help researchers developing autonomous vehicle systems by allowing them to more efficiently run simulations and test algorithms on large datasets using distributed high performance computing resources. It aims to accelerate the development process by enabling experiments that previously took days to be completed within hours or minutes by leveraging large compute clusters. DaSense provides tools for building end-to-end data science pipelines for tasks like data filtering, model training, evaluation and analysis.
Hortonworks provides an overview of their Tez framework for improving Hadoop query processing. Tez aims to accelerate queries by expressing them as dataflow graphs that can be optimized, rather than relying solely on MapReduce. It also aims to empower users by allowing flexible definition of data pipelines and composition of inputs, processors, and outputs. Early results show a 100x speedup on benchmark queries compared to traditional MapReduce.
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable storage of petabytes of data and large-scale computations across commodity hardware.
- Apache Hadoop is used widely by internet companies to analyze web server logs, power search engines, and gain insights from large amounts of social and user data. It is also used for machine learning, data mining, and processing audio, video, and text data.
- The future of Apache Hadoop includes making it more accessible and easy to use for enterprises, addressing gaps like high availability and management, and enabling partners and the community to build on it through open APIs and a modular architecture.
1) The webinar covered Apache Hadoop on the open cloud, focusing on key drivers for Hadoop adoption like new types of data and business applications.
2) Requirements for enterprise Hadoop include core services, interoperability, enterprise readiness, and leveraging existing skills in development, operations, and analytics.
3) The webinar demonstrated Hortonworks Apache Hadoop running on Rackspace's Cloud Big Data Platform, which is built on OpenStack for security, optimization, and an open platform.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. The talk concludes with future work on improved scheduling strategies and real-time resource monitoring.
Analytics Modernization: Configuring SAS® Grid Manager for HadoopHortonworks
Improve the efficiency and accelerate job execution by moving traditional SAS workloads into Hadoop to modernize and optimize SAS analytics. How can we run traditional SAS® jobs, including SAS® Workspace Servers, on Hadoop worker nodes? The answer is SAS® Grid Manager for Hadoop, which is integrated with the Hadoop ecosystem to provide resource management, high availability and enterprise scheduling for SAS customers. By moving SAS workloads inside the Hadoop cluster, efficiency is improved and job execution is accelerated. We will also cover the role of Hadoop YARN, Hadoop Distributed File System (HDFS) storage, and Hadoop client services. We review SAS metadata definitions for SAS Grid Manager, SAS® Object Spawner, and SAS® Workspace Servers. Audio broadcast: https://hortonworks.com/webinar/configuring-sas-grid-manager-hadoop/
Hortonworks Data In Motion Series Part 4Hortonworks
How real-world enterprises leverage Hortonworks DataFlow/Apache NiFi to to create real-time data flows in record time to enable new business opportunities, improve customer retention, accelerate big data projects from months to minutes through increased efficiency and reduced costs.
On-Demand webinar: http://hortonworks.com/webinar/paradigm-shift-business-usual-real-time-dataflows-record-time/
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises.
To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1.
https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/
Hortonworks for Financial Analysts PresentationHortonworks
Hortonworks was founded in 2011 by former Yahoo engineers to support the growth of Apache Hadoop. Their strategy is to overcome technology gaps by making Hadoop easier to install and use, enable an ecosystem of partners by defining open APIs, and overcome knowledge gaps by expanding technical content and training. This will help drive wider adoption of Apache Hadoop as the platform for managing big data in the enterprise.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Streamline Apache Hadoop Operations with Apache Ambari and SmartSenseHortonworks
Apache Ambari 2.5 helps customers simplify the experience for provisioning, managing, monitoring, securing and troubleshooting Hadoop deployments. Find out how the combination of Ambari and SmartSense delivers a path to success to help IT get Hadoop up and running effectively. The end result – you get the full business impact management and benefits of Big Data for your organization.
https://hortonworks.com/webinar/streamline-apache-hadoop-operations-apache-ambari-smartsense/
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
The document discusses a Big Data Meetup organized by C-BAG (Chennai Big Data Analytic Group) on October 29, 2014 in Chennai. It provides details about two speakers, Dhruv Kumar from Concurrent Inc. and Vinay Shukla from Hortonworks, who will discuss reducing development time for production-grade Hadoop applications and Hortonworks' Hadoop platform respectively. The remainder of the document consists of presentation slides that cover topics including the modern data architecture with Hadoop, enterprise goals for data architecture, unlocking applications from new data types, and case studies.
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
The document discusses new features in Apache Hive 0.14 that improve SQL query performance. It introduces a cost-based optimizer that can optimize join orders, enabling faster query times. An example TPC-DS query is shown to demonstrate how the optimizer selects an efficient join order based on statistics about table and column sizes. Faster SQL queries are now possible in Hive through this query optimization capability.
Powering Big Data Success On-Prem and in the CloudHortonworks
How do you optimize Apache Spark workloads in the cloud? How do you tune your resources for maximum performance and efficiency? Find out how the new Hortonworks Flex support subscriptions enables IT agility and success in the cloud. We will cover:
* Options for running Data Science, Analytics and ETL workloads in the cloud
* Hortonworks support offerings including new Flex Support Subscription
* How to run Cloud workloads more efficiently with SmartSense
* Case study on the impact of SmartSense
https://hortonworks.com/webinar/powering-big-data-success-cloud/
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
1. Hortonworks Data Platform 1.2 focuses on continued innovation with Apache Ambari and enhanced security and performance for Hive and HCatalog.
2. Key features include root cause analysis, usage heat maps, and improved ecosystem integration in Ambari, as well as enhanced security models and concurrency improvements.
3. Hortonworks ensures tight alignment with open source Apache projects by certifying the latest stable components and contributing leadership and code back to projects.
This document provides an agenda and overview of topics for a Hortonworks data movement and management meetup. The agenda includes networking, introductions, discussions on Falcon use cases and releases, Hive disaster recovery, server-side extensions, ADF/instance search, Hive-based ingestion/export, Spark integration, and Sqoop 2 features. An overview of Falcon describes its high-level abstraction of Hadoop data processing services. Usage scenarios focus on dataset replication, lifecycle management, and lineage/traceability. The document also discusses Falcon examples for replication, retention, and late data handling.
Predicting Customer Experience through Hadoop and Customer Behavior GraphsHortonworks
Enhancing a customer experience has become essential for communication service providers to effectively manage customer churn and build a strong, long lasting relationship with their customers. This has become increasingly challenging as customer interactions occur across multiple channels. Understanding customer behavior and how it applies across channels is the key to ensuring the best level of experience is achieved by each customer.
In this webinar Hortonworks and Apigee discuss how service providers can capture and visualize customer behavior across customer interaction points like call center events (IVR and chat) and combine it with network data, to predict customer calls and patterns of digital channel abandonment using Hadoop and predictive analysis and visualization tools..
We will identify ways to develop a 360 degree view across a customer’s household through an HDP Data Lake and visualize customer interaction patterns and predict expected behavior using Apigee Insights to identify and initiate the Next-Best-Action for a customer to ensure a superior level of customer experience.
Hortonworks Data In Motion Webinar Series Pt. 2Hortonworks
This document discusses Hortonworks' HDF 2.0 platform for managing data in motion and at rest. The platform includes tools for data ingestion, streaming, and storage. It also allows partners to integrate their solutions and get certified. Use cases highlighted include log analytics, IoT, and connected vehicles. The ecosystem supports ingesting data from various sources and processing it using tools like NiFi, Kafka, and Storm.
This document discusses the author's 10 year journey with Hadoop, from 2006 to 2016. It describes the evolution of key Hadoop technologies like HDFS, MapReduce, YARN and the addition of engines for SQL, NoSQL, streaming and in-memory processing. The document also addresses trends around growth of data from devices, users and the internet of things. It presents a vision of the future where Hadoop (YARN.next) will assemble and securely operate a flexible menu of data access applications and engines.
Your Self-Driving Car - How Did it Get So Smart?Hortonworks
This document summarizes a presentation given by Michael Ger, Dr. Andreas Pawlik, and Dr. Seunghan Han of NorCom and Hortonworks about their DaSense data science platform. DaSense is designed to help researchers developing autonomous vehicle systems by allowing them to more efficiently run simulations and test algorithms on large datasets using distributed high performance computing resources. It aims to accelerate the development process by enabling experiments that previously took days to be completed within hours or minutes by leveraging large compute clusters. DaSense provides tools for building end-to-end data science pipelines for tasks like data filtering, model training, evaluation and analysis.
Hortonworks provides an overview of their Tez framework for improving Hadoop query processing. Tez aims to accelerate queries by expressing them as dataflow graphs that can be optimized, rather than relying solely on MapReduce. It also aims to empower users by allowing flexible definition of data pipelines and composition of inputs, processors, and outputs. Early results show a 100x speedup on benchmark queries compared to traditional MapReduce.
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable storage of petabytes of data and large-scale computations across commodity hardware.
- Apache Hadoop is used widely by internet companies to analyze web server logs, power search engines, and gain insights from large amounts of social and user data. It is also used for machine learning, data mining, and processing audio, video, and text data.
- The future of Apache Hadoop includes making it more accessible and easy to use for enterprises, addressing gaps like high availability and management, and enabling partners and the community to build on it through open APIs and a modular architecture.
1) The webinar covered Apache Hadoop on the open cloud, focusing on key drivers for Hadoop adoption like new types of data and business applications.
2) Requirements for enterprise Hadoop include core services, interoperability, enterprise readiness, and leveraging existing skills in development, operations, and analytics.
3) The webinar demonstrated Hortonworks Apache Hadoop running on Rackspace's Cloud Big Data Platform, which is built on OpenStack for security, optimization, and an open platform.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. The talk concludes with future work on improved scheduling strategies and real-time resource monitoring.
This document discusses how to use Storm and Hadoop together to enable real-time and batch processing of large datasets. It describes using Hadoop to precompute batch views of data, and Storm to incrementally update real-time views as new data streams in. This allows for low-latency queries by combining precomputed batch views with real-time views that compensate for recent data not yet absorbed into the batch views.
Storm: distributed and fault-tolerant realtime computationnathanmarz
Storm is a distributed real-time computation system that provides guaranteed message processing, horizontal scalability, and fault tolerance. It allows users to define data processing topologies and submit them to a Storm cluster for distributed execution. Spouts emit streams of tuples that are processed by bolts. Storm tracks processing to ensure reliability and replays failed tasks. It provides tools for deployment, monitoring, and optimization of real-time data processing.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
This document provides an overview of real-time processing capabilities on Hortonworks Data Platform (HDP). It discusses how a trucking company uses HDP to analyze sensor data from trucks in real-time to monitor for violations and integrate predictive analytics. The company collects data using Kafka and analyzes it using Storm, HBase and Hive on Tez. This provides real-time dashboards as well as querying of historical data to identify issues with routes, trucks or drivers. The document explains components like Kafka, Storm and HBase and how they enable a unified YARN-based architecture for multiple workloads on a single HDP cluster.
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
This document provides an overview of how a trucking company can use Hortonworks Data Platform (HDP) to gain insights from real-time streaming data generated by sensors in its trucks. The company wants to monitor trucks for locations, violations, and other events. HDP allows the company to ingest streaming data from trucks using Kafka and analyze it in real-time with Storm for alerts or serve it to applications with HBase. The company can also run interactive queries on historical data with Hive and Tez. All of this is run on a single HDP cluster for consistent governance, security, and operations across batch and real-time workloads.
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
How can you simplify the management and monitoring of your Hadoop environment? Ensure IT can focus on the right business priorities supported by Hadoop? Take a look at this presentation and learn how you can simplify the management and monitoring of your Hadoop environment, and ensure IT can focus on the right business priorities supported by Hadoop.
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
This is Mark Ledbetter's presentation from the September 22, 2014 Hortonworks webinar “What’s Possible with a Modern Data Architecture?” Mark is vice president for industry solutions at Hortonworks. He has more than twenty-five years experience in the software industry with a focus on Retail and supply chain.
Mr. Slim Baltagi is a Systems Architect at Hortonworks, with over 4 years of Hadoop experience working on 9 Big Data projects: Advanced Customer Analytics, Supply Chain Analytics, Medical Coverage Discovery, Payment Plan Recommender, Research Driven Call List for Sales, Prime Reporting Platform, Customer Hub, Telematics, Historical Data Platform; with Fortune 100 clients and global companies from Financial Services, Insurance, Healthcare and Retail.
Mr. Slim Baltagi has worked in various architecture, design, development and consulting roles at.
Accenture, CME Group, TransUnion, Syntel, Allstate, TransAmerica, Credit Suisse, Chicago Board Options Exchange, Federal Reserve Bank of Chicago, CNA, Sears, USG, ACNielsen, Deutshe Bahn.
Mr. Baltagi has also over 14 years of IT experience with an emphasis on full life cycle development of Enterprise Web applications using Java and Open-Source software. He holds a master’s degree in mathematics and is an ABD in computer science from Université Laval, Québec, Canada.
Languages: Java, Python, JRuby, JEE , PHP, SQL, HTML, XML, XSLT, XQuery, JavaScript, UML, JSON
Databases: Oracle, MS SQL Server, MYSQL, PostreSQL
Software: Eclipse, IBM RAD, JUnit, JMeter, YourKit, PVCS, CVS, UltraEdit, Toad, ClearCase, Maven, iText, Visio, Japser Reports, Alfresco, Yslow, Terracotta, Toad, SoapUI, Dozer, Sonar, Git
Frameworks: Spring, Struts, AppFuse, SiteMesh, Tiles, Hibernate, Axis, Selenium RC, DWR Ajax , Xstream
Distributed Computing/Big Data: Hadoop, MapReduce, HDFS, Hive, Pig, Sqoop, HBase, R, RHadoop, Cloudera CDH4, MapR M7, Hortonworks HDP 2.1
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2.
In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including:
Improvements to Ambari core - such as support for ResourceManager HA
Extensions to Ambari platform - introducing Ambari Administration and Ambari Views
Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"
Introduction to the Hortonworks YARN Ready ProgramHortonworks
The recently launched YARN Ready Program will accelerate multi-workload Hadoop in the Enterprise. The program enables developers to integrate new and existing applications with YARN-based Hadoop. We will cover:
--the program and it's benefits
--why it is important to customers
--tools and guides to help you get started
--technical resources to support you
--marketing recognition you can leverage
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
This document provides an overview of Hortonworks and its Hortonworks Data Platform (HDP). Hortonworks develops, distributes and supports HDP, which is the only 100% open source Apache Hadoop distribution. Hortonworks focuses on innovation in Apache Hadoop projects, addressing enterprise requirements, enabling ecosystem interoperability, and ensuring no vendor lock-in through its open source approach. The document discusses Hortonworks' contributions to Apache Hadoop and other projects, as well as how HDP can be used for operational data refinery, big data exploration, and application enrichment.
Trafodion – an enterprise class sql based on hadoopKrishna-Kumar
Trafodion is a joint HP Labs and HP-IT research project to develop an enterprise-class SQL on Hadoop DBMS engine that specifically targets operational workloads as opposed to analytic workloads. Operational SQL describe workloads previous described as OLTP (online transaction processing) workloads and Operational Data Store (ODS) workloads, but expands that definition from the broad range of enterprise-level transactional applications (ERP, CRM, etc.) to include the new transactions generated from social and mobile data interactions and observations and the new mixing of structured and semi-structured data.
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformEMC
The document discusses Pivotal HD, a Hadoop distribution from Pivotal. It provides an overview of key features of Pivotal HD 2.0 including improved support for real-time analytics using Gemfire XD, enhanced machine learning and SQL capabilities, and integration with the Isilon storage platform. The presentation highlights how Pivotal HD can help customers build a "data lake" to store all of their data and gain insights to create new data-driven services and applications.
Hortonworks provides an open source Apache Hadoop distribution called Hortonworks Data Platform (HDP). Their mission is to enable modern data architectures through delivering enterprise Apache Hadoop. They have over 300 employees and are headquartered in Palo Alto, CA. Hortonworks focuses on driving innovation through the open source Apache community process, integrating Hadoop with existing technologies, and engineering Hadoop for enterprise reliability and support.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
The document discusses how Hadoop can be used for interactive and real-time data analysis. It notes that the amount of digital data is growing exponentially and will reach 40 zettabytes by 2020. Traditional data systems are struggling to manage this new data. Hadoop provides a solution by tying together inexpensive servers to act as one large computer for processing big data using various Apache projects for data access, governance, security and operations. Examples show how Hadoop can be used to analyze real-time streaming data from sensors on trucks to monitor routes, vehicles and drivers.
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Hortonworks Hadoop @ Oslo Hadoop User GroupMats Johansson
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP) which provides a multi-tenant platform for any application and data, and Hortonworks' focus on customer success through its open source community leadership and support. It also discusses how Hadoop has emerged as the foundation for a modern data architecture to unify data processing and analytics for both traditional and new data sources in order to drive business value.
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP), and Hortonworks' role as a partner for customer success. It also summarizes challenges with traditional data systems, how Hadoop emerged as a foundation for a new data architecture, and how HDP delivers a comprehensive data management platform.
Hortonworks and Platfora in Financial Services - WebinarHortonworks
Big Data Analytics is transforming how banks and financial institutions unlock insights, make more meaningful decisions, and manage risk. Join this webinar to see how you can gain a clear understanding of the customer journey by leveraging Platfora to interactively analyze the mass of raw data that is stored in your Hortonworks Data Platform. Our experts will highlight use cases, including customer analytics and security analytics.
Speakers: Mark Lochbihler, Partner Solutions Engineer at Hortonworks, and Bob Welshmer, Technical Director at Platfora
Similar to Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG (20)
Why do we now have electric cars at prodcution scale, quad copters and drones cheap enough for the home hobbyist, and VR displays being bought by companies like Facebook?
Because the technology is there now, thanks to advances made in other industries, solving problems at scale in a big marketplace.
At Scale (in this case):
$270 bn smartphone mkt in 2014
$120 bn internet advertising (proj 2015)
Before we dive into Hadoop and its role within the modern data architecture, let’s set the context for why Hadoop has become important.
Existing approaches for data management have become both technically and commercially impractical.
Technically - these systems were never designed to store or process vast quantities of data
Commercially – the licensing structures with the traditonal approach are no longer feasible.
These two challenges combined with rate at which data is being produce predicated a need for a new approach to data systems. If we fast-forward another 3 to 5 years, more than half of the data under management within the enterprise will be from these new data sources.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
Single focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type
In the open, partner for success.
Everything in the open.
Joint deep engineering Microsoft (HD Insight), HP, SAP and Teradata
In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo!
This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type.
Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies.
This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
Our product, the Hortonworks Data Platform (or HDP for short) is a completely open source, enterprise-grade data platform that’s comprised of dozens of Apache open source projects including Apache Hadoop and YARN at its center.
We have a comprehensive engineering, testing, and certification process that integrates and packages all of these components into a cohesive platform that the enterprise can consume and deploy at scale. And our model enables us to proactively manage new innovations and new open source projects into HDP as they emerge.
To ensure the highest quality, we have a test suite, unique to Hortonworks, that is comprised of 10’s of thousands of system and integration tests that we run at scale on a regular basis including on the world’s largest Hadoop clusters at Yahoo! as part of our co-development relationship.
While our pure-play competitors focus on proprietary components for security, operations, and governance, we invest in new open source projects that address these areas.
For example, earlier in 2014, we acquired a small company called XA Secure that provided a comprehensive security and administration product. We flipped the technology in wholesale into open source as Apache Ranger.
Since our security, operations and governance technologies are open source projects, our partners are able to work with us on those projects to ensure deep integration within our joint solution architectures.
Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers subscribe to one or more topics
A Kafka topic is just a sharded write-ahead log
Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week). This allows usage in situations where the consumer of data may need to reload data.
It also makes it possible to support space-efficient publish-subscribe as there is a single shared log no matter how many consumers; in traditional messaging systems there is usually a queue per consumer, so adding a consumer doubles your data size. This makes Kafka a good fit for things outside the bounds of normal messaging systems such as acting as a pipeline for offline data systems such as Hadoop. These offline systems may load only at intervals as part of a periodic ETL cycle, or may go down for several hours for maintenance, during which time Kafka is able to buffer even TBs of unconsumed data if needed
Replication for HA/fault tolerance is built in
Pull based system for consumers instead of pushed base
Crude benchmark:
Basically, single threaded synchronous messages are 400k per second when using 6 "datanode-ish" servers. This goes up to 2+ MM when using partitions and asynchronous messages. Server specs in the benchmark:
Intel Xeon 2.5 GHz processor with six cores
Six 7200 RPM SATA drives
32GB of RAM
1Gb Ethernet
A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then the server hands out messages in the order they are stored. However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order.
http://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
Elastic Search Flume Sink does exist
This all changed with the introduction of Hadoop 2 and YARN. Introduced in October, 2013 it changed everything.
Introduced in MR-279 by Arun Murthy in 2009, Arun and the team at Hortonworks architected and led it’s development as the core change in Hadoop 2. Our view was that to truly enable Hadoop as a component of a broad data architecture, YARN was the fundamental requirement as it turns Hadoop from a single application data system to a multi application data system. This is foundational to our approach of innovating from the core outwards to build Enterprise Hadoop.
With YARN it is now possible to land all data in one cluster and then access it in multiple ways: from batch to interactive to real-time.
Today, YARN, at the core of Hadoop is the center of our focus on innovation in and around Hadoop. It is clearly the enabling technology that has started a transition to a data lake within organizations.
Simply stated… Hortonworks Architected & led development of YARN in order to enable the Modern Data Architecture
Elastic Search Flume Sink does exist
Data is ingested, it’s on the dashboard, and it’s in HDFS.
Data is ingested, it’s on the dashboard, and it’s in HDFS.
We’re going to explore a SUBSET of the data. <1m records
BinaryClassification example from Spark
LogisticRegression model