This document provides an introduction to Hadoop and big data concepts. It discusses what big data is and how companies like Amazon and Netflix have seen returns on investment from applying data science to large amounts of data. It then covers Hadoop and HDFS, explaining what they are, their architecture, and common commands used to work with HDFS like put, get, ls, and cat. The document is an introductory presentation on big data and Hadoop.
Trucking demo w Spark ML - Paul Hargis - HortonworksKelly Kohlleffel
A trucking company generates millions of event logs from its fleet of trucks that are monitored in real-time. These include normal driving events as well as violation events like speeding. The company analyzes these event logs using Hadoop to understand routes, trucks, and drivers that are more prone to violations. Streaming data is processed using Storm on Hadoop and violations are detected. Machine learning models are trained using Spark on the historical enriched event data to predict violations in real-time and provide recommendations to reduce violations.
Enterprise Apache Hadoop: State of the UnionHortonworks
So what's in store for 2014? This deck was from Shaun Connolly's (VP of Strategy, Hortonworks) State of the Union webinar.
In this deck, you'll find:
- Reflection on Enterprise Hadoop Market in 2013
- The latest releases and innovations within the open source community
- Highlights of what's in store for Apache Hadoop and Big Data in 2014
Big Data, Hadoop, Hortonworks and Microsoft HDInsightHortonworks
Big Data is everywhere. And at the center of the big data discussion is Apache Hadoop, a next-generation enterprise data platform that allows you to capture, process and share the enormous amounts of new, multi-structured data that doesn’t fit into transitional systems.
With Microsoft HDInsight, powered by Hortonworks Data Platform, you can bridge this new world of unstructured content with the structured data we manage today. Together, we bring Hadoop to the masses as an addition to your current enterprise data architectures so that you can amass net new insight without net new headache.
This document summarizes a webinar presented by Hortonworks and Sqrrl on using big data analytics for cybersecurity. It discusses how the growth of data sources and targeted attacks require new security approaches. A modern data architecture with Hadoop can provide a common platform to analyze all security-related data and gain new insights. Sqrrl's linked data model and analytics run on Hortonworks to help investigate security incidents like a network breach, mapping different data sources and identifying abnormal activity patterns.
Talend Open Studio and Hortonworks Data PlatformHortonworks
Data Integration is a key step in a Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. OK, I have a cluster…now what? Complex scripts? For wide scale adoption of Apache Hadoop, an intuitive set of tools that abstract away the complexity of integration is necessary.
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Hortonworks
As the Big Data Analytics and the Apache Hadoop ecosystem has matured and gained increasing traction in established industries with faster adoption in the insurance market than originally anticipated, it is clear that the potential benefits for data management and business intelligence are staggering. At the same time, many big data programs have stalled or failed to deliver on their aspirational value proposition, resulting in a substantial gap between expectations of analytics consumers and the ability of big data analytics programs to deliver. Join Hortonworks and Clarity as we review the common needs of Property and Casualty (P&C) Insurers and how to unlock the true value of big data analytics:
Information agility – Centralization of data and decentralization of analysis
Expanded capability – Conventional analysis combined with real-time analytics demands
Reduced expense – Lower costs through cheaper storage while maintaining scalability
We will discuss a modern data architecture that constitutes a mature, enterprise strength Hadoop framework for P&C Insurers that answers the need for governance processes across the enterprise stack. We will cover how a modern data architecture allows organizations to collect, store, analyze and manipulate massive quantities of data on their own terms—regardless of the source of that data - accelerating the real lifetime value of big data and Hadoop analytics for claims, customer sentiment and telematics.
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifyHortonworks
Join this webinar to explore Hadoop security challenges and trends, learn how to simply the connection of your Hortonworks Data Platform to your existing Active Directory infrastructure and hear about real world examples of organizations that are achieving the following benefits:
- Secured Hortonworks environments thanks to Active Directory infrastructure for identity and authentication.
- Increased productivity and security via single sign-on for IT admins and Hadoop users.
- Least privilege and session monitoring for privileged access to Hortonworks clusters.
Webinar URL: http://hortonworks.com/webinar/simplify-and-secure-your-hadoop-environment-with-hortonworks-and-centrify/
Trucking demo w Spark ML - Paul Hargis - HortonworksKelly Kohlleffel
A trucking company generates millions of event logs from its fleet of trucks that are monitored in real-time. These include normal driving events as well as violation events like speeding. The company analyzes these event logs using Hadoop to understand routes, trucks, and drivers that are more prone to violations. Streaming data is processed using Storm on Hadoop and violations are detected. Machine learning models are trained using Spark on the historical enriched event data to predict violations in real-time and provide recommendations to reduce violations.
Enterprise Apache Hadoop: State of the UnionHortonworks
So what's in store for 2014? This deck was from Shaun Connolly's (VP of Strategy, Hortonworks) State of the Union webinar.
In this deck, you'll find:
- Reflection on Enterprise Hadoop Market in 2013
- The latest releases and innovations within the open source community
- Highlights of what's in store for Apache Hadoop and Big Data in 2014
Big Data, Hadoop, Hortonworks and Microsoft HDInsightHortonworks
Big Data is everywhere. And at the center of the big data discussion is Apache Hadoop, a next-generation enterprise data platform that allows you to capture, process and share the enormous amounts of new, multi-structured data that doesn’t fit into transitional systems.
With Microsoft HDInsight, powered by Hortonworks Data Platform, you can bridge this new world of unstructured content with the structured data we manage today. Together, we bring Hadoop to the masses as an addition to your current enterprise data architectures so that you can amass net new insight without net new headache.
This document summarizes a webinar presented by Hortonworks and Sqrrl on using big data analytics for cybersecurity. It discusses how the growth of data sources and targeted attacks require new security approaches. A modern data architecture with Hadoop can provide a common platform to analyze all security-related data and gain new insights. Sqrrl's linked data model and analytics run on Hortonworks to help investigate security incidents like a network breach, mapping different data sources and identifying abnormal activity patterns.
Talend Open Studio and Hortonworks Data PlatformHortonworks
Data Integration is a key step in a Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. OK, I have a cluster…now what? Complex scripts? For wide scale adoption of Apache Hadoop, an intuitive set of tools that abstract away the complexity of integration is necessary.
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Hortonworks
As the Big Data Analytics and the Apache Hadoop ecosystem has matured and gained increasing traction in established industries with faster adoption in the insurance market than originally anticipated, it is clear that the potential benefits for data management and business intelligence are staggering. At the same time, many big data programs have stalled or failed to deliver on their aspirational value proposition, resulting in a substantial gap between expectations of analytics consumers and the ability of big data analytics programs to deliver. Join Hortonworks and Clarity as we review the common needs of Property and Casualty (P&C) Insurers and how to unlock the true value of big data analytics:
Information agility – Centralization of data and decentralization of analysis
Expanded capability – Conventional analysis combined with real-time analytics demands
Reduced expense – Lower costs through cheaper storage while maintaining scalability
We will discuss a modern data architecture that constitutes a mature, enterprise strength Hadoop framework for P&C Insurers that answers the need for governance processes across the enterprise stack. We will cover how a modern data architecture allows organizations to collect, store, analyze and manipulate massive quantities of data on their own terms—regardless of the source of that data - accelerating the real lifetime value of big data and Hadoop analytics for claims, customer sentiment and telematics.
Simplify and Secure your Hadoop Environment with Hortonworks and CentrifyHortonworks
Join this webinar to explore Hadoop security challenges and trends, learn how to simply the connection of your Hortonworks Data Platform to your existing Active Directory infrastructure and hear about real world examples of organizations that are achieving the following benefits:
- Secured Hortonworks environments thanks to Active Directory infrastructure for identity and authentication.
- Increased productivity and security via single sign-on for IT admins and Hadoop users.
- Least privilege and session monitoring for privileged access to Hortonworks clusters.
Webinar URL: http://hortonworks.com/webinar/simplify-and-secure-your-hadoop-environment-with-hortonworks-and-centrify/
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
Join us to learn how Hortonworks Data Platform and Nimble Storage provide an enterprise-ready data platform for multi-workload data processing. HDP supports an array of processing methods — from batch through interactive to real-time, with key capabilities required of an enterprise data platform — spanning Governance, Security and Operations. Nimble Storage provides the performance, capacity, and availability for HDP and allows you to take advantage of Hadoop with minimal changes to existing data architectures and skillsets.
The Next Generation of Big Data AnalyticsHortonworks
Apache Hadoop has evolved rapidly to become a leading platform for managing and processing big data. If your organization is examining how you can use Hadoop to store, transform, and refine large volumes of multi-structured data, please join us for this session where we will discuss, the emergence of "big data" and opportunities for deriving business value, the evolution of Apache Hadoop and future directions, essential components required in a Hadoop-powered platform, and solution architectures that integrate Hadoop with existing data discovery and data warehouse platforms.
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
Hadoop is no longer optional. Companies of all sizes are in various phases of their own Big Data journey. Whether you are just starting to explore the platform or have multiple clusters up and running, everyone is presented with a similar challenge - developing their internal skillset. Hadoop specialists are hard to find. Hand coding is too prone to error when it comes to storing, integrating or analyzing your data. However, it doesn’t need to be this difficult.
In this recorded webinar, Talend and Hortonworks help you learn how to unify all your data in Hadoop, with no specialized Big Data skills.
Find the recording here. www.talend.com/resources/webinars/challenges-to-hadoop-adoption-if-you-can-dream-it-you-can-build-it
This webinar covers: How Hadoop opens a new world of analytic applications, How to bridge the skills gap with our Big Data solutions, Experience a real-world, simple technical demo
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
As the enterprise's big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red hat and Hortonworks have collaborated to integrate Hortonworks Data Platform with Red Hat's proven platform technologies. Join us in this interactive 3-part webinar series, as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discussed how to eliminate the challenges to Big Data management inside Hadoop.
Go over these slides to learn:
· How to use the scalability and flexibility of Hadoop to drive faster access to usable information across the enterprise.
· Why a pure-YARN implementation for data integration, quality and management delivers competitive advantage.
· How to use the flexibility of RedPoint and Hortonworks to create an enterprise data lake where data is captured, cleansed, linked and structured in a consistent way.
Join Cloudian, Hortonworks and 451 Research for a panel-style Q&A discussion about the latest trends and technology innovations in Big Data and Analytics. Matt Aslett, Data Platforms and Analytics Research Director at 451 Research, John Kreisa, Vice President of Strategic Marketing at Hortonworks, and Paul Turner, Chief Marketing Officer at Cloudian, will answer your toughest questions about data storage, data analytics, log data, sensor data and the Internet of Things. Bring your questions or just come and listen!
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Transform You Business with Big Data and HortonworksHortonworks
This document summarizes a presentation about Hortonworks and how it can help companies transform their businesses with big data and Hortonworks' Hadoop distribution. Hortonworks is the sole distributor of an open source, enterprise-grade Hadoop distribution called Hortonworks Data Platform (HDP). HDP addresses enterprise requirements for mixed workloads, high availability, security and more. The presentation discusses how Hortonworks enables interoperability and supports customers. It also provides an overview of how Pactera can help clients with big data implementation, architecture, and analytics.
Insurance companies of all sizes are challenged to keep up with emerging technologies that deliver a competitive advantage. Recording: https://www.brighttalk.com/webcast/9573/192877
Big data holds the key to greater customer insight and stronger customer relationships. But risk of sensitive data exposure — and compliance violations — keeps many insurers from pursuing big data initiatives and reaping the rewards of business-driven analytics. Join Dataguise and Hortonworks for this live webinar to learn how you can free your organization from traditional information security constraints and unlock the power of your most valuable business assets.
• What do you need to know about PII/PHI privacy before embarking on big data initiatives?
• Why do so many big data initiatives fail before they’ve even begun—and what can you do about it?
• How can IT security organizations help data scientists extract more business value from their data?
• How are leading insurance companies leveraging big data to gain competitive advantage?
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptxHortonworks
Hortonworks partners with systems integrators to accelerate adoption of Apache Hadoop. Hortonworks provides the only 100% open source Apache Hadoop distribution with the most experienced leadership team. Hortonworks' business strategy is to enable the next generation data management platform and accelerate adoption of Apache Hadoop by creating an ecosystem of partners.
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
Accelerate Big Data Application Development with Cascading and HDP, webinar hosted by Hortonworks and Concurrent. Visit Hortonworks.com/webinars to access the recording.
Hortonworks and Clarity Solution Group Hortonworks
Many organizations are leveraging social media to understand consumer sentiment and opinions about brands and products. Analytics in this area, however, is in its infancy and does not always provide a compelling result for effective business impact. Learn how consumer organizations can benefit by integrating social data with enterprise data to drive more profitable consumer relationships. This webinar is presented by Hortonworks and Clarity Solution Group, and will focus on the evolution of Hadoop, the clear advantage of Hortonworks distribution, and business challenges solved by “Consumer720.”
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
Data is exponentially increasing in both types and volumes, creating opportunities for businesses. Watch this video and learn from three Big Data experts: John Kreisa, VP Strategic Marketing at Hortonworks, Imad Birouty, Director of Technical Product Marketing at Teradata and John Haddad, Senior Director of Product Marketing at Informatica.
Multiple systems are needed to exploit the variety and volume of data sources, including a flexible data repository. Learn more about:
- Apache Hadoop 2 and YARN
- Data Lakes
- Intelligent data management layers needed to manage metadata and usage patterns as well as track consumption across these data platforms.
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
An organization’s information is spread across multiple repositories, on-premise and in the cloud, with limited ability to correlate information and derive insights. The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization.
- Leverage 100% of your data: Text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven (powered by HP IDOL and HP Vertica), making it possible to integrate this valuable content and insights into various line of business applications.
- Democratize and enable multi-dimensional content analysis: - Empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease, using the 100% open source Hortonworks Data Platform.
- Extend the enterprise data warehouse: Synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in.
- Dramatically reduce complexity with enterprise-ready SQL engine: Tap into the richest analytics that support JOINs, complex data types, and other capabilities only available with HP Vertica SQL on the Hortonworks Data Platform.
Speakers:
- Ajay Singh, Director, Technical Channels, Hortonworks
- Will Gardella, Product Management, HP Big Data
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
The document discusses security enhancements in Hortonworks Data Platform (HDP) 2.2, including centralized security with Apache Ranger and API security with Apache Knox. It provides an overview of Ranger's ability to centrally administer, authorize, and audit access across Hadoop. Knox is described as securing access to Hadoop APIs from various devices and applications by acting as a gateway. The document highlights new features for Ranger and Knox in HDP 2.2 such as deeper integration with Hadoop components and Ambari management capabilities.
Hortonworks and Voltage Security webinarHortonworks
Securing Hadoop data is a hot topic for good reason – no matter where you are in your Hadoop implementation plans, it’s best to define your data security approach now, not later. Hortonworks and Voltage Security are focused on deeply integrating Hadoop with your existing data center technologies and team capabilities. Attend this discussion to learn about a central policy administration framework across security requirements for authentication, authorization, auditing and data protection.
Slides from the joint webinar. Learn how Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your Data Science efforts.
Together, Pivotal HAWQ and the Hortonworks Data Platform provide businesses with a Modern Data Architecture for IT transformation.
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
1. Hortonworks Data Platform 1.2 focuses on continued innovation with Apache Ambari and enhanced security and performance for Hive and HCatalog.
2. Key features include root cause analysis, usage heat maps, and improved ecosystem integration in Ambari, as well as enhanced security models and concurrency improvements.
3. Hortonworks ensures tight alignment with open source Apache projects by certifying the latest stable components and contributing leadership and code back to projects.
1) The webinar covered Apache Hadoop on the open cloud, focusing on key drivers for Hadoop adoption like new types of data and business applications.
2) Requirements for enterprise Hadoop include core services, interoperability, enterprise readiness, and leveraging existing skills in development, operations, and analytics.
3) The webinar demonstrated Hortonworks Apache Hadoop running on Rackspace's Cloud Big Data Platform, which is built on OpenStack for security, optimization, and an open platform.
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
Hadoop’s cost effective scalability and flexibility to analyze all data types is driving organizations everywhere to embrace big data analytics. From proof of concept to deployment across the enterprise, join Datameer and Hortonworks as we answer the ‘now what?’ when rolling out your Hadoop big data analytics project. This webinar will address critical project components such as data security, data privacy, high availability, user training and use case development.
Apache Hadoop 0.23 at Hadoop World 2011Hortonworks
This document discusses Apache Hadoop 0.23, the first stable release of Hadoop in over 30 months. It introduces the speaker, Arun Murthy, and describes significant new features in Hadoop 0.23 like HDFS federation and YARN. It also covers performance improvements, HDFS high availability, and the extensive testing done for the release across many projects like HBase, Pig and Hive to enable very large deployments of 6000+ nodes.
The document discusses PIG on Storm. It begins with an example of using PIG to perform tokenization, grouping, and counting of sentences. It then shows how the same operations can be done in Storm directly and in PIG running on Storm. The document outlines different Storm execution modes and how PIG can provide state management and sliding windows. It also introduces the concept of a hybrid mode where parts of a PIG script can run on Storm or MapReduce automatically.
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
Join us to learn how Hortonworks Data Platform and Nimble Storage provide an enterprise-ready data platform for multi-workload data processing. HDP supports an array of processing methods — from batch through interactive to real-time, with key capabilities required of an enterprise data platform — spanning Governance, Security and Operations. Nimble Storage provides the performance, capacity, and availability for HDP and allows you to take advantage of Hadoop with minimal changes to existing data architectures and skillsets.
The Next Generation of Big Data AnalyticsHortonworks
Apache Hadoop has evolved rapidly to become a leading platform for managing and processing big data. If your organization is examining how you can use Hadoop to store, transform, and refine large volumes of multi-structured data, please join us for this session where we will discuss, the emergence of "big data" and opportunities for deriving business value, the evolution of Apache Hadoop and future directions, essential components required in a Hadoop-powered platform, and solution architectures that integrate Hadoop with existing data discovery and data warehouse platforms.
2015 02 12 talend hortonworks webinar challenges to hadoop adoptionHortonworks
Hadoop is no longer optional. Companies of all sizes are in various phases of their own Big Data journey. Whether you are just starting to explore the platform or have multiple clusters up and running, everyone is presented with a similar challenge - developing their internal skillset. Hadoop specialists are hard to find. Hand coding is too prone to error when it comes to storing, integrating or analyzing your data. However, it doesn’t need to be this difficult.
In this recorded webinar, Talend and Hortonworks help you learn how to unify all your data in Hadoop, with no specialized Big Data skills.
Find the recording here. www.talend.com/resources/webinars/challenges-to-hadoop-adoption-if-you-can-dream-it-you-can-build-it
This webinar covers: How Hadoop opens a new world of analytic applications, How to bridge the skills gap with our Big Data solutions, Experience a real-world, simple technical demo
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1Hortonworks
As the enterprise's big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red hat and Hortonworks have collaborated to integrate Hortonworks Data Platform with Red Hat's proven platform technologies. Join us in this interactive 3-part webinar series, as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discussed how to eliminate the challenges to Big Data management inside Hadoop.
Go over these slides to learn:
· How to use the scalability and flexibility of Hadoop to drive faster access to usable information across the enterprise.
· Why a pure-YARN implementation for data integration, quality and management delivers competitive advantage.
· How to use the flexibility of RedPoint and Hortonworks to create an enterprise data lake where data is captured, cleansed, linked and structured in a consistent way.
Join Cloudian, Hortonworks and 451 Research for a panel-style Q&A discussion about the latest trends and technology innovations in Big Data and Analytics. Matt Aslett, Data Platforms and Analytics Research Director at 451 Research, John Kreisa, Vice President of Strategic Marketing at Hortonworks, and Paul Turner, Chief Marketing Officer at Cloudian, will answer your toughest questions about data storage, data analytics, log data, sensor data and the Internet of Things. Bring your questions or just come and listen!
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Transform You Business with Big Data and HortonworksHortonworks
This document summarizes a presentation about Hortonworks and how it can help companies transform their businesses with big data and Hortonworks' Hadoop distribution. Hortonworks is the sole distributor of an open source, enterprise-grade Hadoop distribution called Hortonworks Data Platform (HDP). HDP addresses enterprise requirements for mixed workloads, high availability, security and more. The presentation discusses how Hortonworks enables interoperability and supports customers. It also provides an overview of how Pactera can help clients with big data implementation, architecture, and analytics.
Insurance companies of all sizes are challenged to keep up with emerging technologies that deliver a competitive advantage. Recording: https://www.brighttalk.com/webcast/9573/192877
Big data holds the key to greater customer insight and stronger customer relationships. But risk of sensitive data exposure — and compliance violations — keeps many insurers from pursuing big data initiatives and reaping the rewards of business-driven analytics. Join Dataguise and Hortonworks for this live webinar to learn how you can free your organization from traditional information security constraints and unlock the power of your most valuable business assets.
• What do you need to know about PII/PHI privacy before embarking on big data initiatives?
• Why do so many big data initiatives fail before they’ve even begun—and what can you do about it?
• How can IT security organizations help data scientists extract more business value from their data?
• How are leading insurance companies leveraging big data to gain competitive advantage?
Hortonworks Data Platform for Systems Integrators Webinar 9-5-2012.pptxHortonworks
Hortonworks partners with systems integrators to accelerate adoption of Apache Hadoop. Hortonworks provides the only 100% open source Apache Hadoop distribution with the most experienced leadership team. Hortonworks' business strategy is to enable the next generation data management platform and accelerate adoption of Apache Hadoop by creating an ecosystem of partners.
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
Accelerate Big Data Application Development with Cascading and HDP, webinar hosted by Hortonworks and Concurrent. Visit Hortonworks.com/webinars to access the recording.
Hortonworks and Clarity Solution Group Hortonworks
Many organizations are leveraging social media to understand consumer sentiment and opinions about brands and products. Analytics in this area, however, is in its infancy and does not always provide a compelling result for effective business impact. Learn how consumer organizations can benefit by integrating social data with enterprise data to drive more profitable consumer relationships. This webinar is presented by Hortonworks and Clarity Solution Group, and will focus on the evolution of Hadoop, the clear advantage of Hortonworks distribution, and business challenges solved by “Consumer720.”
Hadoop 2.0: YARN to Further Optimize Data ProcessingHortonworks
Data is exponentially increasing in both types and volumes, creating opportunities for businesses. Watch this video and learn from three Big Data experts: John Kreisa, VP Strategic Marketing at Hortonworks, Imad Birouty, Director of Technical Product Marketing at Teradata and John Haddad, Senior Director of Product Marketing at Informatica.
Multiple systems are needed to exploit the variety and volume of data sources, including a flexible data repository. Learn more about:
- Apache Hadoop 2 and YARN
- Data Lakes
- Intelligent data management layers needed to manage metadata and usage patterns as well as track consumption across these data platforms.
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
An organization’s information is spread across multiple repositories, on-premise and in the cloud, with limited ability to correlate information and derive insights. The Smart Content Hub solution from HP and Hortonworks enables a shared content infrastructure that transparently synchronizes information with existing systems and offers an open standards-based platform for deep analysis and data monetization.
- Leverage 100% of your data: Text, images, audio, video, and many more data types can be automatically consumed and enriched using HP Haven (powered by HP IDOL and HP Vertica), making it possible to integrate this valuable content and insights into various line of business applications.
- Democratize and enable multi-dimensional content analysis: - Empower your analysts, business users, and data scientists to search and analyze Hadoop data with ease, using the 100% open source Hortonworks Data Platform.
- Extend the enterprise data warehouse: Synchronize and manage content from content management systems, and crack open the files in whatever format they happen to be in.
- Dramatically reduce complexity with enterprise-ready SQL engine: Tap into the richest analytics that support JOINs, complex data types, and other capabilities only available with HP Vertica SQL on the Hortonworks Data Platform.
Speakers:
- Ajay Singh, Director, Technical Channels, Hortonworks
- Will Gardella, Product Management, HP Big Data
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
The document discusses security enhancements in Hortonworks Data Platform (HDP) 2.2, including centralized security with Apache Ranger and API security with Apache Knox. It provides an overview of Ranger's ability to centrally administer, authorize, and audit access across Hadoop. Knox is described as securing access to Hadoop APIs from various devices and applications by acting as a gateway. The document highlights new features for Ranger and Knox in HDP 2.2 such as deeper integration with Hadoop components and Ambari management capabilities.
Hortonworks and Voltage Security webinarHortonworks
Securing Hadoop data is a hot topic for good reason – no matter where you are in your Hadoop implementation plans, it’s best to define your data security approach now, not later. Hortonworks and Voltage Security are focused on deeply integrating Hadoop with your existing data center technologies and team capabilities. Attend this discussion to learn about a central policy administration framework across security requirements for authentication, authorization, auditing and data protection.
Slides from the joint webinar. Learn how Pivotal HAWQ, one of the world’s most advanced enterprise SQL on Hadoop technology, coupled with the Hortonworks Data Platform, the only 100% open source Apache Hadoop data platform, can turbocharge your Data Science efforts.
Together, Pivotal HAWQ and the Hortonworks Data Platform provide businesses with a Modern Data Architecture for IT transformation.
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
1. Hortonworks Data Platform 1.2 focuses on continued innovation with Apache Ambari and enhanced security and performance for Hive and HCatalog.
2. Key features include root cause analysis, usage heat maps, and improved ecosystem integration in Ambari, as well as enhanced security models and concurrency improvements.
3. Hortonworks ensures tight alignment with open source Apache projects by certifying the latest stable components and contributing leadership and code back to projects.
1) The webinar covered Apache Hadoop on the open cloud, focusing on key drivers for Hadoop adoption like new types of data and business applications.
2) Requirements for enterprise Hadoop include core services, interoperability, enterprise readiness, and leveraging existing skills in development, operations, and analytics.
3) The webinar demonstrated Hortonworks Apache Hadoop running on Rackspace's Cloud Big Data Platform, which is built on OpenStack for security, optimization, and an open platform.
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
Hadoop’s cost effective scalability and flexibility to analyze all data types is driving organizations everywhere to embrace big data analytics. From proof of concept to deployment across the enterprise, join Datameer and Hortonworks as we answer the ‘now what?’ when rolling out your Hadoop big data analytics project. This webinar will address critical project components such as data security, data privacy, high availability, user training and use case development.
Apache Hadoop 0.23 at Hadoop World 2011Hortonworks
This document discusses Apache Hadoop 0.23, the first stable release of Hadoop in over 30 months. It introduces the speaker, Arun Murthy, and describes significant new features in Hadoop 0.23 like HDFS federation and YARN. It also covers performance improvements, HDFS high availability, and the extensive testing done for the release across many projects like HBase, Pig and Hive to enable very large deployments of 6000+ nodes.
The document discusses PIG on Storm. It begins with an example of using PIG to perform tokenization, grouping, and counting of sentences. It then shows how the same operations can be done in Storm directly and in PIG running on Storm. The document outlines different Storm execution modes and how PIG can provide state management and sliding windows. It also introduces the concept of a hybrid mode where parts of a PIG script can run on Storm or MapReduce automatically.
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisHortonworks
In this session, attendees will learn how to use R in the distributed environment of Hadoop using the rmr package. Additionally, the R package googleVis will be used to show how application development teams can incorporate the power of R and the power of Google Chart Tools into their applications quickly and easily. The result is a rich custom data visualization with far less coding than what would otherwise be required. The session will begin by discussing R basics and then moving to concrete examples of statistical analysis on data sets. This will be accompanied by an application development example showing custom visualization of the analysis using googleVis. The application development example will show a browser based app both kicking off the data set analysis using R as well as the visualization of the result. Visualization examples will use both googleVis as well as basic Google Chart Tools. Attendees will leave the session with a concrete example of how to incorporate R into their existing application development practices and how to use Hadoop and its ecosystem to build custom visualizations.
This document contains a presentation about using open source software and commodity hardware to process big data in a cost effective manner. It discusses how Apache Hadoop can be used to collect, store, process and analyze large amounts of data without expensive proprietary software or hardware. The presentation provides examples of how Hadoop is being used by various companies and explores different approaches for refining, exploring and enriching data with Hadoop.
Don't Let Security Be The 'Elephant in the Room'Hortonworks
Don't let security be the "elephant in the room" for enterprise big data. As big data now includes sensitive data from various sources, there are hidden risks to simply adopting big data technologies without also implementing proper data protection. While traditional IT security approaches provide some coverage, they also have gaps and do not fully address protecting data across its lifecycle and wherever it may travel. A data-centric security approach that encrypts data at capture can lock down data and keep it protected as it is stored, processed, and shared across systems.
Adoption de Hadoop : des Possibilités Illimitées - Hortonworks and TalendHortonworks
Hadoop has become unavoidable. Companies of all sizes are at different stages of their thoughts on Big Data. Whether you're just starting to explore the platform or you already have several existing clusters, everyone faces the same challenge - to develop its internal expertise.
Specialists of Big Data, Talend, and Hortonworks, watch this webinar to discover how to unify all your data in Hadoop, without specific skills Big Data.
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
The document discusses a Big Data Meetup organized by C-BAG (Chennai Big Data Analytic Group) on October 29, 2014 in Chennai. It provides details about two speakers, Dhruv Kumar from Concurrent Inc. and Vinay Shukla from Hortonworks, who will discuss reducing development time for production-grade Hadoop applications and Hortonworks' Hadoop platform respectively. The remainder of the document consists of presentation slides that cover topics including the modern data architecture with Hadoop, enterprise goals for data architecture, unlocking applications from new data types, and case studies.
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
Companies in every industry look for ways to explore new data types and large data sets that were previously too big to capture, store and process. They need to unlock insights from data such as clickstream, geo-location, sensor, server log, social, text and video data. However, becoming a data-first enterprise comes with many challenges.
Join this webinar organized by three leaders in their respective fields and learn from our experts how you can accelerate the implementation of a scalable, cost-efficient and robust Big Data solution. Cisco, Hortonworks and Red Hat will explore how new data sets can enrich existing analytic applications with new perspectives and insights and how they can help you drive the creation of innovative new apps that provide new value to your business.
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
This document provides an overview of Hortonworks and its Hortonworks Data Platform (HDP). Hortonworks develops, distributes and supports HDP, which is the only 100% open source Apache Hadoop distribution. Hortonworks focuses on innovation in Apache Hadoop projects, addressing enterprise requirements, enabling ecosystem interoperability, and ensuring no vendor lock-in through its open source approach. The document discusses Hortonworks' contributions to Apache Hadoop and other projects, as well as how HDP can be used for operational data refinery, big data exploration, and application enrichment.
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
Implementing and managing a Big Data environment effectively requires essential efficiencies such as automation, performance monitoring and flexible infrastructure management. Discover new innovations that enable you to manage entire Big Data environments with unparalleled ease of use and clear enterprise visibility across a variety of data repositories.
To learn more about Mainframe solutions from CA Technologies, visit: http://bit.ly/1wbiPkl
Hortonworks Hadoop @ Oslo Hadoop User GroupMats Johansson
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP) which provides a multi-tenant platform for any application and data, and Hortonworks' focus on customer success through its open source community leadership and support. It also discusses how Hadoop has emerged as the foundation for a modern data architecture to unify data processing and analytics for both traditional and new data sources in order to drive business value.
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP), and Hortonworks' role as a partner for customer success. It also summarizes challenges with traditional data systems, how Hadoop emerged as a foundation for a new data architecture, and how HDP delivers a comprehensive data management platform.
Hadoop Reporting and Analysis - JaspersoftHortonworks
Hadoop is deployed for a variety of uses, including web analytics, fraud detection, security monitoring, healthcare, environmental analysis, social media monitoring, and other purposes.
Transform Your Business with Big Data and Hortonworks Pactera_US
Customer insight and marketplace predictions are a few of the profitable benefits found in big data technology. Leading companies are using the advanced analytics solution to find new revenue streams, increase customer satisfaction and optimize the supply chain.
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
In this webinar, WANdisco and Hortonworks look at three examples of using 'Big Data' to get a more comprehensive view of customer behavior and activity in the banking and insurance industries. Then we'll pull out the common threads from these examples, and see how a flexible next-generation Hadoop architecture lets you get a step up on improving your business performance. Join us to learn:
- How to leverage data from across an entire global enterprise
- How to analyze a wide variety of structured and unstructured data to get quick, meaningful answers to critical questions
- What industry leaders have put in place
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
Hadoop is a great platform for storing and processing massive amounts of data. Elasticsearch is the ideal solution for Searching and Visualizing the same data. Join us to learn how you can leverage the full power of both platforms to maximize the value of your Big Data.
In this webinar we'll walk you through:
How Elasticsearch fits in the Modern Data Architecture.
A demo of Elasticsearch and Hortonworks Data Platform.
Best practices for combining Elasticsearch and Hortonworks Data Platform to extract maximum insights from your data.
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
The document discusses how Hortonworks Data Platform (HDP) enables a modern data architecture with Apache Hadoop. HDP provides a common data set stored in HDFS that can be accessed through various applications for batch, interactive, and real-time processing. This allows organizations to store all their data in one place and access it simultaneously through multiple means. YARN is the architectural center of HDP and enables this modern data architecture. HDP also provides enterprise capabilities like security, governance, and operations to make Hadoop suitable for business use.
Hortonworks provides an overview of their Tez framework for improving Hadoop query processing. Tez aims to accelerate queries by expressing them as dataflow graphs that can be optimized, rather than relying solely on MapReduce. It also aims to empower users by allowing flexible definition of data pipelines and composition of inputs, processors, and outputs. Early results show a 100x speedup on benchmark queries compared to traditional MapReduce.
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discuss how to eliminate the challenges to Big Data management inside Hadoop.
This document discusses modern data architecture and Apache Hadoop's role within it. It presents WANdisco and its Non-Stop Hadoop solution, which extends HDFS across multiple data centers to provide 100% uptime for Hadoop deployments. Non-Stop Hadoop uses WANdisco's patented distributed coordination engine to synchronize HDFS metadata across sites separated by wide area networks, enabling continuous availability of HDFS data and global HDFS deployments.
Hortonworks for Financial Analysts PresentationHortonworks
Hortonworks was founded in 2011 by former Yahoo engineers to support the growth of Apache Hadoop. Their strategy is to overcome technology gaps by making Hadoop easier to install and use, enable an ecosystem of partners by defining open APIs, and overcome knowledge gaps by expanding technical content and training. This will help drive wider adoption of Apache Hadoop as the platform for managing big data in the enterprise.
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
The document provides an overview of a webinar presented by Anurag Tandon and John Kreisa of Hortonworks and MicroStrategy respectively. It discusses the drivers for adopting a modern data architecture including the growth of new types of data and the need for efficiency. It outlines how Apache Hadoop can power a modern data architecture by providing scalable storage and processing. Key requirements for Hadoop adoption in the enterprise are also reviewed like the need for integration, interoperability, essential services, and leveraging existing skills. MicroStrategy's role in enabling analytics on big data and across all data sources is also summarized.
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
How can you simplify the management and monitoring of your Hadoop environment? Ensure IT can focus on the right business priorities supported by Hadoop? Take a look at this presentation and learn how you can simplify the management and monitoring of your Hadoop environment, and ensure IT can focus on the right business priorities supported by Hadoop.
Hortonworks provides an open source Apache Hadoop distribution called Hortonworks Data Platform (HDP). Their mission is to enable modern data architectures through delivering enterprise Apache Hadoop. They have over 300 employees and are headquartered in Palo Alto, CA. Hortonworks focuses on driving innovation through the open source Apache community process, integrating Hadoop with existing technologies, and engineering Hadoop for enterprise reliability and support.
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
The HDF 3.3 release delivers several exciting enhancements and new features. But, the most noteworthy of them is the addition of support for Kafka 2.0 and Kafka Streams.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-3-taking-stream-processing-next-level/
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
Forrester forecasts* that direct spending on the Internet of Things (IoT) will exceed $400 Billion by 2023. From manufacturing and utilities, to oil & gas and transportation, IoT improves visibility, reduces downtime, and creates opportunities for entirely new business models.
But successful IoT implementations require far more than simply connecting sensors to a network. The data generated by these devices must be collected, aggregated, cleaned, processed, interpreted, understood, and used. Data-driven decisions and actions must be taken, without which an IoT implementation is bound to fail.
https://hortonworks.com/webinar/iot-predictions-2019-beyond-data-heart-iot-strategy/
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
Cloudbreak, a part of Hortonworks Data Platform (HDP), simplifies the provisioning and cluster management within any cloud environment to help your business toward its path to a hybrid cloud architecture.
https://hortonworks.com/webinar/getting-data-cloud-cloudbreak-live-demo/
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
In this webinar, we talk with experts from Johns Hopkins as they share techniques and lessons learned in real-world Apache Hadoop implementation.
https://hortonworks.com/webinar/johns-hopkins-using-hadoop-securely-access-log-events/
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
Cybersecurity today is a big data problem. There’s a ton of data landing on you faster than you can load, let alone search it. In order to make sense of it, we need to act on data-in-motion, use both machine learning, and the most advanced pattern recognition system on the planet: your SOC analysts. Advanced visualization makes your analysts more efficient, helps them find the hidden gems, or bombs in masses of logs and packets.
https://hortonworks.com/webinar/catch-hacker-real-time-live-visuals-bots-bad-guys/
We have introduced several new features as well as delivered some significant updates to keep the platform tightly integrated and compatible with HDP 3.0.
https://hortonworks.com/webinar/hortonworks-dataflow-hdf-3-2-release-raises-bar-operational-efficiency/
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
With the growth of Apache Kafka adoption in all major streaming initiatives across large organizations, the operational and visibility challenges associated with Kafka are on the rise as well. Kafka users want better visibility in understanding what is going on in the clusters as well as within the stream flows across producers, topics, brokers, and consumers.
With no tools in the market that readily address the challenges of the Kafka Ops teams, the development teams, and the security/governance teams, Hortonworks Streams Messaging Manager is a game-changer.
https://hortonworks.com/webinar/curing-kafka-blindness-hortonworks-streams-messaging-manager/
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
The healthcare industry—with its huge volumes of big data—is ripe for the application of analytics and machine learning. In this webinar, Hortonworks and Quanam present a tool that uses machine learning and natural language processing in the clinical classification of genomic variants to help identify mutations and determine clinical significance.
Watch the webinar: https://hortonworks.com/webinar/interpretation-tool-genomic-sequencing-data-clinical-environments/
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
Last year IBM and Hortonworks jointly announced a strategic and deep partnership. Join us as we take a close look at the partnership accomplishments and the conjoined road ahead with industry-leading analytics offers.
View the webinar here: https://hortonworks.com/webinar/ibmhortonworks-transformation-big-data-landscape/
The document provides an overview of Apache Druid, an open-source distributed real-time analytics database. It discusses Druid's architecture including segments, indexing, and nodes like brokers, historians and coordinators. It also covers integrating Druid with Hortonworks Data Platform for unified querying and visualization of streaming and historical data.
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
Gaining business advantages from big data is moving beyond just the efficient storage and deep analytics on diverse data sources to using AI methods and analytics on streaming data to catch insights and take action at the edge of the network.
https://hortonworks.com/webinar/accelerating-data-science-real-time-analytics-scale/
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
Thanks to sensors and the Internet of Things, industrial processes now generate a sea of data. But are you plumbing its depths to find the insight it contains, or are you just drowning in it? Now, Hortonworks and Seeq team to bring advanced analytics and machine learning to time-series data from manufacturing and industrial processes.
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
Trimble Transportation Enterprise is a leading provider of enterprise software to over 2,000 transportation and logistics companies. They have designed an architecture that leverages Hortonworks Big Data solutions and Machine Learning models to power up multiple Blockchains, which improves operational efficiency, cuts down costs and enables building strategic partnerships.
https://hortonworks.com/webinar/blockchain-with-machine-learning-powered-by-big-data-trimble-transportation-enterprise/
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
For years, the healthcare industry has had problems of data scarcity and latency. Clearsense solved the problem by building an open-source Hortonworks Data Platform (HDP) solution while providing decades worth of clinical expertise. Clearsense is delivering smart, real-time streaming data, to its healthcare customers enabling mission-critical data to feed clinical decisions.
https://hortonworks.com/webinar/delivering-smart-real-time-streaming-data-healthcare-customers-clearsense/
Making Enterprise Big Data Small with EaseHortonworks
Every division in an organization builds its own database to keep track of its business. When the organization becomes big, those individual databases grow as well. The data from each database may become silo-ed and have no idea about the data in the other database.
https://hortonworks.com/webinar/making-enterprise-big-data-small-ease/
Driving Digital Transformation Through Global Data ManagementHortonworks
Using your data smarter and faster than your peers could be the difference between dominating your market and merely surviving. Organizations are investing in IoT, big data, and data science to drive better customer experience and create new products, yet these projects often stall in ideation phase to a lack of global data management processes and technologies. Your new data architecture may be taking shape around you, but your goal of globally managing, governing, and securing your data across a hybrid, multi-cloud landscape can remain elusive. Learn how industry leaders are developing their global data management strategy to drive innovation and ROI.
Presented at Gartner Data and Analytics Summit
Speaker:
Dinesh Chandrasekhar
Director of Product Marketing, Hortonworks
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
Hortonworks DataFlow (HDF) is the complete solution that addresses the most complex streaming architectures of today’s enterprises. More than 20 billion IoT devices are active on the planet today and thousands of use cases across IIOT, Healthcare and Manufacturing warrant capturing data-in-motion and delivering actionable intelligence right NOW. “Data decay” happens in a matter of seconds in today’s digital enterprises.
To meet all the needs of such fast-moving businesses, we have made significant enhancements and new streaming features in HDF 3.1.
https://hortonworks.com/webinar/series-hdf-3-1-technical-deep-dive-new-streaming-features/
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
Join the Hortonworks product team as they introduce HDF 3.1 and the core components for a modern data architecture to support stream processing and analytics.
You will learn about the three main themes that HDF addresses:
Developer productivity
Operational efficiency
Platform interoperability
https://hortonworks.com/webinar/series-hdf-3-1-redefining-data-motion-modern-data-architectures/
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
The document discusses Apache NiFi and streaming change data capture (CDC) with Attunity Replicate. It provides an overview of NiFi's capabilities for dataflow management and visualization. It then demonstrates how Attunity Replicate can be used for real-time CDC to capture changes from source databases and deliver them to NiFi for further processing, enabling use cases across multiple industries. Examples of source systems include SAP, Oracle, SQL Server, and file data, with targets including Hadoop, data warehouses, and cloud data stores.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.
Across all of our user base, we have identified just 3 separate usage patterns – sometimes more than one is used in concert during a complex project, but the patterns are distinct nonetheless. These are Refine, Explore and Enrich.The first of these, the Refine case, is probably the most common today. It is about taking very large quantities of data and using Hadoop to distill the information down into a more manageable data set that can then be loaded into a traditional data warehouse for usage with existing tools. This is relatively straightforward and allows an organization to harness a much larger data set for their analytics applications while leveraging their existing data warehousing and analytics tools.Using the graphic here, in step 1 data is pulled from a variety of sources, into the Hadoop platform in step 2, and then in step 3 loaded into a data warehouse for analysis by existing BI tools
A second use case is what we would refer to as Data Exploration – this is the use case in question most commonly when people talk about “Data Science”.In simplest terms, it is about using Hadoop as the primary data store rather than performing the secondary step of moving data into a data warehouse. To support this use case you’ve seen all the BI tool vendor rally to add support for Hadoop – and most commonly HDP – as a peer to the database and in so doing allow for rich analytics on extremely large datasets that would be both unwieldy and also costly in a traditional data warehouse. Hadoop allows for interaction with a much richer dataset and has spawned a whole new generation of analytics tools that rely on Hadoop (HDP) as the data store.To use the graphic, in step 1 data is pulled into HDP, it is stored and processed in Step 2, before being surfaced directly into the analytics tools for the end user in Step 3.
The final use case is called Application Enrichment.This is about incorporating data stored in HDP to enrich an existing application. This could be an on-line application in which we want to surface custom information to a user based on their particular profile. For example: if a user has been searching the web for information on home renovations, in the context of your application you may want to use that knowledge to surface a custom offer for a product that you sell related to that category. Large web companies such as Facebook and others are very sophisticated in the use of this approach.In the diagram, this is about pulling data from disparate sources into HDP in Step 1, storing and processing it in Step 2, and then interacting with it directly from your applications in Step 3, typically in a bi-directional manner (e.g. request data, return data, store response).
It is for that reason that we focus on HDP interoperability across all of these categories:Data systemsHDP is endorsed and embedded with SQL Server, Teradata and moreBI tools: HDP is certified for use with the packaged applications you already use: from Microsoft, to Tableau, Microstrategy, Business Objects and moreWith Development tools: For .Net developers: Visual studio, used to build more than half the custom applications in the world, certifies with HDP to enable microsoft app developers to build custom apps with HadoopFor Java developers: Spring for Apache Hadoop enables Java developers to quickly and easily build Hadoop based applications with HDPOperational toolsIntegration with System Center, and with Teradata viewpoint
At its core, Hadoop is about HDFS and MapReduce, 2 projects that are really about distributed storage and data processing which are the underpinnings of Hadoop.In addition to Core Hadoop, we must identify and include the requisite “Platform Services” that are central to any piece of enterprise software. These include High Availability, Disaster Recovery, Security, etc, which enable use of the technology for a much broader (and mission critical) problem set.This is accomplished not by introducing new open source projects, but rather ensuring that these aspects are addressed within existing projects.HDFS: Self-healing, distributed file system for multi-structured data; breaks files into blocks & stores redundantly across clusterMapReduce: Framework for running large data processing jobs in parallel across many nodes & combining resultsYARN: New application management framework that enables Hadoop to go beyond MapReduce appsEnterprise-ready servicesHigh availability, disaster recovery, snapshots, security, …
Beyond Core and Platform Services, we must add a set of Data Services that enable the full data lifecycle. This includes capabilities to:Store dataProcess dataAccess dataFor example: how do we maintain consistent metadata information required to determine how best to query data stored in HDFS? The answer: a project called Apache HCatalogOr how do we access data stored in Hadoop from SQL-oriented tools? The answer: with projects such as Hive, which is the defacto standard for accessing data stored in HDFS.All of these are broadly captured under the category of “data services”.Apache HCatalog: Metadata & Table ManagementMetadata service that enables users to access Hadoop data as a set of tables without needing to be concerned with where or how their data is storedEnables consistent data sharing and interoperability across data processing tools such as Pig, MapReduce and HiveEnables deep interoperability and data access with systems such as Teradata, SQL Server, etc.Apache Hive: SQL Interface for HadoopThe de-facto SQL-like interface for Hadoop that enables data summarization, ad-hoc query, and analysis of large datasetsConnects to Excel, Microstrategy, PowerPivot, Tableau and other leading BI tools via Hortonworks Hive ODBC DriverHive currently serves batch and non-interactive use cases; in 2013, Hortonworks is working with Hive community to extend use cases to interactive query. Cloudera, on the other hand, has chosen to abandon Hive in lieu of Cloudera Impala (a Cloudera controlled technology aimed at the analytics market and solely focused on non-operational interactive query use cases)Apache HBase: NoSQL DB for Interactive AppsNon-relational, columnar database that provides a way for developers to create, read, update, and delete data in Hadoop in a way that performs well for interactive applicationsCommonly used for serving “intelligent applications” that predict user behavior, detect shifting usage patterns, or recommend ways for users to engageWebHDFS: Web service interface for HDFSScalable REST API that enables easy and scalable access to HDFS Move files in & out and delete from HDFS; leverages parallelism of clusterPerform file and directory functionswebhdfs://<HOST>:<HTTP PORT>/PATHIncluded in versions 1.0 and 2.0 of Hadoop; created & driven by HortonworkersTalend Open Studio for Big Data: open source ETL tool available as an optional download with HDPIntuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and PigOozie scheduling allows you to manage and stage jobs Connectors for any database, business application or systemIntegrated HCatalog storage
HCatalog – metadata shared across whole platformFile locations become abstract (not hard-coded)Data types become shared (not redefined per tool)Partitioning and HDFS-optimized
Any data management platform that is operated at any reasonable scale requires a management technology – for example SQL Server Management Studio for SQL Server, or Oracle Enterprise Manager for Oracle DB, etc. Hadoop is no exception, and means Apache Ambari, which is increasingly being recognized as foundational to the operation of Hadoop infrastructures. It allows users to provision, manage and monitor a cluster and provides a set of tools to visualize and diagnose operational issues. There are other projects in this category (such as Oozie) but Ambari is really the most influential.Apache Ambari: Management & MonitoringMake Hadoop clusters easy to operateSimplified cluster provisioning with a step-by-step install wizardPre-configured operational metrics for insight into health of Hadoop servicesVisualization of job and task execution for visibility into performance issuesComplete RESTful API for integrating with existing operational toolsIntuitive user interface that makes controlling a cluster easy and productive
So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
And finally, because any enterprise runs a heterogeneous set of infrastructures, we ensure that HDP runs on your choice of infrastructure. Whether this is Linux, Windows (HDP is the only distribution certified for Windows), on a cloud platform such as Azure or Rackspace, or in an appliance, we ensure that all of them are supported and that this work is all contributed back to the open source community.
Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configu- ration choice is to write to local disk as well as a remote NFS mount.
Despite its name the SNN does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged name- space image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. Edit LogWhen a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified. The in-memory metadata is used to serve read requests. The edit log is flushed and synced after every write before a success code is returned to the client. For namenodes that write to multiple directories, the write must be flushed and synced to every copy before returning successfully. This ensures that no operation is lost due to machine failure. fsimageThe fsimagefile is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimagefile, which can grow to be gigabytes in size, would be very slow. This does not compromise resilience, however, because if the namenode fails, then the latest state of its metadata can be reconstructed by loading the fsimagefrom disk into memory, then applying each of the operations in the edit log. In fact, this is precisely what the namenode does when it starts up. According to Apache, SecondaryNameNode is deprecated. See: http://hadoop.apache.org/hdfs/docs/r0.22.0/hdfs_user_guide.html#Secondary+NameNode (Accessed May 2012). – LWSNN configuration options:dfs.namenode.checkpoint.period valuedfs.namenode.checkpoint.size valuedfs.http.addressvaluepoint to name nodes port 50070Gets Fsimage and EditLogdfs.namenode.checkpoint.dirvalueelse defaults to /tmp
Data Disk Failure, Heartbeats and Re-ReplicationEach DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replication factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be increased.Cluster RebalancingThe HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.Data IntegrityIt is possible that a block of data fetched from a DataNode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.Metadata Disk FailureThe fsImage and the editLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a NameNode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a NameNode restarts, it selects the latest consistent FsImage and EditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the NameNode software to another machine is not supported.
Use the put commandIf bar is a directory then foo will be placed in the directory barIf no file or directory exists of name bar, file foo will be named If bar is already a file in HDFS then an error is returnedIf subNamedoes not exist then the file will be created in the directoryNote: -copyFromLocalcan be used instead of -put
Can display using catCopy a file from HDFS to the local file system with –get command$ bin/hadoopfs –get foo LocalFooOther commands$ hadoopfs -rmrdirectory|file Recursively
A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system (HDFS). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasksrunning on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.M/R supports multiple nodes processing contiguous data sets in parallel. Often, one node will “Map”, while another “Reduces”. In between, is the “shuffle”. We’ll cover all these in more detail.Basic “phases”, or steps, are displayed here. Other custom phases may be added.InMapReduce data is passed around as key/value pairs.QUESTION: is the “Reduce” phase required? NOINSTRUCTOR SUGGESTION: During the second day start – have one or more students re-draw this on the white board to re-iterate its importance.
MapReduce consists of Java classes that can process big data in parallel processes. It receives three inputs, a source collection, a Map function and a Reduce function, then returns a new result data collection.The algorithm is composed of a few steps, the first one executes the map() function to each item within the source collection. Map will return zero or may instances of Key/Value objects.Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value pairs.In the next step , an algorithm will sort all Key/Value pairs to create new object instances where all values will be grouped by Key.reduce() then groups each by Key/Value instance. It returns a new instance to be included into the result collection.
A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.An InputFormat determines how the data is split across map tasksTextInputFormat is the default InputFormatIt divides the input data into an InputSplit[] arrayEach InputSplit is given an individual mapEach InputSplit is associated with a destinationnodethat should be used to read the data so Hadoop can try to assign the computation to local data locationThe RecordReader class:Reads input data and produces key-value pairs to be passed into the MapperMay control how data is decompressedConverts data to Java types that MapReduce can work withMapReduce relies on the InputFormat to:Validate the input-specification of the job.Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.The default behavior of InputFormatis to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. [However, HDFS blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via the config parameter mapred.min.split.size.] * validate thisLogical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.If the defaultTextInputFormat is used for a given job, the framework detects compressed inputfiles (i.e., with the .gz extension)and automatically decompresses them using an appropriate CompressionCodec. Note that, depending on the codec used, compressed files cannot be split and each compressed file is processed in its entirety by a single mapper.gzip cannot be split, bzip and lzocan.
Mapping is often used for typical “ETL” processes, such as filtering, transformation, and categorization.Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.The OutputCollector is provided by theframework to collect data output by the Mapper or the Reducer.The Reporter class is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).Applications can also update Counters using the Reporter.Parallel Map ProcessesThe number of maps is usually driven by the number of inputs, orthe total number of blocks in the input files.The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.If you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, unless you use setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.
The Partitioner controls the balancing of the keys of intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function on the key (or portion of it)AllMapper outputs are sorted and then partitioned per ReducerThe total number of partitions is the same as the number of Reduce tasks for the job. ThePartitioner is like a load balancer, that controls which one of the Reduce tasks the intermediate key (and hence the record) is sent to for reduction.Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer to determine final output. Users can control that grouping by specifying a Comparator via JobConf.setOutputKeyComparatorClass(Class).MapReduce comes bundled with a library of generally useful Mappers, Reducers, and Partitioner classes.
MapReduce partitions data among indiidual Reducers, usually running on separate nodes.The Shuffle phase is determined by how a Partitioner (embedded or custom) partitions value pairs to ReducersThe shuffle is where refinements and improvements are continually being made, In many ways, the shuffle is the heart of MapReduce and is where the “magic” happens.
The Sort phase guarantees that the input to every Reducer is sorted by key.
ARecuder extends MapReduceBase(). It implements the Reducer interfaceReceives output from multiple mappersreduce() method is called for each <key, (list of values)> pair in the grouped inputsIt sorts data by key of key/value pairTypically iterates through the values associated with a keyIt is perfectly legal to set the number of reduce-tasks to zero if no reduction is desired.In this case the outputs of the map-tasks go directly to HDFS, to the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out toHDFS.Reducer has 3 primary phases: Shuffle, Sort and Reduce:ShuffleInput to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.SortThe framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.Secondary SortIf equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.ReduceIn this phase the reduce() method is called for each <key, (list of values)> pair in the grouped inputs.The output of the reduce task is typically written to HDFS via OutputCollector.collect(WritableComparable, Writable).Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive.The output of the Reducer is not sorted.
OutputFormat has two responsibilities: determining where and how data is written. Itdetermines where by examining the JobConf and checking to make sure it is a legitimate destination. The 'how' is handled by the getRecordWriter() function.From Hadoop's point of view, these classes come together when the MapReduce job finishes and each reducer has produced a stream of key-value pairs. Hadoop calls checkOutputSpecs with the job's configuration. If the function runs without throwing an Exception, it moves on and calls getRecordWriter which returns an object which can write that stream of data. When all of the pairs have been written, Hadoop calls the close function on the writer, committing that data to HDFS and finishing the responsibility of that reducer.MapReducerelies on the OutputFormat of the job to:Validate the output-specification of the job; for example, check that the output directory doesn't already existProvide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem (usually HDFS)TextOutputFormat is the default OutputFormat.OutputFormats specify how to serialize data by providing a implementation of RecordWriter. RecordWriter classes handle the job of taking an individual key-value pair and writing it to the location prepared by the OutputFormat. There are two main functions to a RecordWriter implements: 'write' and 'close'. The 'write' function takes key-values from the MapReduce job and writes the bytes to disk. The default RecordWriter is LineRecordWriter, part of the TextOutputFormat mentioned earlier. It writes:The key's bytes (returned by the getBytes() function)a tab character delimiterthe value's bytes (again, produced by getBytes())a newline character.The 'close' function closes the Hadoop data stream to the output file.We've talked about the format of output data, but where is it stored? Again, you've probably seen the output of a job stored in many ‘part' files under the output directory like so:|-- output-directory| |-- part-00000| |-- part-00001| |-- part-00002| |-- part-00003| |-- part-00004 '-- part-00005from http://www.infoq.com/articles/HadoopOutputFormat
Mostly "boilerplate" with fill in the blank for datatypesHadoopdatatypes have well defined methods to get/put values into themStandard pattern to declare your output objects outside map() loop scopeStandard pattern to 1) get Java datatypes 2) Do your logic 3) Put into HadoopdatatypesThis map() method is called in a "loop" with successive key, value pairs.Each time through, you typically write key, value pairs to the reduce phaseThe key, value pairs go to the Hadoop Framework where the key is hashed to choose a reducer
In the Reducer:Reduce code is copied to multiple nodes in the clusterThe copies are identicalAll key,value pairs with identical keys are placed into a pair of (key, Iterator<values>) when passed to the reducer.Each invocation of the reduce() method is passed all values associated with a particular key.The keys are guaranteed to arrive in sorted orderEach incoming value is a numeric count of words in a particular line of textThe multiple values represent multiple lines of text processed by multiple map() methodsThe values are in an Iterator, and are summed in a loopThe sum, with its associated key, is sent to a disk file via the Hadoop FrameworkThere are typically many copies of your reduce codeIncoming key,value pairs are of the datatypes that map emitsOutput key,value pairs go to the HDFS, which writes them to diskThe Hadoop Framework constructs the Iterator from map values that are sent to this Reducer
An inverted index is a data structure that stores a mapping from content to its locations in a database file, or in a document or a set of documentsCan provide what to display for a searchNOTE: need to deal with OutputCollector() here.
Pig Latin is a data flow language – a sequence of steps where each step is an operation or commandDuring execution each statement is processed by the Pig interpreter and checked for syntax errorsIf a statement is valid, it gets added to a logical plan built by the interpreterHowever, the step does not actually execute until the entire script is processed, unless the step is a DUMP or STORE command, in which case the logical plan is compiled into a physical plan and executed
After the LOAD statement, point out that nothing is actually loaded! The LOAD command only defines a relationThe FILTER and GROUP commands are fairly self explanatoryThe HDFS is not even touched until the STORE command executes, at which point a MapReduce application is built from the Pig Latin statements shown here.
GruntPig’s interactive shellEnables users to enter Pig Latin interactivelyGrunt will do basic syntax and semantic checking as you enter each lineProvides a shell for users to interact with HDFSTo enter Grunt and use local file system instead:$ pig –x localIn the example script:A is called a relation or aliasimmutable, recreated if reusedmyfile is read from your home directory in HDFSA has no schema associated with itbytearrayLOAD uses default function PigStorage() to load dataAssumes TAB delimited in HDFSThe entire file 'myfile' will be read"pigs eat anything"The elements in A can be referenced by position if no schema is associated$0, $1, $2, ...
Pig return codes:(type in page 18 of Pig)
The complex types can contain data of any type
The DESCRIBE output is:describe employeesemployees: {name: chararray,age: bytearray,zip: int,salary: bytearray}
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java, Python, etc. In Pig a null data element means the value is unknown. This might be because the data is missing, an error occurred in processing it, etc. In most procedural languages, a data value is said to be null when it is unset or does not point to a valid address or object. This difference in the concept of null is important and affects the way Pig treats null data, especially when operating on it.from Programming Pig, Alan Gates 2011
twokey.pig collects all records with the same value for the provided key together into a bagWithin a single relation, group together tuples with the same group keyCan use keywords BY and ALL with GROUPCan optionally be passed as an aggregate function (count.pig)Can group on multiple keys if they are surrounded by parentheses