The document discusses the need for a model management framework to ease the development and deployment of analytical models at scale. It describes how such a framework could capture and template models created by data scientists, enable faster model iteration through a brute force approach, and visually compare models. The framework would reduce complexity for data scientists and allow business analysts to participate in modeling. It is presented as essential for enabling predictive modeling on data from thousands of sensors in an Internet of Things platform.
This document discusses Apache Oozie usage at Yahoo for managing complex data pipelines. It describes how Oozie is deployed at a large scale with high availability. It outlines the types of data pipelines used for tasks like ad targeting and content management. Challenges for large pipelines like dependency management, SLA monitoring, and reprocessing are discussed. User-built monitoring systems are described that integrate with Oozie for tasks like alerting and long job detection. Future work areas like improved testing and coordination are proposed.
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping, and monitoring algorithm convergence in real time.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
This document discusses building a streaming analytics platform and its requirements and capabilities. It covers 5 key attributes of a streaming platform including ingest, process, analyze, visualize and respond. It discusses 6 focus areas for building a streaming platform including common abstraction layer, latency, lambda architecture, scale-out, rapid application development and data visualization. It provides examples of capabilities including continuous processing, event sources, storage, filtering, transformations, correlations and pattern detection. It uses a cybersecurity use case to demonstrate how streaming analytics can be applied for threat protection through exposing, ingesting, prioritizing, analyzing and remediating steps. It also discusses core data platform requirements and interesting problems around offline and local processing for Neustar.
The document discusses accelerating enterprise adoption of Apache Hadoop through a capability-driven approach. It outlines four core tenets for a Hadoop journey: having a capability-driven framework, using a heterogeneous set of technologies, choosing the right fit of open source and commercial solutions, and developing a flexible operating model. Case studies show how following these tenets can help reduce data processing times and give business users improved analytics capabilities.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and use of technologies like Apache Storm and Kafka.
Modern data warehouses need to be modernized to handle big data, integrate multiple data silos, reduce costs, and reduce time to market. A modern data warehouse blueprint includes a data lake to land and ingest structured, unstructured, external, social, machine, and streaming data alongside a traditional data warehouse. Key challenges for modernization include making data discoverable and usable for business users, rethinking ETL to allow for data blending, and enabling self-service BI over Hadoop. Common tactics for modernization include using a data lake as a landing zone, offloading infrequently accessed data to Hadoop, and exploring data in Hadoop to discover new insights.
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
This document discusses Apache Oozie usage at Yahoo for managing complex data pipelines. It describes how Oozie is deployed at a large scale with high availability. It outlines the types of data pipelines used for tasks like ad targeting and content management. Challenges for large pipelines like dependency management, SLA monitoring, and reprocessing are discussed. User-built monitoring systems are described that integrate with Oozie for tasks like alerting and long job detection. Future work areas like improved testing and coordination are proposed.
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping, and monitoring algorithm convergence in real time.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
This document discusses building a streaming analytics platform and its requirements and capabilities. It covers 5 key attributes of a streaming platform including ingest, process, analyze, visualize and respond. It discusses 6 focus areas for building a streaming platform including common abstraction layer, latency, lambda architecture, scale-out, rapid application development and data visualization. It provides examples of capabilities including continuous processing, event sources, storage, filtering, transformations, correlations and pattern detection. It uses a cybersecurity use case to demonstrate how streaming analytics can be applied for threat protection through exposing, ingesting, prioritizing, analyzing and remediating steps. It also discusses core data platform requirements and interesting problems around offline and local processing for Neustar.
The document discusses accelerating enterprise adoption of Apache Hadoop through a capability-driven approach. It outlines four core tenets for a Hadoop journey: having a capability-driven framework, using a heterogeneous set of technologies, choosing the right fit of open source and commercial solutions, and developing a flexible operating model. Case studies show how following these tenets can help reduce data processing times and give business users improved analytics capabilities.
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and use of technologies like Apache Storm and Kafka.
Modern data warehouses need to be modernized to handle big data, integrate multiple data silos, reduce costs, and reduce time to market. A modern data warehouse blueprint includes a data lake to land and ingest structured, unstructured, external, social, machine, and streaming data alongside a traditional data warehouse. Key challenges for modernization include making data discoverable and usable for business users, rethinking ETL to allow for data blending, and enabling self-service BI over Hadoop. Common tactics for modernization include using a data lake as a landing zone, offloading infrequently accessed data to Hadoop, and exploring data in Hadoop to discover new insights.
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
This document discusses predictive maintenance of robots in the automotive industry using big data analytics. It describes Cisco's Zero Downtime solution which analyzes telemetry data from robots to detect potential failures, saving customers over $40 million by preventing unplanned downtimes. The presentation outlines Cisco's cloud platform and a case study of how robot and plant data is collected and analyzed using streaming and batch processing to predict failures and schedule maintenance. It proposes a next generation predictive platform using machine learning to more accurately detect issues before downtime occurs.
This document discusses using active learning for fraud prevention at PayPal. It introduces fraud prevention techniques at PayPal, including machine learning models at the transaction, account, and network levels. It then describes an active learning framework that uses deep learning and gradient boosted trees models along with a query by committee strategy. The experiments show that active learning is able to improve the area under the ROC curve performance of these models while significantly reducing labeling costs compared to random sampling for training data.
The document discusses LLAP (Live Long and Process), a new execution engine in Apache Hive 2.0 that enables sub-second analytical queries. LLAP keeps a small subset of frequently accessed data in memory to enable faster query processing times compared to traditional Hive architectures that rely on disk access. It works by running Hive query fragments concurrently in long-running daemon processes with an in-memory cache, rather than in short-lived YARN containers. This allows queries to retrieve data from memory rather than disk, providing significant performance improvements for interactive analytics workloads. The document provides details on how LLAP is implemented and evaluates its performance benefits based on benchmarks and customer case studies.
The document discusses Apache Beam, a solution for next generation data processing. It provides a unified programming model for both batch and streaming data processing. Beam allows data pipelines to be written once and run on multiple execution engines. The presentation covers common challenges with historical data processing approaches, how Beam addresses these issues, a demo of running a Beam pipeline on different engines, and how to get involved with the Apache Beam community.
This document discusses the new features of Apache Hive 2.0, including:
1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops.
2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons.
3) Using HBase as the metastore to speed up query planning by reducing metadata access times.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins.
5) Improvements to the cost-based optimizer including better statistics collection.
The document provides an introduction to Spark and related Apache projects. It discusses the Spark ecosystem including Spark Core, Spark SQL, Spark Streaming and MLlib. It explains key Spark concepts like RDDs, DataFrames and the Spark SQL optimizer Catalyst. It also introduces the Hortonworks Data Platform which includes components like YARN, HDFS and Zeppelin which can be used with Spark.
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
This document provides an overview of NEC's Heterogeneous Mixture Learning (HML) technology and its implementation on Apache Spark. It introduces the speakers and their backgrounds working on distributed computing and machine learning. The agenda discusses HML, applications of HML, the HML algorithm, and benchmark performance evaluations showing HML achieves competitive prediction accuracy compared to other Spark ML algorithms while maintaining good scalability. Distributed HML on Spark aims to enable fast, large-scale machine learning by balancing work across executors and leveraging high-performance matrix libraries.
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
This document provides an overview of Hadoop and its ecosystem. It discusses the evolution of Hadoop from version 1 which focused on batch processing using MapReduce, to version 2 which introduced YARN for distributed resource management and supported additional data processing engines beyond MapReduce. It also describes key Hadoop services like HDFS for distributed storage and the benefits of a Hadoop data platform for unlocking the value of large datasets.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
The document discusses resource tracking for Hadoop and Storm clusters at Yahoo. It describes how Yahoo developed tools over three years to track resource usage at the application, cluster, queue, user and project levels. This includes capturing CPU and memory usage for Hadoop YARN applications and Storm topologies. The data is stored and made available through dashboards and APIs. Yahoo also calculates total cost of ownership for Hadoop and converts resource usage to estimated monthly costs for projects. This visibility into usage and costs helps with capacity planning, operational efficiency, and ensuring fairness across grid users.
eBay maintains hundreds of millions of accounts across its properties that are unstructured and in different formats. Identifying which accounts belong to the same person enables eBay to personalize customer experiences, provide customer service, and fight fraud. MapReduce provides a robust design pattern to simplify high-scale entity resolution through parallelized modular operations, including linking accounts pairwise, identifying connected components through iterative MapReduce jobs, and validating the results.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Modern data ecosystems require new paradigms to address diverse data sources and user needs. Traditional assumptions about data originating from internal systems and a single data warehouse no longer apply. A new model called "Data Regions" establishes multiple environments for different data usage scenarios, including source onboarding, exploration, reporting, analytics and more. By supporting varied access, structures, domains and integrity across regions, Data Regions can address today's complex data challenges and modernize companies' data ecosystems.
The document provides an overview of new features in Apache Ambari 2.1, including rolling upgrades, alerts, metrics, an enhanced dashboard, smart configurations, views, Kerberos automation, and blueprints. Key highlights include the ability to perform rolling upgrades of Hadoop clusters without downtime by managing different software versions side-by-side, new alert types and a user interface for viewing and customizing alerts, integration of a metrics service for collecting and querying metrics from Hadoop services, customizable service dashboards with new widget types, smart configurations that provide recommended values and validate configurations based on cluster attributes and dependencies, and automated Kerberos configuration.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
This document summarizes a presentation given by Chris Nauroth and Sheetal Dolas of Hortonworks on keeping Hadoop clusters running optimally. It describes several common operational challenges faced by Hadoop users through examples, and how the SmartSense tool can help address these issues by continuously evaluating cluster configurations, identifying risks, and providing recommendations. The presentation covers topics such as unstable NameNodes, high CPU usage, HDFS upgrades, container sizing, accidental data deletion, and time synchronization issues across nodes.
1. The document discusses security considerations for deploying big data as a service (BDaaS) across multiple tenants and applications. It focuses on maintaining a single user identity to prevent data duplication and enforce access policies consistently.
2. It describes using Apache Ranger to centrally define and enforce policies across Hadoop services like HDFS, HBase, Hive. Ranger integrates with LDAP/AD for authentication.
3. The key challenge is propagating user identities from the application layer to the data layer. This can be done by connecting HDFS directly via Kerberos or using a "super-user" that impersonates other users when accessing HDFS.
This document discusses predictive maintenance of robots in the automotive industry using big data analytics. It describes Cisco's Zero Downtime solution which analyzes telemetry data from robots to detect potential failures, saving customers over $40 million by preventing unplanned downtimes. The presentation outlines Cisco's cloud platform and a case study of how robot and plant data is collected and analyzed using streaming and batch processing to predict failures and schedule maintenance. It proposes a next generation predictive platform using machine learning to more accurately detect issues before downtime occurs.
This document discusses using active learning for fraud prevention at PayPal. It introduces fraud prevention techniques at PayPal, including machine learning models at the transaction, account, and network levels. It then describes an active learning framework that uses deep learning and gradient boosted trees models along with a query by committee strategy. The experiments show that active learning is able to improve the area under the ROC curve performance of these models while significantly reducing labeling costs compared to random sampling for training data.
The document discusses LLAP (Live Long and Process), a new execution engine in Apache Hive 2.0 that enables sub-second analytical queries. LLAP keeps a small subset of frequently accessed data in memory to enable faster query processing times compared to traditional Hive architectures that rely on disk access. It works by running Hive query fragments concurrently in long-running daemon processes with an in-memory cache, rather than in short-lived YARN containers. This allows queries to retrieve data from memory rather than disk, providing significant performance improvements for interactive analytics workloads. The document provides details on how LLAP is implemented and evaluates its performance benefits based on benchmarks and customer case studies.
The document discusses Apache Beam, a solution for next generation data processing. It provides a unified programming model for both batch and streaming data processing. Beam allows data pipelines to be written once and run on multiple execution engines. The presentation covers common challenges with historical data processing approaches, how Beam addresses these issues, a demo of running a Beam pipeline on different engines, and how to get involved with the Apache Beam community.
This document discusses the new features of Apache Hive 2.0, including:
1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops.
2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons.
3) Using HBase as the metastore to speed up query planning by reducing metadata access times.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins.
5) Improvements to the cost-based optimizer including better statistics collection.
The document provides an introduction to Spark and related Apache projects. It discusses the Spark ecosystem including Spark Core, Spark SQL, Spark Streaming and MLlib. It explains key Spark concepts like RDDs, DataFrames and the Spark SQL optimizer Catalyst. It also introduces the Hortonworks Data Platform which includes components like YARN, HDFS and Zeppelin which can be used with Spark.
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
This document provides an overview of NEC's Heterogeneous Mixture Learning (HML) technology and its implementation on Apache Spark. It introduces the speakers and their backgrounds working on distributed computing and machine learning. The agenda discusses HML, applications of HML, the HML algorithm, and benchmark performance evaluations showing HML achieves competitive prediction accuracy compared to other Spark ML algorithms while maintaining good scalability. Distributed HML on Spark aims to enable fast, large-scale machine learning by balancing work across executors and leveraging high-performance matrix libraries.
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
This document provides an overview of Hadoop and its ecosystem. It discusses the evolution of Hadoop from version 1 which focused on batch processing using MapReduce, to version 2 which introduced YARN for distributed resource management and supported additional data processing engines beyond MapReduce. It also describes key Hadoop services like HDFS for distributed storage and the benefits of a Hadoop data platform for unlocking the value of large datasets.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
The document discusses resource tracking for Hadoop and Storm clusters at Yahoo. It describes how Yahoo developed tools over three years to track resource usage at the application, cluster, queue, user and project levels. This includes capturing CPU and memory usage for Hadoop YARN applications and Storm topologies. The data is stored and made available through dashboards and APIs. Yahoo also calculates total cost of ownership for Hadoop and converts resource usage to estimated monthly costs for projects. This visibility into usage and costs helps with capacity planning, operational efficiency, and ensuring fairness across grid users.
eBay maintains hundreds of millions of accounts across its properties that are unstructured and in different formats. Identifying which accounts belong to the same person enables eBay to personalize customer experiences, provide customer service, and fight fraud. MapReduce provides a robust design pattern to simplify high-scale entity resolution through parallelized modular operations, including linking accounts pairwise, identifying connected components through iterative MapReduce jobs, and validating the results.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Modern data ecosystems require new paradigms to address diverse data sources and user needs. Traditional assumptions about data originating from internal systems and a single data warehouse no longer apply. A new model called "Data Regions" establishes multiple environments for different data usage scenarios, including source onboarding, exploration, reporting, analytics and more. By supporting varied access, structures, domains and integrity across regions, Data Regions can address today's complex data challenges and modernize companies' data ecosystems.
The document provides an overview of new features in Apache Ambari 2.1, including rolling upgrades, alerts, metrics, an enhanced dashboard, smart configurations, views, Kerberos automation, and blueprints. Key highlights include the ability to perform rolling upgrades of Hadoop clusters without downtime by managing different software versions side-by-side, new alert types and a user interface for viewing and customizing alerts, integration of a metrics service for collecting and querying metrics from Hadoop services, customizable service dashboards with new widget types, smart configurations that provide recommended values and validate configurations based on cluster attributes and dependencies, and automated Kerberos configuration.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
This talk draws on our experience in debugging and analyzing Hadoop jobs to describe some methodical approaches to this and present current and new tracing and tooling ideas that can help semi-automate parts of this difficult problem.
This document summarizes a presentation given by Chris Nauroth and Sheetal Dolas of Hortonworks on keeping Hadoop clusters running optimally. It describes several common operational challenges faced by Hadoop users through examples, and how the SmartSense tool can help address these issues by continuously evaluating cluster configurations, identifying risks, and providing recommendations. The presentation covers topics such as unstable NameNodes, high CPU usage, HDFS upgrades, container sizing, accidental data deletion, and time synchronization issues across nodes.
1. The document discusses security considerations for deploying big data as a service (BDaaS) across multiple tenants and applications. It focuses on maintaining a single user identity to prevent data duplication and enforce access policies consistently.
2. It describes using Apache Ranger to centrally define and enforce policies across Hadoop services like HDFS, HBase, Hive. Ranger integrates with LDAP/AD for authentication.
3. The key challenge is propagating user identities from the application layer to the data layer. This can be done by connecting HDFS directly via Kerberos or using a "super-user" that impersonates other users when accessing HDFS.
This document describes 7 predictive analytics, Spark, and streaming use cases:
1) Live train time tables reduced spread by 40% for Dutch Railways
2) Intelligent equipment saved $40M/year for oil and gas companies
3) Algorithmic loyalty found products customers didn't know they needed for North Face
4) Predictive risk compliance avoided $440M loss in 40 minutes for ConvergEx
5) Live flight optimization helped get passengers home on time for United Airlines
6) Continuous transaction optimization monitored 20,000 systems for Morgan Stanley
7) IoT parcel tracking improved real-time tracking from 20% to 100% for Royal Mail
Efficient Point Cloud Pre-processing using The Point Cloud LibraryCSCJournals
Robotics, video games, environmental mapping and medical are some of the fields that use 3D data processing. In this paper we propose a novel optimization approach for the open source Point Cloud Library (PCL) that is frequently used for processing 3D data. Three main aspects of the PCL are discussed: point cloud creation from disparity of color image pairs; voxel grid downsample filtering to simplify point clouds; and passthrough filtering to adjust the size of the point cloud. Additionally, OpenGL shader based rendering is examined. An optimization technique based on CPU cycle measurement is proposed and applied in order to optimize those parts of the pre-processing chain where measured performance is slowest. Results show that with optimized modules the performance of the pre-processing chain has increased 69 fold.
6 staffing system and retention managementPreeti Bhaskar
This document discusses staffing, turnover, retention, and downsizing. It begins by defining types of turnover like voluntary, involuntary, discharge, and downsizing. Voluntary turnover can be avoidable or unavoidable. Causes of turnover include perceptions of desirability and ease of leaving a job as well as available alternatives. The document then discusses measuring turnover, analyzing reasons for leaving, and estimating costs and benefits. It provides guidelines for increasing retention through intrinsic and extrinsic rewards. Finally, it outlines retention initiatives for discharge situations and downsizing, including progressive discipline, alternatives to layoffs, and supporting employees who remain after downsizing.
The document discusses techniques for training killer whales at SeaWorld. It explains that trainers build trust with the whales by showing them love and care. They reward positive behaviors with food and praise immediately after the whales perform desired actions. When mistakes occur, trainers redirect the whales' energy to a different task rather than punishing them. Young whales are taught new tricks gradually by first rewarding them for closer approximations until they fully perform the behavior. Some key lessons for motivating people at work are to accentuate the positive, redirect mistakes, use praise immediately after good performance, and start training incrementally toward larger goals.
The document defines several formulas for calculating metrics related to testing efforts: % Effort Variation compares actual and estimated effort, % Duration Variation compares actual and planned durations, and % Schedule Variation compares actual and planned end dates. Other metrics include Load Factor, %Size Variation, Test Case Coverage%, Residual Defects Density, Test Effectiveness, Overall Productivity, Test Case Preparation Productivity, and Test Execution Productivity.
Overcoming the Challenges of your Master Data Management JourneyJean-Michel Franco
This Presentaion runs you through all the key steps of an MDM initiative. It considers and showcase the key milestones and building blocks that you will have to roll-out to make your MDM
journey
-> Please contact Talend for a dedicated interactive sessions with a storyboard by customer domain
Después de estudiar este capítulo usted será capaz de:
Explicar cómo funcionan los mercados con el comercio
internacional
Identificar las ventajas implícitas en el comercio
internacional, y señalar quiénes ganan y quiénes pierden
en él
Explicar los efectos de las barreras al comercio
internacional
Explicar y evaluar los argumentos utilizados para
justiticar las restricciones al comercio internacional
This document provides an overview of Apache Atlas and how it addresses big data governance issues for enterprises. It discusses how Atlas provides a centralized metadata repository that allows users to understand data across Hadoop components. It also describes how Atlas integrates with Apache Ranger to enable dynamic security policies based on metadata tags. Finally, it outlines new capabilities in upcoming Atlas releases, including cross-component data lineage tracking and a business taxonomy/catalog.
This medical billing flow chart outlines the process from a patient visit to a provider's front office to insurance billing and payment. It shows that the billing office handles converting visit details to insurance formats, submitting claims to insurance companies, and following up on payments or denials with cash posting and accounts receivable management. Key steps include preliminary screening of visits, dispatching to a clearing house, and claim adjudication by insurance companies.
Predictive Analytics: Extending asset management framework for multi-industry...Capgemini
The document provides information about a webinar on predictive asset management. It discusses how asset data from sensors can be analyzed using HP's HAVEn platform to optimize equipment reliability, performance management, operations control, and predictive maintenance. Examples of predictive asset analytics solutions for various industries are also presented.
Application Developers Guide to HIPAA ComplianceTrueVault
Software developers building mobile health applications need to be HIPAA compliant if their application will be collecting and sharing protected health information. This free plain language guide gives developers everything they need to know about mobile health app development and HIPAA.
Not every mHealth app needs to be HIPAA compliant. Not sure whether your mHealth application needs to be HIPAA compliant or not? Read the guide to find out!
This was a presentation by me for a Seminar For My Pharm. Analysis class. I have tried well to include possible things but haven't gone much in deep because it would be irrelevant as per syllabus. If any mistakes, Please do leave a comment
Mobile Commerce: A Security PerspectivePragati Rai
The document discusses mobile commerce (m-commerce) and security perspectives. It defines m-commerce as commerce conducted on mobile devices, which is growing rapidly and expected to reach $700 billion by 2017. The document outlines the m-commerce ecosystem and various security challenges at each layer from infrastructure to applications. It emphasizes the importance of end-to-end security and compliance with the PCI security standard to help protect users and businesses in the complex mobile commerce space.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
Gain New Insights by Analyzing Machine Logs using Machine Data Analytics and BigInsights.
Half of Fortune 500 companies experience more than 80 hours of system down time annually. Spread evenly over a year, that amounts to approximately 13 minutes every day. As a consumer, the thought of online bank operations being inaccessible so frequently is disturbing. As a business owner, when systems go down, all processes come to a stop. Work in progress is destroyed and failure to meet SLA’s and contractual obligations can result in expensive fees, adverse publicity, and loss of current and potential future customers. Ultimately the inability to provide a reliable and stable system results in loss of $$$’s. While the failure of these systems is inevitable, the ability to timely predict failures and intercept them before they occur is now a requirement.
A possible solution to the problem can be found is in the huge volumes of diagnostic big data generated at hardware, firmware, middleware, application, storage and management layers indicating failures or errors. Machine analysis and understanding of this data is becoming an important part of debugging, performance analysis, root cause analysis and business analysis. In addition to preventing outages, machine data analysis can also provide insights for fraud detection, customer retention and other important use cases.
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
Building a Real-Time Security Application Using Log Data and Machine Learning- Karthik Aaravabhoomi
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
3 Things to Learn About:
*Building scalable real time architectures for managing data from IoT
*Processing data in real time with components such as Kudu & Spark
*Customer case studies highlighting real-time IoT use cases
This document discusses DataOps, which is an agile methodology for developing and deploying data-intensive applications. DataOps supports cross-functional collaboration and fast time to value. It expands on DevOps practices to include data-related roles like data engineers and data scientists. The key goals of DataOps are to promote continuous model deployment, repeatability, productivity, agility, self-service, and to make data central to applications. It discusses how DataOps brings flexibility and focus to data-driven organizations through principles like continuous model deployment, improved efficiency, and faster time to value.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics:
1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry
2) The high level architecture design principles for AI As A Service
3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem
4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc.
5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.
SPSS Modeler 16 What's New!?
IBM SPSS Modeler is a comprehensive predictive analytics platform, designed to bring predictive intelligence to everyday business problems, enabling front-line employees or systems to make more effective decisions and improve outcomes. Modeler scales from desktop installations through to larger deployments that are integrated within operational systems and provides a range of advanced analytics including text analytics, entity analytics, social network analysis, automated modeling and data preparation in addition to decision management and optimization
Webinar: Faster Big Data Analytics with MongoDBMongoDB
Learn how to leverage MongoDB and Big Data technologies to derive rich business insight and build high performance business intelligence platforms. This presentation includes:
- Uncovering Opportunities with Big Data analytics
- Challenges of real-time data processing
- Best practices for performance optimization
- Real world case study
This presentation was given in partnership with CIGNEX Datamatics.
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...YASH Technologies
The client, a large agricultural machinery manufacturer, sought to enhance data visualization of performance measurements from 200,000 machines by implementing OpenTSDB and Cloudera to store, index, and serve metrics collected every 5 seconds. YASH Technologies tuned applications and databases, distributed storage, and eliminated down-sampling to maximize performance. This provided benefits like real-time graphs, capacity planning, and measuring service levels.
This document summarizes a presentation about using cloud computing for big data and business insights. It discusses how modern analytics applications have new requirements like handling large and diverse datasets, iterative experimentation, and variable workloads that are well-suited for cloud computing. The cloud provides massive storage and computing capacity, agility, and ability to handle variable workloads efficiently. It also discusses how the right analytics tools in the cloud can provide unlimited storage, security, flexibility with data formats, powerful and fast processing, and pay per use. Case studies show how companies have benefited from using cloud computing for big data analytics. The Financial Times case study discusses their transformation to digital and how embracing data empowerment helped them achieve business outcomes.
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Once you’ve found and matched duplicates to resolve entities, the next step is to track the lineage of data as it moves from source to final analysis – which is required by several regulations and by the need to verify the source of final decisions or recommendations.
Using Apache Atlas or Cloudera Navigator and tracking data outside the cluster can be done with enterprise governance tools like Collibra or ASG Data Intelligence. However, for auditing purposes you don’t need part of the data provenance – you need ALL of it. Getting a coherent source to consumption view of where specific data came from and how it was changed along the way is much harder than it should be in modern data architectures.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
This document discusses analytics and IoT. It covers key topics like data collection from IoT sensors, data storage and processing using big data tools, and performing descriptive, predictive, and prescriptive analytics. Cloud platforms and visualization tools that can be used to build end-to-end IoT and analytics solutions are also presented. The document provides an overview of building IoT solutions for collecting, analyzing, and gaining insights from sensor data.
Platforming the Major Analytic Use Cases for Modern EngineeringDATAVERSITY
We’ll describe some use cases as examples of a broad range of modern use cases that need a platform. We will describe some popular valid technology stacks that enterprises use in accomplishing these modern use cases of customer churn, predictive analytics, fraud detection, and supply chain management.
In many industries, to achieve top-line growth, it is imperative that companies get the most out of existing customer relationships. Customer churn use cases are about generating high levels of profitable customer satisfaction through the use of knowledge generated from corporate and external data to help drive a more positive customer experience (CX).
Many organizations are turning to predictive analytics to increase their bottom line and efficiency and, therefore, competitive advantage. It can make the difference between business success or failure.
Fraudulent activity detection is exponentially more effective when risk actions are taken immediately (i.e., stop the fraudulent transaction), instead of after the fact. Fast digestion of a wide network of risk exposures across the network is required in order to minimize adverse outcomes.
Supply chain leaders are under constant pressure to reduce overall supply chain management (SCM) costs while maintaining a flexible and diverse supplier ecosystem. They will leverage IoT, sensors, cameras, and blockchain. Major investments in advanced analytics, warehouse relocation, and automation, both in distribution centers and stores, will be essential for survival.
InfoSphere BigInsights is IBM's distribution of Hadoop that:
- Enhances ease of use and usability for both technical and non-technical users.
- Includes additional tools, technologies, and accelerators to simplify developing and running analytics on Hadoop.
- Aims to help users gain business insights from their data more quickly through an integrated platform.
The Shifting Landscape of Data IntegrationDATAVERSITY
This document discusses the shifting landscape of data integration. It begins with an introduction by William McKnight, who is described as the "#1 Global Influencer in Data Warehousing". The document then discusses how challenges in data integration are shifting from dealing with volume, velocity and variety to dealing with dynamic, distributed and diverse data in the cloud. It also discusses IDC's view that this shift is occurring from the traditional 3Vs to the 3Ds. The rest of the document discusses Matillion, a vendor that provides a modern solution for cloud data integration challenges.
Metrology sampling models using tool sensor dataArvind Mozumdar
This document discusses the development and deployment of a predictive metrology model using sensor data from a vacuum deposition tool. The model was created using random forest regression to predict deposition thickness based on tool sensor features. It was validated on historical data and deployed to provide real-time wafer scoring and identify candidates that can be skipped for metrology sampling. This helps reduce metrology costs while maintaining process control. The deployment involved integrating the predictive model with the tool sensor data infrastructure, metrology data systems, and the factory sampling controller. Ongoing work includes expanding the model to additional toolsets and processes.
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
The Lambda Architecture delivers the promise of analytics that is both real-time over streamed data and batch over comprehensive data. But its use relies largely on individuals with architecture savvy and sys admin skills for the capture, scheduling, and deployment of analytical models
We introduce a Model Management Framework that presents a simplified interface that supports end-to-end analytical modeling at scale using the Lambda Architecture. The framework hides the complexity of Lambda and unlocks its power for data scientists, domain experts, and business analysts.
- Data scientists and domain experts who generate the models can select from already captured modeling approaches or onboard their own. The platform makes it easy to compare models in a champion-challenger fashion.
- Business analysts who rely on model’s results can select from a catalog of models created by experts.
Model Management Framework comprises of a model building service, a prediction service, and a resource allocation service.
The result enables a catalog approach to finding analytics, simplified onboarding of new analytics, and a brute-force approach to retraining and comparing models.
How to apply and next steps
Identify desired insights
Data collection
Model-driven analytics
Take