Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been developed as a web enabled tool leveraging Eagle Framework
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Generating timely insights in a high-volume, high-velocity data environment is challenging for a multitude of reasons. As the volume of data increases, so does the amount of time required to transmit it back to the datacenter and process it. Secondly, as the velocity of the data increases, the faster the data and the insights derived from it lose value.
In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Generating timely insights in a high-volume, high-velocity data environment is challenging for a multitude of reasons. As the volume of data increases, so does the amount of time required to transmit it back to the datacenter and process it. Secondly, as the velocity of the data increases, the faster the data and the insights derived from it lose value.
In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge.
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Everyday Probabilistic Data Structures for HumansDatabricks
Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
Streamlio and IoT analytics with Apache PulsarStreamlio
To keep up with fast-moving IoT data, you need technology that can collect, process and store data with performance and scalability. This presentation from Data Day Texas looks at the technology requirements and how Apache Pulsar can help to meet them.
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
Kubernetes is the most popular container orchestration system that is natively designed for Cloud. At Lyft and Cloudera, we have both emerged the next-generation, cloud-native infrastructure based on Kubernetes, which supports various distributed workloads.
With tens of thousands of Java servers running in production in enterprise, Java has become a language of choice for building production systems. If our machines are to exhibit acceptable performance, they require regular tuning.This talk takes a detailed look at techniques for tuning a Java Server.
Puree through Trillion of clicks in seconds using InteranaJagjit Srawan
Big Data LA NoSQL Big Data Track. User Behavior analytics engine. Concepts around events, sessions, funnels using Interana Analytics Engine. Operational considerations.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
Everyday Probabilistic Data Structures for HumansDatabricks
Processing large amounts of data for analytical or business cases is a daily occurrence for Apache Spark users. Cost, Latency and Accuracy are 3 sides of a triangle a product owner has to trade off. When dealing with TBs of data a day and PBs of data overall, even small efficiencies have a major impact on the bottom line.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.
Streamlio and IoT analytics with Apache PulsarStreamlio
To keep up with fast-moving IoT data, you need technology that can collect, process and store data with performance and scalability. This presentation from Data Day Texas looks at the technology requirements and how Apache Pulsar can help to meet them.
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
Kubernetes is the most popular container orchestration system that is natively designed for Cloud. At Lyft and Cloudera, we have both emerged the next-generation, cloud-native infrastructure based on Kubernetes, which supports various distributed workloads.
With tens of thousands of Java servers running in production in enterprise, Java has become a language of choice for building production systems. If our machines are to exhibit acceptable performance, they require regular tuning.This talk takes a detailed look at techniques for tuning a Java Server.
Puree through Trillion of clicks in seconds using InteranaJagjit Srawan
Big Data LA NoSQL Big Data Track. User Behavior analytics engine. Concepts around events, sessions, funnels using Interana Analytics Engine. Operational considerations.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Adding Value in the Cloud with Performance TestRodolfo Kohn
System quality attributes such performance, scalability, and availability are among the main concerns for cloud application developers and product managers. There are many examples of notable system failures that show how a company business can be affected during key events like a Cyber Monday. However, many difficulties come up when a team intends to consciously manage these type of quality attributes during development and operations. It is possible to group these difficulties in two main aspects: human aspects and technical aspects. During this presentation, I will share main technical difficulties we had to deal with in the last seven years working with different cloud services as well as key technical performance, scalability, and availability issues we were able to find and solve. It is about cases that are relevant through different products, technologies, and teams.
EnterpriseDB's Best Practices for Postgres DBAsEDB
This presentation reviews techniques to become a high performance Postgres DBA such as:
- Day to day monitoring
- Ongoing maintenance tasks including bloat and index maintenance
- Database and OS parameter tuning for performance
Security practices
- Planning for production deployment
- High availability best practices including strategies for backup and recovery
- Ideas for professional development
To listen to the recording of this presentation, visit Enterprisedb.com > Resources > Webcasts > On Demand Webcasts
An introduction to Workload Modelling for Cloud ApplicationsRavi Yogesh
A high-level overview of Workload Modelling as a part of Performance Testing Life Cycle with focus on the challenges faced in Cloud environment relative to traditional IT infrastructure.
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Gluent Extending Enterprise Applications with Hadoopgluent.
This presentation shows how to transparently extend enterprise applications with the power of modern data platforms such as Hadoop. Application re-writing is not needed and there is no downtime when virtualizing data with Gluent.
This is episode 3 of the building the perfect PHP app for the enterprise webinar series. Your application is your reputation – how do you ensure it's always available and meets demand without breaking the bank? Learn techniques and tools to quickly pinpoint and fix bugs, crashes, and stability issues in production.
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
The talk presents a new technique of realtime single entity information extraction and investigation. The technique eliminates regular refresh and persistence of data within the search engine (ETL), providing real-time access to source data and improving response times using in-memory data techniques. The solution presented is a concrete solution with live customers, based upon real business needs. I will explain the architectural overview, the technology stack used based on Apache Lucene library, the accomplished results and how to scale out the solution.
Orsyp Dollar Universe - Performance Management for SAPORSYP SOFTWARE
Make sure SAP is part of your IT automation strategy with a job scheduler that will manage the execution of your ABAPs with other processing. Reduce turnaround times on long running jobs by extracting latency and balancing workload. Simplify management and maintenance using a single console to configure variants, check reports and perform other tasks when scheduling jobs across multiple SAP instances.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
1. 1
Hadoop Usage Insight (HUI 1.0)
Session on Descriptive Analytics
ArulKumar
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights
on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been
developed as a web enabled tool leveraging Eagle Framework.
2. Contents
• Why we do this ?
• Our Customers
• Initial Use Case
• Eagle Monitoring Framework as a Solution …
• How we did this ? - EagleApp !!
• Functional Coverage
• AS IS Features
• Methods & Metrics
• TO BE Features
3. 3
Why we do this ?
• 2+ large Hadoop clusters
• 3000+ nodes
• 20,000+ jobs per day
• 50,000,00+ tasks per day
• 200+ types of Hadoop Metrics
• Millions of audit events per day
Complexity
• Varieties of data sources & Collectors
• Join multiple data sources
• Threshold based, windows based
• Multiple metrics correlation
• Metrics pre-aggregations
• Alert rules can’t be hot deployed
Volume
4. 4
Key Stake Holders & Use Cases
Sr. Management
SME’s & Leads
Cluster - Availability and HDFS Usage
Rack wise
Node wise
Host Anomaly detection
Queue - Load Analysis
Queue Load w.r.t. Batch A/C.
M R Count and Failure status w.r.t Queues
Queue wise Elapse time , Job Count
% of Completion M & R
Job – Usage and Progress status
Usage distribution of Jobs across Queues
Job Listing & Status with elapsed on Queue
Alerts
Job Alert Categorization and
Distribution across Queue & Users
Optimization
Optimization suggestion for Jobs in
each Queue. ( Start time of Job &
Other Counter Details on Screen
Counter Analytics
Map Task Attempt ( File System )
Reduce Task Attempt ( File System )
Map Reduce ( File System , Job, Task
Hadoop Operations
Hadoop Developers
Use Cases
ROI
Time to Market
Reduction in MTTR
Freedom for Innovation
Optimized Resource
Usage
Reduction in SLA Time
Insight on Running Jobs
Cluster Capacity Insight
Proactive Remediation
Infrastructure Teams
Product Managers
5. 5
The Initial Use Case …
Anomaly detection algorithm
Continuously crawl job history immediately after Job
completion
Calculate minute level job failure ratio for each node
A node is anomalous when either of 2 conditions happen
• Continuously fails tasks within this node
• Higher failure ratio than rest of nodes in the cluster
10. 10
ASIS Features
Cluster - Availability and HDFS Usage
Rack wise
Node wise
Host Anomaly detection
Queue - Load Analysis
Queue Load w.r.t. Batch A/C.
M R Count and Failure status w.r.t Queues
Queue wise Elapse time
Job Count
% of Completion M & R
Job – Usage and Progress status
Usage distribution of Jobs across Queues
Job Listing & Status with elapsed on Queue
Alerts
Job Alert Categorization and Distribution
across Queue & Users
Optimization
Optimization suggestion for Jobs in each
Queue. ( Start time of Job & Other Counter
Details on Screen )
Counter Analytics
Map Task Attempt ( File System )
Reduce Task Attempt ( File System )
Map Reduce ( File System , Job, Task )
15. 15
Alert type Alert category Trigger Condition Email frequency Actions
Job
Performance
Alerts
long execution compared
with historical data
Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight
Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight
Slow progress HDFS R/W, File R/W has no
progress within 15 minutes
1 hour or > 10 jobs Resource availability Check
long scheduling Map & Reduce Progress 0% even
after 15 minutes
1 hour or > 10 jobs Resource availability Check
long cleanup Map & Reduce Progress 100% but
Job not completed in 15 min.
1 hour or > 10 jobs System resource availability
Check
Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job
Slow processing file RW # of bytes is between 100 to 200 K
bytes per CPU second
1 hour or > 10 jobs Notify and Optimize Job
very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job
Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node
Job Exception
Job Anomaly Alert Buggy job has very high failure ratio
than any of other jobs
on-demand Send email to owner
Typical AlertType and Categories …
17. 17
Real time Data Skew Detection - Approach
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
Counters & Features
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Correlation > 0.9
& Max(Z-Score) > 90%
HDFS Bytes Read Input Records Map Duration (ms) Combine I/P Records
Shuffle Records Local File Bytes Read Input Records Duration (ms) paralyzed
As a framework, Eagle does not assume :
Data source (where, what)
Business logic execution path (how)
Policy engine implementation (how)
Data sink (where, what)
As a framework, Eagle does the following:
SQL-like service API
High-performing query framework
Lightweight streaming process java API
Extensible policy engine implementation
Scalable and distributed rule evaluation
Metadata driven stream processing
Data source extensibility
Data sink extensibility
Interactive dashboard