This document outlines different approaches for upgrading or migrating the infrastructure and data stores for big data systems, including upgrading in place, building a new cluster, and strategies like starting the new cluster before cutting over or doing incremental data moves. It emphasizes the importance of planning, testing, and having solid data flow architectures, and provides an example migration from Cassandra to Hadoop using different approaches.
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTurkish Testing Board
If you are testing a simple mobile app, you may find it relatively easy to find representative test data. However, what if you are testing enterprise scale applications? In the enterprise data center, one hundred or more applications of various sizes, complexity, and criticality co-exist, operating on various data repositories, in some cases shared data repositories. In some cases, disparate data repositories hold related data, and the ability to test integration across applications that access these data sets is critical. In this keynote speech, Rex Black will talk about the challenges facing his clients as they deal with these testing problems. You’ll go away with a better understanding of the nature of the challenges, as well as ideas on how to handle them, grounded in lessons Rex has learned in over 30 years of software engineering and testing.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale
Knowing how to mix hardware with proprietary and open source software can lead to improved performance and reduced costs, as a MetaScale team showed with a client who was running out of capacity using IBM DataStage for ETL.
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Generating timely insights in a high-volume, high-velocity data environment is challenging for a multitude of reasons. As the volume of data increases, so does the amount of time required to transmit it back to the datacenter and process it. Secondly, as the velocity of the data increases, the faster the data and the insights derived from it lose value.
In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Testistanbul 2016 - Keynote: "Enterprise Challenges of Test Data" by Rex BlackTurkish Testing Board
If you are testing a simple mobile app, you may find it relatively easy to find representative test data. However, what if you are testing enterprise scale applications? In the enterprise data center, one hundred or more applications of various sizes, complexity, and criticality co-exist, operating on various data repositories, in some cases shared data repositories. In some cases, disparate data repositories hold related data, and the ability to test integration across applications that access these data sets is critical. In this keynote speech, Rex Black will talk about the challenges facing his clients as they deal with these testing problems. You’ll go away with a better understanding of the nature of the challenges, as well as ideas on how to handle them, grounded in lessons Rex has learned in over 30 years of software engineering and testing.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale
Knowing how to mix hardware with proprietary and open source software can lead to improved performance and reduced costs, as a MetaScale team showed with a client who was running out of capacity using IBM DataStage for ETL.
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Generating timely insights in a high-volume, high-velocity data environment is challenging for a multitude of reasons. As the volume of data increases, so does the amount of time required to transmit it back to the datacenter and process it. Secondly, as the velocity of the data increases, the faster the data and the insights derived from it lose value.
In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
TidalScale has created a software defined computer.
At TidalScale, we have created a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
We configure hosted hardware into one or more TidalPods. Each TidalPod is a virtual supercomputer comprising a set of commodity servers configured with the TidalScale HyperKernel. What the user sees is standard Linux, FreeBSD or Windows running with the sum of all memory, processors, networks, and I/O. The secret sauce is the HyperKernel that fools the guest OS into thinking it’s running directly on a huge, expensive machine when in fact it’s running on a set of smaller, less expensive servers.
We offer an incredibly simple user experience.
• Define the computer size you want (Number of CPU, Amount of Memory), boot the virtual machine, then login to the computer…
Thus, we enable a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers in a Datacenter through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...DataStax
Building and managing cloud applications is not easy. Delivering one with an amazing customer experience is even harder. Join us for “The Performance Challenge: Providing an Amazing Customer Experience No Matter What” webinar where we will deep dive into the challenges customers face with providing a consistent experience no matter where customers are, providing real-time access to data and how DataStax Enterprise can help.
Link to recording: https://youtu.be/qBGsyNulCOs
View past DataStax webinars: http://www.datastax.com/resources/webinars
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been developed as a web enabled tool leveraging Eagle Framework
SQL on Hadoop benchmarks using TPC-DS query setKognitio
Sharon Kirkham, VP Analytics & Consulting at Kognitio, ran the TPC-DS query set using Impala, SparkSQL and Kognitio, to test for speed, reliability and concurrency for different SQL on Hadoop solutions. Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor single thread performance meant it was removed.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Cassandra is well-known for its best-in-class multi-DC. However, some advanced use cases require replicating data between clusters. Some of these use cases involve constrained deployment scenarios such as limited bandwidth or frequent network disruptions which can put significant strains on multi-DC setups. Other use cases include advanced data flows with Hub-and-Spoke configurations.
In this talk we will introduce you to DSE's new Advanced Replication capability that allows for these data flows in such constrained environments. We will provide some motivating use cases, an architecture overview, and some customer and testing results.
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
Cliff Gilmore Solution Architect, DataStax
Cliff Gilmore
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
TidalScale has created a software defined computer.
At TidalScale, we have created a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
We configure hosted hardware into one or more TidalPods. Each TidalPod is a virtual supercomputer comprising a set of commodity servers configured with the TidalScale HyperKernel. What the user sees is standard Linux, FreeBSD or Windows running with the sum of all memory, processors, networks, and I/O. The secret sauce is the HyperKernel that fools the guest OS into thinking it’s running directly on a huge, expensive machine when in fact it’s running on a set of smaller, less expensive servers.
We offer an incredibly simple user experience.
• Define the computer size you want (Number of CPU, Amount of Memory), boot the virtual machine, then login to the computer…
Thus, we enable a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers in a Datacenter through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...DataStax
Building and managing cloud applications is not easy. Delivering one with an amazing customer experience is even harder. Join us for “The Performance Challenge: Providing an Amazing Customer Experience No Matter What” webinar where we will deep dive into the challenges customers face with providing a consistent experience no matter where customers are, providing real-time access to data and how DataStax Enterprise can help.
Link to recording: https://youtu.be/qBGsyNulCOs
View past DataStax webinars: http://www.datastax.com/resources/webinars
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been developed as a web enabled tool leveraging Eagle Framework
SQL on Hadoop benchmarks using TPC-DS query setKognitio
Sharon Kirkham, VP Analytics & Consulting at Kognitio, ran the TPC-DS query set using Impala, SparkSQL and Kognitio, to test for speed, reliability and concurrency for different SQL on Hadoop solutions. Standard Hive was originally investigated as part of this benchmark but lack of SQL support and poor single thread performance meant it was removed.
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
Apache Metron (Incubating) is a streaming cybersecurity application
built on Apache Storm and Hadoop. One of its core missions is to enable
advanced analytics through machine learning and data science to the
users. Because of the relative immaturity of data science platform
infrastructure integrated into Hadoop that is oriented to streaming
analytics applications, we have been forced to create the requisite
platform components out of necessity, utilizing many of the pieces of
the Hadoop ecosystem.
In this talk, we will speak about the Metron analytics architecture and
how it utilizes a custom data science model deployment and autodiscovery
service that is tightly integrated with Hadoop via Yarn and Zookeeper.
We will discuss how we interact with the models deployed there via a
custom domain specific language that can query models as data streams
past. We will generally discuss the full-stack data science tooling that
has been created to enable data science at scale on an advanced analytics
streaming application.
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
This presentation speaks about -
1) How to perform big data testing
2) Tools that can be used for testing
3) Different validation stages involved
4) Performance testing
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Cassandra is well-known for its best-in-class multi-DC. However, some advanced use cases require replicating data between clusters. Some of these use cases involve constrained deployment scenarios such as limited bandwidth or frequent network disruptions which can put significant strains on multi-DC setups. Other use cases include advanced data flows with Hub-and-Spoke configurations.
In this talk we will introduce you to DSE's new Advanced Replication capability that allows for these data flows in such constrained environments. We will provide some motivating use cases, an architecture overview, and some customer and testing results.
About the Speakers
Brian Hess Senior Product Manager, Analytics, DataStax
Brian has been in the analytics space for over 15 years ranging from government to data mining applied research to analytics in enterprise data warehousing and NoSQL engines, in roles ranging from Cryptologic Mathematician to Director of Advanced Analytics to Senior Product Manager. In all these roles he has pushed data analytics and processing to massive scales in order to solve problems that were previously unsolvable.
Cliff Gilmore Solution Architect, DataStax
Cliff Gilmore
Led by Carmine Malfitano, social worker and researcher, students will learn about the CALM (Managing Cancer And Living Meaningfully) therapy.
The CALM therapy is a brief, semi-structured, evidence-based, psychotherapeutic intervention for advanced cancer patients and their primary caregivers. It is designed to help people with metastatic cancer and their caregivers manage the practical and profound problems associated with advanced disease.
The primary goals of CALM are reducing and preventing psychological distress.
Certified Property Solutions is one of the best property management companies in Hawaii serving Honolulu and the Island of Oahu. We have over 7 years of experience, so you At Certified Property Solutions, we focus on developing relationships that last a lifetime with a happy and innovative team that strives to exceed your expectations. We are committed to providing excellent management solutions for you, your friends, and your family.
Presentation: "Researchers’ perceptions of DH trends and topics in the English and Spanish-speaking community. Day of DH data as a case study" at DH2016 Conference in Krakow.
Authors:
Gimena Del Rio Riande (gdelrio.riande@gmail.com)
CONICET, Universidad de Buenos Aires
Salvador Ros (sros@scc.uned.es)
Elena González-Blanco (egonzalezblanco@flog.uned.es)
Antonio Robles Gomez (arobles@scc.uned.es)
Spanish University for Distance Education, UNED
Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady ver...Amro Elfeki
). Simulation of Dispersion in a Heterogeneous Aquifer: Discussion of Steady versus Unsteady Groundwater Flow and Uncertainty analysis. International Symposium on Stochastic Hydraulics, Eds. Vrijling, J.K., Rurijgh, E., Stalenberg, B. Van Gelder, P.H.A.I.M., Verlaan, M., Zijderveld, A., and Waarts, papers on CD-ROM., 23-24, May, 2005. Nijmegen, The Netherlands. ISBN: 90-805649-9-0.
İnovatif Kimya Dergisi Sayı-11 Anlatılan Konu Başlıkları
Su Kirliliği
Yapay Et
Su-H2O
Nükleer Enerji
16 Ton
İyon Tutucular
Kimya Mühendisliği ve Aspen Plus
Ayrıca Her Ay 3 Web Sitesi ve Kimya Bulmacası, Kimya Sektöründen Haberler ile Kimya Sözlüğü
İyi okumalar dileriz.
Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.
Accelerate Hadoop and Spark deployment in a multi-tenant lab environment for dev/test/QA, evaluation of multiple tools for Big Data analytics, and other use cases. BlueData provides a turnkey on-premises solution with software and services to get up and running in two weeks.
The new Big Data Lab Accelerator solution provides a full enterprise license of BlueData EPIC software along with the professional services needed to deploy an on-premises multi-tenant Big Data lab. Within two weeks, customers will have a lab environment to evaluate Big Data tools and spin up multiple Hadoop or Spark clusters for development, testing and quality assurance. As part of this deployment, BlueData will also work with customers to implement initial use cases for Big Data analytics.
Learn more about BlueData at www.bluedata.com
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
I presented to the Georgia Southern Computer Science ACM group. Rather than one topic for 90 minutes, I decided to do an UnConference. I presented them a list of 8-9 topics, let them vote on what to talk about, then repeated.
Each presentation was ~8 minutes, (Except Career) and was by no means an attempt to explain the full concept or technology. Only to wake up their interest.
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Migrating to Cloud: Inhouse Hadoop to Databricks (3)Knoldus Inc.
Modernize your Enterprise Data Lake to Serverless Data Lake, where data, workloads, and orchestrations can be automatically migrated to the cloud-native infrastructure.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
Just In Time Scalability Agile Methods To Support Massive Growth PresentationTimothy Fitz
Eric Reis and Chris Hondl's MySQL conference presentation on Just In Time Scalability. http://startuplessonslearned.blogspot.com/2008/09/just-in-time-scalability.html
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. Changing the Tires on a Big
Data Racecar
@davemcnelis
Sr. Software Engineer, Proofpoint
2. Who am I?
Software engineer at Proofpoint, formerly Emerging Threats
14 years experience, 7 with Cassandra or Hadoop
Currently using Scala more than any other language
Big data focuses have been social media analysis, marketing data, Smart-meter
analytics, and information security research
Current projects revolve around building threat intelligence APIs and data stores
3. Goals
Outline approaches to migrating or upgrading your infrastructure / data store
Pros and Cons of these approaches
Demystify the process, identify ‘gotchas’
Establish guidelines and provide ideas for handling these situations, not create a
gospel
4. Core System Components
Back end store (Hadoop, Cassandra, ect.)
Queuing / messaging service (Kafka, Kinesis, AMQP, RabbitMQ)
Event / Data Producers (APIs, log data, sensors)
Generates base data for the messaging service
Analytics (Queuing system consumers, batch jobs)
Access (APIs, Front ends, batch job output
6. Upgrading in place
Pros
Least expensive
Data can stay where it already lives
Often has sufficient documentation
Cons
Stability concerns of the new back end
Downtime / customer visibility
Good luck rolling back in event of a problem
Degradation of performance during the
upgrade
Testing in production is bad, mmkay
Generally limited to minor upgrades / updates
Even drop-in upgrades aren’t clean
7. So you want to build a new cluster
All of these inherently will cost more than upgrading in place.
Start your engines! -- Spin up a new cluster, dark write to it until enough data, cut
over consumption
Red Flag -- Stop ingestion and consumption, move data, restart ingestion and
consumption
Black Flag -- Incremental copies of data, potentially pausing ingestion/consumption
for brief periods of time
Green Flag -- Let your foundation do most of the work for you
8. Keys to Success
Pre-planning is essential. Don’t expect this to be a couple of days work, plan for weeks
of time.
Solid data flow foundations are key. Consider archiving all incoming data to
something like S3 so you can replay an arbitrary amount of data.
Automated / unit testing on data interaction components will create a lot more
confidence and help identify problem areas early.
9. Start your engines!
Spinning up a new cluster and writing data until there is enough to sustain operations
Fine if no historic data longer than spin-up time is required
Least amount of risk, if older data isn’t needed
Can back-fill legacy data after the cut-over has occurred
10. Red flag -- Stop the race!
Shut it all down, move data to the new format, start everything back up
High customer impact
User visible downtime will occur, not just analytics/ingestion/processing downtime
Might be OK for non-critical, offline systems
11. Black Flag -- Dealing with the stop and go penalty
Attempts to lessen downtime/customer impact
Significant engineering time to set up properly
If you don’t have timestamped write times and non-linear data, might not be feesible
Longest path, in terms of calendar days
High complexity, high potential for mistakes
12. Green flag -- Letting your foundation work for you
Only “Start your engines” has less planned downtime
Difficult or impossible if you don’t have a solid data flow architecture in place
Results for this should be reproducible (in other words, can test things multiple times
if needed)
Data needs to come in from either a queue or batch loads
If everything is from batch loads, should be able to avoid any customer disruption
13. Watch out for that pile up!
Queue / Message bus -- Need to have ample capacity for when you’re not ingesting
from the bus. I.e. Kinesis TTL is 24 hours, Kafka is configurable
Testing -- Build in time to test and verify migrations, and then check it all a second
time.
Testing must be multifaceted -- The code, the data, and the infrastructure
Chasing the white rabbit -- Beware the jabberwocky! Easy to fall into the bleeding edge
trap, but this is high risk for often little rewards
14. Example -- Migrating from Cassandra to Hadoop
“Start your engines!” approach
Began with duplicate writing to both systems
Eventually added kafka with different consumers pushing data to both backends
Dev work to re-implement things with Hadoop/HBase took most resources/time
Once in a “stable” place, started comparing batch job outputs from two systems
Brief maintenance window to cut over
Entire process took several months including dev, ops and testing work
15. Example -- Migrating from Cassandra to Hadoop (cont.)
Unique challenges
Exporting data from Cassandra was hard
Prior to a decent option like Spark
Greatly complicated by vNodes
Used a set of python scripts to actually export all the data
Had multiple kinds of products to deliver
API under constant customer use, couldn’t afford any downtime
Batch job outputs, hourly and daily
16. Example -- Upgrading major versions of Hadoop
Green flag approach
Had to minimize downtime
Not enough calendar time for “Start your engines”
Leveraged Snapshots (both Cassandra and HBase have this construct)
Loaded snapshots into testing environments multiple times
Majority of engineering time was in upgrading libraries and verifying there were no
breaking changes because of the version changes
Second most engineering time was spent building and running test clusters
17. Example -- Upgrading major versions of Hadoop (steps)
1. Determine “time” to start ingesting into both environments
2. Took snapshots of original cluster, loaded into new cluster (can take a long time)
3. Started raw data consumers for the new cluster (i.e. enabling data insertion)
4. Once lag was reduced on insertion, started analytics based consumers
5. Enabled any batch processing
6. Continue to write to both stores for a couple of weeks
7. Verify new cluster output by comparing batch jobs to old cluster
8. Cut over customer facing APIs
18. Summary
Strong foundations are essential
Number of possible ways to win the race
Plan as far out as you can foresee
Upgrading and Migrating are operationally similar, have similar approaches available
Archiving raw incoming data can save you a lot of headaches if you can afford it
Racing analogies only work so long in a presentation before they get worn out