New DNA sequencing technologies are revolutionizing the life sciences by generating extremely large data sets. Traditional tools for processing this data will have difficulty scaling to the coming deluge of genomics data. We discuss how the innovations of Hadoop and Spark are solving core problems that enable scientists to address questions that were previously out of reach.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Terark (Y Combinator W17) has built a new storage engine based on nested succinct trie which provides a 10x-500x performance improvement, a 10:1 compression ratio and a crazy low latency compared to Google's LevelDB, Facebook's RocksDB. It is usable as a standalone key-value store, or as a storage engine for MySQL and MongoDB.
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Databricks
In this session, IBM will present details on advanced Apache Spark analytics currently being performed through a collaborative project with the SETI Institute, NASA, Swinburne University, Stanford University and IBM. The Allen Telescope Array in northern California has been continuously scanning the skies for over two decades, generating data archives with over 200 million signal events.
Come and learn how astronomers and researchers are using Apache Spark, in conjunction with assets such as IBM’s Cognitive Compute Cluster with over 700 GPUs, to train neural net models for signal classification, and to perform computationally intensive Spark workloads on multi-terabyte binary signal files. The speakers will also share details on one of the key components of this implementation: Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark, specifically designed to optimize their performance with object stores. Learn how Stocator works, and see how it was able to greatly improve performance and reduce the quantity of resources used, both for ground-to-cloud uploads of very large signal files, and for subsequent access of radio data for analysis using Spark.
Terark (Y Combinator W17) has built a new storage engine based on nested succinct trie which provides a 10x-500x performance improvement, a 10:1 compression ratio and a crazy low latency compared to Google's LevelDB, Facebook's RocksDB. It is usable as a standalone key-value store, or as a storage engine for MySQL and MongoDB.
•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit
Reinsurance company’s core competencies include the quantification of risk associated with catastrophes, such as hurricanes and earthquakes. Various so-called catastrophe models are available publicly, some commercial and some open-source. The volume of data processed by such “cat models” requires Big Data and High Performance capabilities. This is clearly reflected in the landscape of public models. And the observed trend is towards more and more detailed inputs, as well as outputs. This makes scalability an important concern.
Companies that deal with catastrophe risk commonly use one or several public cat models. If they wish to differentiate themselves from the market, they may build internal proprietary models, in particular in areas that are not covered by existing models. The result is a deeper understanding and an independent quantification of risk, both of which can lead to a competitive edge.
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Monitoring and scaling postgres at datadogSeth Rosenblum
An overview of how Datadog has built and scaled our Postgres clusters to support the ingestion of trillions of metric data points per day, by Seth Rosenblum, Lead Data Reliability Engineer.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
Drizzle is a low latency execution engine for Apache Spark
that is targeted at stream processing and iterative workloads.
Currently, Spark uses a BSP computation model, and notifies the
scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency. In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once.
This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. Our experiments on a 128 node EC2 cluster show that Drizzle can achieve end-to-end streaming latencies of less than 100ms and can get up to 3.5x lower latency than Spark Streaming. Compared to Apache Flink, a record-at-a-time streaming system, we show that Drizzle can recover around 4x faster from failures and that Drizzle has up to 13x lower latency during recovery.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Scaling Data Analytics Workloads on DatabricksDatabricks
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators.
In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we:
Manage a typical query lifetime through the Databricks software stack
Dynamically allocate resources to satisfy the elastic demands of a single cluster
Isolate the data and the generated state within a large organization with multiple clusters
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit
Reinsurance company’s core competencies include the quantification of risk associated with catastrophes, such as hurricanes and earthquakes. Various so-called catastrophe models are available publicly, some commercial and some open-source. The volume of data processed by such “cat models” requires Big Data and High Performance capabilities. This is clearly reflected in the landscape of public models. And the observed trend is towards more and more detailed inputs, as well as outputs. This makes scalability an important concern.
Companies that deal with catastrophe risk commonly use one or several public cat models. If they wish to differentiate themselves from the market, they may build internal proprietary models, in particular in areas that are not covered by existing models. The result is a deeper understanding and an independent quantification of risk, both of which can lead to a competitive edge.
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
Certaines fonctionnalités au cœur du produit de Datadog reposent sur des pipelines de données construits avec Spark qui traitent des milliers de milliards de points chaque jour. Dans cette présentation, nous verrons les grands principes que nous appliquons chez Datadog pour assurer que nos pipelines restent fiables malgré la croissance exponentielle du volume de données, les pannes matérielles, les données corrompues et les erreurs humaines.
Paris Data Eng' Meetup du 26 février 2019 @Datadog
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Monitoring and scaling postgres at datadogSeth Rosenblum
An overview of how Datadog has built and scaled our Postgres clusters to support the ingestion of trillions of metric data points per day, by Seth Rosenblum, Lead Data Reliability Engineer.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
Drizzle is a low latency execution engine for Apache Spark
that is targeted at stream processing and iterative workloads.
Currently, Spark uses a BSP computation model, and notifies the
scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency. In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once.
This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. Our experiments on a 128 node EC2 cluster show that Drizzle can achieve end-to-end streaming latencies of less than 100ms and can get up to 3.5x lower latency than Spark Streaming. Compared to Apache Flink, a record-at-a-time streaming system, we show that Drizzle can recover around 4x faster from failures and that Drizzle has up to 13x lower latency during recovery.
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
Steve Rozen's keynote talk at IEEE CIBCB 2016
Big Genome Data Sheds Light on Cancer Causes
Steven G. Rozen, PhD
Professor, Cancer & Stem Cell Programme, Duke-NUS Medical School, Singapore
Director, Duke-NUS Centre for Computational Biology
The last eight years have see a revolution in the availability of DNA sequencing data. This revolution has been driven by costs that have plummeted from US$ 10 million per human genome in 2008 to US $1,200 today. Abundant sequencing data brings with it a previously unimaginable range of research possibilities in all areas of biomedical research. Naturally, these research possibilities make heavy demands on computation and data storage, because costs of sequencing are falling much faster than Moore's law. In this talk I will present a high level overview of these computational demands. I will then go into detail on a few of the cancer-related big data projects my lab is working on. One of these is "mutation signature analysis", which has important applications in cancer prevention and epidemiology and in research into the fundamental processes by which cancers arise. One example of the importance of this approach is the recent finding that a highly mutagenic herbal remedy is implicated in many more geographical regions and types of cancer than suspected a few years ago.
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media
"Petascale Genomics with Spark", Sean Owen, Director of Data Science at Cloudera
YouTube Link: https://www.youtube.com/watch?v=HY93FdK5i60
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the Author:
Sean is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google. He holds and MBA from the London Business School and a BA in Computer Science from Harvard.
Palestra realizada por Luciano Palma no Intel Software Day 2013 (22/10/2013)
Conheça a arquitetura do Intel Xeon Phi, um coprocessador capaz de entregar mais de 2 TFlops de processamento para sua solução de HPC (High Performance Computing).
Performance analysis in a multitenant cloud environment Using Hadoop Cluster ...Orgad Kimchi
Analyzing the performance of a virtualized multitenant cloud environment can be challenging because of the layers of abstraction. This article shows how to use Oracle Solaris 11 to overcome those limitations.
For more information see:
http://www.oracle.com/technetwork/articles/servers-storage-admin/perf-analysis-multitenant-cloud-2082193.html
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax
In this talk, we review a real-world use case that tested the Cassandra+Spark stack on Datastax Enterprise (DSE). We also cover implementation details around application high availability and fault tolerance using the new DSE File System (DSEFS). From a field and testing perspective, we discuss the strategies we can leverage to meet our requirements. Such requirements include (but not limited to) functional coverage, system integration, usability, and performance. We will discuss best practices and lessons we learned covering everything from application development to DSE setup and tuning.
About the Speaker
Rocco Varela Software Engineer in Test, DataStax
After earning his PhD in bioinformatics from UCSF, Rocco Varela took his passion for technology to DataStax. At DataStax he works on several aspects of performance and test automation around DataStax Enterprise (DSE) integrated offerings such as Apache Spark, Hadoop, Solr, and more recently DSE Graph.
Application Logging in the 21st century - 2014.keyTim Bunce
Slides for my talk at the Austrian Perl Workshop in Salzburg on October 10th.
A video of the talk can be found at https://www.youtube.com/watch?v=4Qj-_eimGuE
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.
ThoughtWorks Tech Talks NYC: DevOops, 10 Ops Things You Might Have Forgotten ...Rosemary Wang
Thoughtworks Tech Talks NYC, 11/30
We built an application or a platform! However, we soon realize that it is t-minus two weeks before release and we have no way of supporting it when it goes to production. Operations has not been trained, no one will know if a component goes down, and somehow the pipeline used in testing does not work in production. Oops. In this talk, we'll cover ten tips from the operations battlefront to remember as you develop an application or platform. With a focus on operations as a user and designing for support, these tips range from reminders on systems quirks to practices on engaging operations early in the development process. By taking a bit of an "operations" mindset in the development process, we can ease the release process and move closer to DevOps culture.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
기존에 저희 회사에서 사용하던 모니터링은 Zabbix 였습니다.
컨테이너 모니터링 부분으로 옮겨가면서 변화가 필요하였고, 이에 대해서 프로메테우스를 활용한 모니터링 방법을 자연스럽게 고민하게 되었습니다.
이에 이영주님께서 테크세션을 진행하였고, 이에 발표자료를 올립니다.
5개의 부분으로 구성되어 있으며, 세팅 방법에 대한 내용까지 포함합니다.
01. Prometheus?
02. Usage
03. Alertmanager
04. Cluster
05. Performance
Similar to DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with Hadoop and Spark (20)
Bringing Sequential Analysis to A/B Testing with examples from his work at Optimizely.
These slides are from a talk given at the SF Data Engineering meetup. http://www.meetup.com/SF-Data-Engineering/events/231047195/
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
A mind-bending way of dealing with time syncing when aggregating data from many disparate sources. Talk by Jasmine Tsai and Alyssa Kwan, Clover Health. To hear about future conferences go to http://dataengconf.com
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
Tips for suceeding in your data science job interview. Talk by Bridge Mellichamp, Stitch Labs. To hear about future conferences go to http://dataengconf.com
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
Learn how LinkedIn makes article recommendations for its users. Talk by Ajit Singh, LinkedIn. To hear about future conferences go to http://dataengconf.com
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Before we dive in, let me ask a couple of questions:
Biologists?
Spark experts?
Gonna tell you a lot of lies today.
There are always at least three different constituencies in the room:
* biologists
* programmers
* someone thinking about how to build a business around this
Won’t satisfy everyone. Where I skip over the truth, maybe there will be at least a breadcrumb of truth left over.
This will not be a very technical talk.
Scared/pissed off some bio people in the past.
Bioinformatics is a field with a long history, thirty or more years as a separate discipline.
At the same time, the fundamental technology is changing.
So if I talk about ‘problems of bioinformatics’ today, it’s OK because
WE COME IN PEACE!
Bioinformatics software development has been *remarkably* effective, for decades.
If there are problems to be solved, these are the result of new technologies, new ambitions of scale.
What even is genomics?
Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference?
So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
Fundamentally, we’re interested in studying individuals (and populations of individuals)
[ADVANCE]
But each individual is actually a population: of cells
[ADVANCE]
But each of those cells has, ideally, an identical genome.
The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about.
The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
Without losing much, assume that our genomes are contained on just a single chromosome.
Now, not only do all the cells in your body have identical genomes…
[ADVANCE]
But individual humans have genomes that are very similar to each other.
So similar that I can define “the same” chromosome between individuals… and that means…
[ADVANCE]
That we can define a ‘base’ or a ‘reference’ chromosome.
Now that there is a reference that all of us adhere to…
[ADVANCE]
We can define a concept of ‘location’ across chromosomes.
This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system.
This also means that we can talk about differences between individuals in terms of diffs to a common reference genome.
But where does this reference genome come from?
Here is Bill Clinton (and Craig Venter and Francis Collins), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
Took >10 years and $2 billion
What did this actually do?
1570: Theatrum Orbis Terrarum: “Theater of the world”
First modern atlas.
A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us.
It’s direct descendants are still with us, today!
Google maps!
So how is the map created/used?
Anyone recognize this?
Genome analogy: a text file a part of the linear sequence of ACGTs.
Difficult to understand.
Mapmakers work to add ANNOTATIONS to the map.
And often, it’s only the annotations that are interesting, so mapmakers focus on *annotation* of the maps themselves.
The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
What does the annotated map of the genome look like?
Chromosome on top. Highlighted red portion is what we’re zoomed in on.
See the scale: total of about 600,000 bases (ACGTs) arranged from left to right.
Multiple annotation “tracks” are overlaid on the genome sequence, marking functional elements, positions of observed human differences, similarity to other animals.
In part it’s the product of numerous additional large biology annotation projects (e.g., HapMap project, 1000 Genomes, ENCODE).
How are these annotations actually generated? Shift gears and talk about the technology.
DNA SEQUENCING
If satellites provide images of the world for cartography, sequences are the microscopes that give you “images” of the genome.
Over past decade, massive EXPONENTIAL increase in throughput (much faster than Moore’s law)
Get sample
Extract DNA (possibly other manipulations)
Dump into sequencer
Spits out text file (actually looks just like that)
But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements?
Bioinformatics is the computational process to reconstruct the genomic information. But…
[ADVANCE]
Often considered simply a black box.
What does it actually look like inside?
Pipelines, of course.
Example pipeline: raw sequencing data => a single individual’s “diff” from the reference.
How are these typically structured?
Each step is typically written as a standalone program – passing files from stage to stage
These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem
What does one of these files look like?
Text is highly inefficient
Compresses poorly
Values must be parsed
Text is semi-structured
Flexible schemas make parsing difficult
Difficult to make assumptions on data structure
Text poorly separates the roles of delimiters and data
Requires escaping of control characters
(ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
But still almost always better than Excel
Imposes severe constraint: global sort invariant. => Many impls depend on this, even if it’s not necessary or conducive to distributed computing.
Bioinformaticians LOVE hand-coded file formats.
But only store several fundamental data types.
Strong assumptions in the formats. Inconsistent implementations in multiple languages.
Doesn’t allow different storage backends.
OK, we discussed what the data/files are like that are passed around. What about the computation itself?
Let’s take one of the transformations in the pipeline. Basically a more complex version of a DISTINCT operation.
Actual code from the standard Picard implementation of MarkDuplicates.
Two things should be going on:
Algorithm/Method overall
Actual code implementation.
Start by building some data structures from the input files.
Then iterate over file and rewrite is as necessary.
But what if we jump into one of these functions. You’ll find a dependence on…
[ADVANCE]
An input option related to Unix file handle limits?
WTF?
Why should this METHOD need know anything about the platform that this is running on? LEAKY ABSTRACTIONS
Most bioinformatics tools make strong assumptions about their environments, and also the structure of the data (e.g., global sort), when it shouldn’t be necessary.
Ok, but that’s not all…
[ADVANCE]
We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual.
But of course, it’s never one pipeline…
[ADVANCE]
It’s a pipeline per person!
But since each pipeline runs (essentially) serially, scaling it up is easy…
[ADVANCE]
Scale out!
Typically managed with a pretty low-level job scheduler.
MANUAL split and merge
MANUAL resource request
BABYSIT for failures/errors
CUSTOM intermediate ser/de
But this basically works and the parallelism is pretty simple. This architecture has kept up with the pace of sequencing for some time now.
So why am I even up here talking? Two reasons…
SCALE!
New levels of ambition for large biology projects.
100k genomes at Genomics England in collaboration with National Health Service.
Raw data for a single individual can be in the hundreds of GB
But even before we hit that huge scale (which is soon)…
We don’t want to analyze each sample separately. We want to use ALL THE DATA we generate.
Well, these pipelines often include lots of aggregation, perhaps we can just…
[ADVANCE]
Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks.
But even worse…
[ADVANCE]
God help you if you want to jointly use all the data in earlier part of the pipeline.
So what do we do? Two things
Things like global sort order are overly restrictive and leads to algos relying on it when it’s not necessary.
Example of an algo. Bioinformatics loves evaluating probabilistic models on the chromosomes.
We can easily extract parallelism at different parts of our pipelines.
Use higher level distributed computing primitives and let the system figure out all the platform issues for you: storage, job scheduling, fault tolerance, shuffles.
Layered abstractions.
Use multiple storage engines with different characteristics. Multiple execution engines.
Application code/algos should only touch the top of the abstraction layer.
Cheap scalable STORAGE at bottom
Resource management middle
EXECUTION engines that can run your code on the cluster and provide parallelism
Consistent SERIALIZATION framework
Scientist should NOT WORRY about lower levels (coordination, file formats, storage details, fault tolerance)
Another computation for a statistical aggregate on genome variant data. Details not important.
Spark data flow:
Distributed data load
High level joins/spatial computations that are parallelized as necessary.
But really nice thing is because our data is stored using the Avro data model…
[ADVANCE]
You can execute the exact same computation using, for example, SQL!
Pick the best tool for the job.
We’ve implemented this vision with Spark, starting from the Amplab (same people that gave you Spark) into a project called
ADAM
The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…
In addition to some of the standard pipeline transformations, implemented the core spatial join operations (analogous to a geospatial library).
Single-node performance improvements.
Free scalability: fixed price, significant wall-clock improvements
See most recent SIGMOD.
Not to be outdone, Craig Venter proposes 1 million genomes at Human Longevity Inc.