This document summarizes a presentation about using AWS Batch and AWS Step Functions for genomic analysis workflows. It discusses:
- AWS Batch for running containerized jobs on EC2 instances in a managed way. Jobs are run based on definitions, queues, and compute environments.
- AWS Step Functions for visualizing and coordinating the components of distributed applications using state machines and workflows.
- An example architecture using AWS Batch for the job execution layer and AWS Step Functions to orchestrate the workflow, providing flexibility, ease of deployment, and integration with non-Batch applications.
- Potential considerations for data sharing, multitenancy, and volume reuse when using AWS Batch for genomic analysis jobs.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. I might agree with that, while remembering that everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode (very often with delayed ignition)? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency? And many other interesting topics that you might face when experimenting with ES.
Andrzej Ludwikowski - Event Sourcing - what could possibly go wrong? - Codemo...Codemotion
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. True, but everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency?
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. I might agree with that, while remembering that everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode (very often with delayed ignition)? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency? And many other interesting topics that you might face when experimenting with ES.
Andrzej Ludwikowski - Event Sourcing - what could possibly go wrong? - Codemo...Codemotion
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. True, but everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency?
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Dependency Injection in Apache Spark ApplicationsDatabricks
Dependency Injection is a programming paradigm that allows for cleaner, reusable, and more easily extensible code. Though Dependency injection has existed for a while now, its use for wiring dependencies in Apache Spark applications is relatively new. In this talk, we present our adventures writing testable Spark applications with dependency injection and explain why it is different than wiring dependencies for web applications due to Spark’s unique programming model.
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
We have built a Flink-based system to allow our business users to configure processing rules on a Kafka stream dynamically. Additionally it allows the state to be built dynamically using replay of targeted messages from a long term storage system. This allows for new rules to deliver results based on prior data or to re-run existing rules that had breaking changes or a defect. Why we submitted this talk: We developed a unique solution that allows us to handle on the fly changes of business rules for stateful stream processing. This challenge required us to solve several problems -- data coming in from separate topics synchronized on a tracer-bullet, rebuilding state from events that are no longer on Kafka, and processing rule changes without interrupting the stream.
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.
This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities.
In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. I might agree with that, while remembering that everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode (very often with delayed ignition)? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency? And many other interesting topics that you might face when experimenting with ES.
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard).
You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.
Spark real world use cases and optimizationsGal Marder
Using Spark for BigData became the standard in the industry. The internet is
full with "hello world" examples, but when your Spark job meets production all hell breaks loose. We will cover real world use cases, how they were designed, why they didn't work and how we made them run fast
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems, many questions arise:
How to handle missing data?
Should my system handle both serving and backed process or separating them out?
Which one of the solutions will be cheaper? Best Performance for Money?
In the talk we will tell the tale of all of the transformations we’ve made to our data model @Windward, show some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL.
And I’ll try my best NOT to answer the question “Which one of them is the Best?”
Sharing our Pain and Lessons learned is promised!
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
I’m a software development groupie, Interested in tackling cutting edge technologies.
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
http://flink-forward.org/kb_sessions/apache-beam-a-unified-model-for-batch-and-streaming-data-processing/
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, et al.) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, describe main concepts in the programming model, and compare with similar systems. We’ll go from a simple scenario to a relatively complex data processing pipeline, and finally demonstrate execution of that pipeline on multiple runtimes.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
AWS Batch is a service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads at scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads, allowing you to focus on analyzing results and solving problems.
In this session, led by the AWS Batch service team, you will learn core concepts behind AWS Batch and details of how the service functions. We will cover multiple patterns used by customers to leverage storage and GPUs as part of their batch workloads. We will also cover how to integrate AWS Batch with other services such as AWS Step Functions for decision based workloads or Amazon CloudWatch Events to trigger batch jobs based on events or schedules.
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)Amazon Web Services
In this workshop, we focus on visualizations of IoT data using ELK, Amazon Elasticsearch, Logstash, and Kibana or Amazon Kinesis. We will dive into how these visualizations can give you new capabilites and understanding when interacting with your device data from the context they provide on the world around them.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
Dependency Injection in Apache Spark ApplicationsDatabricks
Dependency Injection is a programming paradigm that allows for cleaner, reusable, and more easily extensible code. Though Dependency injection has existed for a while now, its use for wiring dependencies in Apache Spark applications is relatively new. In this talk, we present our adventures writing testable Spark applications with dependency injection and explain why it is different than wiring dependencies for web applications due to Spark’s unique programming model.
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...Flink Forward
We have built a Flink-based system to allow our business users to configure processing rules on a Kafka stream dynamically. Additionally it allows the state to be built dynamically using replay of targeted messages from a long term storage system. This allows for new rules to deliver results based on prior data or to re-run existing rules that had breaking changes or a defect. Why we submitted this talk: We developed a unique solution that allows us to handle on the fly changes of business rules for stateful stream processing. This challenge required us to solve several problems -- data coming in from separate topics synchronized on a tracer-bullet, rebuilding state from events that are no longer on Kafka, and processing rule changes without interrupting the stream.
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.
This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities.
In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
Yet another presentation about Event Sourcing? Yes and no. Event Sourcing is a really great concept. Some could say it’s a Holy Grail of the software architecture. I might agree with that, while remembering that everything comes with a price. This session is a summary of my experience with ES gathered while working on 3 different commercial products. Instead of theoretical aspects, I will focus on possible challenges with ES implementation. What could explode (very often with delayed ignition)? How and where to store events effectively? What are possible schema evolution solutions? How to achieve the highest level of scalability and live with eventual consistency? And many other interesting topics that you might face when experimenting with ES.
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard).
You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.
Spark real world use cases and optimizationsGal Marder
Using Spark for BigData became the standard in the industry. The internet is
full with "hello world" examples, but when your Spark job meets production all hell breaks loose. We will cover real world use cases, how they were designed, why they didn't work and how we made them run fast
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems, many questions arise:
How to handle missing data?
Should my system handle both serving and backed process or separating them out?
Which one of the solutions will be cheaper? Best Performance for Money?
In the talk we will tell the tale of all of the transformations we’ve made to our data model @Windward, show some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL.
And I’ll try my best NOT to answer the question “Which one of them is the Best?”
Sharing our Pain and Lessons learned is promised!
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
I’m a software development groupie, Interested in tackling cutting edge technologies.
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...Flink Forward
http://flink-forward.org/kb_sessions/apache-beam-a-unified-model-for-batch-and-streaming-data-processing/
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, et al.) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, describe main concepts in the programming model, and compare with similar systems. We’ll go from a simple scenario to a relatively complex data processing pipeline, and finally demonstrate execution of that pipeline on multiple runtimes.
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.
AWS Batch is a service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads at scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads, allowing you to focus on analyzing results and solving problems.
In this session, led by the AWS Batch service team, you will learn core concepts behind AWS Batch and details of how the service functions. We will cover multiple patterns used by customers to leverage storage and GPUs as part of their batch workloads. We will also cover how to integrate AWS Batch with other services such as AWS Step Functions for decision based workloads or Amazon CloudWatch Events to trigger batch jobs based on events or schedules.
AWS re:Invent 2016: IoT Visualizations and Analytics (IOT306)Amazon Web Services
In this workshop, we focus on visualizations of IoT data using ELK, Amazon Elasticsearch, Logstash, and Kibana or Amazon Kinesis. We will dive into how these visualizations can give you new capabilites and understanding when interacting with your device data from the context they provide on the world around them.
Announcing AWS Batch - Run Batch Jobs At Scale - December 2016 Monthly Webina...Amazon Web Services
AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems.
Learning Objectives:
• Learn about the capabilities and features of AWS Batch
• Learn about the benefits of AWS Batch
• Learn about the different use cases
• Learn how to get started using AWS Batch
AWS Batch is a fully-managed service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads of any scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there is no need to install or manage batch computing software, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2, Spot Instances, and AWS Lambda. AWS Batch reduces operational complexities, saving time and reducing costs. In this session, you will learn core concepts behind AWS Batch and details of how the service functions.
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing on Amaz...Amazon Web Services
AWS Batch is a fully-managed service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads of any scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there is no need to install or manage batch computing software, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2, Spot Instances, and AWS Lambda. AWS Batch reduces operational complexities, saving time and reducing costs. In this session, Principal Product Managers Jamie Kinney and Dougal Ballantyne describe the core concepts behind AWS Batch and details of how the service functions. The presentation concludes with relevant use cases and sample code.
Microservices, Continuous Delivery, and Elasticsearch at Capital OneNoriaki Tatsumi
This presentation focuses on the implementation of Continuous Delivery and Microservices principles in Capital One’s
cybersecurity data platform – which ingests ~6 TB of data every day, and where Elasticsearch is a core component.
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...Amazon Web Services
The IARPA Machine Intelligence from Cortical Networks (MICrONS) program is a research endeavor created to improve neurally-plausible machine-learning algorithms by understanding data representations and learning rules used by the brain through structurally and functionally interrogating a cubic millimeter of mammalian neocortex. This effort requires efficiently storing, visualizing, and processing petabytes of neuroimaging data. The Johns Hopkins University Applied Physics Laboratory (APL) has developed an open-source, highly available service to manage these data, called the Boss. The Boss uses AWS to provide a cloud-native spatial database with an innovative storage hierarchy and auto-scaling capability to balance cost and performance. This system extensively uses serverless components to meet both scalability and cost requirements. In this session, we provide an overview of the Boss, and we focus on how the APL used Amazon DynamoDB, AWS Lambda, and AWS Step Functions for several high-throughput components of the system. We discuss both the challenges and successes with serverless technologies.
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computingAmazon Web Services
AWS Batch is a fully-managed service that enables developers, scientists, and engineers to easily and efficiently run batch computing workloads of any scale on AWS. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there is no need to install or manage batch computing software, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2, Spot Instances, and AWS Lambda. AWS Batch reduces operational complexities, saving time and reducing costs. In this session, Principal Product Managers Jamie Kinney and Dougal Ballantyne describe the core concepts behind AWS Batch and details of how the service functions. The presentation concludes with relevant use cases and sample code.
Amazon Elastic Kubernetes Service (EKS)는 표준 Kubernetes 환경에서 실행되는 어플리케이션과 완벽히 호환됩니다. AWS상에서 Kubernetes 클러스터를 생성하고, 컨테이너 어플리케이션을 배포, 관리, 확장 및 로깅, 모니터링에 대한 실습과 함께, 최근 릴리즈된 AWS IAM 권한을 Pod에 할당하는 방법 등을 Amazon EKS에서 구현하는 과정을 진행합니다.
Batch Processing with Containers on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about the options for running batch workloads on AWS
- Learn how to architect a containerized batch processing service on Amazon ECS
- Learn best practices for optimizing and scaling complex batch workload requirements
Batch processing is useful when you need to periodically analyze large amounts of data, but configuring and scaling a cluster of virtual machines to process complex batch jobs can be difficult. Containers provide a great solution for running batch jobs by providing easily managed, scalable, and portable code environments.
In this tech talk, we’ll show you how to use containers on AWS for batch processing jobs that can scale quickly and cost-effectively. We’ll discuss AWS Batch, our fully managed batch-processing service, and show you how to architect your own batch processing service using the Amazon EC2 Container Service. We’ll also discuss best practices for ensuring efficient and opportunistic scheduling, fine-grained monitoring, compute resource auto-scaling, and security for your batch jobs.
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Everything generates logs. Applications, infrastructure, security ... everything. Keeping track of the flood of log data is a big challenge, yet critical to your ability to understand your systems and troubleshoot (or prevent) issues. In this session, we will use both Amazon CloudWatch and application logs to show you how to build an end-to-end log analytics solution. First, we cover how to configure an Amazon Elaticsearch Service domain and ingest data into it using Amazon Kinesis Firehose, demonstrating how easy it is to transform data with Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data and configure a secure analytics environment. We demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Amazon Web Services
Learning Objectives:
- How to simply scale out your batch workflows on AWS
- How to think about container/job management within managed, high-throughput workflows
- How to build a scalable orchestration framework within AWS Step Functions
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. • We meet quarterly
• We are passionate about Big-Data technologies, the
human genome and personalized medicine
• We have 775 genomies (members) #GenomicsNYC
Big Data Genomics NYC
3. Past MeetUps include:
• Dipping into Guacamole – a spark-powered Somatic
Variant Caller
• Next Generation Tools and Strategies for Genomic
Analysis
• Leverage ADAM and Spark for Genomic Analysis
4. Building Genomics
Pipelines in the Cloud:
Using AWS Batch and AWS Step
Functions to Design and Run
High-Throughput Workflows
with Angel Pizarro
5. What we will cover
• Some context
• Service Overview – AWS Batch
• Service Overview – AWS Step Functions
• Architecture Deep Dive
6. We see similar data analysis patterns
Life Sciences
Financial Services
9. Introducing AWS Batch
Fully Managed
No software to install or
servers to manage. AWS
Batch provisions and
scales your infrastructure
Integrated with AWS
AWS Batch jobs can easily
and securely interact with
services such as Amazon S3,
DynamoDB, and Rekognition
Cost-Efficient
AWS Batch launches compute
resources tailored to your jobs
and can provision Amazon EC2
and EC2 Spot instances
11. IAM Role for
Batch Job
Input Files
Queue of
Runnable Jobs
S3 Events Trigger
Lambda Function
Submits Batch Job
AWS Batch
Compute Environments
AWS Batch Job
Output
Example AWS Batch Job Architecture
Job Definition
Job Resource Requirements
and other parameters
AWS Batch Execution
Application
Image
AWS Batch
Scheduler
12. Job Definitions
Similar to ECS Task Definitions, AWS Batch Job Definitions specify how
jobs are to be run. While each job must reference a job definition, many
parameters can be overridden.
Some of the attributes specified in a job definition:
• IAM role associated with the job
• vCPU and memory requirements
• Mount points
• Container properties
• Environment variables
$ aws batch register-job-definition --job-definition-name gatk
--container-properties ...
13. Jobs
Jobs are the unit of work executed by AWS Batch as containerized
applications running on Amazon EC2.
As your job starts, AWS Batch creates a container using the command
and parameters specified in your job definition. You can optionally
override properties such as CPU and Memory requirements.
$ aws batch submit-job --job-name variant-calling
--job-definition gatk:12 --job-queue genomics
14. Job Queues
Jobs are submitted to a Job Queue, where they reside until they are
able to be scheduled to a compute resource. Information related to
completed jobs persists in the queue for 24 hours.
$ aws batch create-job-queue --job-queue-name genomics
--priority 500 --compute-environment-order ...
15. Compute Environments
Job queues are mapped to one or more Compute Environments which
contain the EC2 instances used to run your AWS Batch jobs.
Managed compute environments enable you to describe your business
requirements (instance types, min/max/desired vCPUs, and EC2 Spot bid as
a % of On-Demand). AWS Batch will then launch an elastic quantity of
instances from a range of instance types based on your jobs’ requirements.
You can select specific instance types (e.g. c4.8xlarge), instance families
(e.g. C4, M4, R4), or simply choose “optimal” and AWS Batch will launch
appropriately sized instances from our more-modern instance families.
16. AWS Batch Concepts
The Scheduler evaluates when, where, and
how to run jobs that have been submitted to
a job queue.
Jobs run in approximately the order in which
they are submitted as long as all
dependencies on other jobs have been met.
19. AWS Step Functions…
…makes it easy to
coordinate the components
of distributed applications
using visual workflows.
20. Application Lifecycle in AWS Step Functions
Visualize in the
Console
Define in JSON Monitor
Executions
21. Seven State Types
Task A single unit of work
Choice Adds branching logic
Parallel Fork and join the data across tasks
Wait Delay for a specified time
Fail Stops an execution and marks it as a failure
Succeed Stops an execution successfully
Pass Passes its input to its output
22. BUILD VISUAL WORKFLOWS USING STATE TYPES
2
2
AWS STEP FUNCTIONS
Task
Choice
Fail
ParallelMountains
People
Snow
27. Considerations for Batch Layer: Data Sharing
Consideration: Jobs are managed at the container, not
instance level. Cannot guarantee consecutive containers in
a workflow will run on same instance.
Solution: Stage all data in Amazon S3, and read and write
everything from there. Also important for traceability,
logging, etc.
28. Considerations for Batch Layer: Multitenancy
Consideration: May have multiple containers running
batch processes on same instance in same base working
directory.
Solution: Within scratch directory, each batch process
creates a subfolder with a unique ID. All scratch data
written to this subdirectory.
29. Considerations for Batch Layer: Volume Reuse
Consideration: Scratch data should live only as long as
the job using it in order to optimize for instance and
Amazon EBS storage costs.
Solution: Within scratch directory, each batch process
creates a subfolder with a unique ID. All scratch data
written to this subdirectory. Delete subdirectory at end of
job.
32. A Flexible Workflow Deployment Model
• Decouple batch engine and workflow orchestration
• Workflow creation now done as JSON
• Easier to deploy
• Easier to automate
• Easier to test
• Can integrate non-Batch applications as well
38. Control Plane for other
Infrastructure
Human Microbiome Project
Public Data Set
Targeted 16S sequencine of 300 healthy adult at 18
specific sites (oral cavity, airways, urogenital track, skin,
and gut)
https://s3-us-west-2.amazonaws.com/human-microbiome-project
40. IARPA MICrONS
Intelligence Advanced Research Projects Activity
Machine Intelligence from Cortical Networks
• MICrONS seeks to revolutionize machine learning by
understanding the representations, transformations, and
learning rules employed by the brain
• The program is expressly designed as a dialogue between
computer science, data science, and neuroscience
Neurally-plausible
Machine Learning
Framework
Behavior
Experiment
Functional
Imaging
Structural
Imaging
Data
Analysis
41. Why Is This Different?
• Current Neural networks are “neurally inspired” but
not considered biofidelic or neurally plausible
• Previous projects to build algorithms based on the
brain exist, but have been focused on macro and
micro information, or lower-fidelity statistics
• Little is known about the brain at the mesoscale
• A “cortical column” is theorized to be order ~1mm3
• In this program, structure and function co-registration
provides a uniquely rich picture of computing circuits
• Researchers are directly measuring mesoscale
activity and circuits
Human Connectome Project
(1-100’s of neurons)
microscale
(1k – 1M neurons)
mesoscale
(brain regions)
macroscale
?
42. Why Is This Different: Functional Imaging
Video Credit: Tianyu Wang (Xu Lab, Cornell University) & Jacob Reimer (Tolias Lab, Baylor College of Medicine)
43. Why Is This Different: Structural Imaging
• Peta-scale structural imaging
• 1mm3 region is large enough to
contain meaningful circuits never
before observed
• ~50k-100k neurons
• ~100,000,000 synapses
• ~4x4x30nm voxels
• ~2 – 2.5 PB
• Three different techniques
• Scanning Electron Microscopy
(SEM)
• Transmission Electron
Microscopy (TEM)
• Fluorescent in situ sequencing
(FISSEQ) Barcoding
Video Credit: Kasthuri, et al. - Cell 2015
Bobby Kasthuri, Daniel Berger, Jeff Lichtman
44. Why Is This Different: Co-registered Data
• Co-registration links structure to
function
• For the first time, researchers will
measure in the same sample at scale:
• Stimulus (”input”)
• Behavior (“output”)
• Connectome (“circuit diagram”)
• Neuronal Activity (“voltages”)
Calcium Imaging Data – Tolias Lab, Baylor College of Medicine
X-ray Tomography and co-registration – Allen Institute for Brain Science
45. Why Can We Succeed Now?
• New imaging techniques and engineering
capabilities can interrogate mesoscale circuits
• Increased computing power has enabled
automated analysis with machine learning
• Reduced storage costs have made collection
and analysis of many petabytes of data possible
• Use of the cloud has provided the ability to scale
when needed and facilitates sharing and
collaboration
We can directly observe and reconstruct mesoscale
neuronal circuits in vivo for the first time
https://www.karlrupp.net
46. The Boss
Block and Object Storage Service
The Boss is a multi-dimensional spatial database, provided as a managed service on AWS
The Boss stores annotation data co-registered to image data
• An annotation is a unique 64-bit identifier applied to a set of voxels, representing its spatial distribution
ID: 1267
ID: 345345
ID: 534534799