(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014Amazon Web Services
Join Peter De Santis, Vice President of Amazon Compute Services, and Matt Garman, Vice President of Amazon EC2 as they share a ''behind the scenes'' look at the evolution of compute at AWS. You hear about the drivers behind the innovations we've introduced, and learn how we've scaled our compute services to meet dramatic usage growth.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on.
In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc.
Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production.
We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.
The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others.
Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more!
In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data.
https://www.bigdataspain.org/2017/talk/tuning-java-driver-for-apache-cassandra
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.
Learn about features with demos and announcements, from cross-cluster replication and frozen indices in Elasticsearch to Kibana Spaces and the ever-growing set of data integrations in Beats and Logstash.
(SPOT211) State of the Union: Amazon Compute Services | AWS re:Invent 2014Amazon Web Services
Join Peter De Santis, Vice President of Amazon Compute Services, and Matt Garman, Vice President of Amazon EC2 as they share a ''behind the scenes'' look at the evolution of compute at AWS. You hear about the drivers behind the innovations we've introduced, and learn how we've scaled our compute services to meet dramatic usage growth.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on.
In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc.
Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production.
We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.
The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others.
Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more!
In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data.
https://www.bigdataspain.org/2017/talk/tuning-java-driver-for-apache-cassandra
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.
Learn about features with demos and announcements, from cross-cluster replication and frozen indices in Elasticsearch to Kibana Spaces and the ever-growing set of data integrations in Beats and Logstash.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
See time series forecasting and automatic log data categorization in action firsthand. Elastic machine learning features have grown into a powerful tool that automates notifications for anomalies and simplifies tasks like pre-configuring NGINX log analysis at scale. Learn how to put them to work on your data.
Optimizing Elastic for Search at McQueen SolutionsElasticsearch
Learn best practices for squeezing every last drop of performance out of Elasticsearch queries and aggregations -- all based off of real-world production clusters.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent
Should you consume Kafka in a stream OR batch? When should you choose each one? What is more efficient, and cost effective?
In this talk we’ll give you the tools and metrics to decide which solution you should apply when, and show you a real life example with cost & time comparisons.
To highlight the differences, we’ll dive into a project we’ve done, transitioning from reading Kafka in a stream to reading it in batch.
By turning conventional thinking on its head and reading our multi-petabyte Kafka stream in batch using Spark and Airflow, we’ve achieved a huge cost reduction of 65% while at the same time getting a more scalable and resilient solution.
We’ll explore the tradeoffs and give you the metrics and intuition you’ll need to make such decisions yourself.
We’ll cover:
Costs of processing in stream compared to batch
Scaling up for bursts and reprocessing
Making the tradeoff between wait times and costs
Recovering from outages
And much more…
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts.
This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered.
The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Advertising Fraud Detection at Scale at T-MobileDatabricks
The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...Kishor Datta Gupta
Log files are generated in many different formats by a plethora of devices and
software. The proper analysis of these files can lead to useful information about
various aspects of each system. Cloud computing appears to be suitable for
this type of analysis, as it is capable to manage the high production rate, the
large size and the diversity of log files. In this paper we investigated log file
analysis with the cloud computational frameworks ApacheTM Hadoop R
and
Apache SparkTM. We developed realistic log file analysis applications in both
frameworks and we performed SQL-type queries in real Apache Web Server log
files. Various experiments were performed with different parameters in order to
study and compare the performance of the two frameworks.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
See time series forecasting and automatic log data categorization in action firsthand. Elastic machine learning features have grown into a powerful tool that automates notifications for anomalies and simplifies tasks like pre-configuring NGINX log analysis at scale. Learn how to put them to work on your data.
Optimizing Elastic for Search at McQueen SolutionsElasticsearch
Learn best practices for squeezing every last drop of performance out of Elasticsearch queries and aggregations -- all based off of real-world production clusters.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users.
Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...HostedbyConfluent
Should you consume Kafka in a stream OR batch? When should you choose each one? What is more efficient, and cost effective?
In this talk we’ll give you the tools and metrics to decide which solution you should apply when, and show you a real life example with cost & time comparisons.
To highlight the differences, we’ll dive into a project we’ve done, transitioning from reading Kafka in a stream to reading it in batch.
By turning conventional thinking on its head and reading our multi-petabyte Kafka stream in batch using Spark and Airflow, we’ve achieved a huge cost reduction of 65% while at the same time getting a more scalable and resilient solution.
We’ll explore the tradeoffs and give you the metrics and intuition you’ll need to make such decisions yourself.
We’ll cover:
Costs of processing in stream compared to batch
Scaling up for bursts and reprocessing
Making the tradeoff between wait times and costs
Recovering from outages
And much more…
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks
This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies.
Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts.
This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered.
The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Scale confidently. From laptop to lots of nodes to multi-cluster, multi-use case deployments, Elastic experts are sharing best practices to master and pitfalls to avoid when it comes to scaling Elasticsearch.
Advertising Fraud Detection at Scale at T-MobileDatabricks
The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.
Performance evaluation of cloud-based log file analysis with Apache Hadoop an...Kishor Datta Gupta
Log files are generated in many different formats by a plethora of devices and
software. The proper analysis of these files can lead to useful information about
various aspects of each system. Cloud computing appears to be suitable for
this type of analysis, as it is capable to manage the high production rate, the
large size and the diversity of log files. In this paper we investigated log file
analysis with the cloud computational frameworks ApacheTM Hadoop R
and
Apache SparkTM. We developed realistic log file analysis applications in both
frameworks and we performed SQL-type queries in real Apache Web Server log
files. Various experiments were performed with different parameters in order to
study and compare the performance of the two frameworks.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB
How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory
Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
Machine Learning at the Limit
John Canny, UC Berkeley
How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms.
Bio
John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Energy efficient AI workload partitioning on multi-core systemsDeepak Shankar
o create an AI system, the semiconductor, software, and systems team need to work together. Multi-core systems can provide extremely low latency and higher throughput at lower power consumption. But concurrent access to shared resources by multiple of AI workloads running on different cores can create higher worst-case execution time (WCET) and causes multiple system failures. Architecture exploration can be used to efficiently balance the compute, communication, synchronization, and storage. In this Webinar, we will be using Workloads from automotive, and data centers to demonstrate the methodology.
VisualSim Architect enables designers to assemble architecture models that extend from the smallest IoT to full automotive, and Radar systems to Data Centers. These models will include any combination of software, processors, ECU, RTOS and networks. Using this platform, software designer can explore the partitioning of the AI tasks (software or model) on to cores based on the latency, bandwidth, and power constraints. Within the IoT, the processor, A/D, Bluetooth and software can be modeled while an automotive design will require the network, ECU and firmware. Both have a unique mechanism to define the traffic, test scenarios and AI workloads. Hardware engineers can select cores, cores per cluster, cache hierarchy, memory controller, accelerators, and the interface topology. Software engineers can tune the partitioning, synchronization overhead, memory access schedules and scheduling.
Cloud Experience: Data-driven Applications Made Simple and FastDatabricks
A complex real-time data workflow implementation is very challenging. This session will describe the architecture of a data platform that provides a single, secure, high-performance system that can be deployed in a hybrid cloud architectures. We will present how to support simultaneous, consistent and high-performance access through multiple industry open source and cloud compatible standards of streaming, table, TSDB, object, and file APIs. A new serverless technology is also used in the architecture to support a dynamic and flexible implementations. The presenter will also outline how the platform was integrated with the Spark eco-system, including AI and ML tools, to simplify the development process
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
Imagine a system where one collects real-time data, develops a machine learning model… Runs analysis and training on powerful GPUs… Clicks on a magic button and then deploys code and ML models to production… All without any heavy lifting from data and DevOps engineers. Today, data scientists work on laptops with just a subset of data and time is wasted while waiting for data and compute.
It’s about efficient use of time! Join Iguazio and NVIDIA so that you can get home early today! Learn how to speed up data science from development to production:
- Access to large scale, real-time and operational data without waiting for ETL
- Run high performance analytics and ML on NVIDIA GPUs (Rapids)
- Work on a shared, pre-integrated Kubernetes cluster with - - Jupyter notebook and leading data science tools
- One-click (really!) deployment to production
Speakers: Yaron Haviv, CTO at Iguazio, Or Zilberman, Data Scientist at Iguazio and Jacci Cenci, Sr. Technical Marketing Engineer at NVIDIA
A deterministic and high performance parallel data processing approach to increase guidance navigation and control robustness.
Compare-and-swap (CAS) is an instruction used in multithreading to achieve synchronisation. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation.
Google APAC Machine Learning Day 是 Google 今年三月初於新加坡 Google 辦公室針對機器學習所舉辦的兩天研討會活動,本次聚會將邀請前往參加該活動的 Evan Lin 及他的同事 Benjamin Chen 帶來他們的心得分享,內容包括:
Tensorflow Summit RECAP
Machine Learning Expert Day 所見所聞
分享一下 Linker Networks 如何使用 Tensorflow
https://gdg-taipei.kktix.cc/events/google-apac-machine-learning-day
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
ER(Entity Relationship) Diagram for online shopping - TAEHimani415946
https://bit.ly/3KACoyV
The ER diagram for the project is the foundation for the building of the database of the project. The properties, datatypes, and attributes are defined by the ER diagram.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
2. About me
Cloud Architect @ Linker
Networks
Golang User Group - Co-
Organizer
Top 5 Taiwan Golang open
source contributor (github
award)
Developer, Curator, Blogger
13. Before machine cluster
DB Master:
IP: 192.168.1.222
DB Slave:
IP: 192.168.1.223
Web Server 1:
IP: 192.168.1.101
Web Server 2:
IP: 192.168.1.102
Web Server 3:
IP: 192.168.1.103
Load Balancer:
IP: 1.2.3.4
27. Maximize Utilization
Analyze utilization and reduce working
machines to save our customer budget
- Predict utilization trend
- Provide auto-scaling threshold
adjustment
35. HPC (with GPU) Server
Storage SDN
Storage SDN
Data Collect Probe & Sensor & Smart GW
Visualization
Data Process
Data Analysis &
Machine Learning
DCOS/
Kubernetes
Spark ML Tensorflow
DCOS / Kubernetes
Cassandra (Storage)
Kafka (Queueing)
Go/Akka (Connector)
Spark (ETL/Streaming)
D3.js
Scikit Learn R
Interactive
Dashboard
Jupyter Notebook
Zeppelin
ML Job
Scheduler
Chronos
MIC System Architecture