The document discusses NASA's use of Apache Spark for big data analytics. It provides context on Chris Mattmann's involvement with Spark through his roles at NASA JPL and the Apache Software Foundation. It outlines some of NASA's big data challenges around handling large volumes of Earth observation data from instruments and simulations. NASA is interested in using Spark for tasks like data triage, archiving, and knowledge extraction to help address these challenges and enable new scientific insights.
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
OSA Con 2022: Arrow in Flight_ New Developments in Data Connectivity
David Li - Voltron Data
Apache Arrow has allowed developers to increase the speed and efficiency of analytics by making it easier for libraries and databases to share data without serialization. This talk provides an overview of the Arrow project, highlighting use cases where Arrow has accelerated analytics workflows by as much as 100x, and previews where Arrow is going, with special attention to database connectivity.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
OSA Con 2022: Arrow in Flight_ New Developments in Data Connectivity
David Li - Voltron Data
Apache Arrow has allowed developers to increase the speed and efficiency of analytics by making it easier for libraries and databases to share data without serialization. This talk provides an overview of the Arrow project, highlighting use cases where Arrow has accelerated analytics workflows by as much as 100x, and previews where Arrow is going, with special attention to database connectivity.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
In an environment where cloud-scaling applications is becoming more and more important, client-server architectures paradigms, as shown by memcached, are back with vengeance. In this talk, Galder will talk about Hot Rod, Infinispan's new client/server binary protocol, explaining the key differences compared to memcached's binary protocol, such as the possibility of receiving cluster topology changes. Audience of this talk will learn of the importance of Hot Rod in 'cloud-scale' application server clustering, where stateless application server instances could use Infinispan Hot Rod clients to retrieve state from an elastic farm of Infinispan Hot Rod servers, improving capabilities to run application server instances as a PaaS. The talk will finish with a brief demo of a cluster of Infinispan Hot Rod servers running on EC2 being accessed from a non-Java client. The audience is expected to have an intermediate understanding of client-server software architectures and cloud deployments.
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftDaniel Krook
Lightning talk from the OpenStack NYC meetup on October 8, 2014.
http://bit.ly/ibm-os-meetup
By Gil Vernik
The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support.
The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.
ELC-E 2010: The Right Approach to Minimal Boot Timesandrewmurraympc
This was presented at ELC-E 2010 in Cambridge and describes an approach to cold boot time reduction. It also demonstrates the approach through a case study with an MS7724 reference board.
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
Keynote presentation at the HPC User Forum 2012 in Darborn, MI, September 19, 2012. http://www.hpcuserforum.com/registration/dearborn2012/dearbornagenda.pdf
How to Create the Google for Earth Data (XLDB 2015, Stanford)Rainer Sternfeld
This talk was built around the example of NOAA Big Data Project, in which Planet OS is a partner with Amazon Web Services. The aim of NOAA Big Data Project is to bring NOAA's atmospheric and oceanic data to the cloud, make it discoverable and machine-readable.
This presentation outlines some of the organizational and technical challenges that exist in the project, as well as potential solutions and ideas to approach this set of challenges.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
FACE-IT is an effort to develop a new IT infrastructure to accelerate existing disciplinary research and enable information transfer among traditionally separate fields. At present, finding data and processing it into usable form can dominate research efforts. By providing ready access to not only data but also the software tools used to process it for specific uses (e.g., climate impact and economic model inputs), FACE-IT allows researchers to concentrate their efforts on analysis. Lowering barriers to data access allows researchers to stretch in new directions and allows researchers to learn and respond to the needs of other fields. FACE-IT builds on the Globus Galaxies platform, which has been developed over the past several years at the University of Chicago. FACE-IT also benefit from substantial software development undertaken by the communities who have developed most of the domain-specific tools required to populate FACE-IT with useful capabilities. The FACE-IT Galaxy manages earth system datatypes (as NetCDF), new tool parameters (dates, map, opendap), aggregated datatypes (RAFT), service providers and cool map visualizers.
Scaling People, Not Just Systems, to Take On Big Data ChallengesMatthew Vaughn
Here, I describe how the Texas Advanced Computing Center has shifted its focus from traditional modeling and simulation towards fully embracing big data analytics performed by users with diverse technical backgrounds.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...Larry Smarr
10.10.11
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
Title: A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political, and Economic
Anaheim, CA
Similar to Spark at NASA/JPL-(Chris Mattmann, NASA/JPL) (20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
1. NASA
and
Apache
Spark
Chris
Ma2mann
Chief
Architect,
Instrument
and
Science
Data
Systems
Sec6on,
NASA
JPL
Adjunct
Associate
Professor,
USC
Director,
Apache
So?ware
Founda6on
Spark
Summit
2015
2. And
you
are?
• Apache
Board
of
Directors
involved
in
– OODT
(VP,
PMC),
Tika
(PMC),
Nutch
(PMC),
Incubator
(PMC),
SIS
(PMC),
Gora
(PMC),
Airavata
(PMC)
• Chief
Architect,
Instrument
and
Science
Data
Systems
SecLon
at
NASA
JPL
in
Pasadena,
CA
USA
• SoPware
Architecture/
Engineering
Prof
at
Univ.
of
Southern
California
15-‐Jun-‐15
2
SparkSummit
4. Instrument
&
Ground
Data
Systems
(Sec6on
398)
• Largest
SecLon
on
Lab
• 250+
people
• Data
Science,
Machine
Learning,
VisualizaLon,
OperaLons
groups
• OCO-‐2,
NPP
Sounder
PEATE,
SMAP,
MER,
MSL,
Mars
2020,
Image
Processing
15-‐Jun-‐15
SparkSummit
4
5. Instrument
and
Ground
Systems:
Earth
Monitoring
15-‐Jun-‐15
SparkSummit
5
Instrument Ground Systems
Algorithms
(e.g., triage)
scalable
archives
Scalable Data Analytics
Graph-based
Array-based
Text-based
Space-based
Satellite
Airborne Instrument
and Platform
In-Situ Sensor
Deep Space
Network (DSN)
Science Community
Researchers
Decision Makers
data (GBs - Earth;
MBs - Planetary)
data (10s of TB)
data (10s-100s of
GB)
data
data
(reduced)
File & Metadata
Management; Workflow
Management; Resource
Management
processed data
raw &
processed data processed
& reduced data
6. Some
“Big
Data”
Grand
Challenges
I’m
interested
in
• How
do
we
handle
700
TB/sec
of
data
coming
off
the
wire
when
we
actually
have
to
keep
it
around?
– Required
by
the
Square
Kilometre
Array
• Joe
scien6st
says
I’ve
got
an
IDL
or
Matlab
algorithm
that
I
will
not
change
and
I
need
to
run
it
on
10
years
of
data
from
the
Colorado
River
Basin
and
store
and
disseminate
the
output
products
– Required
by
the
Western
Snow
Hydrology
project
• How
do
we
compare
petabytes
of
climate
model
output
data
in
a
variety
of
formats
(HDF,
NetCDF,
Grib,
etc.)
with
petabytes
of
remote
sensing
data
to
improve
climate
models
for
the
next
IPCC
assessment?
– Required
by
the
5th
IPCC
assessment
and
the
Earth
System
Grid
and
NASA
• How
do
we
catalog
all
of
NASA’s
current
planetary
science
data?
– Required
by
the
NASA
Planetary
Data
System
Image
Credit:
h2p://www.jpl.nasa.gov/news/news.cfm?release=2011-‐295
Copyright
2012.
Jet
Propulsion
Laboratory,
California
InsLtute
of
Technology.
US
Government
Sponsorship
Acknowledged.
15-‐Jun-‐15
SparkSummit
6
7. Task Title PI Section
1 Power Minimization in Signal
Processing for Data-Intensive Science
Larry
D’Addario
335
2 Machine Learning for Smart Triage of
Big Data
Kiri
Wagstaff
388
3 Archiving, Processing and
Dissemination for the Big Date Era
Chris
Mattmann
388
4 Knowledge driven Automated Movie
Production Environment distribution and
Display (AMPED) Pipeline
Eric
De Jong
3223
Big Data Strategic Initiative
• Apply lower-efficient digital architectures to future
JPL flight instrument developments and proposals.
• Expand and promote JPL expertise with machine
learning algorithm development for real-time triage.
• Utilize intelligent anomaly classification algorithms
in other fields, including data-intensive industry.
• Build on JPL investments in large data archive
systems to capture role in future science facilities.
• Enhance the efficiency and impact of JPL’s data
visualization and knowledge extraction programs.
Initiative Leader: Chris Mattmann
Steering Committee Leader: Joseph Lazio
Initial Major Milestones for FY13 Date
Report on end-to-end power optimization of instruments Jun 2013
Hierarchical classification method for VAST and ChemCam Jan 2013
Demonstrate smart compression for Hyperion and CRISM Mar 2013
Cloud computing research and scalability experiments Feb 2013
Data formats and text, metadata extraction in big data sys. Aug 2013
Develop AMPED pipeline and install in VIP Center Dec 2012
Future Opportunities: Mission and instrument competitions, data-
intensive industries, LSST, future radio observatories.
JPL Concept: Big data technology for data triage, archiving, etc.
Key Challenges this work enables: Broaden JPL business base
(relevant to 1X, 3X, 4X, 7X, 8X, 9X Directorates)
Initiative Long Term Objectives
15-‐Jun-‐15
SparkSummit
7
Credit:
Dayton
Jones
10. • Grab the principals behind the leading
infrastructure/viz technologies
- Shove them in a tight space
- Provide beer coffee and snacks
- Provide awesome data
and challenges
- Provide infrastructure and
connectivity
• Check in every day and 1x a week
• Wall of Shame/Fame
• New Challenges Each Week
• Midterm Presentations
• Peanut Gallery
• Make people talk/socialize
• Put that all together
DARPA
XDATA
15-‐Jun-‐15
SparkSummit
10
11. DARPA
XDATA:
AnalyLcs
+
Viz
15-‐Jun-‐15
SparkSummit
11
0
4000
8000
12000
2006 2008 2010 2012
Date
Volume
|2013-07-10GoFFish: XDATA Summer Workshop Mid-Term
Akamai IP Connectivity using AR
12. Met
these
fine
people
• Ion
Stoica
CEO,
DataBricks
Co-‐Director,
AMP
Lab
• Ma2
Massie
Dev
Manager,
AMP
Lab
• Dr.
Chris
White,
DARPA
XDATA
PM
15-‐Jun-‐15
SparkSummit
12
13. The Apache Software
Foundation
• Largest open source
software development
entity in the world
– Over 2600+ committers
– Over 4200+ contributors
– Over 400+ members
• 100+ Top Level Projects
– 57 Incubating
– 32 Lab Projects
• 12 retired projects in the “Attic”
• Over 1.2 million revisions
• 501(c)3 non-profit organization incorporated in Delaware
- Over 10M successful requests
served a day across the world
- HTTPD web server used on
100+ million web sites (52+%
of the market)
15-‐Jun-‐15
SparkSummit
13
14. Apache
Maturity
Model
• Start
out
with
IncubaLon
• Grow
community
• Make
releases
• Gain
interest
• Diversify
• When
the
project
is
ready,
graduate
into
– Top-‐Level
Project
(TLP)
– Sub-‐project
of
TLP
• Increasingly,
Sub-‐projects
are
discouraged
compared
to
TLPs
14
SparkSummit
15-‐Jun-‐15
15. Apache
is
a
well
recognized
brand
15-‐Jun-‐15
SparkSummit
15
17. Where
does
Spark
fit
into
science?
U.S.
NaLonal
Climate
Assessment
(pic
credit:
Dr.
Tom
Painter)
SKA
South
Africa:
Square
Kilometre
Array
(pic
credit:
Dr.
Jasper
Horrell,
Simon
Ratcliffe
15-‐Jun-‐15
SparkSummit
17
20. Science Data File Formats
• Hierarchical Data Format (HDF)
– http://www.hdfgroup.org
– Versions 4 and 5
– Lots of NASA data is in 4, newer NASA data in 5
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
15-‐Jun-‐15
SparkSummit
20
21. Science Data File Formats
• network Common Data Form (netCDF)
– www.unidata.ucar.edu/software/netcdf/
– Versions 3 and 4
– Heavily used in DOE, NOAA, etc.
– Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
– Not Hierarchical representation: all flat
15-‐Jun-‐15
SparkSummit
21
22. OCO-‐1
Workflow
15-‐Jun-‐15
SparkSummit
22
Credit:
B.
Chafin
24. Data-‐Reuse
Between
Stages
• All
of
these
science
data
pipelines
– Read/Write
NetCDF,
HDF
files
– Write
to
distributed
file
systems
(only
recently
HDFS,
GlusterFS,
etc.)
• Have
Lming
constraints
• Include
jobs
with
varying
Lming
– Some
early
compleLng
jobs
(<1ms)
– Some
long
running
jobs
• What
does
this
sound
like?
SPARK
15-‐Jun-‐15
SparkSummit
24
25. Apache OODT
• Entered “incubation” at the Apache
Software Foundation in 2010
• Selected as a top level Apache Software
Foundation project in January 2011
• Developed by a community of participants
from many companies, universities, and
organizations
• Used for a diverse set of science data
system activities in planetary science,
earth science, radio astronomy,
biomedicine, astrophysics, and more
OODT Development & user community includes:
http://oodt.apache.org
15-‐Jun-‐15
SparkSummit
25
28. MoLvaLon
for
SciSpark
• Experiment
with
in
memory
and
frequent
data
reuse
operaLons
– Regridding,
InteracLve
AnalyLcs
such
as
MCC
search,
and
variable
clustering
(min/max)
over
decadal
datasets
could
benefit
from
in-‐memory
tesLng
(rather
than
frequent
disk
I/O)
– Data
IngesLon
(preparaLon,
formaong)
15-‐Jun-‐15
SparkSummit
28
29. Architecture
of
SciSpark
15-‐Jun-‐15
SparkSummit
29
Scientific RDD Creation
SciSpark
HDFS /
Shark
Scala
(split)
Map
Save /
Cache
Split by
Time
Split by
Region
Load
HDF
Load
NetCDF
Regrid Metrics
Data Scientists /
Expert Usrs
Scientists / Decision Makers /
Educators / Students
Data Centers/
SystemsRCMES obs4MIPs ESGF ExArch DAACs
D3.js
model skill varies strongly for metrics. UC and UCT that
exhibit the largest bias (Fig. 2k) also yield large RMSE
(Fig. 2l); however, DMI, which shows relatively small
bias, is one of the two that show the largest RMSE. All
RCMs also consistently yield higher skill in simulating
precipitation distribution over land surfaces for winter than
for summer (Fig. 3a).
The ENS bias (Fig. 4a) is characterized by wet (dry)
biases in the climatologically dry (wet) regions. This indi-
cates a weakness in representing the precipitation contrast
across the African landscape that would have deleterious
effects in representing everything from regional atmospheric
circulation to the coupling with land and vegetation pro-
cesses. To consider model biases in relation to typical
anomalous conditions, we normalize the ENS bias by the
interannual variability of the CRU precipitation over the
18-year period (rt). The ± 1rt range approximately coin-
cides with the 68 % CI. In this case, the values remain
within the ±1 range over most of Africa (Fig. 4b), i.e., the
magnitude of the ENS bias is less than the local interannual
variability. To consider the systematic bias relative to the
expected precipitation values, the ENS bias is normalized by
the CRU annual-mean value (Fig. 4c). In this case, the
normalized ENS bias is 20 % of the CRU in the region
between 20°S and 10°N. Both normalized ENS biases are
large in the dry/marginally-dry regions including northern
Sahara, eastern Horn of Africa, and Arabia Peninsula.
Figure 5 presents the precipitation annual cycle (Lieb-
mann et al. 2012) in 10 out of the 21 sub-regions; two in
the northern Africa coast (Fig. 5a, f), four in the west
Africa (Fig. 5b–e) and four in the east Africa (Fig. 5g–j)
regions. An annual cycle plot for the entire 21 sub-regions
is presented in Supplemental Figure 2a, b (http://rcmes.jpl.
nasa.gov/publications/figures/Kim-Climate_Dynamics-2012
). Two green lines in Fig. 5a–j represent the ±1rt range
about the observation. All RCMs well simulate the sea-
sonality of precipitation, at least in its phase. Despite large
inter-RCM variations, ENS agrees reasonably with CRU in
most sub-regions. For the Mediterranean regions (Figs. 5a,
f), ENS is within the ±1rt range for most of the year. ENS
also closely agrees with CRU, both in seasonality and
magnitude, in most of the western Africa regions. Fidelity
of ENS in these east Africa regions is generally lower than
in the west coast region. In the Ethiopian Highlands and
Eastern Horn of Africa, all RCMs overestimate CRU and
ENS is outside the ±1rt range throughout a year. The
RCM skill in simulating the annual cycle is summarized for
all sub-regions using portrait diagrams. The normalized
RMSE (Fig. 5k) reveals that model skill varies according
to regions. RMSE remains70 % of CRU for most RCMs
in most sub-regions except the northeastern Africa (eastern
Horn of Africa) and eastern Arabia Peninsula (R10, R20,
R21), coastal Western Sahara (R05), and eastern inland
Sahara (R06) regions. Most RCMs also simulate the phase
of the annual cycle measured by the correlation coeffi-
cients, reasonably well except for R10, R20, and R21
where RMSE is also large (Fig. 5l). Results in Fig. 5 show
that RCM skill varies according to regional climate as these
regions of poor performance are characterized by arid cli-
mate. Among these, the regions in northeastern Africa and
eastern Arabia Peninsula (regions 10, 20, and 21) are
affected by the Arabian-Sea monsoon (e.g., Segele et al.
2009). This may imply that in addition to shortcomings in
model physics for simulating precipitation in these dry
regions, the seasonal moisture flux from the Indian Ocean
associated with the movement of the Indian Ocean ITCZ
(Liebmann et al. 2012) may not be well represented via the
lateral boundary forcing. Evaluation of the large-scale
forcing will be subjects for future studies. Figure 5k, l also
show that ENS is consistently among the best performers.
(a) Precipitation (b) Temperature
Fig. 3 The standardized deviations and spatial pattern correlations between the CRU data and the individual model results for the boreal summer
(June–July–August; blue) and winter (December–January–February; red) over the land surface: a precipitation and b temperature
Systematic model errors
123
Author's personal copy
eastern regions. This west-to-east gradient is reversed in
the Southern Hemisphere (SH) subtropical region. All
RCMs simulate these observed features, but with varying
fidelity (Supplemental Figure 1; http://rcmes.jpl.nasa.gov/
publications/figures/Kim-Climate_Dynamics-2012). The
model bias (Fig. 2a–j) varies strongly among these RCMs.
It also shows systematic regional variations across all or a
majority of these RCMs. All or most RCMs generate wet
biases in South Africa and sub-Sahara (Sahel) region and
dry biases in the northwestern Sahara, northern Madagas-
car Island, southeastern Africa coast, and interior Arabia
Peninsula regions. Precipitation biases in the tropics vary
among RCMs. The spatial variation of the annual-mean
precipitation is evaluated for the mean (Fig. 2k), pattern
correlation, and standardized deviation (Fig. 2l) over the
land area. The distance between REF and individual points
in the Taylor diagram corresponds to RMSE (Taylor 2001).
All RCMs well simulate the overland-mean precipitation
amount (Fig. 2k) with typical biases 10 % of CRU,
except UC and UCT. The spatial pattern agrees closely
with CRU with correlation coefficients 0.8–0.95 (Fig. 2l).
Most RCMs overestimate the magnitude of spatial vari-
ability (standardized deviations). ENS yields smaller
RMSE than all RCMs within ENS (Fig. 2l). The measured
mm/day
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j)
(k) (l)
Fig. 2 The biases in the simulated annual-mean precipitation (mm/
day) against the CRU data for the individual models (a–j). The
overland-mean precipitation (k) and the spatial pattern correlations
and standardized deviations (l) with respect to the CRU data over the
land surface. The red square in (l) indicates the multi-RCM ensemble
J. Kim et al.
123
Author's personal copy
Extract, Transform and Load (ETL)
Sci Spark User Interface
30. Sci
Spark
–
VisualizaLon
(D3
and
friends)
15-‐Jun-‐15
SparkSummit
30
Query Storage
Raster
Vector
Text
Filtering
Gora
(storage framework)
Solr
(indexing framework)
web
pages
index
HBase/Hadoop
Tika
(parsing framework)
ProtocolLayer
• Tika,
Nutch,
Blaze,
Bokeh,
Solr,
Tangelo
31. Use
Cases
• (A)
MulL-‐stage
generaLon
to
generate
Lme-‐
split
data
• (B)
MulL-‐stage
operaLon
to
select
data
from
Shark,
and
cluster
by
deviaLon
from
mean
15-‐Jun-‐15
SparkSummit
31
32. Climate
Metrics
on
SciSpark
15-‐Jun-‐15
SparkSummit
32
sRDD
OODT-
based ETL
into HDFS
sRDD
Regrid
(bilinear and sRDD.cache)
Split by Time
(Jan 1998 - 2012)
in-memory
(node local or replicated)
HDFS/replicated spinning disk or SSD
sRDD
sRDD
Select lat,
lng,value from
Shark over NA
sRDD
Cluster values on
mean, output
clusters
Top 10
sorted
clusters by
mean
in-memory
(node local or replicated)
HDFS/replicated spinning disk or SSD
sRDD
in-memory
(node local or replicated)
A)
B)
33. SciSpark
–
just
geong
started
• Funded
NASA
AIST14
award
to
construct
• SciSpark
climate
scenarios
– Climate
extremes
/
impact
analysis
and
clustering
– Mesocale
ConvecLve
Complex
Search
15-‐Jun-‐15
SparkSummit
33
34. SKA/Astronomy
15-‐Jun-‐15
SparkSummit
34
Go Where the Science is Best!
Deploy Where there is No Infrastructure
Reconfigure as Needed to Optimize Performance
Simplify by Using Raw Voltage Capture
36. Conclusions
• Lots
of
places
in
science
and
NASA
for
Spark
• Great
connecLons
already
• ExisLng
Apache
projects
to
integrate
upstream
• Downstream
use
cases
• Come
chat
with
me
today!
15-‐Jun-‐15
SparkSummit
36