Dr. Francesco Bongiovanni has expertise in scalable distributed systems and algorithms, cloud computing, applied formal methods, and distributed optimizations. He has a B.Sc. in Computer Systems, M.Sc. in Software Engineering of Distributed Systems, and Ph.D. in Computer Science. He has worked at INRIA and Verimag Laboratory. This presentation provides an overview of big data frameworks and tools including HDFS, Mesos, Spark, Spark Streaming, Spark SQL, GraphX, MLLib, Chapel, ZooKeeper, and SparkR that can be run on the eScience cluster for processing large datasets in a scalable, fault-tolerant manner. Examples demonstrate performing operations like averaging 1 billion elements
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
Open Source Lambda Architecture for deep learningPatrick Nicolas
This presentation describes the various layers and open source components that can be used to design and implement a lambda architecture enabled to support batch processing for model training and streaming for prediction
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
In the past years, the increasing computational power has made possible larger scientific experiments that have high computing demands, such as brain tissue simulations. In general, larger simulations imply generating larger amounts of data that need to be then analyzed by neuroscientists. Currently, simulation reports are analyzed by neuroscientists with the help of Python scripts, thanks to its programming simplicity and its performance of the NumPy library.
However, this analysis workflow will become unfeasible in the near future, as we foresee a 10x increase of the dataset size in the next year. Therefore, we are exploring how to accelerate data analysis of brain activity simulations with big data technologies, like Spark. In this talk, we will present how we address this challenge: from building RDDs/DataFrames from custom binary files to data queries and transformations to achieve the desired scientific analyses. In order to reach our goals, we have implemented our workflow in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning.
After significant engineering and programming efforts, we would like to share with the community our lessons learned: how Spark features can leverage data analysis in our neuroscience research area and what type of decisions can negatively impact performance. Moreover, we would also like to open a discussion with some critical limitations we have found in Spark applied to our use cases, and how to address them in the future as a joint community effort. In brief, as takeaway messages, we will highlight the suitability of Spark for our data analysis, how data generation can highly impact subsequent data analysis and how the decision of data types and formats can have a significant impact in Spark performance. We will present our experiments run on Cooley, the Argonne National Laboratory (ANL) data analysis cluster.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
Open Source Lambda Architecture for deep learningPatrick Nicolas
This presentation describes the various layers and open source components that can be used to design and implement a lambda architecture enabled to support batch processing for model training and streaming for prediction
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Accelerating Data Analysis of Brain Tissue Simulations with Apache Spark with...Databricks
In the past years, the increasing computational power has made possible larger scientific experiments that have high computing demands, such as brain tissue simulations. In general, larger simulations imply generating larger amounts of data that need to be then analyzed by neuroscientists. Currently, simulation reports are analyzed by neuroscientists with the help of Python scripts, thanks to its programming simplicity and its performance of the NumPy library.
However, this analysis workflow will become unfeasible in the near future, as we foresee a 10x increase of the dataset size in the next year. Therefore, we are exploring how to accelerate data analysis of brain activity simulations with big data technologies, like Spark. In this talk, we will present how we address this challenge: from building RDDs/DataFrames from custom binary files to data queries and transformations to achieve the desired scientific analyses. In order to reach our goals, we have implemented our workflow in five different ways, combining RDDs, DataFrames, different data structures and representations and different data partitioning.
After significant engineering and programming efforts, we would like to share with the community our lessons learned: how Spark features can leverage data analysis in our neuroscience research area and what type of decisions can negatively impact performance. Moreover, we would also like to open a discussion with some critical limitations we have found in Spark applied to our use cases, and how to address them in the future as a joint community effort. In brief, as takeaway messages, we will highlight the suitability of Spark for our data analysis, how data generation can highly impact subsequent data analysis and how the decision of data types and formats can have a significant impact in Spark performance. We will present our experiments run on Cooley, the Argonne National Laboratory (ANL) data analysis cluster.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Codemotion
Open Source frameworks such as TensorFlow, MXNet, or PyTorch enable anyone to model and train Deep Neural Networks. While there are many great tutorials and talks showing us the best ways for training models, there is few information on what happens after we have trained our model? How can we store, utilize, and update it? In this talk, we look at the complete Deep Learning Pipeline and looks at topics such as deployments, multi-tenancy, jupyter notebooks, model serving, and more.
Convolutional Neural Networks at scale in Spark MLlibDataWorks Summit
Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that are the compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
How Machine Learning and AI Can Support the Fight Against COVID-19Databricks
In this session, we show how to leverage CORD dataset, containing more than 400000 scientific papers on COVID and related topics, and recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
The idea explored in our talk is to apply modern NLP methods, such and named entity recognition (NER) and relation extraction to article’s abstracts (and, possibly, full text), to extract some meaningful insights from the text, and to enable semantically rich search over the paper corpus. We first investigate how to train NER model using Medical NER dataset from Kaggle, and specialized version of BERT (PubMedBERT) as a feature extractor, to allow automatic extraction of such entities as medical condition names, medicine names and pathogens. Entity extraction alone can provide us with some interesting findings, such as how approaches to COVID treatment evolved with time, in terms of mentioned medicines. We demonstrate how to use Azure Machine Learning for training the model.
To take this investigation one step further, we also investigate the usage of pre-trained medical models, available as Text Analytics for Health service on the Microsoft Azure cloud. In addition to many entity types, it can also extract relations (such as the dosage of medicine provisioned), entity negation, and entity mapping to some well-known medical ontologies. We investigate the best way to use Azure ML at scale to score large paper collection, and to store the results.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Codemotion
Open Source frameworks such as TensorFlow, MXNet, or PyTorch enable anyone to model and train Deep Neural Networks. While there are many great tutorials and talks showing us the best ways for training models, there is few information on what happens after we have trained our model? How can we store, utilize, and update it? In this talk, we look at the complete Deep Learning Pipeline and looks at topics such as deployments, multi-tenancy, jupyter notebooks, model serving, and more.
Convolutional Neural Networks at scale in Spark MLlibDataWorks Summit
Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that are the compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Ray (https://github.com/ray-project/ray) is a framework developed at UC Berkeley and maintained by Anyscale for building distributed AI applications. Over the last year, the broader machine learning ecosystem has been rapidly adopting Ray as the primary framework for distributed execution. In this talk, we will overview how libraries such as Horovod (https://horovod.ai/), XGBoost, and Hugging Face Transformers, have integrated with Ray. We will then showcase how Uber leverages Ray and these ecosystem integrations to simplify critical production workloads at Uber. This is a joint talk between Anyscale and Uber.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
How Machine Learning and AI Can Support the Fight Against COVID-19Databricks
In this session, we show how to leverage CORD dataset, containing more than 400000 scientific papers on COVID and related topics, and recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
The idea explored in our talk is to apply modern NLP methods, such and named entity recognition (NER) and relation extraction to article’s abstracts (and, possibly, full text), to extract some meaningful insights from the text, and to enable semantically rich search over the paper corpus. We first investigate how to train NER model using Medical NER dataset from Kaggle, and specialized version of BERT (PubMedBERT) as a feature extractor, to allow automatic extraction of such entities as medical condition names, medicine names and pathogens. Entity extraction alone can provide us with some interesting findings, such as how approaches to COVID treatment evolved with time, in terms of mentioned medicines. We demonstrate how to use Azure Machine Learning for training the model.
To take this investigation one step further, we also investigate the usage of pre-trained medical models, available as Text Analytics for Health service on the Microsoft Azure cloud. In addition to many entity types, it can also extract relations (such as the dosage of medicine provisioned), entity negation, and entity mapping to some well-known medical ontologies. We investigate the best way to use Azure ML at scale to score large paper collection, and to store the results.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Demi Ben Ari - Apache Spark 101 - First Steps into distributed computing:
The world has changed, having one huge server won’t do the job, the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with streaming, SQL, machine learning and graph processing. Showing the basics of Apache Spark and distributed computing.
Demi is a Software engineer, Entrepreneur and an International Tech Speaker.
Demi has over 10 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community and Google Developer Group Cloud.
Big Data Expert, but interested in all kinds of technologies, from front-end to backend, whatever moves data around.
1. Big Data @ LIST
- Overview -
Dr. Francesco Bongiovanni
2. Credentials
Dr. Francesco Bongiovanni
B.Sc. in Computer Systems from Haute Ecole de Namur (Belgium)
ECTS certificate from Kemi-Tornio Univ. of Applied Sciences (Finland)
M.Sc. in Software Engineering of Distributed Systems, KTH - Royal Institute of Technology (Sweden)
Ph.D. in Computer Science from INRIA Sophia-Antipolis and Univ. of Nice Sophia Antipolis (France) - Oasis team (now Scale)
Post-Doc @ INRIA Grenoble and Joseph Fourier University (UJF) - Erods team (formerly Sardes)
Post-Doc @ Verimag Laboratory (CNRS) - Synchrone team
Expertise
● Scalable distributed systems & algorithms
○ P2P systems (Structured/Unstructured)
● Cloud computing
● Applied formal methods (Isabelle Theorem Prover / TLA+)
● Scalable simulations
● Distributed optimizations
● Cluster computing
francesco.bongiovanni@list.lu (Scholar / homepage / LinkedIn)
3. Forewords and disclaimer
● The following frameworks, tools,...are all open source projects (a big chunk is from the Apache
Foundation).
● This presentation is just a glimpse of what is available as of today.
● Full blown details are intentionally omitted. (configurations, setups, debugging bindings/frameworks,
advanced examples...which took months to get right)
As a user of this software stack, you are mainly concerned about two things :
1. Programmability; does this stack provides me with the necessary programming abstractions/tools for
expressing my problems ?
2. Fault tolerance; i.e. what if I ran some stuff and tasks/process/computer fails, does that affect my
computation(s) ?
→ lots of abstractions (for doing computation on big graphs, iterative algorithms on large datasets,...) are
provided through various distributed frameworks
→ Fault-tolerance is baked in most of the presented frameworks.
4. Scientific processing
Scientists want to spend their time exploring, not coding nor
waiting for their computations to be done.
IDEA
Write code Run code
Study
results
Publish
paper
Unproductive time
Never run again
5. The free lunch is over has been over since 2005 !
● Numbers of cores now is 2-
12 *
● Moore’s law continues
○ doubling of number of
transistors/18 months
Holy Grail: develop software
whose performance scales with
numbers of cores
=> Parallelize or perish...
* Intel announced on 16/02/2015 that 18-cores Xeon
chip avail. before this summer
6. The free lunch is over has been over since 2005 !
● Numbers of cores now is 2-
12 *
● Moore’s law continues
○ doubling of number of
transistors/18 months
Holy Grail: develop software
whose performance scales with
numbers of cores
=> Parallelize or perish...
* Intel announced on 16/02/2015 that 18-cores Xeon
chip avail. before this summer
13. eScience Cluster (for prototyping purposes)
Mission
● Provide the team with Big Data programming
capabilities
● Program the cluster as if it was one (big) computer
○ Reliable & Scalable Programming Abstractions
● Co-locate various distributed frameworks efficiently
14. Hardware specs:
● 3 * 2 X QuadCore E5440 @ 2.83 Ghz, 40GB RAM
● 1 * 2 X QuadCore (dualCore) E5550 @ 2.67 Ghz, 48 GB
RAM
=> Total: 40 CPUs, 168G RAM, +/- 1 TB Storage, GLAN
Apache Mesos
● Resource efficient scheduler
● Reliable & Scalable
● Powered by Mesos (Hadoop, Chronos, Marathon, Storm, Spark,
Aurora, Jenkins,…)
Programming capabilities:
● Streaming: able to process data streams (news feeds, twitter
feeds,...)
● Iterative: able to program iterative algorithms (PageRank, K-
means,...)
● Interactive: able to interactively query large volumes of data
● …
eScience Cluster (for prototyping purposes)
16. HDFS, v 1.2.1
● Hadoop Distributed File System -> scalable, reliable file system through replication
○ note: a file is immutable when stored on HDFS
● +/- 1 TB of avail. storage for our datasets (wiki dumps, geographical datasets,...)
● Used by Spark and other frameworks to store/retrieve datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
17. Mesos, v 0.25.0
● Distributed OS Scheduler & tasks executor
● Leverage Linux Cgroups, no virtualization required
● Implements Dominant Resource Fairness alg. for fair task allocation
● See the data center as one big computer
● Use Apache ZooKeeper for master fail-over (distributed coordination alg.)
● !! Common resource layer over which diverse frameworks can run !!
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
18. Spark, v 1.5.1 *
● Fast and general engine for processing large-scale data processing
● leverage in-memory computing
● Able to write apps in Java / Python / Scala
● combine SQL, streaming and complex analytics
● can read from HDFS, Cassandra, HBase, S3…
● integrate with YARN and MESOS
* discovered a blocking bug (MAJOR) in previous release (SPARK-1052), helped the community to narrow it down
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
19. Spark embeds specialized frameworks:
● Spark Streaming: leverages Spark to perform computations on streaming data (twitter feeds, raw
networks greps, …)
● Spark SQL: large scale data warehouse system for Spark that can execute HiveQL queries up to 100x
faster than Hive
● GraphX: library for large-scale graph processing. PageRank can be easily implemented using Graphx for
instance (< 40 LoC)
● ML Lib: Spark implementation of some common machine learning (ML) functionality (linear regression,
binary classification, clustering,...)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
20. SparkR, v 0.1
- R front-end for Spark (link against Spark 1.1.0)
- Leverages Spark engine for distributed/parallel computations
- Allows R functions to run against larger datasets
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
As of Spark 1.4, SparkR has
been integrated into Spark
21. - Emerging open source parallel programming language from Cray Inc.
“Chapel's primary goal is to make parallel programming far more productive, from multicore desktops and laptops to
commodity clusters and the cloud to high-end supercomputers”
- Thanks to (working*) bindings in Go, Chapel programs can run on top of Mesos
- Partitioned Global Address Space model (PGAS), i.e. threads/processes/tasks share one big address
space across nodes
* Found buggy bindings online, managed to make them work (https://github.com/francesco-bongiovanni/mesos-chapel)
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Chapel, v 1.9 (http://chapel.cray.
com/index.html)
Detailed overview
23. Other Frameworks can be built and run on top
of Mesos:
Aurora & Marathon: Mesos framework for long-running apps (Unix equivalent of “init.d”)
Exelixi: Distributed framework running Genetic Algorithms at scale
Chronos: distributed fault-tolerant scheduler supporting complex job topologies (Unix eq. of “cron”)
Storm: distributed real-time computation system (similar to Spark Streaming)
Kafka: high-throughput distributed messaging system
MPI: Message Passing system
...port your own…
Distributed frameworks can be written in JVM based lang. (Scala, Clojure, Java, …), Python, C++, Go , Erlang
HDFS
Mesos
Spark
SparkStreaming
(real-time)
SparkSQL
(SQL)
GraphX
(graph)
MLLib
(machinelearning)
Cray
Chapel
ZooKeeper
SparkR
...
H2O
...
Detailed overview
25. Available Tools and PLs
- a HDFS Cluster (with fault-tolerance and replication enabled) to store datasets
- Applications using Spark can be programmed in Java, Scala or Python
- For building distributed frameworks on top of Mesos
- Mesos officially supports C, C++, Java/Scala and Python
- third parties bindings in Go, Erlang, Haskell,...
- iPython Notebook working with Spark (http://ipython.org/notebook.html)
- Jupyter notebook with Spark Kernel is also available (http://www.jupyter.org)
- RStudio bundled with SparkR package
- there is also RStudio server installed on the cluster
- Compilers/VMs within the cluster (and on the coming VM)
- Go 1.3.1, R 3.1.1, GCC 4.8.2 (C++11 fully supported), JDK 1.7.0_65, Scala 2.10.3, Python 2.7.6,
Chapel 1.9 with GASNET Network Library (high performance network library), Erlang 17.3
27. Class of problem Use Case Before After Improvement
Statistics Averaging 1 billion double
values
15-20 minutes 10 seconds 100x speedup
Visualization
algorithm
Weighted Maps algorithm 36K elements 2.4M elements 66.6x more elements
Clustering algorithm Pearson Correlation
Coefficient on (31K rows x
10 col) full matrix: 991M
tuples
NA 6.9 min making it possible
Clustering algorithm K-Mean on forest data
type (508K rows x 55 col )
NA 43.5 secs
(average)
making it possible
Clustering algorithm CoAbundance Clustering
on 31K rows x 7 cols
5 minutes 20 seconds 15x speedup
Results from the trenches
28. Problem
Steps:
Example #1 - Averaging 1 billion elements
● 17G text file with 10^9 double values
GOAL:
Compute the average of these elements
parseToDouble
sumAllElements divideBy10^9
30. Example #1 - Averaging 1 billion elements
back of the envelope calculations
reading 1MB -> 30 ms
reading 17GB -> 30 * 17.000 = 510.000 ms = 510 secs = 8 minutes 30 sec
Reading a 17GB file from a single computer
31. 17 GB
HDFS Cluster
Reading 1MB from network -> 10 ms
Reading 1MB from local disk -> 30 ms
Reading 17GB from cluster -> 40 * (17.000 / 4) = 170.000 ms = 2 min 50 secs
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 4 nodes cluster
32. Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster: 2 min 50 sec
That’s almost a 3x speed improvement...
33. 17 GB
HDFS Cluster
Reading 1MB from network - > 10 ms
Reading 1MB from local memory -> 250 µs = 0.250 ms
Reading 17GB from cluster -> 10.25 * (17.000 / 4) = 43.562 ms = 43,5 secs
Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 4 nodes cluster IN MEMORY
34. Example #1 - Averaging 1 billion elements
back of the envelope calculations
Reading a 17GB file from a 1 node : 8 min 30 sec
Reading a 17GB file from a 4 nodes cluster IN MEMORY: 43,5 sec
That’s almost a 12x speed improvement...
35. Example #1 - Averaging 10^9 elements
On single machine
15-20 min for
read/parse/avg
for each op (avg,
sum of square,...)
Using Spark on cluster
1st iter:
read/parse/persist in memory/avg
following iters: <10 sec for
any similar operation
36. Example #1 - Averaging 10^9 elements
val rawDistData = sc.textFile("hdfs://10.10.0.141:9000/datasets/1B.txt")
val distData:org.apache.spark.rdd.RDD[Double] = rawDistData.map(_.toDouble)
distData.cache
distData.map(x=>x).reduce(_ + _) / 1000000000
distData.map(x=>x*x).reduce(_ + _) / 1000000000
...
read the file from HDFS
convert values to Double
put data in cache,ie in
memory
perform sum and avg
AVG of Sum of Squares
37. Example #1 - Averaging 1 billion elements
17 GB
Read local
chunks from
HDFS
put data in
memory
compute
avg
send result
back
aggregate
result
39. Example #2 - WeightedMaps
Weighted Maps: treemap visualization of geolocated quantitative data
(by WP4 colleagues)
● French communes with 36K elements
○ Does it scale with a larger number of
elements ?
40. Example #2 - WeightedMaps
● Tested with 2 million elements
○ >2^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...
41. Example #2 - WeightedMaps
● Tested with 2.4 million elements
○ >2.4^6, bottlenecks occur in spatial division
(but may be due to lack of memory) -> under
investigation
● No modifications to WM source implementation
● `just` branch it with Spark
○ minor modifications for data retrieval
convenience
○ most of the time spent on tweaking Spark (for
data serialization, memory consumption, GC
heap space...)
note: data makes no sense, union of union of French +
US counties,...
42. Example #2 - WeightedMaps
● < 150 LoC: from parsing to
painting !
○ time perf for 2.4^6 elements:
+/- 145 sec
● < 18 sec is spent on Spark side
(no optimization whatsoever)
next steps in the pipe
● Mine GeoNames dataset with
Spark
○ extract meaningful data
● Spark impl. of WeightedMap
○ scale WM to the next level
○ rewrite WM source impl. to
leverage Spark data
structure (RDDs)
43. Example #3 - Computing Pearson’s Correlation Coefficient on EVA contigs
Data: EVA contigs
31K rows by 10 Columns
Pearson correlation done on a full matrix, i.e., 31K * 31K = 991.179.289 vector tuples (991 Millions)
`Naive` implementation, took 6.9 minutes
Leverages Scala NLP Library
44. Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 55 Columns
Parsing the above lines into a proper Vector using Scala/Spark
45. Example #4 - K-Means Clustering on Spark
Data: Forest datasets (Covertype Data Set )
581012 rows by 54 Columns
K = 7, Max Iterations = 100
Mean running time: +/- 43.65 secs
Implementation based on k-means|| algorithm
by Bahmani et al (VLDB 2012)
→ When multiple concurrent runs are
requested,they are executed together
with joint passes over the data for
efficiency
46. Ongoing work - CoAbundance clustering
Formal Specs Spark-based impl.
Hydviga project
Biogas production
Contig binning
Visual Analytics-based tool (Parallel Coordinates)
Data Clustering algorithm (in R)
Great for
prototyping
Not so great for
handling large
data sets...
Let’s rewrite it...
47. Ongoing work - CoAbundance clustering
Using TLA+, PlusCal
- simulation
- model checking
=> helps you think above
code level
=> think of it as executable
pseudo-code which can be
checked
48. A kind reminder
● I am NOT a Distributed / Parallel programming fanatic
● YOU can do a lot with modern multicore machines, provided:
○ you know your way around your algorithms and concurrent
programming
○ you have a good idea how your machine works
=> Don’t come to me
if data is too small
if data fits in your computer and it’s a one time thing
=> Come to me
if data does not fit into your memory / disk
if your computation(s) is/are really expensive and frequent
However, even for small data,
- benefits from using Spark locally on your machine
- behind the scenes, there is the Actor model of computation
- so you can use it for `simple` parallel programming
Problem
Algorithm
Program
Instruction Set Architecture (ISA)
Microarchitecture
Circuits
Electrons
49. Somes References
Foundational paper about Spark
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th
USENIX conference on Networked Systems Design and Implementation (NSDI 12). USENIX Association, 2012.
Foundational about Spark Streaming
Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP 13). ACM, 2013.
Paper about Chapel
Chamberlain, Bradford L., David Callahan, and Hans P. Zima. "Parallel programmability and the chapel language." International Journal of
High Performance Computing Applications 21.3 (2007): 291-312.
Paper about GraphX
Gonzalez, Joseph E., et al. "GraphX: Graph Processing in a Distributed Dataflow Framework." Proceedings of the 11th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 14). USENIX Association, 2014.
Paper about Spark SQL
Xin, Reynold S., et al. "Shark: SQL and rich analytics at scale." Proceedings of the 2013 international conference on Management of data.
ACM, 2013
http://spark.apache.org
http://mesos.apache.org
http://chapel.cray.com