1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
Yarn Resource Management Using Machine Learningojavajava
HadoopCon 2016 In Taiwan - How to maximum the utilization of Hadoop computing power is the biggest challenge for Hadoop administer. In this talk I will explain how we use Machine Learning to build the prediction model for the computing power requirements and setting up the MapReduce scheduler parameters dynamically, to fully utilize our Hadoop cluster computing power.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Yarn Resource Management Using Machine Learningojavajava
HadoopCon 2016 In Taiwan - How to maximum the utilization of Hadoop computing power is the biggest challenge for Hadoop administer. In this talk I will explain how we use Machine Learning to build the prediction model for the computing power requirements and setting up the MapReduce scheduler parameters dynamically, to fully utilize our Hadoop cluster computing power.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolving, even complex queries over large datasets can be realized in an interactive fashion with distributed processing framework like Apache Spark, new paradigm of efficient storage were introduced as well to facilitate data processing framework, such as Apache Parquet, ORC provide fast scan over columnar data format, and Apache Hbase offers fast ingest and millisecond scale random access.
In this talk, we will outline Apache Carbondata, a new addition to open source Hadoop ecosystem which is an indexed columnar file format aimed for bridging the gap to fully enable real-time analytics abilities. It has been deeply integrated with Spark SQL and enables dramatic acceleration of query processing by leveraging efficient encoding/compression and effective predicate push down through Carbondata’s multi-level index technique.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 10000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages. We will also discuss best practices around using this new feature. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
We can think of an Apache Spark application as the unit of work in complex data workflows. Building a configurable and reusable Apache Spark application comes with its own challenges, especially for developers that are just starting in the domain. Configuration, parametrization, and reusability of the application code can be challenging. Solving these will allow the developer to focus on value-adding work instead of mundane tasks such as writing a lot of configuration code, initializing the SparkSession or even kicking-off a new project.
This presentation will describe using code samples a developer’s journey from the first steps into Apache Spark all the way to a simple open-source framework that can help kick-off an Apache Spark project very easy, with a minimal amount of code. The main ideas covered in this presentation are derived from the separation of concerns principle.
The first idea is to make it even easier to code and test new Apache Spark applications by separating the application logic from the configuration logic.
The second idea is to make it easy to configure the applications, providing SparkSessions out-of-the-box, easy to set-up data readers, data writers and application parameters through configuration alone.
The third idea is that taking a new project off the ground should be very easy and straightforward. These three ideas are a good start in building reusable and production-worthy Apache Spark applications.
The resulting framework, spark-utils, is already available and ready to use as an open-source project, but even more important are the ideas and principles behind it.
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
Realtime analytics over large datasets has become an increasing wide-spread demand, over the past several years, Hadoop ecosystem has been continuously evolving, even complex queries over large datasets can be realized in an interactive fashion with distributed processing framework like Apache Spark, new paradigm of efficient storage were introduced as well to facilitate data processing framework, such as Apache Parquet, ORC provide fast scan over columnar data format, and Apache Hbase offers fast ingest and millisecond scale random access.
In this talk, we will outline Apache Carbondata, a new addition to open source Hadoop ecosystem which is an indexed columnar file format aimed for bridging the gap to fully enable real-time analytics abilities. It has been deeply integrated with Spark SQL and enables dramatic acceleration of query processing by leveraging efficient encoding/compression and effective predicate push down through Carbondata’s multi-level index technique.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Video to talk: https://www.youtube.com/watch?v=gd4Jqtyo7mM
Apache Spark is a next generation engine for large scale data processing built with Scala. This talk will first show how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API. You will learn about the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases(Spark SQL, Spark Streaming, MLib and GraphX). The Spark source will be be referenced to illustrate how these concepts are implemented with Scala.
http://www.meetup.com/Scala-Bay/events/209740892/
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Scalable Data Science in Python and R on Apache Sparkfelixcss
In the world of Data Science, Python and R are very popular. Apache Spark is a highly scalable data platform. How could a Data Scientist integrate Spark into their existing Data Science toolset? How does Python work with Spark? How could one leverage the rich 10000+ packages on CRAN for R?
We will start with PySpark, beginning with a quick walkthrough of data preparation practices and an introduction to Spark MLLib Pipeline Model. We will also discuss how to integrate native Python packages with Spark.
Compare to PySpark, SparkR is a new language binding for Apache Spark and it is designed to be familiar to native R users. In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable scalable machine learning on Big Data. In addition to talking about the R interface to the ML Pipeline model, we will explore how SparkR support running user code on large scale data in a distributed manner, and give examples on how that could be used to work with your favorite R packages.
Python
R
Apache Spark
ML
DL
SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Event: #SE2016
Stage: IoT & BigData
Data: 2 of September 2016
Speaker: Vitalii Bondarenko
Topic: HD insight spark. Advanced in-memory Big Data analytics with Microsoft Azure
INHACKING site: https://inhacking.com
SE2016 site: http://se2016.inhacking.com/
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
Abstract:
This talk will provide a technical overview of Spark’s DataFrame API in the context of data science, from exploratory data analysis to ETL to machine learning. We will review the API with a demo using a real-world dataset, covering data input/output, summary statistics, missing data handling, and statistical functions. We will then dive into the internals of DataFrame implementations, followed by how we view DataFrame in the long-term Spark roadmap and ecosystem.
Bio:
Reynold Xin is a cofounder of Databricks and a committer on Apache Spark, driving the design of Spark's next-gen API and execution engine. He holds the current world record in 100TB sorting (Daytona GraySort), beating the previous record by a factor of 3. On leave from his PhD at the UC Berkeley AMPLab, he also wrote the highest cited papers in SIGMOD 2011 and SIGMOD 2013.
As the de facto standard for large-scale data processing in the Java world, Apache Spark is the logical choice when you want to investigate big data processing. As a matter of fact, most resources online refer to the Scala API that is exposed by Spark. What to do if you and your company are much more comfortable with Java than the Scala language? These slides give pointers whether it makes sense to learn and introduce an entirely new language just for your big data processing.
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
4Developers 2018: Pyt(h)on vs słoń: aktualny stan przetwarzania dużych danych...PROIDEA
Historycznie świat dużych danych, lub jak kto woli Big Data, był zarezerwowany dla technologii pochodzących ze świata Javy. Z drugiej strony, od lat Python silnie się rozwija w analizie danych i obliczeniach naukowych, które z reguły działają na mniejszych danych. Niemniej, wiele się obecnie zmieniło. Python stał się coraz ważniejszym językiem w projekcie Spark. Ponadto nowe projekty w Python do pracy z dużymi danymi, jak Dask, stają się coraz bardziej popularne. Dodatkowo, coraz więcej zarządzanych platform chmurowych jak Google BigQuery jest powszechnie dostępnych i łatwo używalnych w Python. W tej prezentacji podsumowuje aktualny stan analizy dużych danych w Python, poparty prawdziwymi przykładami, zaletami i wadami danych podejść, oraz przemyśleniami co może przynieść przyszłość.
These slides were presented by Hossein Falaki of Databricks to the Atlanta Apache Spark User Group on Thursday, March 9, 2017: https://www.meetup.com/Atlanta-Apache-Spark-User-Group/events/238120227/
Using spark 1.2 with Java 8 and CassandraDenis Dus
Brief introduction in Spark data processing ideology, comparison Java 7 and Java 8 usage with Spark. Examples of loading and processing data with Spark Cassandra Loader.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
7. Spark Origin
• Apache Spark is an open source cluster computing
framework
• Originally developed at the University of California,
Berkeley's AMPLab
• The first 2 contributors of SparkR:
Shivaram Venkataraman & Zongheng Yang
https://amplab.cs.berkeley.edu/
13. RDD (Resilient Distributed Dataset)
https://spark.apache.org/docs/2.0.0/api/scala/#org.apache.spark.rdd.RDD
Internally, each RDD is characterized
by five main properties:
1. A list of partitions
2. A function for computing each split
3. A list of dependencies on other
RDDs
4. Optionally, a Partitioner for key-value
RDDs (e.g. to say that the RDD is
hash-partitioned)
5. Optionally, a list of preferred
locations to compute each split on
(e.g. block locations for an HDFS
file)
https://docs.cloud.databricks.com/docs/latest/courses
14. RDD dependencies
• Narrow dependency: Each partition of the parent RDD is used by at most
one partition of the child RDD. This means the task can be executed
locally and we don’t have to shuffle. (Eg: map, flatMap, Filter, sample etc.)
• Wide dependency: Multiple child partitions may depend on one partition
of the parent RDD. This means we have to shuffle data unless the parents
are hash-partitioned (Eg: sortByKey, reduceByKey, groupByKey, join etc.)
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
21. How does sparkR works?
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
22. Upgrading From SparkR 1.6 to 2.0
Before 1.6.2 Since 2.0.0
data type naming DataFrame SparkDataFrame
read csv
Package from
Databricks
built-in
function
(like approxQuantile)
X O
ML function glm
more
(or use sparklyr)
SQLContext
/ HiveContext
sparkRSQL.init(sc)
merge in
sparkR.session()
Execute Message very detailed simple
Launch on EC2 API X
https://spark.apache.org/docs/latest/sparkr.html
25. Documents
• If you have to use RDD, refer to AMP-Lab github:
http://amplab-extras.github.io/SparkR-pkg/rdocs/1.2/
and use “:::”
e.g. SparkR:::textFile, SparkR:::lapply
• Otherwise, refer to SparkR official documents:
https://spark.apache.org/docs/2.0.0/api/R/index.html
26. Starting to Use SparkR (v1.6.2)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-1.6.2-bin-hadoop2.6/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc:sparkContext
sc <- sparkR.init(appName = "Demo_SparkR")
# Initialize SQLContext
sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
27. Starting to Use SparkR (v2.0.0)
# Set Spark path
Sys.setenv(SPARK_HOME="/usr/local/spark-2.0.0-bin-hadoop2.7/")
# Load SparkR library into your R session
library(SparkR,
lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
# Initialize SparkContext, sc: sparkContext
sc <- sparkR.session(appName = "Demo_SparkR")
# Initialize SQLContext (don’t need anymore since 2.0.0)
# sqlContext <- sparkRSQL.init(sc)
# your sparkR script
# ...
# ...
sparkR.stop()
28. DataFrames
# Load the flights CSV file using read.df
sdf <- read.df(sqlContext,"data_flights.csv",
"com.databricks.spark.csv", header = "true")
# Filter flights from JFK
jfk_flights <- filter(sdf, sdf$origin == "JFK")
# Group and aggregate flights to each destination
dest_flights <- summarize(
groupBy(jfk_flights, jfk_flights$dest),
count = n(jfk_flights$dest))
# Running SQL Queries
registerTempTable(sdf, "tempTable")
training <- sql(sqlContext,
"SELECT dest, count(dest) as cnt FROM tempTable
WHERE dest = 'JFK' GROUP BY dest")
29. Word Count
# read data into RDD
rdd <- SparkR:::textFile(sc, "data_word_count.txt")
# split word
words <- SparkR:::flatMap(rdd, function(line) {
strsplit(line, " ")[[1]]
})
# map: give 1 for each word
wordCount <- SparkR:::lapply(words, function(word) {
list(word, 1)
})
# reduce: count the value by key(word)
counts <- SparkR:::reduceByKey(wordCount, "+", 2)
# convert RDD to list
op <- SparkR:::collect(counts)
42. Some Tricks
• Customize spark config for launch
• cache()
• Some codes can’t run in Rstudio, try to use terminal
• Packages from 3rd party, like package of read csv
file from databricks
43. The Future of SparkR
• More MLlib API
• Advanced User Define Function
• package(“sparklyr”) from Rstudio
44. Reference
• SparkR: Scaling R Programs with Spark, Shivaram Venkataraman, Zongheng Yang, Davies Liu,
Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica,
and Matei Zaharia. SIGMOD 2016. June 2016.
https://people.csail.mit.edu/matei/papers/2016/sigmod_sparkr.pdf
• SparkR: Interactive R programs at Scale, Shivaram Venkataraman, Zongheng Yang. Spark
Summit, June 2014, San Francisco.
https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf
• Apache Spark Official Research
http://spark.apache.org/research.html
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
• Apache Spark Official Document
http://spark.apache.org/docs/latest/api/scala/
• AMPlab UC Berkeley - SparkR Project
https://github.com/amplab-extras/SparkR-pkg
• Databricks Official Blog
https://databricks.com/blog/category/engineering/spark
• R-blogger: Launch Apache Spark on AWS EC2 and Initialize SparkR Using Rstudio
https://www.r-bloggers.com/launch-apache-spark-on-aws-ec2-and-initialize-sparkr-using-rstudio-2/
46. Join Us
• Fansboard
• Web Designer (php & JavaScript)
• Editor w/ facebook & instagram
• Vpon - Data Scientist
• Taiwan Spark User Group
• Taiwan R User Group
47. Thanks for your attention
& Taiwan Spark User Group
& Vpon Data Team