In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course).
Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.
The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
Committed to the goal of building open-source frameworks, tools, and algorithms that make building real-time applications decisions on live data with stronger security, The RISELab is set to innovate and enhance Spark
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Databricks
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course).
Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.
The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.
A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
Committed to the goal of building open-source frameworks, tools, and algorithms that make building real-time applications decisions on live data with stronger security, The RISELab is set to innovate and enhance Spark
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Strata NYC 2015: What's new in Spark StreamingDatabricks
As the adoption of Spark Streaming in the industry is increasing, so is the community’s demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In particular, some of these features are:
- New Kafka integration for exactly-once guarantees
- Improved Kinesis integration for stronger guarantees
- Addition of more sources to the Python API
Significantly improved UI for greater monitoring and debuggability.
In this talk, I am going to discuss these improvements as well as the plethora of features we plan to add in the near future.
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Strata NYC 2015: What's new in Spark StreamingDatabricks
As the adoption of Spark Streaming in the industry is increasing, so is the community’s demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In particular, some of these features are:
- New Kafka integration for exactly-once guarantees
- Improved Kinesis integration for stronger guarantees
- Addition of more sources to the Python API
Significantly improved UI for greater monitoring and debuggability.
In this talk, I am going to discuss these improvements as well as the plethora of features we plan to add in the near future.
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida
by Anya Bida and Rachel Warren from Alpine Data
https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/
Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
2016-03-17 Structural Value EngineeringPiet Lambert
Presentatie van Piet Lambert (Lambert Engineering) tijdens 5-jarig verjaardagsevent op 17 maart 2016.
Structural Value Engineering is de methode om als stabiliteitsingenieur ruwbouwkosten in de uitvoering reeds in ontwerp te verlagen. Zo worddt extra (architecturale) waarde gecreëerd
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet.
First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten.
FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as:
* Easy exactly-once ingestion from Kafka for streaming and IoT applications
* Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
Continuous Application with Structured Streaming 2.0Anyscale
Introduction to Continuous Application with Apache Spark 2.0 Structured Streaming. This presentation is a culmination and curation from talks and meetups presented by Databricks engineers.
The notebooks on Structured Streaming demonstrates aspects of the Structured Streaming APIs
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Step-by-Step Introduction to Apache Flink Slim Baltagi
This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way:
How to setup and configure your Apache Flink environment?
How to use Apache Flink tools?
3. How to run the examples in the Apache Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?
5. How to write your Apache Flink program in an IDE?
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
This presentation is about building a data product backed by Apache Spark. The source code for the demo can be found at http://brkyvz.github.io/spark-pipeline
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
Delight (https://www.datamechanics.co/delight) is a free & cross-platform monitoring dashboard for Apache Spark, which display system metrics (CPU Usage, Memory Usage) along with Spark information (jobs, stages, tasks) on the same timeline. Delight is a great complement to the Spark UI when it comes to troubleshooting your Spark application and understanding its performance bottleneck. It works freely on top of any Spark platform (whether it’s open-source or commercial, in the cloud or on-premise). You can install it using an open-sourced Spark agent (https://github.com/datamechanics/delight).
In this session, the co-founders of Data Mechanics will take you through performance troubleshooting sessions with Delight on real-world data engineering pipelines. You will see how Delight and the Spark UI can jointly help you spot the performance bottleneck of your applications, and how you can use these insights to make your applications more cost-effective and stable.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Similar to Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics
1. Big Data 2.0
HOW SPARK TECHNOLOGIES ARE RESHAPING THE
WORLD OF BIG DATA ANALYTICS
Presented By: Lillian Pierson, P.E.
2. Today’s webinar
Apache Spark: Journey from “Hadoop Eco System component” to “Big
Data platform”
The story of how Spark began
Is Spark a data engineering or data science platform?
Who is using Spark and for what?
Got Spark skills? Here’s why you should
5. “In-memory computing appliances
are … faster than the traditional
Hadoop system because in-
memory appliances don’t use
MapReduce… By storing data in
memory, in-memory appliances are
able to bypass the time-consuming
disk accesses that are required as
part of the map and reduce
operations that comprise the
MapReduce process. In-memory
data storage processing, and
analysis is fast enough to generate
data analytics in real-time, derived
from streaming data sources.“ –
Excerpt from my book:
Big Data/Hadoop for Dummies
Why in-memory
applications?
17. Changes with Spark 2.0
RDD API
•DataFrame
API
Spark
1.0
•RDD API
•DataFrame
API
Spark
1.3
*RDD API
*DataFrame
API
*Dataset API
Spark
1.6
Dataset API
•DataFrame
API
•RDD API
Spark
2.0
18. Changes with Spark 2.0
RDD API
Dataset API
DataFrame API
RDD API
Spark 1.0 Spark 2.0
19. Changes with Spark 2.0
Structured
Stream
Processing
DataFrame API
Dataset API
21. Taking things from the
beginning…
2009
Mesos
UC Berkeley
Interactive, iterative parallel processing (in-
memory)
◦ Machine learning requirements
Integrates with Hadoop ecosystem
Dr. Ion Stoica
Computer Science Professor
UC Berkeley
22. Databricks… the cutting edge
of Spark
Delivers Apache Spark-as-a-Service
Most popular solution for deploying Spark on
the cloud
Dr. Ion Stoica
Executive Chairman, Apache Databricks
23. Databricks… the cutting edge
of Spark
Spark on an as-needed basis
Automates
◦ Cluster building and configuration
◦ Security
◦ Process monitoring
◦ Resource monitoring
Notebooks
◦ For data analysis and machine learning using Python, R, and Scala
Data visualization capabilities
◦ Data visualization and dashboard design options
24. Is Spark a data
engineering or data
science platform?
DATA ENGINEERING COMPONENTS AND
TECHNOLOGIES
DATA SCIENCE COMPONENTS AND TECHNOLOGIES
25. Spark’s data engineering
elements
Automate cluster sizing and configuration requirements
Data Storage: HDFS
Resource Management:
◦ Spark Standalone
◦ Apache Mesos
◦ Hadoop YARN
26. Spark’s data engineering
elements
Spark Streaming Submodule – Reuse same code you use for batch
processing, but get real-time results!
◦ Integrates with big data source, like:
◦ HDFS
◦ Flume
◦ Kafka
◦ Twitter and
◦ ZeroMQ
27. Doing data science with Spark
Useful for machine learning and analysis of big data
Build big data analytics products
Programmable in Python, R, Scala, and SQL
Submodules:
◦ SQL and DataFrames
◦ MLlib for machine learning
◦ GraphX for in-memory big (graph) data computations
28. Doing data science with Spark
Spark integrates with the following data sources and formats:
◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase
◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)
29. Who is using
Spark and for
what?
A U T O M A T I C L A B S
L E N D U P
S E L L P O I N T S
F I N D I F Y
30. Automatic Labs on Databricks
Making cars smarter with real-time analytics
Connect to, and make smart use, of your car’s data
31. Automatic Labs on Databricks
Automatic apps do things like:
◦ Decoding engine problems
◦ Locating parked cars
◦ Crash detection and response
◦ Low fuel warnings, etc.
Automatic is using Spark to make cars smarter with real-time analytics
During product development, Automatic needs to query, explore, and
visualize large amounts of data, QUICKLY. By moving this work over to
Spark, Automatic was able to:
◦ Validate products in days, not weeks
◦ Complete complex queries in minutes
◦ Free up 1 full-time data scientist
◦ Save $10K/month on infrastructure costs
32. LendUp on
Databricks
Improving the lending
process and experience
“Moving up the LendUp
Ladder means earning
access to more money, at
better rates, for longer
periods of time” - LendUp
33. LendUp on Databricks
LendUp uses Spark for:
◦ Feature engineering at scale
◦ Fast model building and testing
By using Spark to do this work, LendUp is able to:
◦ Build more accurate models, faster
◦ Offer more lines of credit
◦ Develop new products more quickly
◦ Increase in-house productivity of data science team
35. sellpoints on Databricks
Increasing ROI on ad spend
Sellpoint offers services in:
◦ Identifying qualified shoppers
◦ Driving traffic
◦ Increasing sales conversion
By moving to Databricks, sellpoints was able to:
◦ Productize a new predictive analytics offering, improving the ad spend ROI
by threefold compared to competitive offerings.
◦ Reduce the time and effort required to deliver actionable insights to the
business team while lowering costs.
◦ Improve productivity of the engineering and data science team by
eliminating the time spent on DevOps and maintaining open source
software.
36. Findify on Databricks
Improving shopping experience for ecommerce customers
Uses machine learning to continually improve search accuracy
37. Findify on Databricks
Improving shopping experience for ecommerce customers
By moving to Databricks, Findify was able to:
◦ Focus on development instead of infrastructure – Allowing them to complete
their feature development projects faster and reduce customer frustration
in delayed analytics
◦ Focus on building innovative features - because the managed Spark platform
eliminated time spent on DevOps and infrastructure issues.
Uses machine learning to continually improve search accuracy
41. Getting training and
experience in Spark
Get hands-on training in the following areas:
◦ Using RDD
◦ Writing applications using Scala
◦ Spark SQL
◦ Spark Streaming
◦ Machine Learning in Spark (Mllib)
◦ Spark GraphX
◦ Spark Project Implementation
44. Why Data Science From Simplilearn
Key
Features
40 hours of real life
industry project
experience
25 hours of High
Quality e-learning
Visualize and
optimize data
effectively using
the built-in tools in
R , SAS and Excel
48 hours of Live
Instructor Led
Online sessions
Get proficient in
using R,SAS and Excel
to model data and
predict solutions to
business problems
Master the concepts
of statistical analysis
like linear & logistic
regression, cluster
analysis &
forecasting
45. OUR JOURNEY SO FAR Project
Management
Digital Marketing
Big Data &
Analytics
Business
Productivity
Tools
Quality
Management
Virtualization and
Cloud Computing
IT Security
Financial
Management
CompTIA
Certification
IT Hardware and
N/W ERP
IT Services and
Architecture
Agile and Scrum
Certification
OS and Database
Web and App
Programming
Simplilearn : World’s Largest Certification Training Destination
One of the largest collections of accredited certification training in the
world.
YEAR
2010
YEAR
2015
YEAR
2010
YEAR
2016