This document summarizes the key new features in Spark 2.0, including a new Spark Session entry point that unifies SQLContext and HiveContext, unified Dataset and DataFrame APIs, enhanced SQL features like subqueries and window functions, built-in CSV support, machine learning pipeline persistence across languages, approximate query functions, whole-stage code generation for performance improvements, and initial support for structured streaming. It provides examples of using several of these new features and discusses Combient's role in helping customers with analytics.
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center.
The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages.
Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations.
In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks.
Speaker
Raphael Radowitz, Quality Specialist, SAP Labs Korea
A small and brief presentation for internship project at BEL on Data Visualization using Seaborn and matplotlib
Some sensitive information has been redacted.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
The Evaluation of TPC-H on Spark and Spark SQL in ALOJA was conducted at the Big Data Lab to obtain the master degree in Management Information Systems at the Johann-Wolfgang Goethe University in Frankfurt, Germany. Furthermore, the analysis was partially accomplished in collaboration and close coordination with the Barcelona Super Computer Center.
The intention of this research was the integration of a TPC-H on Spark Scala benchmark into ALOJA, an open-source and public platform for automated and cost-efficient benchmarks and to perform an evaluation on the runtime of Spark Scala with or without Hive Metastore compared to Spark SQL. Various alternate file formats with different applied compressions on underlying data and its impact are evaluated. The conducted performance evaluation exposed diverse and captivating outcomes for both benchmarks. Further investigations attempt to detect possible bottlenecks and other irregularities. The aim is to provide an explanation to enhance knowledge of Spark’s engine based on examining the physical plans. Our experiments show, inter alia, that: (1) Spark Scala performs better in case of heavy expression calculation, (2) Spark SQL is the better choice in case of strong data access locality in combination with heavyweight parallel execution. In conclusion, diverse results were observed with the consequence that each API has its advantages and disadvantages.
Surprisingly, our findings are well spread between Spark SQL and Spark Scala and contrary to our expectations Spark Scala did not outperform Spark SQL in all aspects but support the idea that applied optimizations appear to be implemented in a different way by Spark for its core and its extension Spark SQL. The API on top of Spark provides extra information about the underlying structured data, which is probably used to perform additional optimizations.
In conclusion, our research demonstrates that there are differences in the generation of query execution plans that goes hand-in-hand with similar discoveries leading to inefficient joins, and it underlines the importance of our benchmark to identify disparities and bottlenecks.
Speaker
Raphael Radowitz, Quality Specialist, SAP Labs Korea
A small and brief presentation for internship project at BEL on Data Visualization using Seaborn and matplotlib
Some sensitive information has been redacted.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
One of the most promising areas in the world of NoSQL - graphs based storage and processing system which based on the theory of graphs. Neo4J - is, perhaps, the most popular Graph database at the moment. It provides high performance data storage and working with graphs, using various Java APIs and declarative query language Cypher.
Adobe, Cisco, classmates.com, Deutsche telecom and many others are using Neo4J.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
Graph-Powered Machine Learning - Meetup Paris - March 5, 2018
Graph -based machine learning is becoming an important trend in artificial intelligence, transcending a lot of other techniques. Using graphs as a basic representation of data for multiple purposes:
- the data is already modeled for further analysis
- graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs;
- improving computation performances and quality. The talk will present these advantages and present applications in the context of recommendation engines and natural language processing.
Speaker: Dr. Vlasta Kus (@VlastaKus) is a Data Scientist at GraphAware, specializing in graph-based Natural Language Processing and related topics, including deep learning techniques. He speaks English, Czech and some French and currently lives in Prague.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
An introduction to Microsoft R Services,
Microsoft R Open and Microsoft R Server.
This presentation will briefly cover the following:
-Why consider MRO and R Server
-R Server
-MRO
-Microsoft R Services/R Server Platform
-DistributedR
-RevoScaleR/ScaleR
-ConnectR
-DevelopR
-DeployR
-Resources
-References
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Airline Reservations and Routing: A Graph Use CaseJason Plurad
We've all been there before... you hear the announcement that your flight is canceled. Fellow passengers race to the gate agent to rebook on the next available flight. How do they quickly determine the best route from Berlin to San Francisco? Ultimately the flight route network is best solved as a graph problem. We will discuss our lessons learned from working with a major airline to solve this problem using JanusGraph database. JanusGraph is an open source graph database designed for massive scale. It is compatible with several pieces of the open source big data stack: Apache TinkerPop (graph computing framework), HBase, Cassandra, and Solr. We will go into depth about our approach to benchmarking graph performance and discuss the utilities we developed. We will share our comparison results for evaluating which storage backend use with JanusGraph. Whether you are productizing a new database or you are a frustrated traveler, a fast resolution is needed to satisfy everybody involved. Presented at DataWorks Summit Berlin on April 18, 2018
What’s new in Spark 2.0?
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
One of the most promising areas in the world of NoSQL - graphs based storage and processing system which based on the theory of graphs. Neo4J - is, perhaps, the most popular Graph database at the moment. It provides high performance data storage and working with graphs, using various Java APIs and declarative query language Cypher.
Adobe, Cisco, classmates.com, Deutsche telecom and many others are using Neo4J.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
Graph-Powered Machine Learning - Meetup Paris - March 5, 2018
Graph -based machine learning is becoming an important trend in artificial intelligence, transcending a lot of other techniques. Using graphs as a basic representation of data for multiple purposes:
- the data is already modeled for further analysis
- graphs can easily combine multiple sources into a single graph representation and learn over them, creating Knowledge Graphs;
- improving computation performances and quality. The talk will present these advantages and present applications in the context of recommendation engines and natural language processing.
Speaker: Dr. Vlasta Kus (@VlastaKus) is a Data Scientist at GraphAware, specializing in graph-based Natural Language Processing and related topics, including deep learning techniques. He speaks English, Czech and some French and currently lives in Prague.
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
An introduction to Microsoft R Services,
Microsoft R Open and Microsoft R Server.
This presentation will briefly cover the following:
-Why consider MRO and R Server
-R Server
-MRO
-Microsoft R Services/R Server Platform
-DistributedR
-RevoScaleR/ScaleR
-ConnectR
-DevelopR
-DeployR
-Resources
-References
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
As companies adopt data processing technologies and add data-driven features to user-facing products, the need for effective automated test techniques for data processing applications increase. We go through anatomy of scalable data streaming applications, and how to set up test harnesses for reliable integration testing of such applications. We cover a few common anti-patterns that make asynchronous tests fragile, and corresponding patterns for remediation. We will also mention virtualisation components suitable for our testing scenarios.
Data lineage has gained popularity in the Machine Learning community as a way to make models and datasets easier to interpret and to help developers debug their ML pipelines by enabling them to go from a model to the dataset/user who trained it. Data provenance and lineage is the process of building up the history of how a data artifact came to be. This history of derivations and interactions can provide a better context for data discovery, debugging, as well as auditing. In this area, others, such as Google and Databricks, have made small steps.
The Hopsworks approach presented provenance information is collected implicitly through the unobtrusive instrumentation of jupyter notebooks and python code - What we call 'implicit provenance'.
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Airline Reservations and Routing: A Graph Use CaseJason Plurad
We've all been there before... you hear the announcement that your flight is canceled. Fellow passengers race to the gate agent to rebook on the next available flight. How do they quickly determine the best route from Berlin to San Francisco? Ultimately the flight route network is best solved as a graph problem. We will discuss our lessons learned from working with a major airline to solve this problem using JanusGraph database. JanusGraph is an open source graph database designed for massive scale. It is compatible with several pieces of the open source big data stack: Apache TinkerPop (graph computing framework), HBase, Cassandra, and Solr. We will go into depth about our approach to benchmarking graph performance and discuss the utilities we developed. We will share our comparison results for evaluating which storage backend use with JanusGraph. Whether you are productizing a new database or you are a frustrated traveler, a fast resolution is needed to satisfy everybody involved. Presented at DataWorks Summit Berlin on April 18, 2018
What’s new in Spark 2.0?
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, V.P. of Apache Beam; Founder/CEO at Operiant
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
1. What’s new in Spark 2.0?
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
1
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
2. Combient - Who We Are
• A joint venture owned by several Swedish global enterprises in the Wallenberg sphere
• One of our key missions is the
Analytics Centre of Excellence (ACE)
• Provide data science and data engineer
resources to our owners and members
• Share best practices, advanced methods,
and state-of-the-art technology to
generate business value from data
3. Apache Spark Evolution
2016201420122010 2011 2013 2015
Apache Incubator
Project
Spark 1.0
released
Spark 2.0
released
First Spark Summit
2017
Spark paper
OSDI 2012
Top-level
Apache Project
Spark paper
HotCloud 2010
announced
strong investment in Spark
Project begin
in AMPLab –
UC Berkeley
Spark technical
report, UC Berkeley
announced efforts
to unite Spark with
Hadoop
* Very rapid takeoffs: from research publication to major adoptions in industry in 4 years
integrates Spark 1.5.2
into their distribution
added Spark to
their 4.0 release
4. Spark 2.0 “Easier, Faster, and Smarter”
• Spark Session - a new entry point
• Unification of APIs => Dataset & Dataframe
• Spark SQL enhancements: subqueries, native Window functions,
better error messages
• Built-in CSV file type support
• Machine learning pipeline persistence across languages
• Approximate DataFrame Statistics Functions
• Performance enhancement (Whole-stage-code generation)
• Structured streaming (Alpha)
4
5. Spark Session
5
Spark 1.6 => SQLContext, HiveContext
Spark 2.0 => SparkSession
val ss = SparkSession.builder().master(master)
.appName(appName).getOrCreate()
val conf = new SparkConf().setMaster(master).setAppName(appName)
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
6. Unified API: DataFrame & Dataset
• DataFrame - introduced in Spark 1.3 (March 2015)
• Dataset - experimental in Spark 1.6 (March 2016)
6
• Observation: very rapid changes of APIs
Java, Scala,
Python, R
Java, Scala
7. Dataset API example
7
Which countries has a
shrinking population
from 2014 to 2015?
name population_2014 population_2015
Afghanistan XXXX XXXX
Albania XXXX XXXX
…. …. ….
• Key benefit over DataFrame API: static typing!
Compile time
error detection
Table: country_pop
8. Subqueries
8
Spark 1.6 => Subqueries inside FROM clause
Spark 2.0 => Subqueries inside FROM and WHERE clause
Ex. Which countries have more than 100M people in 2016?
13. Spark 2.0 Machine learning
(ML) APIs Status
• spark.mllib (RDD-based APIs)
• Entering bug-fixing modes (no new features)
• spark.ml (Dataframe-based): models + transformations
• Introduced in January 2015
• Become a primary API
13An example Pipeline model
14. ML Pipeline Persistence
Across languages
14
Data scientist Data engineer
• Prototype and experiments with
candidate ML pipelines
• Focus on rapid prototype and
fast exploration of data
ML Pipelines
• Implement and deploy pipelines in
production environment
• Focus on maintainable and
reliable systems
Propose Implement / deploy
Prefer Python / R / ..
Why?
Source: classroomclipart.com Source: worldartsme.com/
Prefer Java / Scala
- Diverge sourcecodes
- Synchronisation efforts/delays
Problems
15. ML Pipeline Persistence
Across languages
15
Data scientist Data engineer
• Prototype and experiments with
candidate ML pipelines
• Focus on rapid prototype and
fast exploration of data
ML Pipelines
• Implement and deploy pipelines in
production environment
• Focus on maintainable and
reliable systems
Propose Implement / deploy
Prefer Python / R / ..
Why?
Pipeline persistence allow for import/export pipeline across languages
• Introduced only for Scala in Spark 1.6
• Status of Spark 2.0
• Scala/Java (Full), Python (Near-full), R (with hacks)
Source: classroomclipart.com Source: worldartsme.com/
Prefer Java / Scala
18. Approximate DataFrame
Statistics Functions
Approximate Quantile
18
Motivation: rapid explorative data analysis
Bloom filter - Approximate set membership data structure
CountMinSketch - Approximate frequency estimation data structure
Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile
summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data
(SIGMOD '01), Timos Sellis and Sharad Mehrotra (Eds.). ACM, New York, NY, USA, 58-66.
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun.
ACM 13, 7 (July 1970), 422-426. DOI=http://dx.doi.org/10.1145/362686.362692
Cormode, Graham (2009). "Count-min sketch" (PDF). Encyclopedia of Database Systems.
Springer. pp. 511–516.
19. Bloom filter- Approximate set
membership data structure
19
Inserting items O(k)
Spark 2: example code
Membership query => O(k)
k hash functions
Trade space efficiency with false positives
Source : wikipedia
23. Approximate Quantile
23
Spark 2: example code
Details are much more complicated than the above two. :)
res8: Array[Double] = Array(167543.0, 1260934.0, 3753121.0, 5643475.0)
Databrick example notebook:
https://docs.cloud.databricks.com/docs/latest/sample_applications/
04%20Apache%20Spark%202.0%20Examples/05%20Approximate%20Quantile.html
24. Whole-stage code generation
• Collapsing multiple function calls into one
• Minimising overheads (e.g., virtual function calls) and using CPU registers
24
Volcano (Spark 1.X)
Example Source:
Databrick example notebook:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/
6122906529858466/293651311471490/5382278320999420/latest.html
Whole-stage code generation
(Spark 2.0)
25. Structure Streaming [Alpha]
• SQL on streaming data
25
Source : https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Databrick example notebook:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/
4012078893478893/295447656425301/5985939988045659/latest.html
26. Structure Streaming [Alpha]
26
Source: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-
streaming-in-apache-spark-2-0.html
Not ready for production
27. We are currently recruiting 🙂
combient.com/#careers
Presentation materials are available below.
https://github.com/rerngvit/technical_presentations/tree/master/
2016-10-27_ML_Meetup_Stockholm%40Schibsted