A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements. As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads. We will give a tutorial covering best practices and some of the immediate and future benefits to expect.
ML persistence is one of the biggest improvements in the DataFrame-based API. With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds.
At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.
Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements. As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads. We will give a tutorial covering best practices and some of the immediate and future benefits to expect.
ML persistence is one of the biggest improvements in the DataFrame-based API. With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds.
At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
Sudarshan Kadambi presented this talk at the Bay Area Spark Meetup @ Bloomberg. He covered Bloomberg Apache Spark Server and contributions to Apache Spark. The talk also talked about challenges of doing high-volume online analytics while still observing high-levels of SLAs
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Vectorized Query Execution in Apache Spark at FacebookDatabricks
A standard query execution system processes one row at a time. Vectorized query execution batches multiples rows together in a columnar format, and each operator uses simple loops to iterate over data within a batch. This feature greatly reduces the CPU usage for reading, writing and query operations like scanning, filtering. In this talk, we will take a deep dive into Facebook's ORC-based vectorized reader and writer implementation, discuss how vectorization affects performance of various data types in Hive/Spark, and quantify the improvements vectorization brings to the Facebook Warehouse.
Speaker: Chen Yang
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
Apache Spark Usage in the Open Source EcosystemDatabricks
Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Sudarshan Kadambi presented this talk at the Bay Area Spark Meetup @ Bloomberg. He covered Bloomberg Apache Spark Server and contributions to Apache Spark. The talk also talked about challenges of doing high-volume online analytics while still observing high-levels of SLAs
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
Exceptions are the Norm: Dealing with Bad Actors in ETLDatabricks
Stable and robust data pipelines are a critical component of the data infrastructure of enterprises. Most commonly, data pipelines ingest messy data sources with incorrect, incomplete or inconsistent records and produce curated and/or summarized data for consumption by subsequent applications.
In this talk, we go over new and upcoming features in Spark that enabled it to better serve such workloads. Such features include isolation of corrupt input records and files, useful diagnostic feedback to users and improved support for nested type handling which is common in ETL jobs.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
In this one-hour webinar, you will be introduced to Spark, the data engineering that supports it, and the data science advances that it has spurned. You’ll discover the interesting story of its academic origins and then get an overview of the organizations who are using the technology. After being briefed on some impressive Spark case studies, you’ll come to know of the next-generation Spark 2.0 (to be released in just a few months). We will also tell you about the tremendous impact that learning Spark can have upon your current salary, and the best ways to get trained in this ground-breaking new technology.
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Vectorized Query Execution in Apache Spark at FacebookDatabricks
A standard query execution system processes one row at a time. Vectorized query execution batches multiples rows together in a columnar format, and each operator uses simple loops to iterate over data within a batch. This feature greatly reduces the CPU usage for reading, writing and query operations like scanning, filtering. In this talk, we will take a deep dive into Facebook's ORC-based vectorized reader and writer implementation, discuss how vectorization affects performance of various data types in Hive/Spark, and quantify the improvements vectorization brings to the Facebook Warehouse.
Speaker: Chen Yang
Spark r under the hood with Hossein FalakiDatabricks
SparkR is a new and evolving interface to Apache Spark. It offers a wide range of APIs and capabilities to Data Scientists and Statisticians. Being a distributed system with a JVM core some R users find SparkR errors unfamiliar. In this talk we will show what goes on under the hood when you interact with SparkR. We will look at SparkR architecture, performance bottlenecks and API semantics. Equipped with those, we will show how some common errors can be eliminated. I will use debugging examples based on our experience with real SparkR use cases.
Apache Spark Usage in the Open Source EcosystemDatabricks
Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
Parallelizing Existing R Packages with SparkRDatabricks
R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Speaker: Hossein Falaki
This talk was originally presented at Spark Summit East 2017.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
http://hortonworks.com/hadoop/spark/
Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=03debab5ba04b34a033dc5c2f03c7967
As the ratio of memory to processing power rapidly evolves, many within the Hadoop community are gravitating towards Apache Spark for fast, in-memory data processing. And with YARN, they use Spark for machine learning and data science use cases along side other workloads simultaneously. This is a continuation of our YARN Ready Series, aimed at helping developers learn the different ways to integrate to YARN and Hadoop. Tools and applications that are YARN Ready have been verified to work within YARN.
Cloud computing gives you a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session to understand best practices for scaling your resources from zero to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
HBase adoption continues to explode amid rapid customer success and unbridled innovation. HBase with its limitless scalability, high reliability and deep integration with Hadoop ecosystem tools, offers enterprise developers a rich platform on which to build their next generation applications. In this workshop we will explore HBase SQL capabilities, deep Hadoop ecosystem integrations and deployment & management best practices.
Scala - The Simple Parts, SFScala presentationMartin Odersky
These are the slides of the talk I gave on May 22, 2014 to the San Francisco Scala user group. Similar talks were given before at GOTO Chicago, keynote, at Gilt Groupe and Hunter College in New York, at JAX Mainz and at FlatMap Oslo.
AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...Amazon Web Services
FINRA is a leader in the Financial Services industry who sought to move toward real-time data insights of billions of time-ordered market events by migrating from SQL batch processes on-prem, to Apache Spark in the cloud. By using Apache Spark on Amazon EMR, FINRA can now test on realistic data from market downturns, enhancing their ability to provide investor protection and promote market integrity (FINRA enacts rules and provides guidance that securities exchanges & brokers must follow). By using AWS Spot instances, FINRA has saved up to 50% from its on premises solution, increased elasticity/scalability, and accelerated reprocessing requests (from months to days). Learn best practices on how FINRA moves toward real-time data analytics with Spark and AWS, while managing production workloads in parallel, increasing performance and IT efficiency, reducing cost, and modernizing and scaling their infrastructure to prepare for real-time processing in the future.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
This presentations is first in the series of Apache Spark tutorials and covers the basics of Spark framework.Subscribe to my youtube channel for more updates https://www.youtube.com/channel/UCNCbLAXe716V2B7TEsiWcoA
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
What’s new in Spark 2.0?
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...Codemotion
Scala è un linguaggio di programmazione general purpose multi-paradigma pensato per realizzare applicazioni ad alte prestazioni che girano all'interno della Java Virtual Machine. Spark è il framework "Big Data", basato su Scala, più flessibile e performante disponibile oggi sul mercato. Durante il talk verrà introdotto il linguaggio Scala e verranno mostrate le potenzialità legate al suo utilizzo nell'ambito dello sviluppo di applicazioni web di ultima generazione compresa la possibilità di processamento parallelo di grandi quantità di dati attraverso l'utilizzo del framework Spark.
At Spark Summit East in New York, we unveil PowerStream, an Internet of Things (IoT) simulation with visualizations and alerts based on real-time data from 2 million sensors across global wind farms.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
3. In just OVER two words…
Not Much,
but let’s talk about it.
4. Let’s start from the beginning
• What is Apache Spark?
• An open source cluster computing framework.
• Originally developed at the University of California, Berkeley's AMPLab.
• Aimed and designed to be a Big Data computational framework.
8. Version History (major changes)
1.0 –
SparkSQL
(formally Shark
project)
1.1 –
Streaming
support for
python
1.2 – Core
engine
improvements.
GraphX
graduates.
1.3 –
DataFrame.
Python engine
improvements.
1.4 – SparkR
1.5 – Bugs and
Performance
1.6 – Dataset
(experimental)
9. Spark 2.0.x
• Major notables:
• Scala 2.11
• SparkSession
• Performance
• API Stability
• SQL:2003 support
• Structured Streaming
• R UDFs and Mllib algorithms implementation
10. API
• Spark doesn't like API changes
• The good news:
• To migrate, you’ll have to perform
little to no changes to your code.
• The (not so) bad news:
• To benefit from all the
performance improvements, some
old code might need more
refactoring.
11. Programming API
• Unifying DataFrame and Dataset:
• Dataset[Row] = DataFrame
• SparkSession replaces
SqlContext/HiveContext
• Both kept for backwards compatibility.
• Simpler, more performant accumulator
API
• A new, improved Aggregator API for
typed aggregation in Datasets
12. SQL Language
• Improved SQL functionalities (SQL
2003 support)
• Can now run all 99 TPC-DS
queries
• The parser support ANSI-SQL as
well as HiveQL
• Subquery support
13. SparkSQL new features
• Native CSV data source (Based on
Databricks’ spark-csv package)
• Better off-heap memory
management
• Bucketing support (Hive
implementation)
• Performance performance
performance
15. Judges Ruling
Dataset was supposed to
be the future like a 6
months ago
SQL 2003 is so 2003
The API lives on!
SQL 2003 is cool
16. Structured Streaming (Alpha)
Structured Streaming is a scalable and
fault-tolerant stream processing engine
built on the Spark SQL engine.
You can express your streaming
computation the same way you would
express a batch computation on static
data.
17. Structured Streaming (cont.)
You can use the Dataset / DataFrame API in Scala, Java or Python to
express streaming aggregations, event-time windows, stream-to-
batch joins, etc.
The computation is executed on the same optimized Spark SQL
engine.
Exactly-once fault-tolerance guarantees through checkpointing and
Write Ahead Logs.
18. Windowing streams
How many vehicles entered each toll booth every 5 minutes?
val windowedCounts = cars.groupBy(
window($"timestamp", ”5 minutes", ”5 minutes"),
$”car"
).count()
23. Tungsten Project - Phase 2
Performance improvements
Heap Memory management
Connectors optimizations
Let’s Pop the hood
24. Project Tungsten
Bring Spark performance closer to the bare metal, through:
• Native memory Management
• Runtime code generation
Started @ Version 1.4
The cornerstone that enabled the Catalyst engine
25. Project Tungsten - Phase 2
Whole stage code generation
a. A technique that blends state-of-the-art from modern compilers and MPP databases.
b. Gives a performance boost of up to x9 faster
c. Emit optimized bytecode at runtime that collapses the entire query into a single function
d. Eliminating virtual function calls and leveraging CPU registers for intermediate data
26. Project Tungsten - Phase 2
Optimized input / output
a. Caching for Dataframes is based on Parquet
b. Faster Parquet reader
c. Google Gueva is OUT
d. Smarter HadoopFS connector
you have to be running on DataFrame / Dataset
27. Overall Judges Ruling
I Want to complain but i don’t know
what about!!
Internal performance improvements
aside, this feels more like Spark 1.7
I like flink...
All is good
SparkSQL is for sure the future of
spark
The competition has done well
for Spark