Big Data Meets Learning Science: Keynote by Al Essa

•

3 likes•4,123 views

How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.

Data & Analytics

Big Data Meets Learning Science
Apache Spark Summit East 2017
Alfred Essa
VP, Research and Data Science
McGraw-Hill Education
@malpaso

Our Journey from Print to Digital
2 McGraw-HillLearning Science
3 Spark, DataBricks
1 Innovation Pipeline

Speed of innovation, not
data, is the differentiator.

Technology Time to Market
Spark Factor
People Process
Apache Spark
DataBricks

Innovation Pipeline
Research
Product
Validation
Product
Development

Databricks underpins our
innovation pipeline and
workflow.

From Print to Digital: 128-year Journey
K-12, Higher Ed &
Professional
businesses
~4,800
employees

Parking Lot
Adaptive Platform Leverages MHE Reach and Scale
How do we learn and how can we learn better?.
Introduction of
SmartBook
May 2013 1,500+ adaptive
products available
Now
Learners who have used MHE
Adaptive
~5,500,000 ~10,000,000,000
Student interactions
Authors trained to use MHE Adaptive
~4,000

Parking Lot
Research Phase
Research Step: Build Models and Algorithms
Model Iteration: Ship “Data Instrumented” product
to improve and tune models
Product Focus: Iterate prototypes with business
and customers
.
Research Question: How do we learn and how
can we learn better?

Stacked
Algorithm
Learning Tool for Optimizing Acquisition and Recall
2 31
Learning
Science
Principles
Effortful Recall
Spaced
Practice
Interleaving
Cognitive
Science
Model
Mobile App

The Problem
Students drop out or fail
their course
1
2 At-risk students can be difficult
to identify by instructors
Identify at-risk students
pre-emptively

The Solution
A classifier to
predict
abandonment
Jacqueline Feild
Data Scientist
Nicholas Lewkow
Data Scientist

Solution: A Classifier to Predict Abandonment
F1 F2 F3 F4 F5 F61
F1 F2 F3 F4 F5 F61
F1 F2 F3 F4 F5 F60
F1 F2 F3 F4 F5 F61
F1 F2 F3 F4 F5 F6
Classification
algorithm
0
• Logistic Regression used for initial
classification algorithm
• Simple algorithm to interpret
• Provides probability estimates
instead of hard classification label
• Allows for simple interpretation of
feature importance
• One classifier works for all disciplines

Parking Lot
Companies that we havemet with
and completed Stage One, but there is
no immediate partnership opportunity
Parallel Pipeline for Creating Classifier
How do we learn and how can we learn better?.
Models and Algorithms
Ship “Data Instrumented” Product
Iterate Prototypes with Customers

Parking Lot
Companies that we havemet with
and completed Stage One, but there is
no immediate partnership opportunity
Spark Transformation
How do we learn and how can we learn better?.
Models and Algorithms
Ship “Data Instrumented” Product
Iterate Prototypes with Customers

Parking Lot
Companies that we havemet with
and completed Stage One, but there is
no immediate partnership opportunity
Speedup with Spark
How do we learn and how can we learn better?.
Models and Algorithms
Ship “Data Instrumented” Product
Iterate Prototypes with Customers
Sn: Speedup from n cores
t1: Time to run on 1 core
tn: Time to run on n cores

Parking Lot
Companies that we havemet with
and completed Stage One, but there is
no immediate partnership opportunity
Evaluate Model Accuracy
How do we learn and how can we learn better?.
Models and Algorithms
Iterate Prototypes with Customers
• Use area under the receiver operating characteristic
curve (AUC-ROC) as another measure of model
accuracy
• 0.9 - 1.0 = excellent
• 0.8 - 0.9 = good
• 0.7 - 0.8 = fair
• 0.6 - 0.7 = poor
• 0.5 - 0.6 = fail
• Look at how the AUC-ROC for a model changes
throughout the semester
22

Evaluate Intervention Window
Intervention Window:
How much time in
advance can we provide
for an intervention to
occur prior to
abandonment?

Conclusions
Technology is important, but
build an agile innovation
workflow with Databricks.

What's hot

Zalando SE is Europe’s leading online fashion platform and connects customers, brands and partners. With millions of visitors each month, we have petabytes of purchase, click-stream, product and other data in our data lake. This data is crucial to powering insights on shopper behavior and driving an AI-first strategy to improve site engagement. Over 7 months ago, Zalando adopted Apache Spark, Delta Lake and Databricks as its de-facto computation platform for analytics and machine learning. During this period, we onboarded well over 50 internal teams ranging from BI teams, with no knowledge of Spark or big data running ETL pipelines to AI/ML teams already using EMR and Spark for heavy model training. Provided the spectrum of varied business problems they were trying to solve, we worked with each team individually, understanding their use cases, helping them validate assumptions, developing working code and taking them to production. In this talk we will share best practices for building a unified data and analytics architecture on Databricks, lessons learned rolling it out across the organization and provide a deep dive on AI & Analytics use cases in the fashion ecommerce space.

Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks

Databricks

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

Spark Summit EU talk by Pat Patterson

Spark Summit

How Spark Enables the Internet of Things- Paula Ta-Shma

Spark Summit

Quby, an Amsterdam-based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage. Using Europe’s largest energy dataset, consisting of petabytes of IoT data, the company has developed AI powered products that are used by hundreds of thousands of users on a daily basis. Delta Lake ensures the quality of incoming records though schema enforcement and evolution. But it is the Data Engineers role to check whether the expected data is ingested in to the Delta Lake at the right time with expected metrics so that downstream processes will function their duties. Re-training models and serving on the fly might go wrong unless we put the right monitoring infrastructure too.

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...

Databricks

Moving at the speed of a startup often means rapid iterative development, which can lead to a patchwork of systems and processes. In the early days at Kik (one of the most popular chat apps among U.S. teens), the data team was able to move extremely quickly but often at the expense of scalable data engineering. In this session, Kik’s head of data will share the eight things they did to save time and money. The team took their data stack from a complex combination of systems and processes to a scalable, simple, and robust platform leveraging Apache Spark and Databricks to make data super easy for everyone in the company to use.

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...

Spark Summit

Apache Spark 2.0 introduced Structured Streaming which allows users to continually and incrementally update your view of the world as new data arrives while still using the same familiar Spark SQL abstractions. Michael Armbrust from Databricks talks about the progress made since the release of Spark 2.0 on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications. Speaker: Michael Armbrust Video: http://go.databricks.com/videos/spark-summit-east-2017/using-structured-streaming-apache-spark This talk was originally presented at Spark Summit East 2017.

Insights Without Tradeoffs: Using Structured Streaming

Databricks

Getting cars to drive autonomously is one of the most exciting problems these days. One of the key challenges is making them drive safely, which requires processing large amounts of data. In our talk we would like to focus on only one task of a self-driving car, namely road detection. Road detection is a software component which needs to be safe for being able to keep the car in the current lane. In order to track the progress of such a software component, a well-designed KPI (key performance indicators) evaluation pipeline is required. In this presentation we would like to show you how we incorporate Spark in our pipeline to deal with huge amounts of data and operate under strict scalability constraints for gathering relevant KPIs. Additionally, we would like to mention several lessons learned from using Spark in this environment.

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

Databricks

Spark at Airbnb

Hao Wang

We have deployed a hybrid cloud storage solution that leverages compute in the public cloud along with our specialized hardware storage. We will discuss the tradeoffs of hybrid cloud storage, which workloads are best suited for this model, the pipeline we have deployed, and the challenges and best practices we have learned. Spark provides a flexible compute environment that can be used alongside todays cloud compute providers. However in read-heavy workloads that dominate much of analysis and machine learning today, storage costs scale poorly on these same cloud storage models. Hybrid cloud offers an alternative approach to get amortized storage costs over a dedicated link while using elastic compute in the cloud. We are currently running an end to end data science stack with multiple production workloads with this setup – A Spark-based ETL for transforming the real time log data that we ingest from our devices in the field into databases, a scale-out general regular expression search over log files that provides our support engineers real time access to searching for pathologies across our customer base, and a Spark based machine learning system for time series analysis to predict various customer metrics.

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...

Databricks

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Spark Summit

Scalable AutoML for Time Series Forecasting using Ray

Databricks

The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Spark Summit

“The ability to detect malware has needed to drastically change in the past few years away from traditional signature or list based techniques. Couple this with the rise of mobile device based attacks, where the scale of the data is predicted to be 60% of the internet in 2018*, our online lives will need Machine Learning (ML) and Data Science to ensure its security. At Wandera we have successfully implemented a malware detection (and classification) ML model at scale with the use of Apache Spark (MLib) and the PMML via OpenScoring paradigm. In this talk we will touch on the training data and why we use Spark at all, the features we extract from mobile phone applications and how we then obtain our high accuracy scores in the cloud. At Wandera we have successfully implemented a Malware detection (and classification) ML model at scale with the use of Apache Spark (MLib) and the PMML via OpenScoring paradigm. *https://blog.cloudflare.com/our-predictions-for-2018/”

Detecting Mobile Malware with Apache Spark with David Pryce

Databricks

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Jeff Magnusson

This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies. Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Databricks

Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future. Authors: Josh McNutt, Keria Bermudez-Hernandez

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Databricks

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)

Spark Summit

We all know how to create ML models, but the path to turning them into a highly scalable easy to use system by users is not always clear. What happens when you need to run thousands of them, on many different datasets, simultaneously and at a huge scale? AND, do it reliably so you can sleep well at night!! To achieve exactly that, we’ve decided to go down the serverless route and build an anomaly detection system on top of it. We’ll go over the pros and cons of building such a system using serverless and when such an approach could work for you. Our SpotLight anomaly detection system is capable of easily reusing ML models, and scale to run millions of time series simultaneously with ease. Our system eliminates manual work and allows our end users with no scientific background to set anomalies to detect in a plug and play way and get alerts in no time. In this talk, we’ll walk you through the architecture and share useful ideas you can adopt and implement in your own projects.

Anomaly Detection at Scale!

Databricks

"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets. Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the key data reliability and performance data pipelines challenges – How Databricks Delta helps build robust pipelines at scale – Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements – How to deliver performance gains using Delta PREREQUISITES: – A fully-charged laptop (8-16GB memory) with Chrome or Firefox – Pre-register for Databricks Community Edition" Speakers: Steven Yu, Burak Yavuz

Building Robust Production Data Pipelines with Databricks Delta

Databricks

What's hot (20)

Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Spark Summit EU talk by Pat Patterson

How Spark Enables the Internet of Things- Paula Ta-Shma

Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...

Insights Without Tradeoffs: Using Structured Streaming

Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...

Spark at Airbnb

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Scalable AutoML for Time Series Forecasting using Ray

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Detecting Mobile Malware with Apache Spark with David Pryce

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)

Anomaly Detection at Scale!

Building Robust Production Data Pipelines with Databricks Delta

Viewers also liked

Deep learning is a fast growing subset of machine learning. There is an emerging trend to conduct deep learning in the same cluster along with existing data processing pipelines to support feature engineering and traditional machine learning. As the leading framework for Distributed ML, we believe that the addition of deep learning to the super-popular Spark framework is important, because it allows Spark developers to perform a range of data analysis tasks within a single framework that helps avoid the complexity inherent in using multiple frameworks and libraries. As one of the early and top contributors to Apache Spark, Intel is thrilled to share with the community a big deal contribution to open source Spark…”BigDL” -… A distributed deep Learning framework organically built on Big Data (Apache Spark) platform. It combines the benefits of “high performance computing” and “Big Data” architecture for rich deep learning support. With BigDL on Spark, customers can eliminate large volume of unnecessary dataset transfer between separate systems, eliminate separate HW clusters and move towards a CPU cluster, reduce system complexity and the latency for end-to-end learning. Ultimately, customers can achieve better scale, higher resource utilization, ease of use/development, and better TCO. Feature parity with Caffe and Torch, significant performance boost when combined with Intel’s Math Kernel Library (MKL), scale-out, fault tolerance, elasticity and dynamic resource sharing are some of the prominent features of BigDL. BigDL open source project will be launched at the 2017 Spark Summit East and this keynote will help spotlight this new contribution and benefits to the Spark developer community and encourage their wide contribution and collaboration. We will also showcase some real world applications of Big DL from early customers’ adoption.

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

Spark Summit

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

Artificial intelligence (AI) is not new. It emerged as a computer science discipline in the 50’s and has been a persistent theme in science fiction. What is new is that enterprises now have the prerequisites needed to create pragmatic AI applications: plenty of data, deep learning frameworks, and blazing fast distributed compute clusters à la Apache Spark. Forrester Vice President and Principal Analyst, Mike Gualtieri will enumerate and demystify nine essential AI technology building blocks that enterprises can use to add a modicum of intelligence to existing and new applications.

Artificial Intelligence: How Enterprises Can Crush It With Apache Spark: Keyn...

Spark Summit

An Increasing number of Apache Spark deployments are in Cloud and hybrid environments. This often means that Spark workloads are ephemeral but the data exists in a durable storage either in cloud and on-prem. The data also moves between cloud storage and on-prem. With this architecture in place, security and governance have become paramount to run Spark workloads across on-prem and cloud. In this keynote, we will walk through several issues and highlight a Spark workload running in an ephemeral cluster with security and governance across Cloud/On-Prem and how the same security and governance is shared with other workloads.

Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...

Spark Summit

A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.

RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion Stoica

Spark Summit

So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions: How do I manage offsets? How do I manage state? How do I make my spark streaming job resilient to failures? Can I avoid some failures? How do I gracefully shutdown my streaming job? How do I monitor and manage (e.g. re-try logic) streaming job? How can I better manage the DAG in my streaming job? When to use checkpointing and for what? When not to use checkpointing? Do I need a WAL when using streaming data source? Why? When don’t I need one? In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.

What No One Tells You About Writing a Streaming App: Spark Summit East talk b...

Spark Summit

In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down. Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes. We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed

Spark Summit

This presentation will provide technical design and development insights in order to set up a Kerberosied (secured) JupyterHub notebook using Spark. Joy will show how Bloomberg set up the Kerberos-based Spark-notebook-integrating JupyterHub, Sparkmagic, and Levy. Sparkmagic provides the Spark kernel for Scala and Python. Livy is one of the most promising open source software to allow to submit Spark jobs over http-based REST interfaces. In this presentation, Joy will highlight the capabilities of Sparkmagic and Livy, along with the gap or development required in order to integrate the software seamlessly to work with your secured Spark cluster. The Kerberos integration techniques that he’ll discuss can be applied to other types of authenticators, such as OAuth. No prior knowledge of any of these technologies is requied in order to understand this presentation.

Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...

Spark Summit

Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...

Spark Summit

If you are running Apache Spark in cloud environments, Object Stores —such as Amazon S3 or Azure WASB— are a core part of your system. What you can’t do is treat them like “just another filesystem” —do that and things will, eventually, go horribly wrong. This talk looks at the object stores in the cloud infrastructures, including underlying architectures., compares them to what a “real filesystem” is expected to do and shows how to use object stores efficiently and safely as sources of and destinations of data. It goes into depth on recent “S3a” work, showing how including improvements in performance, security, functionality and measurement —and demonstrating how to use make best use of it from a spark application. If you are planning to deploy Spark in cloud, or doing so today: this is information you need to understand. The performance of you code and integrity of your data depends on it.

Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...

Spark Summit

Due to Spark, writing big data applications has never been easier…at least until they stop being easy! At Lightbend we’ve helped our customers out of a number of hidden Spark pitfalls. Some crop up often; the ever-persistent OutOfMemoryError, the confusing NoSuchMethodError, shuffle and partition management, etc. Others occur less frequently; an obscure configuration affecting SQL broadcasts, struggles with speculating, a failing stream recovery due to RDD joins, S3 file reading leading to hangs, etc. All are intriguing! In this session we will provide insights into their origins and show how you can avoid making the same mistakes. Whether you are a seasoned Spark developer or a novice, you should learn some new tips and tricks that could save you hours or even days of debugging.

Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...

Spark Summit

Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks. Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools. Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R. In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.

APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van Niekerk

Spark Summit

Fraudsters attempt to pay for goods, flights, hotels – you name it – using stolen credit cards. This hurts both the trust of card holders and the business of vendors around the world. We built a Real-Time Fraud Prevention Engine using Open Source (Big Data) Software: Spark, Spark ML, H2O, Hive, Esper. In my talk I will highlight both the business and the technical challenges that we’ve faced and dealt with.

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...

Spark Summit

BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Spark Summit

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Spark Summit

Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Spark Summit

Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them. Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version. In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.

Debugging PySpark: Spark Summit East talk by Holden Karau

Spark Summit

Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software. Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...

Spark Summit

Apache Spark has become a popular and successful way for Python programming to parallelize and scale up their data processing. In many use cases, though, a PySpark job can perform worse than equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. In this talk, we’ll examine some of the the data serialization and other interoperability issues, especially with Python libraries like pandas and NumPy, that are impacting PySpark performance and work that is being done to address them. This will relate closely to other work in binary columnar serialization and data exchange tools in development such as Apache Arrow and Feather files.

Improving Python and Spark Performance and Interoperability: Spark Summit Eas...

Spark Summit

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR. • Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R. • Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas. • Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods. • Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics. • Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future. • Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency. • Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

Viewers also liked (20)