Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: Spark Summit East talk by Jorg Schad

•

7 likes•1,342 views

The document discusses powering predictive mapping at scale using the SMACK stack, which includes Spark, Kafka, and Elasticsearch. It describes how the SMACK stack can ingest millions of events per second from connected devices, store the data in Apache Spark, and allow real-time and batch processing of the data. It also provides an example of using the stack for real-time tracking of geo-enabled IoT devices and demonstrates the data flow and a demo of the system.

Data & Analytics

© 2016 Mesosphere, Inc. All Rights Reserved. 1
@joerg_schad @dcos #smack
Powering Predictive
Mapping at Scale with
Spark, Kafka, and Elastic
Search
Spark Summit East
February 08, 2017

© 2016 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Distributed Systems Engineer
@joerg_schad

© 2016 Mesosphere, Inc. All Rights Reserved. 3
HYPERSCALE MEANS VOLUME AND VELOCITY
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product Recommendations

© 2016 Mesosphere, Inc. All Rights Reserved. 4
SMACK stack
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events
per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and build
data driven applications
DC/OS
Sensors
Devices
Clients

© 2016 Mesosphere, Inc. All Rights Reserved. 5
NAIVE APPROACH
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Industry Average
12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka

© 2016 Mesosphere, Inc. All Rights Reserved. 6
Mesos &
DC/OS

© 2016 Mesosphere, Inc. All Rights Reserved. 7
MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka

© 2016 Mesosphere, Inc. All Rights Reserved. 8
DC/OS ENABLES MODERN DISTRIBUTED APPS
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
Big Data + Analytics EnginesMicroservices (in containers)
Streaming
Batch
Machine Learning
Analytics
Functions &
Logic
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Distributed systems kernel to
abstract resources
Ecosystem of frameworks & apps
Consistent architecture to run on
top of kernel
User Interface (GUI & CLI)
Core system services
(e.g., distributed init, cron, service
discovery, package mgt & installer,
storage)
Any Infrastructure (Physical, Virtual, Cloud)

© 2016 Mesosphere, Inc. All Rights Reserved. 9
EXAMPLE:
REAL-TIME
TRACKING

© 2016 Mesosphere, Inc. All Rights Reserved. 10
GEO-ENABLED IoT

© 2016 Mesosphere, Inc. All Rights Reserved. 11
DATA FLOW

© 2016 Mesosphere, Inc. All Rights Reserved. 12
DEMO

© 2016 Mesosphere, Inc. All Rights Reserved. 13
THANK YOU!
ANY
QUESTIONS?
@dcos
users@dcos.io
/groups/8295652
/dcos
/dcos/examples
/dcos/demos
chat.dcos.io

© 2017 Mesosphere, Inc. All Rights Reserved. 14
Keep it running!

© 2016 Mesosphere, Inc. All Rights Reserved. 15
SERVICE OPERATIONS
● Configuration Updates (ex: Scaling, re-configuration)
● Binary Upgrades
● Cluster Maintenance (ex: Backup, Restore, Restart)
● Monitor progress of operations
● Debug any runtime blockages

© 2016 Mesosphere, Inc. All Rights Reserved. 16
Typical Use: distributed, large-scale data
processing; micro-batching
Why Spark Streaming?
● Micro-batching creates very low
latency, which can be faster
● Well defined role means it fits in well
with other pieces of the pipeline
APACHE SPARK (STREAMING)

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Spark Summit

Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Spark Summit

Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...

Spark Summit

In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc. Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language. We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Spark Summit

The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.

Big Telco - Yousun Jeong

Spark Summit

Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...

Spark Summit

Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.

Reinsurance company’s core competencies include the quantification of risk associated with catastrophes, such as hurricanes and earthquakes. Various so-called catastrophe models are available publicly, some commercial and some open-source. The volume of data processed by such “cat models” requires Big Data and High Performance capabilities. This is clearly reflected in the landscape of public models. And the observed trend is towards more and more detailed inputs, as well as outputs. This makes scalability an important concern. Companies that deal with catastrophe risk commonly use one or several public cat models. If they wish to differentiate themselves from the market, they may build internal proprietary models, in particular in areas that are not covered by existing models. The result is a deeper understanding and an independent quantification of risk, both of which can lead to a competitive edge.

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

Alluxio, Inc.

RISELab:Enabling Intelligent Real-Time Decisions

Jen Aman

Spark Summit East Keynote by Ion Stoica A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency. In Drizzle, we introduce group scheduling, where multiple batches (or a group) of computation are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortize the costs of task serialization and launch. Our experiments on a 128 node EC2 cluster show that Drizzle can achieve end-to-end streaming latencies of less than 100ms and can get up to 3.5x lower latency than Spark Streaming. Compared to Apache Flink, a record-at-a-time streaming system, we show that Drizzle can recover around 4x faster from failures and that Drizzle has up to 13x lower latency during recovery.

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark Summit

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...

Spark Summit

Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Spark Summit

Cybercrime is big business. Gartner reports worldwide security spending at $80B, with annual losses totalling more than $1.2T (in 2015). Small to medium sized businesses now account for more than half of the attacks targeting enterprises today. The threat actors behind these attacks are continually shifting their techniques and toolkits to evade the security defenses that businesses commonly use. Thanks to the growing frequency and complexity of attacks, the task of identifying and mitigating security-related events has become increasingly difficult. At eSentire, we use a combination of data and human analytics to identify, respond to and mitigate cyber threats in real-time. We capture all network traffic on our customers’ networks, hence ingesting a large amount of time-series data. We process the data as it is being streamed into our system to extract relevant threat insights and block attacks in real-time. Furthermore, we enable our cybersecurity analysts to perform in-depth investigations to: i) confirm attacks and ii) identify threats that analytical models miss. Having security experts in the loop provides feedback to our analytics engine, thereby improving the overall threat detection effectiveness. So how exactly can you build an analytics pipeline to handle a large amount of time-series/event-driven data? How do you build the tools that allow people to query this data with the expectation of mission-critical response times? In this presentation, William Callaghan will focus on the challenges faced and lessons learned in building a human-in-the loop cyber threat analytics pipeline. They will discuss the topic of analytics in cybersecurity and highlight the use of technologies such as Spark Streaming/SQL, Cassandra, Kafka and Alluxio in creating an analytics architecture with missions-critical response times.

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Spark Summit

Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries. Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Cloudera, Inc.

How to teach your data scientist to leverage an analytics cluster with Presto...

Alluxio, Inc.

Apache Spark At Scale in the Cloud

Databricks

Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on. You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal. By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs: Sizing the cluster based on your dataset (shuffle partitions) Ingestion challenges – well begun is half done (globbing S3, small files) Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you) Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win) Scheduling (FAIR vs FIFO, is there a difference for your pipeline?) Caching and persistence (it’s the cost of doing business, so what are your options?) Fault tolerance (blacklisting, speculation, task reaping) Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans) Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)

Architecture at Scale

Elasticsearch

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Databricks

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark Summit

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Spark Summit

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Spark Summit

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Spark Summit

Analyzing and comparing your energy consumption with that of other consumers provides healthy peer pressure and useful insight leading to energy conservation and impacting the bottom line. We helped GridPocket (http://www.gridpocket.com/), a smart grid company developing energy management applications for electricity water and gas utilities, implement high scale anonymized energy comparison queries with an order of magnitude lower cost and higher performance than was previously possible. IoT use cases like that of GridPocket are swamping our planet with data, and drive demand for analytics on extremely scalable and low cost storage. Enter Spark SQL over Object Storage: highly scalable and low cost storage which provides RESTful APIs to store and retrieve objects and their metadata. Key performance indicators (KPIs) of query performance and cost are the number of bytes shipped from Object Storage to Spark and the number of incurred REST requests. We propose Pluggable Spark SQL Filters, which extend the existing Spark SQL partitioning mechanism with an ability to dynamically filter irrelevant objects during query execution. Our approach handles any data format supported by Spark SQL (Parquet, JSON, csv etc.), and unlike pushdown compatible formats such as Parquet which require touching each object to determine its relevance, it avoids accessing irrelevant objects altogether. We developed a pluggable interface for developing and deploying Filters, and implemented GridPocket’s filter which screens objects according to their metadata, for example geo-spatial bounding boxes which describe the area covered by an object’s data points. This leads to drastically lower KPIs since there is no need to ship the entire dataset from Object Storage to Spark if you are only comparing yourself with your neighborhood. We demonstrate GridPocket analytics notebooks, report on our implementation and resulting 10-20x speedups, explain how to implement a Pluggable File Filter, and how we applied this to other use cases.

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Spark Summit

Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.

Spark Streaming and MLlib - Hyderabad Spark Group

Phaneendra Chiruvella

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Spark Summit

In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...

Spark Summit

Processing real-time analytics of big data streams from sensor data will continue to be an important task as embedded technology increases and we continue to generate new types and ways of data analysis, particularly in regard to the Internet of Things (IoT). Robotics models many of these key challenges well and incorporates the possibility of high- throughput streams as well as complex online machine learning and analytics algorithms. These challenges make it an almost ideal candidate for in depth analysis of real-time streaming analytics. We look at a simultaneous localization and mapping (SLAM) problem, an ongoing research area in robotics for autonomous vehicles, and well recognized as a non-trivial problem space in both industry and research. We will use a new integrated framework on Kafka and Spark Streaming to explore a constrained SLAM problem using online algorithms to navigate and map a space in real time. We present benchmarks of our open-source robot’s integration with Kafka and Spark Streaming for performance against other SLAM algorithms currently in use, explore some of the challenges we faced in our implementation, and make recommendations for improvement of performance and optimization on our framework. Finally, new to this talk, we demo real-time usage of our implementation with the Turtlebot II and explore relevant benchmarks and their implications on the future of autonomous vehicles in the IoT and cloud analytics space.

What's hot

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

Spark Summit

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

Alluxio, Inc.

RISELab:Enabling Intelligent Real-Time Decisions

Jen Aman

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark Summit

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark Summit

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...

Spark Summit

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Spark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Spark Summit

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Cloudera, Inc.

How to teach your data scientist to leverage an analytics cluster with Presto...

Alluxio, Inc.

Apache Spark At Scale in the Cloud

Databricks

Architecture at Scale

Elasticsearch

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Databricks

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark Summit

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Spark Summit

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Spark Summit

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Spark Summit

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Spark Summit

Spark Streaming and MLlib - Hyderabad Spark Group

Phaneendra Chiruvella

What's hot (20)

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...

ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...

RISELab:Enabling Intelligent Real-Time Decisions

Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

How to teach your data scientist to leverage an analytics cluster with Presto...

Apache Spark At Scale in the Cloud

Architecture at Scale

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Spark Streaming and MLlib - Hyderabad Spark Group

Viewers also liked

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Spark Summit

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...

Spark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Spark Summit

As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

Spark Summit

R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset? In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Spark Summit

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data. In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity. ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Spark Summit

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce. T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis. Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Spark Summit

HP ships millions of PCs, Printers, and other devices every year to customers in all market segments. More customers are seeking services provided with our products enabling new opportunities for HP to create services from the data we can collect from our devices. Every device we ship is an IoT endpoint with powerful CPU to capture rich data. Insights from this data are used internally to improve our products and focus on customer needs. In this presentation, John will focus on HP’s journey to enabling Big Data analytics from within a large enterprise environment. He will review the challenges and how HP decided on AWS, Apache Spark and Databricks as the foundation for their entry into Big Data Analytics. John will also review how HP uses Spark to build analytic services from the data they generate from their devices.

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR. • Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R. • Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas. • Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods. • Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics. • Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future. • Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency. • Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

Spark Summit

Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...

Spark Summit

Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends.  How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community. As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.

Tuning and Monitoring Deep Learning on Apache Spark

Databricks

Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs. Speaker: Tim Hunter This talk was originally presented at Spark Summit East 2017.

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...

Spark Summit

The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples. A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.

Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...

Spark Summit

In this talk we will introduce the business use case of how we create a real-time platform for our Second Look project using Spark and Kafka. Second Look is a feature created by Capital One to detect and notify cardholders of these potential mistakes and unexpected charges. We bring them to the attention of the customers automatically through email alerts and push notifications to ensure customers can take timely action. The situations can be resolved through a conversation with the merchant, or a dispute on your charge directly to Capital One. We help to guide the user through this resolution path through our user experiences. We use Spark extensively to build the infrastructure for this project. Before we use Spark and Kafka, the alerts were not sent in real-time and there were delays in days between when the customers transact and when customers receive the alerts. With the power of Spark and Kafka, we are able to send the alert in a more timely manner. We will share how we connect each parts together from data ingestion to processing, alert generation, and alert delivery. We will demonstrate how Spark plays critical role in the whole infrastructure. What’s next? We will leverage more power of machine learning using Spark to generate various types of alerts.

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Spark Summit

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Spark Summit

BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Spark Summit

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...

Spark Summit

Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software. Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications. To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.

Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...

Spark Summit

AI plays a central role in the today’s Internet applications and emerging intelligent systems, which are driving the need for scalable, distributed big data analytics with deep learning capabilities. There is increasing demand from organizations to discover and explore data using advanced big data analytics and deep learning. In this talk, we will share how we work with our users to build deep learning powered big data analytics applications (e.g., object detection, image recognition, NLP, etc.) using BigDL, an open source distributed deep learning library for Apache Spark.

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

Spark Summit

This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics. The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application. Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.

Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...

Spark Summit

One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.

Viewers also liked (20)