AI plays a central role in the today’s Internet applications and emerging intelligent systems, which are driving the need for scalable, distributed big data analytics with deep learning capabilities. There is increasing demand from organizations to discover and explore data using advanced big data analytics and deep learning. In this talk, we will share how we work with our users to build deep learning powered big data analytics applications (e.g., object detection, image recognition, NLP, etc.) using BigDL, an open source distributed deep learning library for Apache Spark.
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit
Moving at the speed of a startup often means rapid iterative development, which can lead to a patchwork of systems and processes. In the early days at Kik (one of the most popular chat apps among U.S. teens), the data team was able to move extremely quickly but often at the expense of scalable data engineering. In this session, Kik’s head of data will share the eight things they did to save time and money. The team took their data stack from a complex combination of systems and processes to a scalable, simple, and robust platform leveraging Apache Spark and Databricks to make data super easy for everyone in the company to use.
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit
BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit
Deep learning is a fast growing subset of machine learning. There is an emerging trend to conduct deep learning in the same cluster along with existing data processing pipelines to support feature engineering and traditional machine learning. As the leading framework for Distributed ML, we believe that the addition of deep learning to the super-popular Spark framework is important, because it allows Spark developers to perform a range of data analysis tasks within a single framework that helps avoid the complexity inherent in using multiple frameworks and libraries. As one of the early and top contributors to Apache Spark, Intel is thrilled to share with the community a big deal contribution to open source Spark…”BigDL” -… A distributed deep Learning framework organically built on Big Data (Apache Spark) platform. It combines the benefits of “high performance computing” and “Big Data” architecture for rich deep learning support. With BigDL on Spark, customers can eliminate large volume of unnecessary dataset transfer between separate systems, eliminate separate HW clusters and move towards a CPU cluster, reduce system complexity and the latency for end-to-end learning. Ultimately, customers can achieve better scale, higher resource utilization, ease of use/development, and better TCO. Feature parity with Caffe and Torch, significant performance boost when combined with Intel’s Math Kernel Library (MKL), scale-out, fault tolerance, elasticity and dynamic resource sharing are some of the prominent features of BigDL.
BigDL open source project will be launched at the 2017 Spark Summit East and this keynote will help spotlight this new contribution and benefits to the Spark developer community and encourage their wide contribution and collaboration. We will also showcase some real world applications of Big DL from early customers’ adoption.
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit
As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Spark Summit
Across all assets globally, Shell carries a huge stock of spare part inventory which ties up large quantities of working capital. Over the past 2 years an interdisciplinary project team has produced a tool, Inventory Optimization Analytics solution (IOTA), based on advanced analytical methods, that helps assets optimise stock levels and purchase strategies. To calculate the recommended stocking inventory level requirement for a material the Data Science team have written a Markov Chain Monte Carlo (MCMC) bootstrapping statistical model in R. Cumulatively, the computational task is large but, fortunately, is one of an embarrassingly parallel nature because the model can be applied independently to each material. The original solution which utilised the R “parallel” package was deployed on a single 48 core PC and took 48 hours to run. In this presentation, we describe how we moved the original solution to a distributed cloud-based Apache Spark framework. Using the new R User Defined Functions API in Apache Spark and with only a minimal amount of code changes the computational run time was reduced to 4 hours. A restructuring of the architecture to “pipeline” the problem resulted in a run time of less than 1 hour. This use case is important because it verifies the scalability and performance of SparkR.
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters.
Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives.
This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit
Moving at the speed of a startup often means rapid iterative development, which can lead to a patchwork of systems and processes. In the early days at Kik (one of the most popular chat apps among U.S. teens), the data team was able to move extremely quickly but often at the expense of scalable data engineering. In this session, Kik’s head of data will share the eight things they did to save time and money. The team took their data stack from a complex combination of systems and processes to a scalable, simple, and robust platform leveraging Apache Spark and Databricks to make data super easy for everyone in the company to use.
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit
BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.
Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit
Deep learning is a fast growing subset of machine learning. There is an emerging trend to conduct deep learning in the same cluster along with existing data processing pipelines to support feature engineering and traditional machine learning. As the leading framework for Distributed ML, we believe that the addition of deep learning to the super-popular Spark framework is important, because it allows Spark developers to perform a range of data analysis tasks within a single framework that helps avoid the complexity inherent in using multiple frameworks and libraries. As one of the early and top contributors to Apache Spark, Intel is thrilled to share with the community a big deal contribution to open source Spark…”BigDL” -… A distributed deep Learning framework organically built on Big Data (Apache Spark) platform. It combines the benefits of “high performance computing” and “Big Data” architecture for rich deep learning support. With BigDL on Spark, customers can eliminate large volume of unnecessary dataset transfer between separate systems, eliminate separate HW clusters and move towards a CPU cluster, reduce system complexity and the latency for end-to-end learning. Ultimately, customers can achieve better scale, higher resource utilization, ease of use/development, and better TCO. Feature parity with Caffe and Torch, significant performance boost when combined with Intel’s Math Kernel Library (MKL), scale-out, fault tolerance, elasticity and dynamic resource sharing are some of the prominent features of BigDL.
BigDL open source project will be launched at the 2017 Spark Summit East and this keynote will help spotlight this new contribution and benefits to the Spark developer community and encourage their wide contribution and collaboration. We will also showcase some real world applications of Big DL from early customers’ adoption.
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit
As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Spark Summit
Across all assets globally, Shell carries a huge stock of spare part inventory which ties up large quantities of working capital. Over the past 2 years an interdisciplinary project team has produced a tool, Inventory Optimization Analytics solution (IOTA), based on advanced analytical methods, that helps assets optimise stock levels and purchase strategies. To calculate the recommended stocking inventory level requirement for a material the Data Science team have written a Markov Chain Monte Carlo (MCMC) bootstrapping statistical model in R. Cumulatively, the computational task is large but, fortunately, is one of an embarrassingly parallel nature because the model can be applied independently to each material. The original solution which utilised the R “parallel” package was deployed on a single 48 core PC and took 48 hours to run. In this presentation, we describe how we moved the original solution to a distributed cloud-based Apache Spark framework. Using the new R User Defined Functions API in Apache Spark and with only a minimal amount of code changes the computational run time was reduced to 4 hours. A restructuring of the architecture to “pipeline” the problem resulted in a run time of less than 1 hour. This use case is important because it verifies the scalability and performance of SparkR.
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
In this session, you will learn how CERN easily applied end-to-end deep learning and analytics pipelines on Apache Spark at scale for High Energy Physics using BigDL and Analytics Zoo open source software running on Intel Xeon-based distributed clusters.
Technical details and development learnings will be shared using an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. The classifier has demonstrated very good performance figures for efficiency, while also reducing the false positive rate compared to the existing methods. It could be used as a filter to improve the online event selection infrastructure of the LHC experiments, where one could benefit from a more flexible and inclusive selection strategy while reducing the amount of downstream resources wasted in processing false positives.
This is part of CERN’s research on applying Deep Learning and Analytics using open source and industry standard technologies as an alternative to the existing customized rule based methods. We show how we could quickly build and implement distributed deep learning solutions and data pipelines at scale on Apache Spark using Analytics Zoo and BigDL, which are open source frameworks unifying Analytics and AI on Spark with easy to use APIs and development interfaces seamlessly integrated with Big Data Platforms.
In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Data Warehousing with Spark Streaming at ZalandoDatabricks
Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.
The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
BigDL is a distributed deep learning framework for Apache Spark open sourced by Intel. BigDL helps make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters.
In this session, we will introduce BigDL, how our customers use BigDL to build End to End ML/DL applications, platforms on which BigDL is deployed and also provide an update on the latest improvements in BigDL v0.1, and talk about further developments and new upcoming features of BigDL v0.2 release (e.g., support for TensorFlow models, 3D convolutions, etc.).
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayDatabricks
Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
"Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark."
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
Whether you’re processing IoT data from millions of sensors or building a recommendation engine to provide a more engaging customer experience, the ability to derive actionable insights from massive volumes of diverse data is critical to success. MediaMath, a leading adtech company, relies on Apache Spark to process billions of data points ranging from ads, user cookies, impressions, clicks, and more — translating to several terabytes of data per day. To support the needs of the data science teams, data engineering must build data pipelines for both ETL and feature engineering that are scalable, performant, and reliable.
Join this webinar to learn how MediaMath leverages Databricks to simplify mission-critical data engineering tasks that surface data directly to clients and drive actionable business outcomes. This webinar will cover:
- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
In Spark 2.0, we introduced Structured Streaming, which allows users to continually and incrementally update your view of the world as new data arrives, while still using the same familiar Spark SQL abstractions. I talk about progress we’ve made since then on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications.
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataGetInData
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
Presentation from the performance given by Mariusz during the Data Science Summit ML Edition.
Author: Mariusz Strzelecki
Linkedin: https://www.linkedin.com/in/mariusz-strzelecki/
___
Company:
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show:
1) How we tried to solve this problem using traditional DW techniques
2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.
3) Some of the key learnings we had when migrating from DW to Spark.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage.
We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. obliviousness). Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
From R Script to Production Using rsparkling with Navdeep GillDatabricks
The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.
Data Warehousing with Spark Streaming at ZalandoDatabricks
Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.
The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
BigDL is a distributed deep learning framework for Apache Spark open sourced by Intel. BigDL helps make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters.
In this session, we will introduce BigDL, how our customers use BigDL to build End to End ML/DL applications, platforms on which BigDL is deployed and also provide an update on the latest improvements in BigDL v0.1, and talk about further developments and new upcoming features of BigDL v0.2 release (e.g., support for TensorFlow models, 3D convolutions, etc.).
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayDatabricks
Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.
In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.
We provide an update on developments in the intersection of the R and the broader machine learning ecosystems. These collections of packages enable R users to leverage the latest technologies for big data analytics and deep learning in their existing workflows, and also facilitate collaboration within multidisciplinary data science teams. Topics covered include – MLflow: managing the ML lifecycle with improved dependency management and more deployment targets – TensorFlow: TF 2.0 update and probabilistic (deep) machine learning with TensorFlow Probability – Spark: latest improvements and extensions, including text processing at scale with SparkNLP
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
As the development of semiconductor devices, manufacturing system leads to improve productivity and efficiency for wafer fabrication. Owing to such improvement, the number of wafers yielded from the fabrication process has been rapidly increasing. However, current software systems for semiconductor wafers are not aimed at processing large number of wafers. To resolve this issue, the BISTel (a world-class provider of manufacturing intelligence solutions and services for manufacturers) tries to build several products for big data such as Trace Analyzer (TA) and Map Analyzer (MA) using Apache Spark. TA is to analyze raw trace data from a manufacturing process. It captures details on all variable changes, big and small and give the traces' statistical summary (i.e.: min, max, slope, average, etc.). Several BISTel's customers, which are the top-tier semiconductor company in the world use the TA to analyze the massive raw trace data from their manufacturing process. Especially, TA is able to manage terabytes of data by applying Apache Spark's APIs. MA is an advanced pattern recognition tool that sorts wafer yield maps and automatically identify common yield loss patterns. Also, some semiconductor companies use MA to identify clustering patterns for more than 100,000 wafers, which can be considered as big data in the semiconductor area. This talk will introduce these two products which are developed based on the Apache Spark and present how to handle the large-scale semiconductor data in the aspects of software techniques.
Speakers: Seungchul Lee, Daeyoung Kim
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
"Deep Learning is now the standard in object detection, but it is not easy to analyze large amounts of images, especially in an interactive fashion. Traditionally, there has been a gap between Deep Learning frameworks, which excel at image processing, and more traditional ETL and data science tools, which are usually not designed to handle huge batches of complex data types such as images.
In this talk, we show how manipulating large corpora of images can be accomplished in a few lines of code because of recent developments in Apache Spark. Thanks to Spark’s unique ability to blend different libraries, we show how to start from satellite images and rapidly build complex queries on high level information such as houses or buildings. This is possible thanks to Magellan, a geospatial package, and Deep Learning Pipelines, a library that streamlines the integration of Deep Learning frameworks in Spark. At the end of this session, you will walk away with the confidence that you can solve your own image detection problems at any scale thanks to the power of Spark."
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
Whether you’re processing IoT data from millions of sensors or building a recommendation engine to provide a more engaging customer experience, the ability to derive actionable insights from massive volumes of diverse data is critical to success. MediaMath, a leading adtech company, relies on Apache Spark to process billions of data points ranging from ads, user cookies, impressions, clicks, and more — translating to several terabytes of data per day. To support the needs of the data science teams, data engineering must build data pipelines for both ETL and feature engineering that are scalable, performant, and reliable.
Join this webinar to learn how MediaMath leverages Databricks to simplify mission-critical data engineering tasks that surface data directly to clients and drive actionable business outcomes. This webinar will cover:
- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
In Spark 2.0, we introduced Structured Streaming, which allows users to continually and incrementally update your view of the world as new data arrives, while still using the same familiar Spark SQL abstractions. I talk about progress we’ve made since then on robustness, latency, expressiveness and observability, using examples of production end-to-end continuous applications.
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataGetInData
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
Presentation from the performance given by Mariusz during the Data Science Summit ML Edition.
Author: Mariusz Strzelecki
Linkedin: https://www.linkedin.com/in/mariusz-strzelecki/
___
Company:
Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries.
https://getindata.com
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show:
1) How we tried to solve this problem using traditional DW techniques
2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.
3) Some of the key learnings we had when migrating from DW to Spark.
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
GE Aviation has hundreds of data scientists and engineers developing algorithms. The majority of these people do not have the time to learn Apache Spark and continue to develop on local machines in Python or R. We also have lots of historical code that was not developed for Spark. However, the business wanted to deploy to a Spark environment for scalability, as quickly as possible. So how did we bridge the gap? A data scientist and software engineer will co-present to share how we approached the problem of building, unifying and scaling these algorithms.
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage.
We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. obliviousness). Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark Summit
Modern infrastructure and applications generate extraordinary volumes of log and telemetry data. At Pure Storage, we know this first hand: we have over 5PB of log data from production customers running our all-flash storage systems, from our engineering testbeds, and from test stations at manufacturing partners. Every part of our company — from engineering to sales — now depends on the insights we gather from this data. Given the diversity of our end users, it’s no surprise that our analysis tools comprise a broad mix of reporting queries, stream-processing operations, ad-hoc analyses, and deeper machine-learning algorithms. In this session, we will cover lessons learned from scaling our data warehouse and how we are leveraging Apache Spark’s capabilities as a central hub to meet our analytics demands.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
Scaling out doesn’t have to mean giving up transactions and efficient joins! Relational databases can scale horizontally, and using them as a store for Spark Streaming or batch computations can help cover areas in which Spark is typically weaker. Examples will be drawn from our experience using Citus (https://github.com/citusdata/citus), an open-source extension to Postgres, but lessons learned should be applicable to many databases.
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
Legacy enterprise data warehouse (EDW) architecture, geared toward day-to-day workloads associated with operational querying, reporting, and analytics, are often ill-equipped to handle the volume of data, traffic, and varied data types associated with a modern, ad-hoc analytics platform. Faced with challenges of increasing pipeline speed, aggregation, and visualization in a simplified, self-service fashion, organizations are increasingly turning to some combination of Spark, Hadoop, Kafka, and proven analytical databases like Vertica as key enabling technologies to optimize their EDW architecture. Join us to learn how successful organizations have developed real-time streaming solutions with these technologies for range of use cases, including IOT predictive maintenance.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Spark Summit
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. If your data is already in one of Spark Streaming’s well-supported message queuing systems, this is easy. If not, an ad hoc solution to import data may work for a single application, but trying to scale that approach to complex data pipelines integrating dozens of data sources and sinks with multi-stage processing quickly breaks down. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges.
The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier. This talk will first describe some data pipeline anti-patterns we have observed and motivate the need for a tool designed specifically to bridge the gap between other data systems and stream processing frameworks. We will introduce Kafka Connect, starting with basic usage, its data model, and how a variety of systems can map to this model. Next, we’ll explain how building a tool specifically designed around Kafka allows for stronger guarantees, better scalability, and simpler operationalization compared to other general purpose data copying tools. Finally, we’ll describe how combining Kafka Connect and Spark Streaming, and the resulting separation of concerns, allows you to manage the complexity of building, maintaining, and monitoring large scale data pipelines.
Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Sof...Spark Summit
Fraudsters attempt to pay for goods, flights, hotels – you name it – using stolen credit cards. This hurts both the trust of card holders and the business of vendors around the world. We built a Real-Time Fraud Prevention Engine using Open Source (Big Data) Software: Spark, Spark ML, H2O, Hive, Esper. In my talk I will highlight both the business and the technical challenges that we’ve faced and dealt with.
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data.
In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity.
ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.
How to Integrate Spark MLlib and Apache Solr to Build Real-Time Entity Type R...Spark Summit
Understanding the types of entities expressed in a search query (Company, Skill, Job Title, etc.) enables more intelligent information retrieval based upon those entities compared to a traditional keyword-based search. Because search queries are typically very short, leveraging a traditional bag-of-words model to identify entity types would be inappropriate due to the lack of contextual information. We implemented a novel entity type recognition system which combines clues from different sources of varying complexity in order to collect real-world knowledge about query entities. We employ distributional semantic representations of query entities through two models: 1) contextual vectors generated from encyclopedic corpora like Wikipedia, and 2) high dimensional word embedding vectors generated from millions of job postings using Spark MLlib. In order to enable real-time recognition of entity types, we utilize Apache Solr to cache the embedding vectors generated by Spark MLlib. This approach enable us to recognize entity types for entities expressed in search queries in less than 60 milliseconds which makes this system applicable for real-time entity type recognition.
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Spark Summit
Due to Spark, writing big data applications has never been easier…at least until they stop being easy! At Lightbend we’ve helped our customers out of a number of hidden Spark pitfalls. Some crop up often; the ever-persistent OutOfMemoryError, the confusing NoSuchMethodError, shuffle and partition management, etc. Others occur less frequently; an obscure configuration affecting SQL broadcasts, struggles with speculating, a failing stream recovery due to RDD joins, S3 file reading leading to hangs, etc. All are intriguing! In this session we will provide insights into their origins and show how you can avoid making the same mistakes. Whether you are a seasoned Spark developer or a novice, you should learn some new tips and tricks that could save you hours or even days of debugging.
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
Data scientists write SQL queries everyday. Very often they know how to write correct queries but don’t know why their queries are slow. This is more obvious in Spark than in Redshift as Spark requires additional tuning such as caching while Redshift does heavy lifting behind the scene.
In this talk I will cover a few lessons we learned from migrating one of the biggest table here (900M+ rows/day) from AWS Redshift to Spark.
Specifically:
– Why and how do we migrate?
– How do we tune the query for Spark to gain 10x speed vs direct translated from Redshift
– How do we scale the team on Spark (with 80+ people in our data science team)
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
APACHE TOREE: A JUPYTER KERNEL FOR SPARK by Marius van NiekerkSpark Summit
Many data scientists are already making heavy usage of the Jupyter ecosystem for analyzing data using interactive notebooks.
Apache Toree (incubating) is a Jupyter kernel designed to act as a gateway to Spark by enabling users Spark from standard Jupyter notebooks. This allows users to easily integrate Spark into their existing Jupyter deployments, This allows users to easily move between languages and contexts without needing to switch to a different set of tools.
Apache Toree is designed expressly for interactive work. It supports interpreters in Scala, Python, and R.
In this talk, I will cover the design of Toree, how it interacts with the Jupyter ecosystem and various ways in which users can extend the functionality of Apache Toree via a powerful plugin system.
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit
Today there are several compliance use cases — archiving, e-discovery, supervision + surveillance, to name a few — that appear naturally suited as Hadoop workloads but haven’t seen wide adoption. In this talk, we’ll discuss common limitations, how Apache Spark helps, and propose some new blueprints as to how to modernize this architecture and disrupt existing solutions. Additionally, we’ll discuss the rising role of Apache Spark in this ecosystem; leveraging machine learning and advanced analytics in a space that has traditionally been restricted to fairly rote reporting.
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
Apache Spark has become a popular and successful way for Python programming to parallelize and scale up their data processing. In many use cases, though, a PySpark job can perform worse than equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master. In this talk, we’ll examine some of the the data serialization and other interoperability issues, especially with Python libraries like pandas and NumPy, that are impacting PySpark performance and work that is being done to address them. This will relate closely to other work in binary columnar serialization and data exchange tools in development such as Apache Arrow and Feather files.
Distributed Deep Learning At Scale On Apache Spark With BigDLYulia Tell
Intel recently released BigDL, an open source distributed deep Learning framework for Apache Spark (https://github.com/intel-analytics/BigDL). It brings native support for deep learning functionalities to Spark, provides orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch/TensorFlow) with respect to single node Xeon performance, and efficiently scales out deep learning workloads based on the Spark architecture. In addition, it also allows data scientists to perform distributed deep learning analysis on big data using the familiar tools including python, notebook, etc.
In this talk, we will give an introduction to BigDL, show how Big Data users and data scientist can leverage BigDL for their deep learning (such as image recognition, object detection, NLP, etc.) analysis on large amounts of data in a distributed fashion, which allows them to use their Big Data (e.g., Apache Hadoop and Spark) cluster as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
Linaro is building an OpenStack based Developer Cloud. Here we present what was required to bring OpenStack to 64-bit ARM, the pitfalls, successes and lessons learnt; what’s missing and what’s next.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
ABSTRACT: The ongoing big data revolution has revolutionized the way in which technology is used to empower new business segments like social networking and transform old business segments like traditional retail. However, the DNA that is used to build data processing platform is evolving quite rapidly. There is a plethora of competing tools, technologies, and “religion” for how to build state-of-the-art data analysis frameworks. In this talk, I will go over five ways to build scalable high-performance long-lasting data analysis frameworks in the wrong way. Surprisingly, the industry is full of examples of organization building frameworks in this “wrong” way. Since the “right” way to build a technology framework is dependent on the key business drivers, it is my hope that this talk will spur a discussion on what is the “right” way for Pinterest. The talk will focus on technologies including “data plumbing” (e.g. tools in the Hadoop ecosystem), and statistical modeling methods (e.g. R and Python). In this talk, I’ll try to connect to platform builders, data scientists, and business decision makers.
BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, where he also earned his Ph.D. He has worked in the area of databases (now fashionably called “big data”) for over two decades. He has won several best paper awards, and industry research awards. He is the recipient of the Wisconsin COW teaching award, and the U. Michigan College of Engineering Education Excellence Award. He has a strong interest in seeing research ideas transition to actual products. His Ph.D. thesis work was acquired by NCR/Teradata in 1997, and he also co-founded Locomatix -- a startup that built a platform to power real-time data-driven mobile services. Locomatix became part of Twitter in 2013. He is an ACM Distinguished Scientist and an IEEE Senior Member. He also serves on the board of Lands’ End, and advises a number of startups.
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
BigDL is a distributed deep learning library for Apache Spark*
and a unified Big Data Platform Driving Analytics and Data Science.
If you like what you read be sure you ♥ it below. Thank you!
Jeremy Nixon, Machine Learning Engineer, Spark Technology Center at MLconf AT...MLconf
Convolutional Neural Networks at scale in Spark MLlib:
Jeremy Nixon will focus on the engineering and applications of a new algorithm built on top of MLlib. The presentation will focus on the methods the algorithm uses to automatically generate features to capture nonlinear structure in data, as well as the process by which it’s trained. Major aspects of that include compositional transformations over the data, convolution, and distributed backpropagation via SGD with adaptive gradients and an adaptive learning rate. Applications will look into how to use convolutional neural networks to model data in computer vision, natural language and signal processing. Details around optimal preprocessing, the type of structure that can be learned, and managing its ability to generalize will inform developers looking to apply nonlinear modeling tools to problems that they face.
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
The GPU Open Analytics Initiative, GOAI, is accelerating data science like never before. CPUs are not improving at the same rate as networking and storage, and leveraging GPUs data scientist can analyze more data than ever with less hardware. Learn more about how GPU are accelerating data science (not just Deep Learning), and how to get started.
This is a presentation about big data with Java. In those slides, you can find why big data is so important and some of the tools that are used for creating big data applications like Apache Hadoop, Apache Spark, Apache Kafka and etc.
Introduction to Spark: Or how I learned to love 'big data' after all.Peadar Coyle
Slides from a talk I will give in early 2016 at the Luxembourg Data Science Meetup. Aim is to give an introduction to Apache Spark, from a Machine Learning experts point of view. Based on various other tutorials out there. This will be aimed at non-specialists.
Similar to Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang and Yiheng Wang (20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
3. 3
Big data boost deep learning Production ML/DL system is Complex
Why BigDL?
Andrew NG, Baidu, NIPS 2015 Paper
4. 4
Why BigDL?
BigDL open sourced on Dec 30, 2016
§ Write deep learning applications as standard Spark programs
§ Run on top of existing Spark or Hadoop clusters(No change to the clusters)
§ Rich deep learning support
§ High performance powered by Intel MKL and multi-threaded programming
§ Efficient scale-out with an all-reduce communications on Spark
6. 6
Fraud Transaction Detection
Fraud transaction detection is very import to finance companies. A good fraud detection
solution can save a lot of money.
ML solution challenge
§ Data cleaning
§ Feature engineering
§ Unbalanced data
§ Hyper parameter
7. 7
Fraud Transaction Detection
§ History data is stored on Hive
§ Easily data preprocess/cleaning
with Spark-SQL
§ Spark ML pipelinefor complex
feature engineering
§ Under sample + Bagging solve
unbalance problem
§ Grid search for hyperparameter
tuning
Powered by BigDL
8. 8
Product Defect Detection and Classification
Data source
§ Cameras installed on manufactory pipeline
Task
§ Detect defect from the photos
§ Classify the defect
11. 11
Fast-RCNN
§ Faster-RCNN is a popularobject
detection framework
§ It share the features between detection
network and region proposal network
Ren, Shaoqing, et al. "Faster r-cnn: Towards
real-time object detection with region proposal
networks." Advances in neural information
processing systems. 2015.
12. 12
Object Detection with Fast-RCNN
See the code at: https://github.com/intel-analytics/BigDL/pull/387
13. 13
Language Model with RNN
Text
Preprocessing
RNN Model
Training
Sentence
Generating
§ Sentence Tokenizer
§ Dictionary Building
§ Input Document
Transformer
Generated sentences with
regard to trigger words.
14. 14
RNN Model
See the code at:
https://github.com/intel-analytics/BigDL/tree/master/dl/src/main/scala/com/intel/analytics/bigdl/models/rnn
15. 15
Learn from Shakespeare Poems
Output of RNN:
Long live the King . The King and Queen , and the Strange of the Veils of the rhapsodic .
and grapple, and the entreatments of the pressure .
Upon her head , and in the world ? `` Oh, the gods ! O Jove ! To whom the king : `` O
friends !
Her hair, nor loose ! If , my lord , and the groundlingsof the skies . jocund and Tasso in
the Staggering of the Mankind . and
16. 16
Fine-tune Caffe/Torch Model on Spark
BigDL Model
Fine-tune
Melancholy
Sunny
Macro
Caffe
Model
BigDL
Model
Torch
Model
Load
• Train on different datasetbasedon pre-trainedmodel
• Predict image style instead oftype
• Save training timeand improveaccuracy
Image source: https://www.flickr.com/photos/
18. 18
Integration with Spark Streaming
Spark
Streaming RDDs
EvaluatorBigDL
Model
StreamWriter
BigDL integarates with Spark Streaming for runtime training and prediction
HDFS/S3
Kafka
Flume
Kinesis
Twitter
Train
Predict
19. 19
Tight Integration with SparkSQLand DataFrames
df.select($’image’)
.withColumn(
“image_type”,
ImgClassifier(“image”))
.filter($’image_type’==‘dog’)
.show()
Image classification on ImageNet(http://www.image-net.org)
20. 20
More BigDL Examples
BigDL provide examples to help developer play with bigdl and start with popular models.
https://github.com/intel-analytics/BigDL/wiki/Examples
Models(Train and Inference example code):
§ LeNet, Inception, VGG, ResNet, RNN, Auto-encoder
Examples:
• Text Classification
• Image Classification
• Load Torch/Caffe model