Large Scale Multimedia Data Intelligence And Analysis On Spark

Spark Summit 2016 talk by Quan Wang (Baidu)

Data & Analytics

Large Scale Multimedia Data
Processing on Spark
and Related Applications at Baidu
Quan Wang
Baidu USA

Motivations
• Why Multimedia?
• Examples of large scale multimedia processing at
Baidu:
– HD map generation and simulation for self-driving cars
– Image feature exaction and transform for CTR predictions

Application Example: Self-Driving Cars
• Maps for navigation, planning and localization:
• HD Map generation pipeline:

Data challenge in Self-driving Cars Project
• Backend map generation and simulation:
– Point clouds inputs of 40MB/s for one single 3D LiDAR sensor
– Counting all 2D/3D sensors => TB level of data, per hour, per car
• Example: HD map generation on Spark

Another Example: Baidu Image Search
• Billions of images need feature extractions &
transformations for deep learning applications:
– Image recognition and classification
– Ranking for best picture to show
– CTR prediction

Challenges
• Core functions for feature extraction
• Efficient large scale distributed executions of feature
extraction with multimedia input support
• Plug and play for any feature extraction executable
never designed for Spark; Flexible and easy to use for
platform users

Feature Extraction Core Function
• Feature extraction C++ program depends on CDNN + OpenCV
library
– Compute the per pixel difference based on a pre-computed mean
– Feed the difference values into a pre-computed CDNN model
– Produce image features after multi-layers of computations
• Need streaming/pipe based function support

Single Node Execution
Distribute to 500 machines
How to manage?

Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Platform Space User Space

Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Encoding Serialization
Binary
stream

Technical Details:
Binary Flow on Each Executor Node
Partition of
Multimedia
Files
Encoding Serialization Deserialization
Decoding
User Logic
Encoding
Serialization
Binary
stream
Binary
stream

Implementation Highlights
• All data serialization and encoding inside Spark
RDD, thus linear scalable
• Flexible output format (feature vectors, ranking
scores, processed binaries, etc.)
• Easy to plug in customized encoding/serialization
functions directly into platform
• Support of passing spark internal information (e.g.
partition id, task attempt id) into user program

Flexibility
Flexible output format (feature vectors, ranking scores, processed binaries, etc.)
Easy to plug in Customized encoding/serialization functions directly into Spark

Conclusion
• Introduce the Binary Piped RDD for:
– Platform level abstraction of input data format in their original binary form
– Seamless streaming to and from existing executable/libraries for high level
data analysis and understanding
– Linear scalability with input data
• General data intelligence and analysis
– Binary input format + Pipe based bin/lib execution
Missing functionality in Spark/Hadoop

Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.

Spark Summit EU talk by Bas Geerdink

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs

Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate. This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.

High Performance Python on Apache Spark

Wes McKinney

Migrating to Apache Spark at Netflix

In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers. Speaker Ryan Blue

CaffeOnSpark: Deep Learning On Spark Cluster

Spark Streaming and MLlib - Hyderabad Spark Group

Phaneendra Chiruvella

At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Scaling Apache Spark on Kubernetes at Lyft

Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speakers: Li Gao, Rohit Menon

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.

Resource-Efficient Deep Learning Model Selection on Apache Spark

Building Reliable Data Lakes at Scale with Delta Lake

Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain. This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class. What you’ll learn: Understand the key data reliability challenges How Delta Lake brings reliability to data lakes at scale Understand how Delta Lake fits within an Apache Spark™ environment How to use Delta Lake to realize data reliability improvements Prerequisites A fully-charged laptop (8-16GB memory) with Chrome or Firefox Pre-register for Databricks Community Edition

Spark Summit EU talk by Kaarthik Sivashanmugam

Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...

In this talk, we will present SPynq framework: A framework for the efficient mapping and acceleration of Spark applications on heterogeneous all-programmable MPSoC-based platforms, such as Zynq. Spark has been mapped to the Pynq platform and the proposed framework allows the seamlessly utilization of the programmable logic for the hardware acceleration of computational intensive Spark kernels. We have also developed the required libraries in Spark, by extending the MLLib library, that hides the accelerator’s details to minimize the design effort to utilize the accelerators. A cluster of 4 nodes (workers) based on the all-programmable MPSoCs has been implemented and the proposed platform is evaluated in a typical machine learning application based on logistic regression. The logistic regression kernel has been developed as an accelerator and incorporated to the Spark. The developed system is compared to a high-performance Xeon cluster that is typically used in cloud computing. The performance evaluation shows that the heterogeneous accelerator-based MpSoC can achieve up to 2.3x system speedup compared with a Xeon system (with 90% accuracy) and 20x better energy-efficiency. For embedded application, the proposed system can achieve up to 40x speedup compared to the software only implementation on low-power embedded processors and 30x lower energy consumption.

Spark Summit EU talk by Brij Bhushan Ravat

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

EFSA is the European agency providing independent scientific advice on existing and emerging risks across the entire food chain. On 27/03/2021 a new EU regulation (EU 2019/1381) has been enacted, requiring EFSA to significantly increase the transparency of its risk assessment processes towards all citizens. To comply with this new regulation, delaware has been supporting EFSA in undergoing a large Digital Transformation program. We have been designing and rolling-out a modern data platform running on Azure and powered by Databricks. This platform acts as a central control tower brokering data between a variety of applications. It is built around modularity principles, making it adaptable and versatile while keeping the overall ecosystem aligned w.r.t. changing processes and data models. At the heart of the platform lie two important patterns: 1. An Event Driven Architecture (EDA): enabling an extremely loosely coupled system landscape. By centrally brokering events near real-time, consumer applications can react immediately to events from producer applications as they occur. Event producers are decoupled from consumers via a publisher/subscribe mechanism. 2. A central data store built around a lakehouse architecture. The lakehouse collects, organizes and serves data across all stages of the data processing cycle, all data types and all data volumes. Events streams from the EDA layer feed into the store as curated data blocks and are complemented by other sources. This store in turn feeds into APIs, reporting and applications, including the new Open EFSA portal: a public website developed by delaware hosting all relevant scientific data, updated in near real-time. At delaware we are very excited about this project and proud of what we have achieved with EFSA so far.

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users. Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button. We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs. The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.

An Introduction to Sparkling Water by Michal Malohlava

Airstream: Spark Streaming At Airbnb

The Spark (R)evolution in The Netherlands

What's hot

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

Spark Summit EU talk by Ahsan Javed Awan

Spark and Couchbase: Augmenting the Operational Database with Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Ahsan Javed Awan

Huawei Advanced Data Science With Spark Streaming

Scalable And Incremental Data Profiling With Spark

Spark Summit EU talk by Kaarthik Sivashanmugam

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Scaling Apache Spark on Kubernetes at Lyft

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Resource-Efficient Deep Learning Model Selection on Apache Spark

Building Reliable Data Lakes at Scale with Delta Lake

Spark Summit EU talk by Kaarthik Sivashanmugam

Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...

Spark Summit EU talk by Brij Bhushan Ravat

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

An Introduction to Sparkling Water by Michal Malohlava

What's hot (20)

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Spark Summit EU talk by Ahsan Javed Awan

Spark and Couchbase: Augmenting the Operational Database with Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Huawei Advanced Data Science With Spark Streaming

Scalable And Incremental Data Profiling With Spark

Spark Summit EU talk by Kaarthik Sivashanmugam

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Scaling Apache Spark on Kubernetes at Lyft

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Resource-Efficient Deep Learning Model Selection on Apache Spark

Building Reliable Data Lakes at Scale with Delta Lake

Spark Summit EU talk by Kaarthik Sivashanmugam

Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...

Spark Summit EU talk by Brij Bhushan Ravat

Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...

An Introduction to Sparkling Water by Michal Malohlava

Viewers also liked

Airstream: Spark Streaming At Airbnb

The Spark (R)evolution in The Netherlands

Morticia: Visualizing And Debugging Complex Spark Workflows

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Don Drake

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Building Custom Machine Learning Algorithms With Apache SystemML

ETL with SPARK - First Spark London meetup

Rafal Kwasny

SparkR - Play Spark Using R (20160909 HadoopCon)

wqchen

Building a unified data pipeline in Apache SparkDataWorks Summit

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Big Data in Production: Lessons from Running in the Cloud

Edge AI and Vision Alliance

Evaluation question 0307HH

Top 11 Resources For Homeschoolers

Paul Bass

Case Nextdoor.fi at LESS2010

Marko Taipale

ERP Implementation Services UK

Arcus Universe Ltd

How Online Education Used to reduce illiteracy in remote areas

shuvo510

final resume newLauren Cash

Grafico diario del dax perfomance index para el 11 10-2011Experiencia Trading

Viewers also liked (19)

Airstream: Spark Streaming At Airbnb

The Spark (R)evolution in The Netherlands

Morticia: Visualizing And Debugging Complex Spark Workflows

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Building Custom Machine Learning Algorithms With Apache SystemML

ETL with SPARK - First Spark London meetup

SparkR - Play Spark Using R (20160909 HadoopCon)

Building a unified data pipeline in Apache Spark

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Big Data in Production: Lessons from Running in the Cloud

Evaluation question 03

Top 11 Resources For Homeschoolers

Case Nextdoor.fi at LESS2010

ERP Implementation Services UK

How Online Education Used to reduce illiteracy in remote areas

final resume new

Grafico diario del dax perfomance index para el 11 10-2011

Similar to Large Scale Multimedia Data Intelligence And Analysis On Spark

“Seamless Deployment of Multimedia and Machine Learning Applications at the E...

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2022/06/seamless-deployment-of-multimedia-and-machine-learning-applications-at-the-edge-a-presentation-from-qualcomm/ Megha Daga, Senior Director of Product Management for AIoT at Qualcomm, presents the “Seamless Deployment of Multimedia and Machine Learning Applications at the Edge” tutorial at the May 2022 Embedded Vision Summit. There has been an explosion of opportunities for edge compute solutions across the internet of things. This growth in opportunities and the diversity of applications is leading to fragmentation in the IoT space both in hardware and software, which creates challenges for developers. In addition, customers and developers are facing challenges in efficient data management and optimized application deployment on embedded edge platforms. In this session, Daga introduces the Qualcomm Intelligent Multimedia SDK, which empowers developers to tackle these challenges and deploy edge compute applications in a scalable, flexible and optimized way. The Qualcomm Intelligent Multimedia SDK easily decodes and organizes sensor data and executes applications efficiently on edge platforms.

Design reviewUniversity of New Brunswick

Seattle Spark Meetup Mobius CSharp API

shareddatamsft

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn

Grokking VN

Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring. 1 số điểm bài talk sẽ chia sẻ: - Giới thiệu về hệ thống unified metrics platform của LinkedIn - Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza. - Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture. Speaker: Khai Tran, Staff Software Engineer - LinkedIn. - Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle. - PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.

2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...

Patrick Guimonet

Total Knockout: Start-to-Finish Development of Suitability Applications Using...

Blue Raster

2014 Esri International Developer Summit User Presentation Learn strategies for developing cross-platform suitability analysis applications using ArcGIS Image Server D3 and Knockout. Round 1: Suitability Web Services, automating raster pre-processing and mosaic creation, creating custom raster functions, publishing and consuming image services Round 2: Visualization, calculating statistics for user drawn polygons, displaying results using D3, creating map and raster data export Round 3: Building the App, building suitability user interfaces in JavaScript, mobile development strategies, managing data using Knockout

Offline maps for mobile developers (Android/iOS)

Vadim Nikolaev

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Ali Hodroj

Forge - DevCon 2016: Drawings! Drawings! Everywhere!

Autodesk

Abhishek Singhal, Riversoft Ben O’Donnell, BIMobject Albert Szilvasy, Autodesk AutoCAD DWG files are widely used in many industries today, the Forge platform provides REST APIs to unlock the data inside them or to create new ones. This class will show how to securely connect Forge to various data storage services where your DWGs reside. It will demonstrate how to access a database while processing a DWG, a critical piece for many customers who embed database keys in their DWG files. Finally, we will look at how a real world customer uses the Design Automation APIs today.

Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain

MDC_UNICA

Flexibility and high efficiency are common design drivers in the embedded systems domain. Coarse-grained reconfigurable coprocessors can tackle these issues, but they suffer of complex design, debugging and applications mapping problems. In this paper, we propose an automated design flow that aids developers in design and managing coarse-grained reconfigurable coprocessors. It provides both the hardware IP and the software drivers, featuring two different levels of coupling with the host processor. The presented solution has been tested on a JPEG codec, targeting a commercial Xilinx Virtex-5 FPGA.

PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT

PROIDEA

Introduction to Blackfin BF532 DSPPantech ProLabs India Pvt Ltd

Dubbo in Internet Finance Industry

Huxing Zhang

As a leading IT service provider in the consumer finance field, Shanghai Rongzhijia Financial Information Service Co., Ltd. built China's first Internet loan search platform. It went from zero to over 30 million users, who have taken out nearly 15 billion RMB in loans through the platform within only two years. This slides introduces how they evolve their IT system from monolithic application to a Dubbo based micro-service architecture.

Real Time Streaming Architecture at Ford

DataWorks Summit

Ford Motor Company's mission to become both an Automotive and Mobility company has required an evolution in our analytics data flow, from traditional batch processing systems to dynamically routed stream processing based systems. Valuable data is continually being generated across the enterprise, from consumer WiFi in dealerships, robots working on the assembly line, and vehicle diagnostic data, and is now flowing into Ford's Real Time Streaming Architecture (RTSA). Our goal was to develop a provider agnostic, end to end solution to ingest and dynamically route individual streams of data in less than one second from edge node to Ford's on premise data center, or vice versa. The architecture dynamically scales in the cloud to reliably handle thousands of outbound and inbound transactions per second, with data provenance capabilities to audit data flow from end to end.

Cloud for Game Developers – Myth or Real Scenarios?

DevGAMM Conference

Preventative Maintenance of Robots in Automotive Industry

DataWorks Summit/Hadoop Summit

Droidcon 2013 automotive quality dunca_czol_garminDroidcon Berlin

Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019

Chun-Yu Tseng

MBE Summit 2012

dopsahl

In this presentation, keynoting a session at the 2012 Model Based Enterprise Summit 2012 on lightweight visualization, Consortium Executive Director Dave Opsahl informs the audience on how what we typically refer to as "visualization", while a needed and extremely valuable category of solutions, is not the same as "communication". The MBE Summit, hosted annually by the National Institue of Standards and Technology, is a gathering of managers, technologists, engineers, and thought leaders on how to enable the Model Based Enterpris (MBE).

lokesh_UX_Designer_v5Lokesh S

Similar to Large Scale Multimedia Data Intelligence And Analysis On Spark (20)

“Seamless Deployment of Multimedia and Machine Learning Applications at the E...

Design review

Seattle Spark Meetup Mobius CSharp API

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn

2015-06-10 Ceus by IberianSPC - new options for SharePoint 2016 and Office 36...

Total Knockout: Start-to-Finish Development of Suitability Applications Using...

Offline maps for mobile developers (Android/iOS)

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Forge - DevCon 2016: Drawings! Drawings! Everywhere!

Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain

PLNOG 8: Kazimierz Jantas - Innowacyjne rozwiązania dla IT

Introduction to Blackfin BF532 DSP

Dubbo in Internet Finance Industry

Real Time Streaming Architecture at Ford

Cloud for Game Developers – Myth or Real Scenarios?

Preventative Maintenance of Robots in Automotive Industry

Droidcon 2013 automotive quality dunca_czol_garmin

Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019

MBE Summit 2012

lokesh_UX_Designer_v5

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.

Snorkel: Dark Data and Machine Learning with Christopher Ré

Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark. Snorkel is open source on github and available from Snorkel.Stanford.edu.

Deep Learning on Apache® Spark™: Workflows and Best Practices

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Deep Learning on Apache® Spark™ : Workflows and Best Practices

RISELab:Enabling Intelligent Real-Time Decisions

Spark Summit East Keynote by Ion Stoica A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.

Spatial Analysis On Histological Images Using Spark

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

A Graph-Based Method For Cross-Entity Threat Detection

Time-Evolving Graph Processing On Commodity Clusters

Deploying Accelerators At Datacenter Scale Using Spark

Re-Architecting Spark For Performance Understandability

Re-Architecting Spark For Performance Understandability

Low Latency Execution For Apache Spark

Efficient State Management With Spark 2.0 And Scale-Out Databases

Livy: A REST Web Service For Apache Spark

GPU Computing With Apache Spark And Python

Spark And Cassandra: 2 Fast, 2 Furious

Spark on Mesos

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Spark at Bloomberg: Dynamically Composable Analytics