SlideShare a Scribd company logo
1 of 21
Download to read offline
#EUstr1
Intel Big Data Engineering Team
Spark Structured Streaming
Helps Smart Manufacturing
Xiaochang Wu
xiaochang.wu@intel.com
Hao Cheng
hao.cheng@intel.com
Who am I?
• Currently working at Intel Big Data Engineering team
based in Shanghai, China
• Joined Intel from 2006, focus on performance
optimization for CPU / GPU / System / App for ~10 years
• Delivering the best Spark performance on Intel platforms
• Building reference big data platform for Intel customers
2#EUstr1
Agenda
• Motivation
• Architecture and Features Overview
• Adopting Spark Structured Streaming
• Improvement Proposals
3#EUstr1
Motivation
• Building Spark-centric IOT Data Platform for PCB
Manufacturing
– Data collection, processing, analytics, visualization and alerting
– Short-term goal: Enhance predictive fault repair and material tracking
efficiency
• Take advantage of latest Structured Streaming
– Streaming ETL
– Stateless and Stateful processing
• Learn from real-world problems
4#EUstr1
Overview
• Transform the PCB surface mount lines
• SPI: Solder Paste Inspection
• AOI: Automated Optical Inspection
5#EUstr1
Solder
Paste
Printing
SPI
Component
Placement
Pre-Reflow
AOI
Reflow
Soldering
Post-
Reflow
AOI
Data Platform Features
• Device Management
– Register and manage devices
– RPC commands to access device
attributes
• Data Collection
– Collect and store device data in
reliable way
• Data Processing & Analytics
– Domain-specific scripts for both
batch and real-time processing
– Support using Machine Learning
algorithms for advanced analytics
• Rule Engine & Actions
– Process incoming data with flexible
user-defined rules.
– Trigger alarms or send commands to
devices using rules.
• Data Visualization
– Customizable user-friendly UI &
Dashboards
– Generate interactive data reports to
fulfil business needs
• Multi-tenancy, Scale-out, Fault-
tolerance & Security
6#EUstr1
Data Platform Architecture
7#EUstr1
*Other names and brands may be claimed as the property of others.
Data Characteristics
• Data Sources
– Manufacturing Execution System (MES)
– Sensor, Camera, Log
– OCR for machine screen output
• Structured, semi-structured and unstructured data
– Low frequency but large amount
• E.g. Picture and video
– High frequency continuous data
• Not large amount per unit. Total amount is very large
• E.g. vibration sensor data used to detect the quality of the equipment.
8#EUstr1
Spark Structured Streaming
• Built on the Spark SQL engine
– Express streaming the same way as batch
– Dataset/DataFrame API in Scala, Java, Python or R
• Incrementally and continuously updating the final result
– Handling Event-time and Late Data
– Delivering end-to-end exactly-once fault tolerance semantics
– Output modes: Append, Complete, Update mode
• Support Stateless and Stateful operations
9put your #assignedhashtag here by setting the footer in view-header/footer
Fast, scalable, fault-tolerant, end-to-end exactly-once
Streaming ETL
• Data are produced from sources as JSON objects, transferred with
MQTT and saved to Kafka
• JSON objects are complex and have nested structures
• Use Spark SQL functions to extract
– selectExpr, get_json_object, cast, alias
10#EUstr1
Stateless and Stateful Operations
• Stateless
– Projection, Filter
• Stateful (Implicit)
– Aggregation
– Time window
– Join
• Advanced arbitrary stateful operations
– mapGroupsWithState / flatMapGroupsWithState
11#EUstr1
Structured Streaming Considerations
• Output mode: append, update, complete
• Handling late data
– Method 1: use time window and automatic watermark
– Method 2: use manual watermark
• Post-processing
– Debug: console sink / memory sink
– File sink / Foreach sink
– Collect to driver side and process
12#EUstr1
Case 1: Find out offset AVG and STDDEV in Solder Paste
Inspection (SPI)
13#EUstr1
Case 2: Find out machine off time from event logs (1)
• Track sessions from streams of events and find out machine off time
– mapGroupsWithState / flatMapGroupsWithState
– Off time = Duration from status “off” to next “on”
14#EUstr1
Case 2: Find out machine off time from event logs (2)
• Event: define data types for events
• SessionInfo: Session information as state
• SessionUpdate: Update information returned as result
15#EUstr1
16#EUstr1
Case 2: Find out machine off time from event logs (3)
• Event time ordering
• Every subsequent event depends on previous event
• Required to sort events based on timestamp due to logical dependence
• Handling late events
• Late data will break previous calculation result, forced to maintain all
events and recalculate
• Solution: manual watermarking for late data, remove old events if
timeout, incremental calculation
17#EUstr1
Improvement Proposals
• Flexible window and trigger operation
• Complex Event Processing (CEP) APIs
– Stream Join
– Session Tracking
– Pattern
– Alert
• Streaming Domain Specific Language (DSL)
– SQL-like language for Streams
• We are working on them!
18#EUstr1
See Also from Intel team
• Today 14:00 @ AUDITORIUM
An Adaptive Execution Engine For Apache Spark SQL
by Carson Wang
• Tomorrow 11:00 @ LIFFEY HALL
FPGA-Based Acceleration Architecture for Spark SQL
by Qi Xie
19#EUstr1
Legal Notices and Disclaimers
• You may not use or facilitate the use of this document in connection with
any infringement or other legal analysis concerning Intel products described
herein. You agree to grant Intel a non-exclusive, royalty-free license to any
patent claim thereafter drafted which includes subject matter disclosed
herein.
• No license (express or implied, by estoppel or otherwise) to any intellectual
property rights is granted by this document.
• Intel disclaims all express and implied warranties, including without
limitation, the implied warranties of merchantability, fitness for a particular
purpose, and non-infringement, as well as any warranty arising from course
of performance, course of dealing, or usage in trade.
20#EUstr1
References
• Surface mount process: http://www.surfacemountprocess.com/
• Structured Streaming Programming Guide:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
• Five Spark SQL Utility Functions to Extract and Explore Complex Data Types:
https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore-
complex-data-types.html
• Spark Structured Sessionization example:
https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spa
rk/examples/sql/streaming/StructuredSessionization.scala
21#EUstr1

More Related Content

What's hot

Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
Databricks
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 

What's hot (20)

Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
 
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C... Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementAn AI-Powered Chatbot to Simplify Apache Spark Performance Management
An AI-Powered Chatbot to Simplify Apache Spark Performance Management
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis GkoufasSpark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Yiannis Gkoufas
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 

Similar to Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

SplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin InternationalSplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin International
Splunk
 

Similar to Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu (20)

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Shantanu's Resume
Shantanu's ResumeShantanu's Resume
Shantanu's Resume
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017 Spark Autotuning - Spark Summit East 2017
Spark Autotuning - Spark Summit East 2017
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
SplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin InternationalSplunkLive! Customer Presentation - Garmin International
SplunkLive! Customer Presentation - Garmin International
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
Łukasz Romaszewski on Internet of Things Raspberry Pi and Java Embedded JavaC...
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 

More from Spark Summit

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 

More from Spark Summit (20)

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 

Recently uploaded

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
LuisMiguelPaz5
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
jll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdfjll-asia-pacific-capital-tracker-1q24.pdf
jll-asia-pacific-capital-tracker-1q24.pdf
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

  • 1. #EUstr1 Intel Big Data Engineering Team Spark Structured Streaming Helps Smart Manufacturing Xiaochang Wu xiaochang.wu@intel.com Hao Cheng hao.cheng@intel.com
  • 2. Who am I? • Currently working at Intel Big Data Engineering team based in Shanghai, China • Joined Intel from 2006, focus on performance optimization for CPU / GPU / System / App for ~10 years • Delivering the best Spark performance on Intel platforms • Building reference big data platform for Intel customers 2#EUstr1
  • 3. Agenda • Motivation • Architecture and Features Overview • Adopting Spark Structured Streaming • Improvement Proposals 3#EUstr1
  • 4. Motivation • Building Spark-centric IOT Data Platform for PCB Manufacturing – Data collection, processing, analytics, visualization and alerting – Short-term goal: Enhance predictive fault repair and material tracking efficiency • Take advantage of latest Structured Streaming – Streaming ETL – Stateless and Stateful processing • Learn from real-world problems 4#EUstr1
  • 5. Overview • Transform the PCB surface mount lines • SPI: Solder Paste Inspection • AOI: Automated Optical Inspection 5#EUstr1 Solder Paste Printing SPI Component Placement Pre-Reflow AOI Reflow Soldering Post- Reflow AOI
  • 6. Data Platform Features • Device Management – Register and manage devices – RPC commands to access device attributes • Data Collection – Collect and store device data in reliable way • Data Processing & Analytics – Domain-specific scripts for both batch and real-time processing – Support using Machine Learning algorithms for advanced analytics • Rule Engine & Actions – Process incoming data with flexible user-defined rules. – Trigger alarms or send commands to devices using rules. • Data Visualization – Customizable user-friendly UI & Dashboards – Generate interactive data reports to fulfil business needs • Multi-tenancy, Scale-out, Fault- tolerance & Security 6#EUstr1
  • 7. Data Platform Architecture 7#EUstr1 *Other names and brands may be claimed as the property of others.
  • 8. Data Characteristics • Data Sources – Manufacturing Execution System (MES) – Sensor, Camera, Log – OCR for machine screen output • Structured, semi-structured and unstructured data – Low frequency but large amount • E.g. Picture and video – High frequency continuous data • Not large amount per unit. Total amount is very large • E.g. vibration sensor data used to detect the quality of the equipment. 8#EUstr1
  • 9. Spark Structured Streaming • Built on the Spark SQL engine – Express streaming the same way as batch – Dataset/DataFrame API in Scala, Java, Python or R • Incrementally and continuously updating the final result – Handling Event-time and Late Data – Delivering end-to-end exactly-once fault tolerance semantics – Output modes: Append, Complete, Update mode • Support Stateless and Stateful operations 9put your #assignedhashtag here by setting the footer in view-header/footer Fast, scalable, fault-tolerant, end-to-end exactly-once
  • 10. Streaming ETL • Data are produced from sources as JSON objects, transferred with MQTT and saved to Kafka • JSON objects are complex and have nested structures • Use Spark SQL functions to extract – selectExpr, get_json_object, cast, alias 10#EUstr1
  • 11. Stateless and Stateful Operations • Stateless – Projection, Filter • Stateful (Implicit) – Aggregation – Time window – Join • Advanced arbitrary stateful operations – mapGroupsWithState / flatMapGroupsWithState 11#EUstr1
  • 12. Structured Streaming Considerations • Output mode: append, update, complete • Handling late data – Method 1: use time window and automatic watermark – Method 2: use manual watermark • Post-processing – Debug: console sink / memory sink – File sink / Foreach sink – Collect to driver side and process 12#EUstr1
  • 13. Case 1: Find out offset AVG and STDDEV in Solder Paste Inspection (SPI) 13#EUstr1
  • 14. Case 2: Find out machine off time from event logs (1) • Track sessions from streams of events and find out machine off time – mapGroupsWithState / flatMapGroupsWithState – Off time = Duration from status “off” to next “on” 14#EUstr1
  • 15. Case 2: Find out machine off time from event logs (2) • Event: define data types for events • SessionInfo: Session information as state • SessionUpdate: Update information returned as result 15#EUstr1
  • 17. Case 2: Find out machine off time from event logs (3) • Event time ordering • Every subsequent event depends on previous event • Required to sort events based on timestamp due to logical dependence • Handling late events • Late data will break previous calculation result, forced to maintain all events and recalculate • Solution: manual watermarking for late data, remove old events if timeout, incremental calculation 17#EUstr1
  • 18. Improvement Proposals • Flexible window and trigger operation • Complex Event Processing (CEP) APIs – Stream Join – Session Tracking – Pattern – Alert • Streaming Domain Specific Language (DSL) – SQL-like language for Streams • We are working on them! 18#EUstr1
  • 19. See Also from Intel team • Today 14:00 @ AUDITORIUM An Adaptive Execution Engine For Apache Spark SQL by Carson Wang • Tomorrow 11:00 @ LIFFEY HALL FPGA-Based Acceleration Architecture for Spark SQL by Qi Xie 19#EUstr1
  • 20. Legal Notices and Disclaimers • You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. • No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. • Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. 20#EUstr1
  • 21. References • Surface mount process: http://www.surfacemountprocess.com/ • Structured Streaming Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html • Five Spark SQL Utility Functions to Extract and Explore Complex Data Types: https://databricks.com/blog/2017/06/13/five-spark-sql-utility-functions-extract-explore- complex-data-types.html • Spark Structured Sessionization example: https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spa rk/examples/sql/streaming/StructuredSessionization.scala 21#EUstr1