In the Manufacturing industry, reliability and time to market are key factors to accomplish business goals. Nowadays, Analytics are more and more deployed to get insights’ from data and foster a data driven culture to achieve a greater effectiveness and efficiency within business operations.
In the Analytics domain, real challenges are often represented by data collection, such as the existence of heterogeneous and widespread data sources and the choice of ingestion technologies and strategies, the need to ensure a continuous data inflow and to release production-ready Analytics services to be integrated into in daily operations
In order to address those challenges, Magneti Marelli ICT Innovation team has adopted a structured approach starting from foundations, that is by building a distinctive Big Data Architecture, known as the Magneti Marelli Architecture (MARC). Differently from common Big Data architectures, which are developed on batch or streaming paradigms, MARC is an event and service oriented architecture with the flexibility to manage complex tasks running both in the DMZ plant, in the plant network and in Cloud. It combines traditional patterns for handling data, such as “Service Broker”, “Forwarder”, “Singleton”, “Wrapping”, “Store and Forward”, with best of breed technologies such as Databricks, Microsoft Azure Data Lake Store, Azure SQL, PowerBI and Azure Functions.
In this presentation MARC key components will be introduced, together with main integrated services. Additionally, it will be shown how mentioned routine issues in data management will be addressed and solved with the aid of MARC unique structure and related services: practical examples will be provided for incremental data ingestion, incremental data processing, hybrid Spark deployments and the usage of heterogeneous Application Servers. Finally, it will be clear how the adoption of a structured approach to the development of Big Data Architecture for data management has dramatically fostered the demand for Analytics services and their effective use by the business to accomplish manufacturing cost reduction.
This document is a project report submitted by MD Dilshad for the partial fulfillment of a diploma in computer science and engineering from Maulana Azad National Urdu University. It details their 24 week industrial training with Web Monk Technology in New Delhi, where they worked on website designing and development using HTML, CSS, PHP, and WordPress. The objectives of the training were to gain practical experience in a real work environment, apply academic knowledge, and prepare for future employment.
This document summarizes the internship work of programming a virtual test box to automate testing of Body Control Modules (BCMs) in vehicles. The test box uses National Instruments hardware, the Canoe software, and custom scripts to test BCM signals, messages, and configurations over the CAN network and K-Line interface. The intern programmed test cases in XML, CAPL, and .NET to check over 100 signal and message values across 17 test cases for vehicle exterior lights and door controls. A challenge was fully integrating the K-Line server tool to read diagnostic trouble codes, which is a work in progress. The intern gained experience programming automated tests and interfacing with vehicle networks.
= Manage ontologies and use semantic data in SharePoint with GRASP =
GRASP ("Graph for SharePoint") is the SharePoint solution that introduces ontologies and semantic data into SharePoint. Ontologies are uploaded and managed directly in SharePoint. This fosters collaboration among ontologists and ensures preservation and compliance with the ECM-strategy of your company.
= SPARQL queries in SharePoint =
Ontologies are uploaded into an attached triple store (RDF store) directly from within SharePoint. With the standard query language SPARQL you can query them and, thus, retrieve their data. Additionally, any semantic data can be processed in SharePoint that is accessible via a SPARQL endpoint or triple store. SPARQL query results are available in SharePoint web parts and SharePoint lists in order to generate insights that are important for your SharePoint users and workflows.
=Applications with GRASP=
GRASP is optimized for companies that pursue a SharePoint-based strategy and that want to extend this strategy to cover their ontologies or that want to utilize semantic data to improve business processes. Typical industries are: Pharma, Insurance, Manufacturing.
*Central ontology life-cycle management in SharePoint.
*Controlled and standardized user-access, back-up and recovery strategies for ontologies.
*Semantic data from ontologies and SPARQL endpoints become accessible to SharePoint users and workflows (requires Triplestore basic, OpenLink Virtuoso or TopBraid)
DoIP is designed with ISO 13400-2 transport layer & ISO 14229–5 UDS application layer. Find out how DoIP supports next gen remote vehicle diagnostics & automotive ECU applications.
https://www.embitel.com/blog/embedded-blog/how-uds-on-ip-or-doip-is-enabling-remote-vehicle-diagnostics
The document discusses the need to reform and improve vocational education in India. It notes that currently, vocational education makes up a small percentage of the education system and is not aligned well with industry needs. The document outlines several problems with the current system, including a lack of private sector involvement, rigid regulations, and few opportunities for career progression or skill upgrading. It also discusses government initiatives to establish a National Vocational Qualification Framework and compares vocational education frameworks in other countries like the UK, Australia, and China. The goal is to make recommendations to help introduce higher-quality vocational education programs in India.
Nettur Technical Training Foundation (NTTF) is a premier technical educational institution established in 1959 that provides corporate training services. It has over 20 training centers across India and over 100 experienced trainers. NTTF offers both on-campus and off-campus customized technical, functional, and soft skills training programs to over 100 corporate clients across various sectors such as automotive, aerospace, construction, food processing, and more. Some of NTTF's major clients include Maruti Suzuki, Ashok Leyland, Tata Motors, and MRF.
The document is a resume for Priyanka Mahajan, who has a B.Tech in Electronics and Communication Engineering. She is currently working as a Technical Recruiter at IMS People in Ahmedabad, India. Her experience includes identifying, screening, and qualifying candidates for engineering positions. She has skills in areas like PCB design, computer networking, programming languages, and simulation software. She also has several certifications and participated in many technical competitions and workshops during her time in college.
This document is a project report submitted by MD Dilshad for the partial fulfillment of a diploma in computer science and engineering from Maulana Azad National Urdu University. It details their 24 week industrial training with Web Monk Technology in New Delhi, where they worked on website designing and development using HTML, CSS, PHP, and WordPress. The objectives of the training were to gain practical experience in a real work environment, apply academic knowledge, and prepare for future employment.
This document summarizes the internship work of programming a virtual test box to automate testing of Body Control Modules (BCMs) in vehicles. The test box uses National Instruments hardware, the Canoe software, and custom scripts to test BCM signals, messages, and configurations over the CAN network and K-Line interface. The intern programmed test cases in XML, CAPL, and .NET to check over 100 signal and message values across 17 test cases for vehicle exterior lights and door controls. A challenge was fully integrating the K-Line server tool to read diagnostic trouble codes, which is a work in progress. The intern gained experience programming automated tests and interfacing with vehicle networks.
= Manage ontologies and use semantic data in SharePoint with GRASP =
GRASP ("Graph for SharePoint") is the SharePoint solution that introduces ontologies and semantic data into SharePoint. Ontologies are uploaded and managed directly in SharePoint. This fosters collaboration among ontologists and ensures preservation and compliance with the ECM-strategy of your company.
= SPARQL queries in SharePoint =
Ontologies are uploaded into an attached triple store (RDF store) directly from within SharePoint. With the standard query language SPARQL you can query them and, thus, retrieve their data. Additionally, any semantic data can be processed in SharePoint that is accessible via a SPARQL endpoint or triple store. SPARQL query results are available in SharePoint web parts and SharePoint lists in order to generate insights that are important for your SharePoint users and workflows.
=Applications with GRASP=
GRASP is optimized for companies that pursue a SharePoint-based strategy and that want to extend this strategy to cover their ontologies or that want to utilize semantic data to improve business processes. Typical industries are: Pharma, Insurance, Manufacturing.
*Central ontology life-cycle management in SharePoint.
*Controlled and standardized user-access, back-up and recovery strategies for ontologies.
*Semantic data from ontologies and SPARQL endpoints become accessible to SharePoint users and workflows (requires Triplestore basic, OpenLink Virtuoso or TopBraid)
DoIP is designed with ISO 13400-2 transport layer & ISO 14229–5 UDS application layer. Find out how DoIP supports next gen remote vehicle diagnostics & automotive ECU applications.
https://www.embitel.com/blog/embedded-blog/how-uds-on-ip-or-doip-is-enabling-remote-vehicle-diagnostics
The document discusses the need to reform and improve vocational education in India. It notes that currently, vocational education makes up a small percentage of the education system and is not aligned well with industry needs. The document outlines several problems with the current system, including a lack of private sector involvement, rigid regulations, and few opportunities for career progression or skill upgrading. It also discusses government initiatives to establish a National Vocational Qualification Framework and compares vocational education frameworks in other countries like the UK, Australia, and China. The goal is to make recommendations to help introduce higher-quality vocational education programs in India.
Nettur Technical Training Foundation (NTTF) is a premier technical educational institution established in 1959 that provides corporate training services. It has over 20 training centers across India and over 100 experienced trainers. NTTF offers both on-campus and off-campus customized technical, functional, and soft skills training programs to over 100 corporate clients across various sectors such as automotive, aerospace, construction, food processing, and more. Some of NTTF's major clients include Maruti Suzuki, Ashok Leyland, Tata Motors, and MRF.
The document is a resume for Priyanka Mahajan, who has a B.Tech in Electronics and Communication Engineering. She is currently working as a Technical Recruiter at IMS People in Ahmedabad, India. Her experience includes identifying, screening, and qualifying candidates for engineering positions. She has skills in areas like PCB design, computer networking, programming languages, and simulation software. She also has several certifications and participated in many technical competitions and workshops during her time in college.
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
The document discusses advanced database technologies and techniques. It provides examples of using MySQL, PostgreSQL, and Tokutek databases. It discusses approaches to improving speed, availability, reliability, and scalability of databases. It also covers monitoring databases, optimizing database and query performance, and profiling queries. Examples demonstrate how to optimize queries and access data from different databases.
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...ITCamp
If there is a common practice in architecting software systems, it is to have them store the last known state of business entities in a relational database: though widely adopted and effectively supported by existing development tools, this practice trades the easiness of implementation with the cost of losing the history of such entities.
Event Sourcing provides a pivotal solution to this problem, giving systems the capability of restoring the state they had at any given point in time. Furthermore, injecting mock-up events and having them replayed by the business logic allows for an easy implementation of simulations and “what if” scenarios.
In this session, Andrea will demonstrate how to design time travelling systems by examining real-world, production-tested solutions.
The document discusses the challenges data scientists face in operationalizing big data projects and making the results accessible for broader organizational use. It argues that within the next 18 months, big data will become integrated into standard reporting and analysis used by all employees, not just data scientists. However, current tools like Hadoop are too slow for interactive work. New technologies are needed that provide massively parallel processing and tightly integrate with Hadoop, but also allow for use of existing reporting tools. This will require analytical platforms with in-memory processing capabilities and low latency.
Reinventing DDC in the Age of Data AnalyticsMemoori
Reinventing DDC in the Age of Data Analytics! Memoori Talks to Jim Lee, CEO Cimetrics, Anto Budiardjo, CEO Anka Labs & Alper Üzmezler, CTO Anka Labs. Can we rethink the DDC to become data-centric, able to perform analytics and capable of sending data to cloud systems?
The document discusses Cisco's vision for the Internet of Everything (IoE) and how it applies to the manufacturing sector. It addresses some of the challenges manufacturers face, such as disconnected systems and technology silos. It then presents Cisco's proposed architecture for industrial networks, including security frameworks, edge computing, and fog computing to enable distributed data processing at the network edge. This architecture is meant to help manufacturers overcome challenges and leverage IoT/IoE for operational improvements and business benefits.
Lessons learned building a big data analytics engine, from proprietary to ope...J On The Beach
Lessons learned building a big data analytics engine, from proprietary to open source by Álvaro Santamaria & Joel Brunger
After spending four years building a proprietary all-in-one streaming analytics engine for financial services, it became clear that open-source was starting to pull ahead. Alvaro will talk about the challenges of creating an IT operations solution for financial services; what to build, what not to build, and how to use open source tools to get past the infrastructure and focus on the business problems that matter.
RightScale Roadtrip Boston: Accelerate to CloudRightScale
The Accelerate to Cloud keynote will help you understand the current state of cloud adoption, identify the business value for your organization, and provide you a framework to plot your course to cloud adoption.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Get the most out of Oracle Data Guard - OOW versionLudovico Caldara
If you use Oracle Data Guard feature just for data protection, you are using less than half of its potential. You already pay for it, so why not getting the most out of it? In this session I will show how you can use Oracle Data Guard capabilities for common tasks such as database cloning, database migration and reporting, with the help of other features included in Oracle Database Enterprise Edition
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
This document discusses advanced database techniques and monitoring. It begins with an introduction and safe harbor statement. It then discusses issues like speed, availability, reliability and scalability for databases. It provides examples of using MySQL, MySQL Cluster, PostgreSQL and columnar storage. It also discusses monitoring databases and optimizing database queries and models. Sources are provided at the end.
Real-Time Analytics with Confluent and MemSQLSingleStore
This document discusses enabling real-time analytics for IoT applications. It describes how industries like auto, transportation, energy, warehousing and logistics, and healthcare need real-time analytics to handle streaming data from IoT sensors. It also discusses how Confluent's Kafka stream processing platform can be used to build applications that ingest IoT data at high speeds, transform the data, and power real-time analytics and user interfaces. MemSQL's in-memory database is presented as a fast and scalable storage option to support real-time analytics on the large volumes of IoT data.
Lew Tucker discusses the rise of cloud computing and its impact. He defines various cloud service models like SaaS, PaaS, and IaaS. Tucker analogizes the shift to cloud computing from individual data centers generating their own power to today's electrical grid. Major drivers of cloud computing include the growth of web APIs and massive amounts of user-generated data. Tucker outlines how cloud computing changes what developers can access and how applications are designed and scaled.
Splunk App for Stream - Einblicke in Ihren NetzwerkverkehrGeorg Knon
The document discusses the Splunk App for Stream, which enables real-time insights into private, public and hybrid cloud infrastructures by capturing and analyzing critical events from wire data not found in logs or with other collection methods. It provides an overview of the app, what's new, important features, architecture and deployment, customer success examples, and FAQs.
Visual, Interactive, Predictive Analytics for Big DataArimo, Inc.
Adatao Demo at the First Apache Spark Summit, Nikko Hotel, San Francisco, December 2, 2013
A real-time, live demo of the Adatao big-data analytics system for both Business Users and Data Scientists/Engineers. We showed terabyte data modeling in seconds on a 40-node cluster. And with a beautiful, user-friendly web app, as well as R/R-Studio & Python interfaces.
The document outlines the experience and qualifications of an embedded systems engineer, including over 30 years of experience in areas such as microcontroller programming, real-time operating systems, communication protocols, modeling languages, and IDEs/compilers. It provides a detailed list of technical skills and experience with various microcontroller platforms, programming languages, frameworks, and tools for embedded software development.
The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.
Angular (v2 and up) - Morning to understand - LinagoraLINAGORA
Slides of the talk about Angular, at the "Matinée Pour Comprendre" organized by Linagora the 22/03/17.
Discover what's new in Angular, why is it more than just a framework (platform) and how to manage your data with RxJs and Redux.
This document discusses predictive maintenance of robots in the automotive industry using big data analytics. It describes Cisco's Zero Downtime solution which analyzes telemetry data from robots to detect potential failures, saving customers over $40 million by preventing unplanned downtimes. The presentation outlines Cisco's cloud platform and a case study of how robot and plant data is collected and analyzed using streaming and batch processing to predict failures and schedule maintenance. It proposes a next generation predictive platform using machine learning to more accurately detect issues before downtime occurs.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
More Related Content
Similar to Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spark with Singletons, Wrapping, and Facade with Andrea Condorelli
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
The document discusses advanced database technologies and techniques. It provides examples of using MySQL, PostgreSQL, and Tokutek databases. It discusses approaches to improving speed, availability, reliability, and scalability of databases. It also covers monitoring databases, optimizing database and query performance, and profiling queries. Examples demonstrate how to optimize queries and access data from different databases.
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...ITCamp
If there is a common practice in architecting software systems, it is to have them store the last known state of business entities in a relational database: though widely adopted and effectively supported by existing development tools, this practice trades the easiness of implementation with the cost of losing the history of such entities.
Event Sourcing provides a pivotal solution to this problem, giving systems the capability of restoring the state they had at any given point in time. Furthermore, injecting mock-up events and having them replayed by the business logic allows for an easy implementation of simulations and “what if” scenarios.
In this session, Andrea will demonstrate how to design time travelling systems by examining real-world, production-tested solutions.
The document discusses the challenges data scientists face in operationalizing big data projects and making the results accessible for broader organizational use. It argues that within the next 18 months, big data will become integrated into standard reporting and analysis used by all employees, not just data scientists. However, current tools like Hadoop are too slow for interactive work. New technologies are needed that provide massively parallel processing and tightly integrate with Hadoop, but also allow for use of existing reporting tools. This will require analytical platforms with in-memory processing capabilities and low latency.
Reinventing DDC in the Age of Data AnalyticsMemoori
Reinventing DDC in the Age of Data Analytics! Memoori Talks to Jim Lee, CEO Cimetrics, Anto Budiardjo, CEO Anka Labs & Alper Üzmezler, CTO Anka Labs. Can we rethink the DDC to become data-centric, able to perform analytics and capable of sending data to cloud systems?
The document discusses Cisco's vision for the Internet of Everything (IoE) and how it applies to the manufacturing sector. It addresses some of the challenges manufacturers face, such as disconnected systems and technology silos. It then presents Cisco's proposed architecture for industrial networks, including security frameworks, edge computing, and fog computing to enable distributed data processing at the network edge. This architecture is meant to help manufacturers overcome challenges and leverage IoT/IoE for operational improvements and business benefits.
Lessons learned building a big data analytics engine, from proprietary to ope...J On The Beach
Lessons learned building a big data analytics engine, from proprietary to open source by Álvaro Santamaria & Joel Brunger
After spending four years building a proprietary all-in-one streaming analytics engine for financial services, it became clear that open-source was starting to pull ahead. Alvaro will talk about the challenges of creating an IT operations solution for financial services; what to build, what not to build, and how to use open source tools to get past the infrastructure and focus on the business problems that matter.
RightScale Roadtrip Boston: Accelerate to CloudRightScale
The Accelerate to Cloud keynote will help you understand the current state of cloud adoption, identify the business value for your organization, and provide you a framework to plot your course to cloud adoption.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Get the most out of Oracle Data Guard - OOW versionLudovico Caldara
If you use Oracle Data Guard feature just for data protection, you are using less than half of its potential. You already pay for it, so why not getting the most out of it? In this session I will show how you can use Oracle Data Guard capabilities for common tasks such as database cloning, database migration and reporting, with the help of other features included in Oracle Database Enterprise Edition
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
This document discusses advanced database techniques and monitoring. It begins with an introduction and safe harbor statement. It then discusses issues like speed, availability, reliability and scalability for databases. It provides examples of using MySQL, MySQL Cluster, PostgreSQL and columnar storage. It also discusses monitoring databases and optimizing database queries and models. Sources are provided at the end.
Real-Time Analytics with Confluent and MemSQLSingleStore
This document discusses enabling real-time analytics for IoT applications. It describes how industries like auto, transportation, energy, warehousing and logistics, and healthcare need real-time analytics to handle streaming data from IoT sensors. It also discusses how Confluent's Kafka stream processing platform can be used to build applications that ingest IoT data at high speeds, transform the data, and power real-time analytics and user interfaces. MemSQL's in-memory database is presented as a fast and scalable storage option to support real-time analytics on the large volumes of IoT data.
Lew Tucker discusses the rise of cloud computing and its impact. He defines various cloud service models like SaaS, PaaS, and IaaS. Tucker analogizes the shift to cloud computing from individual data centers generating their own power to today's electrical grid. Major drivers of cloud computing include the growth of web APIs and massive amounts of user-generated data. Tucker outlines how cloud computing changes what developers can access and how applications are designed and scaled.
Splunk App for Stream - Einblicke in Ihren NetzwerkverkehrGeorg Knon
The document discusses the Splunk App for Stream, which enables real-time insights into private, public and hybrid cloud infrastructures by capturing and analyzing critical events from wire data not found in logs or with other collection methods. It provides an overview of the app, what's new, important features, architecture and deployment, customer success examples, and FAQs.
Visual, Interactive, Predictive Analytics for Big DataArimo, Inc.
Adatao Demo at the First Apache Spark Summit, Nikko Hotel, San Francisco, December 2, 2013
A real-time, live demo of the Adatao big-data analytics system for both Business Users and Data Scientists/Engineers. We showed terabyte data modeling in seconds on a 40-node cluster. And with a beautiful, user-friendly web app, as well as R/R-Studio & Python interfaces.
The document outlines the experience and qualifications of an embedded systems engineer, including over 30 years of experience in areas such as microcontroller programming, real-time operating systems, communication protocols, modeling languages, and IDEs/compilers. It provides a detailed list of technical skills and experience with various microcontroller platforms, programming languages, frameworks, and tools for embedded software development.
The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.
Angular (v2 and up) - Morning to understand - LinagoraLINAGORA
Slides of the talk about Angular, at the "Matinée Pour Comprendre" organized by Linagora the 22/03/17.
Discover what's new in Angular, why is it more than just a framework (platform) and how to manage your data with RxJs and Redux.
This document discusses predictive maintenance of robots in the automotive industry using big data analytics. It describes Cisco's Zero Downtime solution which analyzes telemetry data from robots to detect potential failures, saving customers over $40 million by preventing unplanned downtimes. The presentation outlines Cisco's cloud platform and a case study of how robot and plant data is collected and analyzed using streaming and batch processing to predict failures and schedule maintenance. It proposes a next generation predictive platform using machine learning to more accurately detect issues before downtime occurs.
Similar to Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spark with Singletons, Wrapping, and Facade with Andrea Condorelli (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spark with Singletons, Wrapping, and Facade with Andrea Condorelli
1. Magneti Marelli, ICT Innovation
Road to Enterprise Architecture for Big Data
Applications
Mixing Apache Spark with singletons, wrapping, facade
London, United Kingdom
#SAISEnt4
#SAISEnt4
2. Company Overview
Magneti Marelli is an international company committed to the design and production of hi-tech systems and
components for the automotive sector.
AUTOMOTIVE LIGHTING
(Headlamp, Rearlamp, Lighting and Body Electronics)
ELECTRONICS
(Instrument Clusters, Infotainment & Telematics)
SUSPENSION SYSTEMS AND SHOCK ABSORBERS
(Suspension Systems, Shock Absorbers and Dynamic Systems)
PLASTIC COMPONENTS AND MODULES
(Bumper, Dashboard, Central Console, Pedals, Hand Brake Levers and Fuel System)
AFTERMARKET PARTS & SERVICES
(Mechanical, Body Work, Electrics and Electronic and Consumables)
EXHAUST SYSTEMS
(Manifolds, Catalytic converter, Diesel Particulate Filter and Mufflers)
POWERTRAIN
(Gasoline and Diesel engine control, Electric Motor, Inverter and Transmission)
Corporate Presentation 2October 6, 2018
MOTORSPORT
(Injection Systems, Electronic Control Units, Hybrid Systems, Telemetry Systems, Electric Actuators)
3. Magneti Marelli Worldwide Footprint
Corporate Presentation 3
PP - AC
PP - AC
PP - AC
USA PP – AC
MEXICO
BRASIL
ARGENTINA
GERMANY
POLAND
CZECH REP.
SLOVAKIA
RUSSIA
SERBIA
TURKEY
PP - R&D – ACCHINA
JAPAN
KOREA
MALAYSIA
INDIAITALY
SPAIN
PP - AC
PP – R&D - AC
PP
PP - AC
PP – R&D – AC PP – R&D - AC
PP - AC
PP - R&D – AC
PP - R&D – AC
PP
PP - AC
AC
PP
October 6, 2018
PP: Production Plant R&D: R&D Center AC: Application Center
FRANCE
ROMANIA R&D - AC
4. #SAISEnt4
Big Data storyline
4
BUSINESS
EXPLORATION
PHASE
PROOF OF CONCEPT
PHASE
PRODUCTION
PHASE
CUMULATIVE ROWS PROCESSED
30bn
0bn
10bn
20bn
Apr 2017 Jun 2017 Oct 2017 Jan 2018 Mar 2018 Jul 2018
Jan 2017
Big Data
group was
created
Aug 2017
Welding
Machines
POC
29th Jan 2018
Databricks
was adopted
Apr 2018
MARC 1.0
released
Jun 2018
The SMT
project
Aug 2018
The
Metalizers
project
Nov 2017
SMT POC
Sep 2017
Telemetry
POC
Feb 2017
First POC
approved,
data loaded
from USB
6. #SAISEnt4
The Surface-Mount Technology (SMT) project
6
Surface-Mount Technology PCB Preparation Assembly Line
Pre Production & Assembly Line
Timestamp
PCB ID
1 file per Item
Timestamp
Temperature
Humidity
1 file per day
Timestamp
PCB ID
Soldering Paste
Sensor Data
Temperature
…
1 file per Item
NIP
Pick Up Info
Feeder Info
DB SQL
NIP
Images
Sensor Data
Anomalies
MDB
NIP
Images
Sensor Data
PCB Final Status
DB SQL
Lasermarker Serigraphy
Automated Optical
Inspection
Post Printing
Pick and Place Oven
Automated Optical
Inspection
Post Reflow
7. #SAISEnt4
The Surface-Mount Technology (SMT) project
7
Surface-Mount Technology PCB Preparation Assembly Line
Pre Production & Assembly Line
Timestamp
PCB ID
1 file per Item
Timestamp
Temperature
Humidity
1 file per day
Timestamp
PCB ID
Soldering Paste
Sensor Data
Temperature
…
1 file per Item
NIP
Pick Up Info
Feeder Info
DB SQL
NIP
Images
Sensor Data
Anomalies
MDB
NIP
Images
Sensor Data
PCB Final Status
DB SQL
Lasermarker Serigraphy
Automated Optical
Inspection
Post Printing
Pick and Place Oven
Automated Optical
Inspection
Post Reflow
Machine Learning
Problems
1. Machine Status
Monitoring
2. Bottleneck
harmonic model
3. Anomaly
Recommender
Engine
8. #SAISEnt4
A Dream Project
8
Production process is well known1
1
Data source is clearly defined2
2
Need is raised by plant people3
3
Algorithmic challenges are clear4
4
#SAISEnt4
9. #SAISEnt4
?
Becoming a Nightmare
9
Success!!!
• So… Where is the data?
• How can I read/access the data?
• How can I be supported by my data
scientist colleagues?
• How can I attach a Spark cluster to my
Jupyter notebook?
• Who is going to port the notebook to
production?
• What do you mean with production?
Production process is well known1
Data source is clearly defined2
Need is raised by plant people3
Algorithmic challenges are clear4
10. #SAISEnt4
Acquire Store, Transform, Enrich, Organize & Aggregate
Data Integration, Transform, Aggregate
Data Sources
Structured Data
10
Enterprise Architecture
Logical Data Warehouse
Data Marts
Traditional Enterprise Data Warehouse
Distributed Process
Other/
HadoopNoSQL
Analyze & Deliver Insight
Data
Services
Market-
place or
Datasets
Self-Service
Data
Preparation
/
Data
Access
Layer
Analytic
Capabilities
Analyze
Optimize
Forecast
Report
Plan
Discover
Collaborate
Predict
Model
Advanced
Analytics
AudioVideo
IoT Feeds
Streaming
Unstructured/Semi-
structured Data
ImageIT Log
SAP
Operational Systems
Text Doc External
IOT
MES
Stream Ingestion
Real Time
Batch/Micro-
batch/CDC
Staging, At Rest,
CDC
Streaming, In Motion
External Doc
Text
Other Sources
Master Data Management
Data Quality
MDM DQ
SQL
Data Lake (Curated, Enriched & Transformed Data)
DataLake(RawData)
SQL
Machine Learning layer
Data Science
Environment
The Technology Bazar
ARCHITECTURE != LIST OF TECHNOLOGIES AND FANCY ARROWS
11. #SAISEnt4 11
Keep it simple, stupid
PRESENTATIONEXPLORATION PRODUCTIONINGESTION AND STORAGE
Hammer
Gateway
HG Job 1
HG Job 2
HG Job n
DATA LAKE
Azure Data Lake
Store
DATA
EXPLORATION
Notebooks
Datamart
ARCHITECTURAL OBJECTS
MESSAGE
QUEUE
FORWARDER
BUSINESS
LOGICS
Dbutils+Spark
APP
MCO
mco.read
mco.write
AZURE
FUNCTIONS
mco.log
12. #SAISEnt4 12
Enterprise Architecture – Data, where are you?
Hammer
Gateway
HG Job 1
HG Job 2
HG Job n
Pick & Place
DATA
EXPLORATION
Noteb
ooks DATAMART
APP
ENGINE
M
ES
S
A
G
E
Q
U
E
U
E
F
OR
W
A
R
D
E
RMCO
mco.read
mco.write
mco.run
I
ON
S
BUSINESS
LOGICS
Dbutils
+Spark
HammerGateway
HG
Job1
HG
Job2
HG
Jobn
DATALAKE
Dat
a
Lak
e
Sto
re
• Where are the csv?
• What do you mean with bcp out?
• How can I get a copy in the cloud?
• How can I update data on a regular basis?
?
Write
from
scratch
a sort
of ETL
tool
Production ready tentative
(10% of times)
USB
loading
Quick and Dirty (90%
of times)
DATA LAKE
Azure Data Lake
Store
13. #SAISEnt4
Enterprise Architecture – The Jupyter case
• How could I work together with other data
scientists?
• How can I deal with computation spikes?
• How can I attach an Apache Spark cluster to
my Jupyter?
• Damn, Java Heap Memory Exception: what
do you mean?
Hammer Gateway
HG Job 1
HG Job 2
HG Job n
Pick & Place
HammerGateway
HG
Job1
HG
Job2
HG
Jobn
DATALAKE
Dat
a
Lak
e
Sto
re
DATAMART
APP
ENGINE
M
ES
S
A
G
E
Q
U
E
U
E
F
OR
W
A
R
D
E
RMCO
mco.read
mco.write
mco.run
BUSINESS
LOGICS
Dbutils
+Spark
DATA
EXPLORATION
Noteb
ooks
DATA
EXPLORATION
Notebooks
MCO
mco.read
mco.write
AZURE
FUNCTIONS
mco.log
DATA LAKE
Azure Data Lake
Store
14. #SAISEnt4
One singleton to rule them all
mco
Pattern: Singleton
Use: bring tokens and
technical access to notebook
Benefit:
• enhancing security
• enabling access control
• reducing vendor lock-in
effect
mco.read
Pattern: Wrapping
Use: take data from data lake
knowing only data names
Benefit:
• no one will need to know
where data are or how
data are stored
• incremental read
capability out-of-the-box
• reducing time to port code
in production
• reduced reading time
• reduce the vendor lock-in
effect (to propagate a
new HDFS PAAS vendor
on all services is a matter
of hours)
mco.log
Pattern: Wrapping
Use: bring developer grade
logging capability
Benefit:
• reducing debug time
• enabling process audits
mco.write
Pattern: Wrapping
Use: save data everywhere
Benefit:
• no one will need to know
where data must be put
or how
• avoid dangerous
behavious such as writing
on a SQL with a
transformation action
(connection pool, my
beloved friend…)
MCO
mco.read
mco.write
AZURE
FUNCTIONS
mco.log
15. #SAISEnt4 15
Enterprise Architecture – The model is ready!
• Is the code production ready?
• Who is going to port the notebook to
production?
• Developer algorithm is wrong: it produces
different numbers…
• Ok, I got it! I’ll need a crontab… but where?
HammerGateway
HG
Job1
HG
Job2
HG
Job
n
DATALAKE
Dat
a
Lak
e
Sto
re
DATAMART
APP
ENGINE
M
ES
S
A
G
E
Q
U
E
U
E
F
OR
W
A
R
D
E
RMCO
mco.read
mco.write
mco.run
I
O
N
S
BUSINESS
LOGICS
Dbutils
+Spark
DATA
EXPLORATION
Noteb
ooks
Hammer Gateway
HG Job 1
HG Job 2
HG Job n
Pick &
Place
DATA
EXPLORATION
Notebooks
BUSINESS
LOGICS
Dbutils+Spark
MESSAGE
QUEUE
FORWARDER
16. #SAISEnt4 16
A For-what? What the hell?
MESSAGE
QUEUE
FORWARDER
Clean
SMT Data
Cycle
Time
Anomaly Det
Super
Secret
Service
…
DATAMART
Clean SMT Data service is the
first in the MESSAGE QUEUE
and is sent to the
FORWARDER. Status is set to
WIP
This service is forwarded to
Databricks
Databricks gets data through
the MCO
Once data is loaded, the Spark
code starts the cleaning job
Cleaned data is cached in the
DATAMART through the MCO.
Status is set to DONE
TO-DO
TO-DO
TO-DO
…
WIPDONE
WIPDONE
APP
BUSINESS
LOGICS
Dbutils+Spark
17. #SAISEnt4 17
A For-what? What the hell?
MESSAGE
QUEUE
FORWARDER
Clean
SMT Data
Cycle
Time
Anomaly Det
Super
Secret
Service
…
DATAMART
TO-DO
TO-DO
TO-DO
…
WIPDONE
WIPDONE
APP
WIP
BUSINESS
LOGICS
Dbutils+Spark
18. #SAISEnt4 18
A For-what? What the hell?
MESSAGE
QUEUE
FORWARDER
Clean
SMT Data
Cycle
Time
Anomaly Det
Super
Secret
Service
…
DATAMART
TO-DO
TO-DO
TO-DO
…
WIPDONE
WIPDONE
APP
WIP
X
• What if some datasets could not be moved into the cloud?
• How to deal with super secret business logic?
FORWARDER as the main component for cloud hybridization!
Data needed to run Super
Secret Service cannot be
moved outside Magneti Marelli
servers
BUSINESS
LOGICS
Dbutils+Spark
19. #SAISEnt4
ON PREMISE
19
A For-what? What the hell?
MESSAGE
QUEUE
FORWARDER
Clean
SMT Data
Cycle
Time
Anomaly Det
Super
Secret
Service
…
DATAMART
TO-DO
TO-DO
TO-DO
…
WIPDONE
WIPDONE
APP
WIP
EDW
Super Secret Service is
forwarded to an on premise
Apache Spark cluster
Apache Spark cluster gets data
through the MCO
Outcome is persisted on an on
premise SQL Server
A custom web app allow user to
see job output
BUSINESS
LOGICS
Dbutils+Spark
20. #SAISEnt4 20
A For-what? What the hell? – Predictive balancing
Clean
PS Data
RAM
Intensive
JOB
TO-DO
TO-DO
WIPDONE
WIP
X
A first service is submitted by
the forwarder
A second service is submited
before the first finished. The
Cluster is busy witht the other
computation
The forwarder can submit the
job to any application server, so
it creates a new Databricks
cluster and submit to it. Cluster
creation is anticipated using
predictive algorithms.
MESSAGE
QUEUE
Clean
SMT Data
Cycle
Time
Anomaly Det
Super
Urgent
Service
…
TO-DO
TO-DO
TO-DO
…
WIPDONE
WIPDONE
WIP
300 GB RAM (90%)
2 hours long
100 GB RAM
5 minutes long
Slow Services
Cluster
Dbutils+Spark
Fast Services
Cluster
Dbutils+Spark
21. #SAISEnt4 21
Enterprise Architecture – Don’t mind about nerd stuff
HammerGateway
HG
Job1
HG
Job2
HG
Job
n
DATALAKE
Dat
a
Lak
e
Sto
re
DATAMART
APP
ENGINE
M
ES
S
A
G
E
Q
U
E
U
E
F
OR
W
A
R
D
E
RMCO
mco.read
mco.write
mco.run
I
O
N
S
BUSINESS
LOGICS
Dbutils
+Spark
DATA
EXPLORATION
Noteb
ooks
Hammer Gateway
HG Job 1
HG Job 2
HG Job n
Pick &
Place
BUSINESS
LOGICS
Dbutils+Spark
Data Scientists’ presentation concerns:
• How do I write a web page?
• Do I need to bootstrap?
• MV-what? I thought Spring was just a
season!
• Single sign-on? What do you mean? MESSAGE
QUEUE
FORWARDER
DATAMART
DATA
EXPLORATION
Notebooks
APP
22. #SAISEnt4
NO-COMPLEXITY
DATA INGESTION AND STORAGE
22
PRESENTATION
DATA
SCIENTIST
TOYBOX
PRODUCTION
Enterprise Architecture (“…and in the darkness bind them”)
1 day to add a new source1
Architecture «embeds» the guideline2
Presentation layer is drag and drop3
Service queuing ensures enterprise
and managed scalability
4
Data scientists do not waste time in
boring activities
5
Pick & Place
MESSAGE
QUEUE
FORWARDER
Hammer
Gateway
HG Job 1
HG Job 2
HG Job n
DATA
EXPLORATION
Notebooks
Datamart
BUSINESS
LOGICS
Dbutils+Spark
APP
MCO
mco.read
mco.write
AZURE
FUNCTIONS
mco.log
DATA LAKE
Azure Data Lake
Store
23. #SAISEnt4
“I have done the deed. Did you hear a noise?”
23
“The Guide says there is an art to flying", said Ford, "or
rather a knack. The knack lies in learning how to throw
yourself at the ground and miss.”
Production process is well known1
Data source is clearly defined2
Need is raised by plant people3
Algorithmic challenges are clear4
24. #SAISEnt4
The Surface-Mount Technology (SMT) project
24
Surface-Mount Technology PCB Preparation Assembly Line
Pre Production & Assembly Line
Timestamp
PCB ID
1 file per Item
Timestamp
Temperature
Humidity
1 file per day
Timestamp
PCB ID
Soldering Paste
Sensor Data
Temperature
…
1 file per Item
NIP
Pick Up Info
Feeder Info
DB SQL
NIP
Images
Sensor Data
Anomalies
MDB
NIP
Images
Sensor Data
PCB Final Status
DB SQL
Lasermarker Serigraphy
Automated Optical
Inspection
Post Printing
Pick and Place Oven
Automated Optical
Inspection
Post Reflow
Machine Learning
Problems
1. Machine Status
Monitoring
2. Bottleneck
harmonic model
3. Anomaly
Recommender
Engine
25. #SAISEnt4
Anomaly Recommender Engine
25
Use Case
Support maintenance team to
prioritize standard and
extraordinary maintenance
activities.
Benefit
Reducing machine stoppage losses
per year per line.
Down time reduction.
Description
A summary dashboard shows the
health of each part of the line. A drill
down with details is available.
26. #SAISEnt4
Much ado about nothing… ?
26
Surface-Mount Technology PCB Preparation Assembly Line
Pre Production & Assembly Line
Break-even point reached after
8 months
Cost per line reduced by 90%
after the first one
Return On Investment: 12X in 3
years
• Databricks and Microsoft PowerBI allow a very cost
effective first project
• Hammer Gateway allows cost effective ingestion
• MCO enabled Data Scientists to convert notebooks
in services with a very very low effort
• Microsoft Azure and Databricks ensure endless
scalability