This document discusses vulnerabilities in voice over IP (VoIP) and unified communications systems. It begins by introducing the speaker and their background in VoIP security. It then outlines various attack vectors such as exploiting vulnerabilities in signaling protocols, message content, and unified messaging features to inject malicious content or execute code. The document emphasizes that securing UC involves more than just securing VoIP, and recommends approaches like secure infrastructure design, authentication, and client protection to help secure these systems.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
This document discusses building real-time data processing and analytics with Databricks and Kafka. It describes how Databricks' lakehouse platform and Spark Structured Streaming can be used with Apache Kafka to ingest streaming data and perform real-time analytics. It also provides an example of how a large retailer, Albertsons, uses Databricks to distribute offers in real-time, power dashboards with streaming data, and enable hyper-personalization with real-time data models. The partnership between Databricks and Confluent is also discussed as a way to modernize data platforms and power new real-time applications and analytics.
Departed Communications: Learn the ways to smash them!Fatih Ozavci
Unified Communications (UC) is widely used by larger organisations for video conferences, office collaboration, cloud services and mobile communications. These services also have key roles in the IP Multimedia Subsystem (IMS) implementations of next generation mobile networks. As a result of these, customers require unified collaboration; and the telecommunications industry offers managed communications services and infrastructure using UC and IMS technologies. These offerings also come with design issues, well-known security vulnerabilities and legacy services.
Security testing of communication networks, however, is underestimated, and mostly under-scoped. Due to the lack of time and resources, the results of the security tests are only providing a security illusion. On the other hand, the advanced VoIP and UC attacks can be much faster and efficient with a proper methodology used. Therefore, this talk aims to improve the testing skills of the assurance teams for better penetration testing results. The theme of the talk is on transferring the VoIP and UC knowledge from a phreak to penetration testers. This will be performed through practical attack demonstrations, testing tips and automated actions.
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
The document discusses building real-time dashboards on data streams. It describes using Apache Kafka to ingest streaming data from Wikipedia edits. The data is enriched using Kafka Streams and stored in Apache Druid for powering interactive visualizations in Superset. Key components are Kafka for the event flow, Kafka Streams for processing, Druid for the data store, and Superset for visualization.
The RADIUS accounting messages contain cleartext user identity and device identifiers that can be used to track users even if the authentication traffic is encrypted. Some ways to mitigate this are:
- Use EAP-TLS or EAP-PWD authentication to avoid sending IMSI or username in cleartext
- Enable IMSI privacy in EAP-SIM/EAP-AKA methods
- Tunnel or encrypt RADIUS accounting messages to prevent eavesdropping of user identities
- Use temporary identifiers instead of permanent ones in accounting messages
Proper configuration of authentication and accounting privacy features can help prevent user tracking via network monitoring of 802.1X and RADIUS traffic. However, complete anonymity is difficult to achieve in centralized network authentication systems
VOIP merupakan teknologi pengganti telepon konvensional. VOIP menggunakan teknologi jaringan yang biasa kita gunakan untuk berinternet. Konsep VOIP secara umum dibagi menjadi lima, PBX pada voip disebut sebagai IP-PBX atau softswitch server. Berikut penjelelasanya.
Sumber :
https://www.crowedteam.xyz/2019/03/cara-membuat-server-voip-dengan-asterisk.html
https://cypressnorth.com/technology/setting-up-a-small-office-or-home-office-voip-system-with-asterisk-pbx-part-2/
https://davidadinugroho.wordpress.com/tag/kekurangan-pbx/
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
This document discusses Microsoft Office 365, a cloud-based productivity suite. It provides an overview of cloud computing benefits and Office 365 features and subscription plans for small businesses and enterprises. Key capabilities of Office 365 plans include Exchange email, SharePoint collaboration, online meetings and Office Web Apps. The presentation compares Office 365 to on-premise installations and Google Apps and is sponsored by SNP Technologies, a technology consulting firm.
This document discusses vulnerabilities in voice over IP (VoIP) and unified communications systems. It begins by introducing the speaker and their background in VoIP security. It then outlines various attack vectors such as exploiting vulnerabilities in signaling protocols, message content, and unified messaging features to inject malicious content or execute code. The document emphasizes that securing UC involves more than just securing VoIP, and recommends approaches like secure infrastructure design, authentication, and client protection to help secure these systems.
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
This document discusses building real-time data processing and analytics with Databricks and Kafka. It describes how Databricks' lakehouse platform and Spark Structured Streaming can be used with Apache Kafka to ingest streaming data and perform real-time analytics. It also provides an example of how a large retailer, Albertsons, uses Databricks to distribute offers in real-time, power dashboards with streaming data, and enable hyper-personalization with real-time data models. The partnership between Databricks and Confluent is also discussed as a way to modernize data platforms and power new real-time applications and analytics.
Departed Communications: Learn the ways to smash them!Fatih Ozavci
Unified Communications (UC) is widely used by larger organisations for video conferences, office collaboration, cloud services and mobile communications. These services also have key roles in the IP Multimedia Subsystem (IMS) implementations of next generation mobile networks. As a result of these, customers require unified collaboration; and the telecommunications industry offers managed communications services and infrastructure using UC and IMS technologies. These offerings also come with design issues, well-known security vulnerabilities and legacy services.
Security testing of communication networks, however, is underestimated, and mostly under-scoped. Due to the lack of time and resources, the results of the security tests are only providing a security illusion. On the other hand, the advanced VoIP and UC attacks can be much faster and efficient with a proper methodology used. Therefore, this talk aims to improve the testing skills of the assurance teams for better penetration testing results. The theme of the talk is on transferring the VoIP and UC knowledge from a phreak to penetration testers. This will be performed through practical attack demonstrations, testing tips and automated actions.
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
The document discusses building real-time dashboards on data streams. It describes using Apache Kafka to ingest streaming data from Wikipedia edits. The data is enriched using Kafka Streams and stored in Apache Druid for powering interactive visualizations in Superset. Key components are Kafka for the event flow, Kafka Streams for processing, Druid for the data store, and Superset for visualization.
The RADIUS accounting messages contain cleartext user identity and device identifiers that can be used to track users even if the authentication traffic is encrypted. Some ways to mitigate this are:
- Use EAP-TLS or EAP-PWD authentication to avoid sending IMSI or username in cleartext
- Enable IMSI privacy in EAP-SIM/EAP-AKA methods
- Tunnel or encrypt RADIUS accounting messages to prevent eavesdropping of user identities
- Use temporary identifiers instead of permanent ones in accounting messages
Proper configuration of authentication and accounting privacy features can help prevent user tracking via network monitoring of 802.1X and RADIUS traffic. However, complete anonymity is difficult to achieve in centralized network authentication systems
VOIP merupakan teknologi pengganti telepon konvensional. VOIP menggunakan teknologi jaringan yang biasa kita gunakan untuk berinternet. Konsep VOIP secara umum dibagi menjadi lima, PBX pada voip disebut sebagai IP-PBX atau softswitch server. Berikut penjelelasanya.
Sumber :
https://www.crowedteam.xyz/2019/03/cara-membuat-server-voip-dengan-asterisk.html
https://cypressnorth.com/technology/setting-up-a-small-office-or-home-office-voip-system-with-asterisk-pbx-part-2/
https://davidadinugroho.wordpress.com/tag/kekurangan-pbx/
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
The traditional approach to insurance pricing involves fitting a generalized linear model (GLM) to data collected on historical claims payments and premiums received. The explosive growth in data availability and increasing competitiveness in the marketplace are challenging actuaries to find new insights in their data and make predictions with more granularity, improved speed and efficiency, and with tighter integration among business units to support strategic decisions.
In this session we will share our experience implementing deep hierarchical neural networks using TensorFlow and PySpark on Databricks. We will discuss the benefits of the ML Runtime, our experience using the goofys mount, our process for hyperparameter tuning, specific considerations for the large dataset size and extreme volatility present in insurance data, among other topics.
Authors: Bryn Clark, Krish Rajaram
This document discusses Microsoft Office 365, a cloud-based productivity suite. It provides an overview of cloud computing benefits and Office 365 features and subscription plans for small businesses and enterprises. Key capabilities of Office 365 plans include Exchange email, SharePoint collaboration, online meetings and Office Web Apps. The presentation compares Office 365 to on-premise installations and Google Apps and is sponsored by SNP Technologies, a technology consulting firm.
Building a Security Operations Center (SOC).pdfTapOffice
Ben Rothke presented on building a Security Operations Center (SOC). He discussed the need for a SOC due to the large amounts of security data organizations face. A SOC provides continuous monitoring, protection, detection and response against threats. It is the nucleus of security operations. Rothke outlined the components and functions of an effective SOC, and discussed deciding whether to build an internal SOC or outsource to a managed security service provider. He provided questions to consider for each approach to help organizations determine the best option.
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksDatabricks
Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.
Fiber optic network design is a specialized process that involves planning network requirements, choosing appropriate components, reviewing the design, specifying testing requirements, writing specifications for contractors, and estimating costs. It requires coordinating with various stakeholders and an in-depth knowledge of fiber optic systems, installation standards, and local regulations. The goal is to produce a design that supports the network and meets the client's communication needs.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
The document provides an overview of Splunk IT Service Intelligence (ITSI). Some key points:
- ITSI makes Splunk "service-aware" and provides insights into IT services to help accelerate customers' path to operational intelligence.
- ITSI provides search-based KPIs, full-fidelity service health monitoring, and leverages Splunk's universal data platform to provide a data-driven approach.
- Core concepts in ITSI include services, KPIs, health scores, service analyzers for monitoring services, glass tables dashboards, and deep dives for investigation.
- Notable events are also generated by correlation searches to indicate service degradation.
Over the past 10 years the Session Initiation Protocol (SIP) has moved from the toy of researchers and academics to the de-facto standard for telephony and multimedia services in mobile and fixed networks.
Probably one of the most emotionally fraught discussions in the context of SIP was whether Session Border Controllers (SBC) are good or evil.
SIP was designed with the vision of revolutionizing the way communication services are developed, deployed and operated. Following the end-to-end spirit of the Internet SIP was supposed to turn down the walled gardens of PSTN networks and free communication services from the grip of large telecom operators. By moving the intelligence to the end systems, developers were supposed to be able to develop new communication services that will innovate the way we communicate with each other.
This was to be achieved without having to wait for the approval of the various telecommunication standardization groups such as ETSI or the support of incumbent telecoms.
Session border controllers are usually implemented as SIP Back-to-Back User Agents (B2BUA) that are placed between a SIP user agent and a SIP proxy. The SBC then acts as the contact point for both the user agents and the proxy. Thereby the SBC actually breaks the end-to-end behavior of SIP, which has led various people to deem the SBC as an evil incarnation of the old telecom way of thinking. Regardless of this opposition, SBCs have become a central part of any SIP deployment.
In this paper we will first give a brief overview of how SIP works and continue with a description of what SBCs do and the different use cases for deploying SBCs.
Multi-Cloud Strategy for Unrestricted PossibilitiesHarsh V Sehgal
Multi-cloud strategies allow organizations to use multiple cloud providers to gain various benefits like choosing the best cloud for each workload, cost savings, and redundancy. A multi-cloud strategy involves selecting clouds deliberately based on needs rather than using them haphazardly. It requires addressing challenges like management complexity, security, and interoperability. Planning is key to avoiding issues and achieving business goals through a multi-cloud approach.
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Auto Dialer vs Power Dialer: What's the DifferenceAshish Kumar
Are you confused in choosing suitable outbound dialing software, Auto Dialer or Power Dialer, for your call center? Let's understand the difference between the Auto Dialer & Power Dialer. This comparison between the two dialers will help you in choosing the right one as per the call center size and requirement.
This document provides an overview of Microsoft Information Protection and its capabilities for knowing, protecting, preventing loss of, and governing data across an organization. It discusses using sensitivity labels and encryption to protect emails and files, data loss prevention tools, and retention policies for compliance. The document also outlines the current and planned solutions for information protection in the tenant and demos how to apply labels using Office apps, Windows Explorer, and Outlook.
This presentation provides an overview of the key components of a service support process: configuration management, problem management, release management, change management, and incident management. It describes the basic functions and benefits of each component. Configuration management involves managing IT components, configurations, and the configuration management database (CMDB). Problem management involves identifying, diagnosing, and resolving problems and errors. Change management involves controlling and tracking changes to minimize impacts. Release management coordinates testing and deployment of releases. Incident management handles detection, classification, and resolution of incidents. Together, these components work to improve IT service quality, user productivity, and support efficiency.
Matomo External Dashboards & Data Visualisation.pdfMichael Weber
The data visualization possibilities in Matomo are limited. We are therefore looking at external data visualization solutions and whether and how they can be integrated with Matomo. Specifically for Google Data Studio, we look at the process and the possibilities in detail.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
This document provides guidelines for deploying Microsoft Lync unified communications over an Aruba wireless network. It discusses Lync architecture and features, wireless network design considerations including access point placement and RF planning, quality of service configuration, and troubleshooting tools. The document focuses on using the Lync Software-Defined Networking API and Aruba's network visibility to provide end-to-end monitoring of real-time Lync calls and diagnose performance issues.
Bryan Starbuck from WhiteHat Engineering discusses cloud security and privacy standards. He notes that there are many cloud standards and startups can use a vendor to run Amazon EC2 instances with applied privacy and security standards. Bryan lists compliance standards that cloud applications and infrastructure may need to adhere to, such as SOC 2, Cobit, HIPAA, and NIST 800-53.
Команда Data Phoenix Events приглашает всех, 17 августа в 19:00, на первый вебинар из серии "The A-Z of Data", который будет посвящен MLOps. В рамках вводного вебинара, мы рассмотрим, что такое MLOps, основные принципы и практики, лучшие инструменты и возможные архитектуры. Мы начнем с простого жизненного цикла разработки ML решений и закончим сложным, максимально автоматизированным, циклом, который нам позволяет реализовать MLOps.
https://dataphoenix.info/the-a-z-of-data/
https://dataphoenix.info/the-a-z-of-data-introduction-to-mlops/
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use.
Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock.
By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases.
Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation.
In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.
Building a Security Operations Center (SOC).pdfTapOffice
Ben Rothke presented on building a Security Operations Center (SOC). He discussed the need for a SOC due to the large amounts of security data organizations face. A SOC provides continuous monitoring, protection, detection and response against threats. It is the nucleus of security operations. Rothke outlined the components and functions of an effective SOC, and discussed deciding whether to build an internal SOC or outsource to a managed security service provider. He provided questions to consider for each approach to help organizations determine the best option.
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksDatabricks
Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.
Fiber optic network design is a specialized process that involves planning network requirements, choosing appropriate components, reviewing the design, specifying testing requirements, writing specifications for contractors, and estimating costs. It requires coordinating with various stakeholders and an in-depth knowledge of fiber optic systems, installation standards, and local regulations. The goal is to produce a design that supports the network and meets the client's communication needs.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
The document provides an overview of Splunk IT Service Intelligence (ITSI). Some key points:
- ITSI makes Splunk "service-aware" and provides insights into IT services to help accelerate customers' path to operational intelligence.
- ITSI provides search-based KPIs, full-fidelity service health monitoring, and leverages Splunk's universal data platform to provide a data-driven approach.
- Core concepts in ITSI include services, KPIs, health scores, service analyzers for monitoring services, glass tables dashboards, and deep dives for investigation.
- Notable events are also generated by correlation searches to indicate service degradation.
Over the past 10 years the Session Initiation Protocol (SIP) has moved from the toy of researchers and academics to the de-facto standard for telephony and multimedia services in mobile and fixed networks.
Probably one of the most emotionally fraught discussions in the context of SIP was whether Session Border Controllers (SBC) are good or evil.
SIP was designed with the vision of revolutionizing the way communication services are developed, deployed and operated. Following the end-to-end spirit of the Internet SIP was supposed to turn down the walled gardens of PSTN networks and free communication services from the grip of large telecom operators. By moving the intelligence to the end systems, developers were supposed to be able to develop new communication services that will innovate the way we communicate with each other.
This was to be achieved without having to wait for the approval of the various telecommunication standardization groups such as ETSI or the support of incumbent telecoms.
Session border controllers are usually implemented as SIP Back-to-Back User Agents (B2BUA) that are placed between a SIP user agent and a SIP proxy. The SBC then acts as the contact point for both the user agents and the proxy. Thereby the SBC actually breaks the end-to-end behavior of SIP, which has led various people to deem the SBC as an evil incarnation of the old telecom way of thinking. Regardless of this opposition, SBCs have become a central part of any SIP deployment.
In this paper we will first give a brief overview of how SIP works and continue with a description of what SBCs do and the different use cases for deploying SBCs.
Multi-Cloud Strategy for Unrestricted PossibilitiesHarsh V Sehgal
Multi-cloud strategies allow organizations to use multiple cloud providers to gain various benefits like choosing the best cloud for each workload, cost savings, and redundancy. A multi-cloud strategy involves selecting clouds deliberately based on needs rather than using them haphazardly. It requires addressing challenges like management complexity, security, and interoperability. Planning is key to avoiding issues and achieving business goals through a multi-cloud approach.
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Auto Dialer vs Power Dialer: What's the DifferenceAshish Kumar
Are you confused in choosing suitable outbound dialing software, Auto Dialer or Power Dialer, for your call center? Let's understand the difference between the Auto Dialer & Power Dialer. This comparison between the two dialers will help you in choosing the right one as per the call center size and requirement.
This document provides an overview of Microsoft Information Protection and its capabilities for knowing, protecting, preventing loss of, and governing data across an organization. It discusses using sensitivity labels and encryption to protect emails and files, data loss prevention tools, and retention policies for compliance. The document also outlines the current and planned solutions for information protection in the tenant and demos how to apply labels using Office apps, Windows Explorer, and Outlook.
This presentation provides an overview of the key components of a service support process: configuration management, problem management, release management, change management, and incident management. It describes the basic functions and benefits of each component. Configuration management involves managing IT components, configurations, and the configuration management database (CMDB). Problem management involves identifying, diagnosing, and resolving problems and errors. Change management involves controlling and tracking changes to minimize impacts. Release management coordinates testing and deployment of releases. Incident management handles detection, classification, and resolution of incidents. Together, these components work to improve IT service quality, user productivity, and support efficiency.
Matomo External Dashboards & Data Visualisation.pdfMichael Weber
The data visualization possibilities in Matomo are limited. We are therefore looking at external data visualization solutions and whether and how they can be integrated with Matomo. Specifically for Google Data Studio, we look at the process and the possibilities in detail.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
This document provides guidelines for deploying Microsoft Lync unified communications over an Aruba wireless network. It discusses Lync architecture and features, wireless network design considerations including access point placement and RF planning, quality of service configuration, and troubleshooting tools. The document focuses on using the Lync Software-Defined Networking API and Aruba's network visibility to provide end-to-end monitoring of real-time Lync calls and diagnose performance issues.
Bryan Starbuck from WhiteHat Engineering discusses cloud security and privacy standards. He notes that there are many cloud standards and startups can use a vendor to run Amazon EC2 instances with applied privacy and security standards. Bryan lists compliance standards that cloud applications and infrastructure may need to adhere to, such as SOC 2, Cobit, HIPAA, and NIST 800-53.
Команда Data Phoenix Events приглашает всех, 17 августа в 19:00, на первый вебинар из серии "The A-Z of Data", который будет посвящен MLOps. В рамках вводного вебинара, мы рассмотрим, что такое MLOps, основные принципы и практики, лучшие инструменты и возможные архитектуры. Мы начнем с простого жизненного цикла разработки ML решений и закончим сложным, максимально автоматизированным, циклом, который нам позволяет реализовать MLOps.
https://dataphoenix.info/the-a-z-of-data/
https://dataphoenix.info/the-a-z-of-data-introduction-to-mlops/
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use.
Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock.
By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases.
Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation.
In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.
Leveraging Mainframe Data for Modern Analyticsconfluent
The document provides an overview of leveraging mainframe data for modern analytics using Attunity Replicate and Confluent streaming platform powered by Apache Kafka. It discusses the history of mainframes and data migration, how Attunity enables real-time data migration from mainframes, the Confluent streaming platform for building applications using data streams, and how Attunity and Confluent can be combined to modernize analytics using mainframe data streams. Use cases discussed include query offloading and cross-system customer data integration.
Architecting an Open Source AI Platform 2018 editionDavid Talby
How to build a scalable AI platform using open source software. The end-to-end architecture covers data integration, interactive queries & visualization, machine learning & deep learning, deploying models to production, and a full 24x7 operations toolset in a high-compliance environment.
The document outlines the roadmap for SQL Server, including enhancements to performance, security, availability, development tools, and big data capabilities. Key updates include improved intelligent query processing, confidential computing with secure enclaves, high availability options on Kubernetes, machine learning services, and tools in Azure Data Studio. The roadmap aims to make SQL Server the most secure, high performing, and intelligent data platform across on-premises, private cloud and public cloud environments.
This document describes Hopsworks, an end-to-end data platform for analytics and machine learning built by KTH and RISE SICS. It provides data ingestion, preparation, experimentation, model training, and deployment capabilities. The platform is built on Apache technologies like Apache Beam, Spark, Flink, Kafka, and uses Kubernetes for orchestration. It also includes a feature store for ML features. The document then discusses Apache Flink and its use for stream processing applications. It provides examples of using Flink's APIs like SQL, CEP, and machine learning. Finally, it introduces the concept of continuous deep analytics and the Arcon framework for unified analytics across streams, tensors, graphs and more through an intermediate
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
Streaming Data and Stream Processing with Apache Kafkaconfluent
Apache Kafka is an open-source streaming platform that can be used to build real-time data pipelines and streaming applications. It addresses challenges with diverse data sets arriving at increasing rates. The document discusses how Apache Kafka can help with challenges around data integration, stream processing, and managing streaming platforms at scale. It also outlines key features of Apache Kafka like the Kafka Connect API for data integration, the Kafka Streams API for stream processing, and Confluent Control Center for monitoring and management.
Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.
Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.
Dipti will cover:
-Open Data Lake analytics - what it is and what use cases it supports
-Why companies are moving to an open data lake analytics approach
-Why the open source data lake query engine Presto is critical to this approach
This presentation shows new features in SQL 2019, and a recap of features from SQL 2000 through 2017 as well. You would be wise to hear someone from Microsoft deliver this material.
The document summarizes Microsoft's SQL Server 2005 Analysis Services (SSAS). It provides an overview of SSAS capabilities such as data mining algorithms, unified dimensional modeling, scalability features, and integrated manageability with SQL Server. It also describes demos of the OLAP and data mining capabilities and how SSAS can be deployed and managed for scalability, availability, and serviceability.
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
Are you interested in learning how to schedule batch jobs in container runtimes?
Maybe you’re wondering how to apply continuous delivery in practice for data-intensive applications? Perhaps you’re looking for an orchestration tool for data pipelines?
Questions like these are common, so rest assured that you’re not alone.
In this webinar, we’ll cover the recent feature improvements in Spring Cloud Data Flow. More specifically, we’ll discuss data processing use cases and how they simplify the overall orchestration experience in cloud runtimes like Cloud Foundry and Kubernetes.
Please join us and be part of the community discussion!
Presenters :
Sabby Anandan, Product Manager
Mark Pollack, Software Engineer, Pivotal
In this presentation, we show how Data Reply helped an Austrian fintech customer to overcome previous performance limitations in their data analytics landscape, leverage real-time pipelines, break down monoliths, and foster a self-service data culture to enable new event-driven and business-critical use cases.
Gs08 modernize your data platform with sql technologies wash dcBob Ward
The document discusses the challenges of modern data platforms including disparate systems, multiple tools, high costs, and siloed insights. It introduces the Microsoft Data Platform as a way to manage all data in a scalable and secure way, gain insights across data without movement, utilize existing skills and investments, and provide consistent experiences on-premises, in the cloud, and hybrid environments. Key elements of the Microsoft Data Platform include SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Data Lake, and Analytics Platform System.
Sparkflows provides a solution to reduce the cost and time required to develop big data analytics applications from months to hours. It offers a visual workflow editor that allows data analysts, data scientists, and data engineers to easily build analytics workflows by dragging and dropping nodes without extensive coding. Some key benefits include interactive execution, rich visualizations, pre-built workflows for common use cases, and the ability to deploy complex pipelines in minutes.
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
An open data lake platform provides a robust and future-proof data management paradigm to support a wide range of data processing needs, including data exploration, ad-hoc analytics, streaming analytics, and machine learning.
Similar to Lessons Learned from Modernizing USCIS Data Analytics Platform (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Lessons Learned from Modernizing USCIS Data Analytics Platform
1. A Journey To Modernization
Shawn Benjamin & Prabha Rajendran
2. Problems Faced
q Informatica ETL pipeline was brittle
q Lengthy Informatica ETL development
cycle
q Lengthy load time workflows for ingestion
of data
q Lack of ability for real time/ near real-time
data
q Lack of data science platform
2
4. 4
C3 CON
eCIMS
C3 LANS
C4
NFTS
ELIS
NASS
CPMS
Pay.gov
AR-11
RAPS
MFAS
VSS
ATS
PCTS
SNAP
RNACS
Benefits
Files
Payment
Verification
Validation
Cust Svc
Scheduler
AdminFraudFOIA
Benefits
Mart
Scheduler
Mart
Payment
Mart
Validation
Mart
APSS
C3 Con
C4
CIS
ELIS2
MFAS
NFTS
PCTS
RNACS
VSS
C3 LAN
AAO x2
CSC x2
MSC x2
NSC x2
TSC x2
VSC x2
CHAMPS
ECHO
iCLAIMS
IDCMS
QUEUE
NPWR
CAMINO VIBE
SODA
Data Marts
SCCLAIMS
FDNS-DS
SMART Subject Areas
SAS LibrarieseCISCOR
Direct Connects
Data
Marts
Direct Connects
Active ODSes Decom. ODSes
Treasury
CIR
CIS
x7
x1
x2
x1
x1
x5
x2
x5
x2
x1
x2
x6 x1
x1
x1
x1
x2
x1
x2
x1
x1
LEGEND
ODSes
VIS x2
CPMS x1
SRMT x1
NFTS x1x1
x1
x1
FACCON x1
ePULSE x1
BI Tools
Data Marts
Users
4
2016SNAPSHOT
JANUARY
Data
Sources
SMART
Subject Areas
36
2
eCISCOR
ODS
FutureData MartDirect Connect
66
2,354
ETL28 Processes
5. Implemented
Databricks private
cloud
VPC (26 nodes) in the
AWS
Connected the Databricks
cluster to the Oracle database
Created all relevant
DB tables in HIVE
metadata pointing to
Oracle database
Copied relevant tables from Oracle
database to S3 using Scala code
Data is stored in Apache Parquet
columnar format. For context, the 120
million row 83 column can be dumped
to S3 in just 10 minutes.
Identified appropriate
partition scheme
large tables were partitioned to
optimize Spark query performance
Created multiple notebooks
Perform data analysis and visualize the
results, e.g. created histogram of case
life cycle duration
Successful Proof of Concept
5
7. 7
• 75 Data
Sources
• Xx Data
Interfaces
• 7 Data
Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject
Areas
• 56 SAS
Libraries
• 6,233 Users
• 75 Data
Sources
• 35 Application
Interfaces
• 7 Data Marts
• 4 BI Tools
• 6,086
Tableau
Dashboards
• 118 SMART
Subject Areas
• 56 SAS
Libraries
• 6,233 Users
8. Databricks Accomplishments
IMPLEMENTATION
OF DELTA LAKE
EASY INTEGRATION
WITH OBIEE ,SAS
AND TABLEAU WITH
NATIVE
CONNECTORS
INTEGRATION WITH
GITHUB FOR
CONTINUOUS
INTEGRATION &
DEPLOYMENT
AUTOMATING
ACCOUNT
PROVISIONING
MACHINE LEARNING
(ML FLOW)
INTEGRATION
8
9. Change Data Capture using Delta Lake
Databricks Delta –Success Factors
v Faster Ingestion of CDC changes
v Resiliency
v Improved Data Quality , Reporting
Availability and runtime
performance
v Schema evolution - adding
additional columns without
rewriting the entire table
Databricks Delta-Lessons Learned
v Storage requirements increased
v Vacuum and Optimization is
mandatory to improve the
performance
9
12. Data Science Experiments using ML
12
ML Graphs processes after running the models
Prediction Model Samples
13. Text and Log Mining
0
0.5
1
NegativeSentiment Positive Sentiment
Sentiment Analysis
13
14. Time Series Models and H2O Integration
Integrated H20 with Databricks and built a model predicting the count of ‘No show’ on N400 using
the traditional Time series forecasting to predict inefficiencies in normal day-to-day planning and operations
14
15. Enabling Security & Governance
15
Access
Control (ACL)
Credentials
Passthrough
Secrets
Management
v Control users access to data using the
Databricks view-based access control
model (Table and schema level ACLs)
v Control users access to clusters that are
not enabled for table access control
v Enforced data object privileges at
onboarding phase
v Used Databricks secrets manager to store
credentials and reference in notebooks
16. Databricks Management API Usage
16
Cluster/Jobs management àCreate, delete,
manage clusters and get execution status of daily
scheduled jobs which helped automated
monitoring.
Library /Secret managementà Easy upload of
any third-party libraries and manage encrypted
scopes/credentials to connect to source and
target endpoints.
Integration and Deploymentsà API with Git
and Jenkins for continuous integration and
continuous deployment
Enabled MLFlow Tracking API for our Data
Science experiments
API
Integrated Databricks Management API with Jenkins and other scripting tools to
automate all our administration and management tasks.
17. Lessons learned through this Journey
Training plan Cloud based
experience
Subject Matter
Expertise
Automation
17
18. Success Strategy
Success Criteria Benefit
Performance
ü Auto-scalability leveraging on-demand and spot instances
ü Efficient processing of larger datasets comparable to RDMS systems
ü Scalable read/write performance on S3
Support for a variety of statistical
programming languages
ü Data Science Platform ( R, Python, Scala and SQL)
ü Supports MLIB : Machine Learning & Deep Learning
Integration with existing tools
ü Allows connections to industry standard technologies via ODBC/JDBC
connection and inbuilt connectors.
Easily integrate new data sources
ü Supports seamless integration with data streaming technologies like
Kafka/Kinesis using Spark Streaming. This supports both structured and
unstructured
ü Leverages S3 extensively
Secure
ü Supports integration with multiple Single-Sign-On platforms
ü Supports native encryption-decryption features (AES-256 and KMS)
ü Supports Access Control Layer (ACL)
ü Implemented in USCIS Private cloud
18