This document provides an agenda and overview for a tutorial on building shared AI services. The session will cover AI engineering platforms, data pipelines, traditional AI roles and their challenges, skills required for AI engineers, and benchmarking machine learning and deep learning approaches. It includes a live demo of building an end-to-end AI pipeline with Kafka, NiFi, Spark Streaming and Keras on Spark.
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Solution Brief: Real-Time Pipeline AcceleratorBlueData, Inc.
Get started with Spark Streaming, Kafka, and Cassandra for real-time data analytics.
BlueData makes it easy to deploy Spark infrastructure and applications on- premises. The BlueData EPIC software platform is purpose-built to simplify and accelerate the deployment of Spark, Hadoop, and other tools for Big Data analytics—leveraging Docker containers and virtualized infrastructure.
Our new Real-Time Pipeline Accelerator solution provides the software and professional services you need for building data pipelines in a multi-tenant environment for Spark Streaming, Kafka, and Cassandra. With help from the BlueData team, you’ll also have two end-to-end real-time data pipelines as a starting point.
Learn more about BlueData at www.bluedata.com
Many enterprise are implementing Hadoop projects to manage and process large datasets. Big question is: how to configure Hadoop clusters to connect to enterprise directory containing 100k+ users and groups for access management. Several large enterprises have complex directory servers for managing users and groups. Many advanced features have been recently added to Hadoop user management in order to support various complex directory server structures.
In this session attendees will learn about: setting up Hadoop node with users from Active Directory for executing Hadoop jobs, setting up authentication for enterprise users, and setting up authorization for users and groups using Apache Ranger. Attendees will also learn about the common challenges faced in the enterprise environments while interacting with Active Directory including filtering out users to be brought into Hadoop from Active Directory, restricting access to a set of users from Active Directory, handling users from nested group structures, etc.
Speakers
Sailaja Polavarapu, staff Software Engineer, Hortonworks
Velmurugan Periasamy, Director - Engineering, Hortonworks
Deep Dive into the New Features of Apache Spark 3.1Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...Databricks
FPGA has recently gained attention throughout the industry because of its performance-per-power efficiency, re-programmable flexibility and wide range of applicableness. As a prediction to this phenomenon, Intel has been planning a new product line which offers a Xeon processor with integrated FPGA that will enable datacenters to easily deploy high-performance accelerators with a relatively low cost of ownership. The new Xeon+FPGA Platform is supported with a software ecosystem that eliminates the difficulties traditional FPGA devices had such as datacenter wide accelerator deployment.
In this session, Intel will present their design and implementation of FPGA as a supplement to vcores in Spark YARN mode to accelerate SparkML applications on the Intel Xeon+FPGA platform. In particular, they have added new options to Spark core that provides an interface for the user to describe the accelerator dependencies of the application. The FPGA info in the Spark context will be used by the new APIs and DRF policy implemented on YARN to schedule the Spark executor to a host with Xeon+FPGA installed. Experimental results using ALS scoring applications that accelerate GEneral Matrix to Matrix Multiplication operations demonstrate that Xeon+FPGA improves the FLOPS throughput by 1.5× compared to a CPU-only cluster.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Overall 12 Years and 9+ of experience as an Oracle Database Administrator, my experience has scaled in providing client support, analysis, designing, development, testing, installation, migration, maintenance and administration for oracle databases, which included understanding the requirements, upgrades, performance tuning, backup & recovery, cloning and space management using ORACLE 11g, 10g, 9i performing the role of Oracle Database Administrator according to the set standards and timelines.
Title: Scalable R
Event description:
During this short session you will get introduced to Microsoft R for big data and its integration into (not only) Microsoft environment (SQL Server / Hadoop) with showcase of tools and code.
About speaker:
Michal Marusan origins comes from data warehousing and business intelligence on massively parallel database engines but for more than last five years he has been working on numerous Big Data and Advanced Analytics projects with different customers mainly from Telco, Banking and Transportation industry.
Michal’s focus and passion is helping customers with implementation of new analytical methods into their business environments to drive data-driven decisions and generate new business insights both in the cloud and on-premises systems.
Michal is member of Global Black Belt team, CEE Advanced Analytics and Big Data TSP at Microsoft.
Registration:
@Meetup.com group's event here & @Eventbrite registration here (if you use both your seat is guarateed). +our event you can find also @Facebook here.
[Disclaimer: If you use both (Meetup.com& Eventbrite) or at least one of them your seat is guarateed/if you just mark "going" @ this Facebook event we can't guarantee your seat].
Language of the event: R & Slovak
------------------------------------
R <- Slovakia [R enthusiasts and users, data scientists and statisticians of all levels from Slovakia]
------------------------------------
This meetup group is for Data Scientists, Statisticians, Economists and Data Enthusiasts using R for data analysis and data visualization. The goals are to provide R enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.
--
PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData groups, events, and conferences aim to provide a venue for users acrossall the various domains of data analysis to share their experiences and their techniques. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.
This talk provides an architecture overview of data-centric microservices illustrated with an example application. The following Microservices concepts are illustrated - domain driven design, event-driven services, Saga transactions, Application tracing and Health monitoring with different microservices using a variety of data types supported in the database - business data, documents, spatial, graph, and events. A running example of a mobile food delivery application (called GrubDash) is used, with a hands-on-lab that is available for attendees to work through on the Oracle Cloud after these sessions. The rest of the talks will build upon this Microservices architecture framework.
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
Seamless disaster recovery has several challenges. Data, metadata, and transaction information need to be moved in sync. It should also be easy for the users and applications to reason about the state of the replica. The “hadoop scale” also brings unique challenges as bandwidth between clusters can be a limiting factor. The data transfer has to be minimized for replication, failover, as well as fail back scenarios.
In this talk we will discuss how the above challenges are addressed for supporting seamless replication and disaster recovery for Hive.
Speakers
Sankar Hariappan, Hortonworks, Staff Software Engineer
Anishek Agarwal, Hortonworks, Engineering Manager
(SPOT305) Event-Driven Computing on Change Logs in AWS | AWS re:Invent 2014Amazon Web Services
An increasingly common form of computing is computation in response to recently occurring events. These might be newly arrived or changed data, such as an uploaded Amazon S3 image file or an update to an Amazon DynamoDB table, or they might be changes in the state of some system or service, such as termination of an EC2 instance. Support for this form of computing requires both a means of efficiently surfacing events as a sequence of change records, as well as frameworks for processing such change logs. This session provides an overview of how AWS intends to facilitate event-driven computing through support for both change logs as well as various means of processing them.
Slides presented at SDBigData Meetup:
http://www.meetup.com/sdbigdata/events/225691323/
There was a request for more Couchbase use case information and NoSQL primer, so I added a number of slides to let me talk to those aspects right before doing the presentation.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
Deep Dive into the New Features of Apache Spark 3.1Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...Databricks
FPGA has recently gained attention throughout the industry because of its performance-per-power efficiency, re-programmable flexibility and wide range of applicableness. As a prediction to this phenomenon, Intel has been planning a new product line which offers a Xeon processor with integrated FPGA that will enable datacenters to easily deploy high-performance accelerators with a relatively low cost of ownership. The new Xeon+FPGA Platform is supported with a software ecosystem that eliminates the difficulties traditional FPGA devices had such as datacenter wide accelerator deployment.
In this session, Intel will present their design and implementation of FPGA as a supplement to vcores in Spark YARN mode to accelerate SparkML applications on the Intel Xeon+FPGA platform. In particular, they have added new options to Spark core that provides an interface for the user to describe the accelerator dependencies of the application. The FPGA info in the Spark context will be used by the new APIs and DRF policy implemented on YARN to schedule the Spark executor to a host with Xeon+FPGA installed. Experimental results using ALS scoring applications that accelerate GEneral Matrix to Matrix Multiplication operations demonstrate that Xeon+FPGA improves the FLOPS throughput by 1.5× compared to a CPU-only cluster.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
[This is work presented at SIGMOD'13.]
The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.
SQL Server Reporting Services Disaster Recovery WebinarDenny Lee
This is the PASS DW/BI Webinar for SQL Server Reporting Services (SSRS) Disaster Recovery webinar. You can find the video at: http://www.youtube.com/watch?v=gfT9ETyLRlA
Overall 12 Years and 9+ of experience as an Oracle Database Administrator, my experience has scaled in providing client support, analysis, designing, development, testing, installation, migration, maintenance and administration for oracle databases, which included understanding the requirements, upgrades, performance tuning, backup & recovery, cloning and space management using ORACLE 11g, 10g, 9i performing the role of Oracle Database Administrator according to the set standards and timelines.
Title: Scalable R
Event description:
During this short session you will get introduced to Microsoft R for big data and its integration into (not only) Microsoft environment (SQL Server / Hadoop) with showcase of tools and code.
About speaker:
Michal Marusan origins comes from data warehousing and business intelligence on massively parallel database engines but for more than last five years he has been working on numerous Big Data and Advanced Analytics projects with different customers mainly from Telco, Banking and Transportation industry.
Michal’s focus and passion is helping customers with implementation of new analytical methods into their business environments to drive data-driven decisions and generate new business insights both in the cloud and on-premises systems.
Michal is member of Global Black Belt team, CEE Advanced Analytics and Big Data TSP at Microsoft.
Registration:
@Meetup.com group's event here & @Eventbrite registration here (if you use both your seat is guarateed). +our event you can find also @Facebook here.
[Disclaimer: If you use both (Meetup.com& Eventbrite) or at least one of them your seat is guarateed/if you just mark "going" @ this Facebook event we can't guarantee your seat].
Language of the event: R & Slovak
------------------------------------
R <- Slovakia [R enthusiasts and users, data scientists and statisticians of all levels from Slovakia]
------------------------------------
This meetup group is for Data Scientists, Statisticians, Economists and Data Enthusiasts using R for data analysis and data visualization. The goals are to provide R enthusiasts a place to share ideas and learn from each other about how best to apply the language and tools to ever-evolving challenges in the vast realm of data management, processing, analytics, and visualization.
--
PyData is a group for users and developers of data analysis tools to share ideas and learn from each other. We gather to discuss how best to apply Python tools, as well as those using R and Julia, to meet the evolving challenges in data management, processing, analytics, and visualization. PyData groups, events, and conferences aim to provide a venue for users acrossall the various domains of data analysis to share their experiences and their techniques. PyData is organized by NumFOCUS.org, a 501(c)3 non-profit in the United States.
This talk provides an architecture overview of data-centric microservices illustrated with an example application. The following Microservices concepts are illustrated - domain driven design, event-driven services, Saga transactions, Application tracing and Health monitoring with different microservices using a variety of data types supported in the database - business data, documents, spatial, graph, and events. A running example of a mobile food delivery application (called GrubDash) is used, with a hands-on-lab that is available for attendees to work through on the Oracle Cloud after these sessions. The rest of the talks will build upon this Microservices architecture framework.
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
Seamless disaster recovery has several challenges. Data, metadata, and transaction information need to be moved in sync. It should also be easy for the users and applications to reason about the state of the replica. The “hadoop scale” also brings unique challenges as bandwidth between clusters can be a limiting factor. The data transfer has to be minimized for replication, failover, as well as fail back scenarios.
In this talk we will discuss how the above challenges are addressed for supporting seamless replication and disaster recovery for Hive.
Speakers
Sankar Hariappan, Hortonworks, Staff Software Engineer
Anishek Agarwal, Hortonworks, Engineering Manager
(SPOT305) Event-Driven Computing on Change Logs in AWS | AWS re:Invent 2014Amazon Web Services
An increasingly common form of computing is computation in response to recently occurring events. These might be newly arrived or changed data, such as an uploaded Amazon S3 image file or an update to an Amazon DynamoDB table, or they might be changes in the state of some system or service, such as termination of an EC2 instance. Support for this form of computing requires both a means of efficiently surfacing events as a sequence of change records, as well as frameworks for processing such change logs. This session provides an overview of how AWS intends to facilitate event-driven computing through support for both change logs as well as various means of processing them.
Slides presented at SDBigData Meetup:
http://www.meetup.com/sdbigdata/events/225691323/
There was a request for more Couchbase use case information and NoSQL primer, so I added a number of slides to let me talk to those aspects right before doing the presentation.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
Apache Kafka is now nearly ubiquitous in modern data pipelines and use cases. While the Kafka development model is elegantly simple, operating Kafka clusters in production environments is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters, especially when there are potentially hundreds or thousands of topics, producers and consumers and billions of messages.
The root cause of why real-time applications is lag may be due to an application problem – like poor data partitioning or load imbalance – or due to a Kafka problem – like resource exhaustion or suboptimal configuration. Therefore getting the best performance, predictability, and reliability for Kafka-based applications can be difficult. In the end, the operation of your Kafka powered analytics pipelines could themselves benefit from machine learning (ML).
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
Precima data scientist and architect discussing their data science and big data tech stack at the Toronto Data Science & Big Data meetup on Jan 30, 2019 hosted by WeCloudData https://weclouddata.com and sponsored by precima and loyaltyone.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Proactive ops for container orchestration environmentsDocker, Inc.
Break -> inspect -> fix is the Ops workflow for infrastructure stacks of the past. Distributed infrastructure and applications claim to be the new generation, but why is it so much more painful to maintain and troubleshoot them? Much of the pain comes from outdated operational models relying on reactive or, worse yet, manual monitoring and Ops.
This talk lays out a proactive Ops model for container infrastructure. By focusing on event monitoring, infrastructure state monitoring, trend analysis, and distributed log collection, a proactive Ops model delivers observability for distributed apps that was not possible before. Using real-world examples from Swarm and Kubernetes, we'll demonstrate the tools used and how we relieve Ops pain in container orchestration.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
AMIS organiseerde op maandagavond 15 juli het seminar ‘Oracle database 12c revealed’. Deze avond bood AMIS Oracle professionals de eerste mogelijkheid om de vernieuwingen in Oracle database 12c in actie te zien! De AMIS specialisten die meer dan een jaar bèta testen hebben uitgevoerd lieten zien wat er nieuw is en hoe we dat de komende jaren gaan inzetten!
Deze presentatie is deze avond gegeven als een plenaire sessie!
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
"Data mesh is a relatively recent architectural innovation, espoused as one of the best ways to fix analytic data. We renegotiate aged social conventions by focusing on treating data as a product, with a clearly defined data product owner, akin to that of any other product. In addition, we focus on building out a self-service platform with integrated governance, letting consumers safely access and use the data they need to solve their business problems.
Data mesh is prescribed as a solution for _analytical data_, so that conventionally analytical results (think weekly sales or monthly revenue reports) can be more accurately and predictably computed. But what about non-analytical business operations? Would they not also benefit from data products backed by self-service capabilities and dedicated owners? If you've ever provided a customer with an analytical report that differed from their operational conclusions, then this talk is for you.
Adam discusses the resounding successes he has seen from applying data mesh _off-label_ to both analytical and operational domains. The key? Event streams. Well-defined, incrementally updating data products that can power both real-time and batch-based applications, providing a single source of data for a wide variety of application and analytical use cases. Adam digs into the common areas of success seen across numerous clients and customers and provides you with a set of practical guidelines for implementing your own minimally viable data mesh.
Finally, Adam covers the main social and technical hurdles that you'll encounter as you implement your own data mesh. Learn about important data use cases, data domain modeling techniques, self-service platforms, and building an iteratively successful data mesh."
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
Presentation used by Lucas Jellema and Marco Gralike during the AMIS Oracle Database 12c Launch event on Monday the 15th of July 2013 (much thanks to Tom Kyte, Oracle, for being allowed to use some of his material)
M.
Similar to C19013010 the tutorial to build shared ai services session 2 (20)
Walk Through a Real World ML Production ProjectBill Liu
Success in productionizing ML models is difficult to achieve due to tools, processes and operational procedures. In this session, we demonstrate how data scientists and ML engineers collaborate and efficiently deploy models to production with the Wallaroo platform.
Using a real world scenario we will click down into the ML production journey that Data Scientists and ML engineers go through to take ML models into production. In this session you will learn:
The current pain points and blockers to production
The 2 persona roles in the ML production process. Data Scientist (DS) and ML Engineer
How the ML engineer creates a workspace in Wallaroo, and invites the DS to collaborate
How the DS uploads and deploys models to WL performing simple validation checks on output
How the ML Engineer can check model health (inference speed, etc)
How the DS checks logs, looks for anomalies
How the DS switches model in the pipeline
Speakers: Nina Zumel, Martin Bald
Redefining MLOps with Model Deployment, Management and Observability in Produ...Bill Liu
Tech talk: https://www.aicamp.ai/event/eventdetails/W2022052410
What happens after your machine learning models are deployed in production? How do you make sure that your model performance does not degrade as data and the world change?
The constantly changing data creates challenges for data scientists and engineering teams on how to detect which models have been affected and how to get their ML applications up and running seamlessly.
In this session we will take a deep dive into the new ML model monitoring and drift detection technology. We will discuss:
- How to track the ongoing accuracy of their models in production
- How to immediately detect drift before it causes significant damage to the business
- How to locate the cause of model drifting in live environments.
We will also discuss how data scientists and ML engineers can collaborate effectively using their respective tools to identify issues and take the necessary actions with a live demo and a real world use case.
Speaker: Younes Amar, Head of Product Wallaroo AI.
Resources: https://docs.wallaroo.ai/
These days, training of the Machine Learning models at the device Edge is still a risky endeavor. It is frequently considered a purely academic subject with little value for real-life product development.
In her talk, Vera will challenge this misconception, talk about the advantages of learning at the Edge and guide you through the Edge learning decision-making framework and design principles.
https://www.aicamp.ai/event/eventdetails/W2021102210
Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Speaker: Davide Coccomini, Nicola Messina
Website: https://www.aicamp.ai/event/eventdetails/W2021101110
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
deep_autoviml is a powerful new deep learning library with a very simple design goal: Make it as easy as possible for novices and experts alike to experiment with and build tensorflow.keras preprocessing pipelines and models in as few lines of code as possible.
deep_autoviml will enable data scientists, ML engineers and data engineers to fast prototype tensorflow models and data pipelines for MLOps workflows using the latest TF 2.4+ and keras preprocessing layers. You can now upload your saved model to any Cloud provider and make predictions out of the box since all the data preprocessing layers are attached to the model itself!
In this webinar, we will discuss the problems that deep_AutoViML can solve, its architecture design and demo how to build powerful TF.Keras models on structured data, NLP and Image data domains.
https://www.aicamp.ai/event/eventdetails/W2021080918
Metaflow: The ML Infrastructure at NetflixBill Liu
Metaflow was started at Netflix to answer a pressing business need: How to enable an organization of data scientists, who are not software engineers by training, build and deploy end-to-end machine learning workflows and applications independently. We wanted to provide the best possible user experience for data scientists, allowing them to focus on parts they like (modeling using their favorite off-the-shelf libraries) while providing robust built-in solutions for the foundational infrastructure: data, compute, orchestration, and versioning.
Today, the open-source Metaflow powers hundreds of business-critical ML projects at Netflix and other companies from bioinformatics to real estate.
In this talk, you will learn about:
- What to expect from a modern ML infrastructure stack.
- Using Metaflow to boost the productivity of your data science organization, based on lessons learned from Netflix.
- Deployment strategies for a full stack of ML infrastructure that plays nicely with your existing systems and policies.
https://www.aicamp.ai/event/eventdetails/W2021080510
AI stands on three pillars: algorithms, hardware and training data. While the first two have already become commodities on the market, the latter - reliable labelled data - is still a bottleneck in the industry.
Need to add twice as much data to the training set to improve your model? Want to validate the accuracy of a new classificator in an hour? Or maybe you are building a human-in-the-loop process with 90% of cases processed automatically and the trickiest 10% of cases fine-tuned by people in real time. You can do it all with crowdsourcing, but only with crowdsourcing done right.
In this talk, we will discuss how the new generation of methods and tools allows to collect high quality human labelled data on a large scale and why every ML specialist should know how to use crowdsourcing.
You will learn from the talk:
* Understand the applicability, benefits and limits of the crowdsourcing approach.
* Integrate an on-demand workforce into your processes and build human-in-the-loop processes.
* Control the quality and accuracy of data labeling to develop high performing ML models.
* Understand the full-cycle crowdsourcing project
Speaker: Daria Baidakova(Toloka)
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Deep Reinforcement Learning and Its ApplicationsBill Liu
What is the most exciting AI news in recent years? AlphaGo!
What are key techniques for AlphaGo? Deep learning and reinforcement learning (RL)!
What are application areas for deep RL? A lot! In fact, besides games, deep RL has been making tremendous achievements in diverse areas like recommender systems and robotics.
In this talk, we will introduce deep reinforcement learning, present several applications, and discuss issues and potential solutions for successfully applying deep RL in real life scenarios.
https://www.aicamp.ai/event/eventdetails/W2021042818
Big Data and AI in Fighting Against COVID-19Bill Liu
Website: https://learn.xnextcon.com/event/eventdetails/W20070810
As the COVID-19 pandemic sweeps the globe, big data and AI have emerged as crucial tools for everything from diagnosis and epidemiology to therapeutic and vaccine development.
In this talk, we collect and review how big data is fighting back against COVID-19. We also provide a deep diving for two interesting use cases: 1) Use NLP and BERT to answer scientific questions. 2) Covid-19 data lake from Databricks, Google and Amazon
Agenda:
Introduction
Supercomputers for Scientific Research
Covid-19 Tracking and Prediction
Covid-19 Research and Diagnosis
Use Case 1 NLP and BERT to answer scientific questions
Use Case 2 Covid-19 Data Lake and Platform
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsBill Liu
website: https://learn.xnextcon.com/event/eventdetails/W20051110
video: https://www.youtube.com/watch?v=8tG8PJC6oaU
In reinforcement learning (RL), an agent learns how to optimize performance solely by collecting experience in the real world or via a simulator. RL is being applied to problems such as decision making, process optimization (e.g., manufacturing and supply chains), ad serving, recommendations, self-driving cars, and algorithmic trading.
In this talk, I will discuss RLlib, a reinforcement learning library built on Ray with a strong focus on large-scale execution and scalability, ease-of-use for general users, as well as customizability for developers and researchers.
RLlib offers autonomous task-learning via many common RL algorithms and it scales from a laptop to a cluster with hundreds of machines. It is used by dozens of organizations, from startups to research labs to large organizations. You will see RLlib in action with a live demo.
Build computer vision models to perform object detection and classification w...Bill Liu
event: https://learn.xnextcon.com/event/eventdetails/W20042918
video:
description: Computer Vision has received significant attention over the recent years, both within academia, and industry. As the state-of-the-art rapidly improves, the art-of-the-possible follows , offering innovative forms of computer vision applications for different scenarios.
In this talk, Ramine will cover the background and development of computer vision, and demonstrate how to use AWS to build robust, computer vision models to perform object detection and classification.
Key Takeaways:
Understand the history of Computer Vision
Learn how to use Amazon SageMaker to build and Deploy Computer Vision Models
How to orchestrate multiple models for implementing a real-world use case
Causal Inference in Data Science and Machine LearningBill Liu
Event: https://learn.xnextcon.com/event/eventdetails/W20042010
Video: https://www.youtube.com/channel/UCj09XsAWj-RF9kY4UvBJh_A
Modern machine learning techniques are able to learn highly complex associations from data, which has led to amazing progress in computer vision, NLP, and other predictive tasks. However, there are limitations to inference from purely probabilistic or associational information. Without understanding causal relationships, ML models are unable to provide actionable recommendations, perform poorly in new, but related environments, and suffer from a lack of interpretability.
In this talk, I provide an introduction to the field of causal inference, discuss its importance in addressing some of the current limitations in machine learning, and provide some real-world examples from my experience as a data scientist at Brex.
https://learn.xnextcon.com/event/eventdetails/W20040610
This talk explains how to practically bring the power of convolutional neural networks and deep learning to memory and power-constrained devices like smartphones. You will learn various strategies to circumvent obstacles and build mobile-friendly shallow CNN architectures that significantly reduce the memory footprint and therefore make them easier to store on a smartphone;
The talk also dives into how to use a family of model compression techniques to prune the network size for live image processing, enabling you to build a CNN version optimized for inference on mobile devices. Along the way, you will learn practical strategies to preprocess your data in a manner that makes the models more efficient in the real world.
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningBill Liu
https://learn.xnextcon.com/event/eventdetails/W20040310
I will describe what is available in terms of Open Source and Proprietary tools for automating Data Science tasks and introduce 2 new tools: one to visualize any sized data set with one click, another: to try multiple ML models and techniques with a single call. I will provide the Github Repos for both for free in the talk.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
C19013010 the tutorial to build shared ai services session 2
1. The tutorial to build shared AI
services
--Session 2
Suqiang Song (Jack)
Director & Chapter Leader of Data/AI Engineering @ Mastercard
jackssqcyy@gmail.com
https://www.linkedin.com/in/suqiang-song-72041716/
2. Agenda
Module 3: AI Engineering platform and AI
Engineers ( 40 mins)
• Key factors to consider an AI Engineering platform
• architect a data pipeline framework
• Apache NiFi introduction
• Traditional AI Tribe and its challenges
• knowledges and skills are required for AI Engineer
• Growing path for an AI Engineer
Session 2: Feb. 1rd Friday 10am-12pm PT
Module 4: Benchmark between
Spark Machine learning and Deep
learning + Code Lab 2 (30 mins)
• Traditional Collaborative Filtering
approach with Spark Mllib ALS (Scala)
• Build an NCF deep learning approach
with Intel Analytic Zoo on Spark (Scala)
Q & A (10 mins)Live Demo (40 mins)
• Build an end to end AI Pipeline with
Kafka, NiFi, Spark Streaming and Keras
on Spark
3. Course Prerequisites
• Install Docker at your local laptop
• Download two Docker images from shared drive URL
kafka.tar and demo-whole.tar and also
demo_pipeline.xml
passcode : jack
• Load images to your Docker environment
https://1drv.ms/f/s!AsXKHMXBWUIBiBpaYk9FFjdoUifg
$ docker load -i demo-whole.tar
$ docker load -i kafka.tar
5. AI Engineering Organization/ProcessAPI/ Pipeline Enablement
Talent Data
Technology
infrastructure
ConsolidateLeverage Automate
Key factors to consider an AI Engineering platform
6. Continue
AI Engineering platform
ML /DL learning pipelines
Historical + Incremental
Data Sources
Data Pipeline Bus
Real Time
Event Integration
Batch
Data Integration
Business Rule
Integration
Online Systems CRM
Files
Transfer
Data LakeMessage
Bus
Real Time
Serving
Streaming
Serving
Batch
Serving
Monitoring
Metrics
Serving Engine
Data Pipeline Engine
Machine Learning & Deep
Learning Libs/Frameworks
Performance
Analyzer
Predefined Integrated Pipelines
Predefined Serving APIs & Templates
Predefined AI Service
Templates
Workbench
Admin
Cloud Native
7. Data Flow Pipeline
• Flow-based ”programming”
• Source –Channel-Sink “structure”
• Ingest Data from various sources
and Transform data to various
destinations
• Extract – Transform – Load
• High-Throughput, straight-through
data flows
• Data Governance
• Combine Batch and Stream-
Processing
• Visual coding with flow editor
• Event Processing (ESP & CEP)
8. The X(quality attributes) for data pipeline
framework includes
• Clustering
• High Availability & Recovery
• Delivery Guarantee
• Data Buffering ,Flow Control and Back Pressure
• Data Governance
• Usability
• Extensibility
• Multi-Tenancy
• Version Control & Deployment
• Security
• Monitoring & Diagnostic Capabilities
• Integration Capabilities
• Cloud Native
• Performance , Latency and Throughputs
Architect a data
pipeline framework
What is the DFX ?
Along with functional
requirements, there are various
quality attributes.
The difference in these attributes
can make the product very
different.
Such as Tesla and Leaf
DFX is Design For Quality
Attributes
9. Example : High
Availability and
Recovery
High availability
– Pipeline level : Each step or processor at flow that is likely to
encounter failures will have a "failure" routing relationship
– Pipeline Failure is handled by looping that failure
relationship back on the same step or to new steps
– Node level failover will depends on a "cluster coordinator"
and a "primary node" elected
– Pipeline failover between nodes ?
Recovery
– Replay : Content repository should be designed to act as a
rolling buffer of history which supports replay every well
– Data Recovery after failover , the eventual consistency
– Breakpoint resume : Last-saved offset , how you resume the
pipeline from the broken pieces after fixed
10. Example :
Data Buffering ,Flow
Control and Back
Pressure
Buffering with Prioritization
– Configure a prioritizer per connection, such as FirstInFristOut
, NewestFirst,OldestFirst etc..
– Determine what is important for your data – time based,
arrival order,
importance of a data set
– Funnel many connections down to a single connection to
prioritize across
data sets
– Develop your own prioritizer if needed
Flow Control & Back-Pressure
– Configure back-pressure such as expiration, threshold for
per connection
– Based on number of flows or total size of flows
– Upstream processor no longer scheduled to run until below
threshold
11. Example : Security Control Plane
– Pluggable authentication :2-Way SSL, LDAP, Kerberos
– File-based authority provider out of the box
– Multiple roles to defines access controls
Data Plane
– Optional 2-Way SSL between cluster nodes
– Optional 2-Way SSL on Site-To-Site ( or Edge-to-Edge)
connections
– Encryption/Decryption of data through processors
Data privacy and compliance
– PCI/PII compliance
– GDPR (General Data Protection Regulation)
Yes , you don’t want
your CEO to be testified
before Congress ☺
12. Example :
Multi-Tenancy
Ability for multiple groups of
entities (people or systems) to
command, control, and observe
state of different parts of the
dataflow
Multi-tenant Authorization
– Enable a self-service model for dataflow management,
allowing each team or organization to manage flows with a
full awareness of the rest of the flow, to which they do not
have access.
Multi-tenant isolation and Separated SLA/QoS
– Data is absolutely critical and it is loss intolerant
– Enables the fine-grained flow specific configuration to each
tenant
– Data Buffering ,Flow Control and Back Pressure should be
considered at tenant level
Multi-tenant isolated resources management
– Integrate with 3rd popular resources management
framework such as Yarn
– Split up the functionalities of resource management and job
scheduling/monitoring into separate daemons
15. Clustering
Assessment Score Ratings
High Availability and Recovery
2 431 5
N
NiFiN
2 431 5
N
Delivery Guarantee
2 431 5
N
Data Buffering ,Flow
Control and Back Pressure
2 431 5
N
Data Governance
2 431 5
N
Usability
2 431 5
N
Extensibility
2 431 5
N
Multi-Tenancy
2 431 5
N
Version Control & Deployment
2 431 5
N
Authentication & Authorization
2 431 5
N
Encryption and decryption
2 431 5
N
Monitoring & Diagnostic
2 431 5
N
Integration capabilities
2 431 5
N
Cloud Native
2 431 5
N
Performance, Latency and Throughputs
(Real Time & Streaming)
2 431 5
N
Performance, Latency and Throughputs
(Batch files / DB actions )
2 431 5
N
16. Ingest
TransformAnalyze
Output
Understand
Problem
Ingest
Data
Explore and
Understand
Data
Clean and
Shape Data
Evaluate
Data
Create and
build Models
Communicate
Results
Deliver &
Deploy Model
Data Engineer
Architect how data is organized
& ensure operability
Data Scientist
Deep analytics and modeling
for hidden insights
Business Analyst
Work with data to apply
insights to business strategy
App Developer
Integrates data & insights with
existing or new applications
Traditional AI Tribe
18. 4
4
4
4
4
3
3
0 1 2 3 4
ALGORITHM
ML MODELING
RESEARCH
MATH
STATISTICS
HYPERTUNING
FEATURE ENGINEERING
2
1
1
1
2
2
2
0 1 2 3 4
QA
TROUBLE SHOOTING
ABSTRACT THINKING
DEBUGING
DATA STRUCTURES
UNIT TEST
SOFTWARE ALGORITHM
1
1
1
3
4
0 1 2 3 4
SHELL/PERL
C++
JAVA/SCALA
PYTHON
R
2
1
2
1
2
1
2
2
0 1 2 3 4
DATA MODELING
DATA WAREHOUSING…
SQL
RDB & NOSQL
ETL
DATA GOVERNANCE
DATA PIPELINE
JOB/WORKFLOW
3
1
1
2
2
4
2
0 1 2 3 4
APPLIANCE(SCALE UP)
DISTRIBUTE…
KAFKA/STREAMING
SPARK
HADOOP
SAS
MPP
Data Mining Programing / Coding Languages
Data Engineering Big Data Stacks
3
3
2
3
4
2
2
2
0 1 2 3 4
PANDAS
H2O
SPARK MLLIB
SCIKIT LEARN
R LIB
DL-CAFFE
DL-KERAS
DL-TENSORFLOW
APIs /Services/App
( Model Serving)
1
1
1
1
1
1
1
0 1 2 3 4
AUTOMATION
REAL TIME MESSAGING
CACHE
API/APP FRAMEWORK
RPC/RESTFUL…
CI/CD
CLOUD NATIVE
3
4
3
2
2
3
0 1 2 3 4
DATA VISUALIZATION
MODEL INTERPRETABILITY
BUSINESS ACUMEN
COMMUNICATION SKILLS
BUSINESS ANALYSIS
PRESENTATION
Ratings for traditional data scientist
ML/DL Frameworks
Visualization
& Communications
19. Ratings for traditional data engineer
1
1
1
2
1
1
2
0 1 2 3 4
ALGORITHM
ML MODELING
RESEARCH
MATH
STATISTICS
HYPERTUNING
FEATURE ENGINEERING
3
3
3
3
4
3
3
0 1 2 3 4
QA
TROUBLE SHOOTING
ABSTRACT THINKING
DEBUGING
DATA STRUCTURES
UNIT TEST
SOFTWARE ALGORITHM
4
2
3
3
1
0 1 2 3 4
SHELL/PERL
C++
JAVA/SCALA
PYTHON
R
4
4
4
4
4
4
4
4
0 1 2 3 4
DATA MODELING
DATA WAREHOUSING…
SQL
RDB & NOSQL
ETL
DATA GOVERNANCE
DATA PIPELINE
JOB/WORKFLOW
3
4
3
4
4
2
3
0 1 2 3 4
APPLIANCE(SCALE UP)
DISTRIBUTE…
KAFKA/STREAMING
SPARK
HADOOP
SAS
MPP
Data Mining Programing / Coding Languages
Data Engineering Big Data Stacks
2
1
2
2
1
2
2
2
0 1 2 3 4
PANDAS
H2O
SPARK MLLIB
SCIKIT LEARN
R LIB
DL-CAFFE
DL-KERAS
DL-TENSORFLOW
APIs /Services/App
( Model Serving)
3
2
2
3
3
3
3
0 1 2 3 4
AUTOMATION
REAL TIME MESSAGING
CACHE
API/APP FRAMEWORK
RPC/RESTFUL…
CI/CD
CLOUD NATIVE
2
1
2
1
1
2
0 1 2 3 4
DATA VISUALIZATION
MODEL INTERPRETABILITY
BUSINESS ACUMEN
COMMUNICATION SKILLS
BUSINESS ANALYSIS
PRESENTATION
ML/DL Frameworks
Visualization
& Communications
20. Ratings for traditional application developer
1
1
1
2
1
1
1
0 1 2 3 4
ALGORITHM
ML MODELING
RESEARCH
MATH
STATISTICS
HYPERTUNING
FEATURE ENGINEERING
3
4
4
4
3
4
4
0 1 2 3 4
QA
TROUBLE SHOOTING
ABSTRACT THINKING
DEBUGING
DATA STRUCTURES
UNIT TEST
SOFTWARE ALGORITHM
2
3
4
2
1
0 1 2 3 4
SHELL/PERL
C++
JAVA/SCALA
PYTHON
R
2
1
3
3
2
2
2
3
0 1 2 3 4
DATA MODELING
DATA WAREHOUSING…
SQL
RDB & NOSQL
ETL
DATA GOVERNANCE
DATA PIPELINE
JOB/WORKFLOW
1
1
2
1
1
1
1
0 1 2 3 4
APPLIANCE(SCALE UP)
DISTRIBUTE…
KAFKA/STREAMING
SPARK
HADOOP
SAS
MPP
Data Mining Programing / Coding Languages
Data Engineering Big Data Stacks
1
1
1
1
1
1
1
1
0 1 2 3 4
PANDAS
H2O
SPARK MLLIB
SCIKIT LEARN
R LIB
DL-CAFFE
DL-KERAS
DL-TENSORFLOW
APIs /Services/App
( Model Serving)
3
4
4
4
4
4
3
0 1 2 3 4
AUTOMATION
REAL TIME MESSAGING
CACHE
API/APP FRAMEWORK
RPC/RESTFUL…
CI/CD
CLOUD NATIVE
2
1
3
2
1
2
0 1 2 3 4
DATA VISUALIZATION
MODEL INTERPRETABILITY
BUSINESS ACUMEN
COMMUNICATION SKILLS
BUSINESS ANALYSIS
PRESENTATION
ML/DL Frameworks
Visualization
& Communications
21. Ratings for modern AI engineer
3
2
2
3
2
3
3
0 1 2 3 4
ALGORITHM
ML MODELING
RESEARCH
MATH
STATISTICS
HYPERTUNING
FEATURE ENGINEERING
3
4
4
3
4
3
4
0 1 2 3 4
QA
TROUBLE SHOOTING
ABSTRACT THINKING
DEBUGING
DATA STRUCTURES
UNIT TEST
SOFTWARE ALGORITHM
3
2
4
4
2
0 1 2 3 4
SHELL/PERL
C++
JAVA/SCALA
PYTHON
R
3
3
3
3
3
3
4
3
0 1 2 3 4
DATA MODELING
DATA WAREHOUSING…
SQL
RDB & NOSQL
ETL
DATA GOVERNANCE
DATA PIPELINE
JOB/WORKFLOW
3
4
4
4
4
2
3
0 1 2 3 4
APPLIANCE(SCALE UP)
DISTRIBUTE…
KAFKA/STREAMING
SPARK
HADOOP
SAS
MPP
Data Mining Programing / Coding Languages
Data Engineering Big Data Stacks
3
2
4
3
2
3
4
4
0 1 2 3 4
PANDAS
H2O
SPARK MLLIB
SCIKIT LEARN
R LIB
DL-CAFFE
DL-KERAS
DL-TENSORFLOW
APIs /Services/App
( Model Serving)
3
4
4
4
4
3
3
0 1 2 3 4
AUTOMATION
REAL TIME MESSAGING
CACHE
API/APP FRAMEWORK
RPC/RESTFUL…
CI/CD
CLOUD NATIVE
4
3
3
3
2
3
0 1 2 3 4
DATA VISUALIZATION
MODEL INTERPRETABILITY
BUSINESS ACUMEN
COMMUNICATION SKILLS
BUSINESS ANALYSIS
PRESENTATION
Visualization
& Communications
ML/DL Frameworks
22. AI EngineerAreas need enhancements Training / Improving approach
Languages
Deep Learning To be expert , from bottom to the top
Use Deep Learning to avoid the gap
Spark ***** Hadoop ***
6 months certificate Plus
Data Engineering
Big Data Stacks
API /Application Design and
implement Essentials
Model Serving
Programing Course or Hands on
project , no need a CS MasterProgramming / Coding
Java/ Scala ++ ,Python +
Growing path 1 : traditional data scientist - > AI Engineer
23. AI EngineerAreas need enhancements Training / Improving approach
Languages
ML Framework
Deep Learning
Visualization , Communication Skills
Presentation
Visualization
& Communication
API /Application Design and
implement Advanced
Model Serving
Secondary DS/BA Master
At least 6 months certificate
Use Deep Learning to simply
Data Mining
Java/ Scala + ,Python +,R ++
Fast.ai ,GitHub , Kaggle, Coursera
/Udemy
At least 6 months DL certificate
Growing path 2 : traditional data engineer - > AI Engineer
24. AI EngineerAreas need enhancements Training / Improving approach
Languages
Fast.ai ,GitHub , Kaggle, Coursera
/Udemy
At least 6 months DL certificate
ETL Essentials
Data Pipeline ++
Use Deep Learning to avoid the gap
Hadoop + Spark Essentials
At least 6 months certificate
Data Engineering
Big Data Stacks
Python +++, R ++
Secondary DS/BA Master
At least 6 months certificate
Use Deep Learning to simply
Data Mining
ML Framework
Deep Learning
Visualization , Communication Skills
Presentation
Visualization
& Communication
Growing path 3 : traditional application developer - > AI Engineer
27. Spark Mllib
ALS
NCF on Spark
Collaborative Filtering -- Model selection
Spark Mllib
KMeans
28. • Offers a set of parallelized machine learning algorithms for ML
• Supports Model Selection (hyper parameter tuning) using Cross
Validation and Train-Validation Split.
• Supports Java, Scala or Python apps using Data Frame-based API
Enables Parallel, Distributed ML for large datasets on
Spark Clusters
Spark Mllib
30. Spark Mllib ML Pipeline
DataFrame: Spark ML uses DataFrame rather than regular RDD as they hold a
variety of data types (e.g. feature vectors, true labels, and predictions).
Transformer: a transformer converts a DataFrame into another DataFrame
usually by appending columns. (since Spark DataFrame is immutable, it actually
creates a new DataFrame). The implement method for a transformer is
“transform()”.
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to
produce a Transformer. Implements method fit() taking a DataFrame and a
model (also a transformer) as input.
Pipeline: Chains multiple Transformers and Estimators each as a stage to
specify an ML workflow. These stages are run in order, and the input
DataFrame is transformed as it passes through each stage.
Parameter: All Transformers and Estimators now share a common API for
specifying parameters.
Evaluator: Evaluate model performance. The Evaluator can be
• RegressionEvaluator for regression problems,
• BinaryClassificationEvaluator for binary data, or
• MulticlassClassificationEvaluator for multiclass problems.