The talk will introduce Ludwig, a deep learning toolbox that allows to train models and to use them for prediction without the need to write code. It is unique in its ability to help make deep learning easier to understand for non-experts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. By using Ludwig, experts and researchers can simplify the prototyping process and streamline data processing so that they can focus on developing deep learning architectures.
Bio:
Piero Molino is a Senior Research Scientist at Uber AI with focus on machine learning for language and dialogue. Piero completed a PhD on Question Answering at the University of Bari, Italy. Founded QuestionCube, a startup that built a framework for semantic search and QA. Worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on natural language processing with deep learning and then joined Geometric Intelligence, where he worked on grounded language understanding. After Uber acquired Geometric Intelligence, he became one of the founding members of Uber AI Labs.
Discuss the different ways model can be served with MLflow. We will cover both the open source MLflow and Databricks managed MLflow ways to serve models. Will cover the basic differences between batch scoring and real-time scoring. Special emphasis on the new upcoming Databricks production-ready model serving.
Experimentation to Industrialization: Implementing MLOpsDatabricks
In this presentation, drawing upon Thorogood’s experience with a customer’s global Data & Analytics division as their MLOps delivery partner, we share important learnings and takeaways from delivering productionized ML solutions and shaping MLOps best practices and organizational standards needed to be successful.
We open by providing high-level context & answering key questions such as “What is MLOps exactly?” & “What are the benefits of establishing MLOps Standards?”
The subsequent presentation focuses on our learnings & best practices. We start by discussing common challenges when refactoring experimentation use-cases & how to best get ahead of these issues in a global organization. We then outline an Engagement Model for MLOps addressing: People, Processes, and Tools. ‘Processes’ highlights how to manage the often siloed data science use case demand pipeline for MLOps & documentation to facilitate seamless integration with an MLOps framework. ‘People’ provides context around the appropriate team structures & roles to be involved in an MLOps initiative. ‘Tools’ addresses key requirements of tools used for MLOps, considering the match of services to use-cases.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
Cloud, Cost, Complexity, and threat Coverage are top of mind for every security leader. The Lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. In this talk, we will show how Lakehouse is essential for effective Cybersecurity and popular security use-cases. We will also share how Databricks empowers the security data scientist and analyst of the future and how this technology allows cyber data sets to be used to solve business problems.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail.
Below topics are explained in this Spark SQL presentation:
1. What is Spark SQL?
2. Spark SQL features
3. Spark SQL architecture
4. Spark SQL - Dataframe API
5. Spark SQL - Data source API
6. Spark SQL - Catalyst optimizer
7. Running SQL queries
8. Spark SQL demo
This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
Discuss the different ways model can be served with MLflow. We will cover both the open source MLflow and Databricks managed MLflow ways to serve models. Will cover the basic differences between batch scoring and real-time scoring. Special emphasis on the new upcoming Databricks production-ready model serving.
Experimentation to Industrialization: Implementing MLOpsDatabricks
In this presentation, drawing upon Thorogood’s experience with a customer’s global Data & Analytics division as their MLOps delivery partner, we share important learnings and takeaways from delivering productionized ML solutions and shaping MLOps best practices and organizational standards needed to be successful.
We open by providing high-level context & answering key questions such as “What is MLOps exactly?” & “What are the benefits of establishing MLOps Standards?”
The subsequent presentation focuses on our learnings & best practices. We start by discussing common challenges when refactoring experimentation use-cases & how to best get ahead of these issues in a global organization. We then outline an Engagement Model for MLOps addressing: People, Processes, and Tools. ‘Processes’ highlights how to manage the often siloed data science use case demand pipeline for MLOps & documentation to facilitate seamless integration with an MLOps framework. ‘People’ provides context around the appropriate team structures & roles to be involved in an MLOps initiative. ‘Tools’ addresses key requirements of tools used for MLOps, considering the match of services to use-cases.
In this introduction to Apache Hive the following topics are covered:
1. Hive Origin
2. Hive philosophy and architecture
3. Hive vs. RDBMS
4. HiveQL and Hive Shell
5. Managing tables
6. Data types and schemas
7. Querying data
8. HiveODBC
9. Resources
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
Cloud, Cost, Complexity, and threat Coverage are top of mind for every security leader. The Lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. In this talk, we will show how Lakehouse is essential for effective Cybersecurity and popular security use-cases. We will also share how Databricks empowers the security data scientist and analyst of the future and how this technology allows cyber data sets to be used to solve business problems.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail.
Below topics are explained in this Spark SQL presentation:
1. What is Spark SQL?
2. Spark SQL features
3. Spark SQL architecture
4. Spark SQL - Dataframe API
5. Spark SQL - Data source API
6. Spark SQL - Catalyst optimizer
7. Running SQL queries
8. Spark SQL demo
This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
CQRS and Event Sourcing, An Alternative Architecture for DDDDennis Doomen
Most of us will be familiar with the standard 3- or 4-layer architecture you often see in larger enterprise systems. Some are already practicing Domain Driven Design and work together with the business to clarify the domain concepts. Perhaps you’ve noticed that is difficult to get the intention of the 'verbs' from that domain into this standard architecture. If performance is an important requirement as well, then you might have discovered that an Object-Relational Mapper and a relational database are not always the best solution.
One of the main reasons for this is the fact that the interests of a consistent domain that takes into account the many business rules, and those of data reporting and presentation are conflicting. That’s why Betrand Meyer introduced the Command Query Separation principle.
An architecture based on this principle combined with the Event Sourcing concept provides the ideal architecture for building high-performance systems designed using DDD. Well-known bloggers like Udi Dahan and Greg Young have already spent quite a lot of of posts on this, and this year’s Developer Days had some coverage as well.
But how do you build such a system with the. NET framework? Is it really as complex as some claim, or is just different work?
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Building a Feature Store around Dataframes and Apache SparkDatabricks
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
REST API debate: OData vs GraphQL vs ORDSSumit Sarkar
Learn the latest industry trends surrounding REST API standardization and what this means for your roadmap. OData is an OASIS standard REST API and has been established among tech companies such as Microsoft, SAP, CA, IBM and Salesforce. GraphQL was created by Facebook in 2015 and has already been deployed at tech companies such as Facebook, Shopify and Intuit. ORDS is the Oracle REST API and delivers similar standardization for Oracle-centric applications.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Mark and Wes will talk about Cypher optimization techniques based on real queries as well as the theoretical underlying processes. They'll start from the basics of "what not to do", and how to take advantage of indexes, and continue to the subtle ways of ordering MATCH/WHERE/WITH clauses for optimal performance as of the 2.0.0 release.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
CQRS and Event Sourcing, An Alternative Architecture for DDDDennis Doomen
Most of us will be familiar with the standard 3- or 4-layer architecture you often see in larger enterprise systems. Some are already practicing Domain Driven Design and work together with the business to clarify the domain concepts. Perhaps you’ve noticed that is difficult to get the intention of the 'verbs' from that domain into this standard architecture. If performance is an important requirement as well, then you might have discovered that an Object-Relational Mapper and a relational database are not always the best solution.
One of the main reasons for this is the fact that the interests of a consistent domain that takes into account the many business rules, and those of data reporting and presentation are conflicting. That’s why Betrand Meyer introduced the Command Query Separation principle.
An architecture based on this principle combined with the Event Sourcing concept provides the ideal architecture for building high-performance systems designed using DDD. Well-known bloggers like Udi Dahan and Greg Young have already spent quite a lot of of posts on this, and this year’s Developer Days had some coverage as well.
But how do you build such a system with the. NET framework? Is it really as complex as some claim, or is just different work?
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
The catalyst for the success of automobiles came not through the invention of the car but rather through the establishment of an innovative assembly line. History shows us that the ability to mass produce and distribute a product is the key to driving adoption of any innovation, and machine learning is no different. MLOps is the assembly line of Machine Learning and in this presentation we will discuss the core capabilities your organization should be focused on to implement a successful MLOps system.
Building a Feature Store around Dataframes and Apache SparkDatabricks
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
REST API debate: OData vs GraphQL vs ORDSSumit Sarkar
Learn the latest industry trends surrounding REST API standardization and what this means for your roadmap. OData is an OASIS standard REST API and has been established among tech companies such as Microsoft, SAP, CA, IBM and Salesforce. GraphQL was created by Facebook in 2015 and has already been deployed at tech companies such as Facebook, Shopify and Intuit. ORDS is the Oracle REST API and delivers similar standardization for Oracle-centric applications.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Mark and Wes will talk about Cypher optimization techniques based on real queries as well as the theoretical underlying processes. They'll start from the basics of "what not to do", and how to take advantage of indexes, and continue to the subtle ways of ordering MATCH/WHERE/WITH clauses for optimal performance as of the 2.0.0 release.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
Strata CA 2019: From Jupyter to Production Manu MukerjiManu Mukerji
Proposed title
From Jupyter to production
Description of the presentation
Jupyter is very popular for data science, data exploration and visualization, this talk is about how to use it in for AI/ML in a production environment.
General Flow of talk:
How things can go wrong with QA, Production releases when using a notebook
Common Jupyter ML examples
Standard ML flow
Training in production
Model creation
Testing in production
Papermill and Jupyter
Production workflows with Sagemaker
Speaker
Manu Mukerji is senior director of data, machine learning, and analytics at 8×8. Manu’s background lies in cloud computing and big data, working on systems handling billions of transactions per day in real time. He enjoys building and architecting scalable, highly available data solutions and has extensive experience working in online advertising and social media.
Start machine learning in 5 simple stepsRenjith M P
Simple steps to get started with machine learning.
The use case uses python programming. Target audience is expected to have a very basic python knowledge.
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on
cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly
improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under
various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction
framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then
predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
Introduction to Machine Learning for newcomers. It will show you some basic concepts like what is supervised learning, unsupervised learning, classification, regression, under/overfitting, clustering, anomaly detection, and how to have some measures. It will illustrates examples through scikit-learn and tensorflow code
This slide deck gives an overview of the Azure Machine Learning Service. It highlights benefits of Azure Machine Learning Workspace, Automated Machine Learning and integration Notebook scripts
Case Study: How CA Went From 40 Days to Three Days Building Crystal-Clear Tes...CA Technologies
Here at CA Technologies, our development teams share many of the same challenges producing quality software as our customers.
For more information on DevOps: Continuous Delivery, please visit: http://cainc.to/CAW17-CD
Case Study: How CA Went From 40 Days to Three Days Building Crystal-Clear Tes...CA Technologies
Here at CA Technologies, our development teams share many of the same challenges producing quality software as our customers.
For more information on DevOps: Continuous Delivery, please visit: http://ow.ly/3X1E50g62YC
PythonQuants conference - QuantUniversity presentation - Stress Testing in th...QuantUniversity
In this talk, we will discuss the key aspects of model verification and validation and introduce a novel approach to do stress and scenario tests leveraging parallel and distributed computing technologies and the cloud.
We have developed a platform that leverages cloud based technologies to run stress tests on a massive scale without having to invest in fixed in-house architectures. Through a case study, we will illustrate best practices for stress and scenario testing for model verification and validation. These best practices meant to provide practical tips for companies embarking on a formal model risk management program or enhancing their stress testing methodologies
This tutor introduces the basic idea of machine learning with a very simple example. Machine learning teaches machines (and me too) to learn to carry out tasks and concepts by themselves. It is that simple, so here is an overview:
http://www.softwareschule.ch/examples/machinelearning.jpg
(DAT311) Large-Scale Genomic Analysis with Amazon RedshiftAmazon Web Services
Genomics analysis is one of the biggest data problems out there. With DNA sequencing finally down to an affordable cost, the current bottleneck is shifting from sequencing genomes to deriving meaning from genomes at a large scale. Learn how Human Longevity, Inc., uses Amazon Redshift to analyze thousands of whole genomes every month. Dive into their detailed architecture, including how they ingest terabytes of genomic information each day. Learn how they optimize their schema, rapidly analyzing thousands of genomes in a single query using a "select, aggregate, annotate" paradigm. Finally, learn best practices for using Amazon Redshift to accelerate research.
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
Dive into the essentials of ML model development, processes, and techniques to combat underfitting and overfitting, explore distributed training approaches, and understand model explainability. Enhance your skills with practical insights from a seasoned expert.
Training and tuning models with lengthy training cycles like those in deep learning can be extremely expensive and may sometimes involve techniques that degrade performance. We'll explore recent research on optimization strategies to efficiently tune these types of deep learning models. We will provide benchmarks and comparisons to other popular methods for optimizing the models, and we'll recommend valuable areas for further applied research.
Similar to Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI (20)
How to use the Economic Complexity Index to guide innovation plansData Science Milan
In this talk Mauro Pelucchi will present the Economic Complexity Index (ECI) and the Product Complexity Index (PCI), two network measures that provide unique insights into economic development patterns.We will show how to compute these metrics and explore the network theory behind these indices (Hidalgo and Hausmann, 2009).
The measures are also related to various dimensionality reduction methods and can be used to determine distances between nodes based on their nodes based on their similarity.Finally, we will discover how to interpret these metrics to compare countries, markets, products, and guide our plans in a data-driven context.
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
It is indeed a wonderful time to build machine learning systems, as the growing ecosystems of tools and shared best practices make even small teams incredibly productive at scale. In this talk, we present our philosophy for modern, no-nonsense data pipelines, highlighting the advantages of a (almost) pure serverless and open-source approach, and showing how the entire toolchain works - from raw data to model serving - on a real-world dataset.
Finally, we argue that the crucial component for analyzing data pipelines is not the model per se, but the surrounding DAG, and present our proposal for producing automated "DAG cards" from Metaflow classes.
Bio:
Jacopo Tagliabue was co-founder and CTO of Tooso, an A.I. company in San Francisco acquired by Coveo in 2019. Jacopo is currently the Lead A.I. Scientist at Coveo. When not busy building A.I. products, he is exploring research topics at the intersection of language, reasoning and learning, with several publications at major conferences (e.g. WWW, SIGIR, RecSys, NAACL). In previous lives, he managed to get a Ph.D., do scienc-y things for a pro basketball team, and simulate a pre-Columbian civilization.
Topics: MLOps, Metaflow, model cards.
Question generation using Natural Language Processing by QuestGen.AIData Science Milan
Manual question generation (worksheets and quizzes) in edtech is not scalable for online transformation and leads to increased workload on teachers due to the pandemic. In this session, we will explore natural language processing (NLP) techniques to generate Multiple Choice Questions automatically from any text content using the T5 transformer model. We will also explore methods to deploy the T5 question generation model for fast CPU inference using ONNX conversion and quantization.
Bio:
Ramsri is a Lead Data Scientist with 8+ years of work experience across Silicon Valley, Singapore, and India. Most recently he had been a co-founder and CTO of a funded AI-assisted assessments startup. He has spent the last 2 years developing question generation models in edtech and also released an open-source library on the same.
Abstract: Data preparation and modelling are the activities that take most of the time in a typical data scientist workday. In this session we’ll see how AWS services for Analytics and data management can be effectively used and integrated in AI/ML pipelines. We’ll focus on AWS Glue, AWS Glue DataBrew and AWS Data Wrangler with a bit of theory and hands-on demos.
Bio:
Francesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in UK, italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems. Francesco also has a strong experience in systems integration and design and implementation of applications.
Topics: machine learning pipelines, AWS, cloud.
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
A Feature Store enables machine learning (ML) features to be registered, discovered, and used as part of ML pipelines, thus making it easier to transform and validate the training data that is fed into machine learning systems. Feature stores can also enable consistent engineering of features between training and inference, but to do so, they need a common data processing platform. The first Feature Stores, developed at hyperscale AI companies such as Uber, Airbnb, and Facebook, enabled feature engineering using domain specific languages, providing abstractions tailored to the companies’ feature engineering domains. However, a general purpose Feature Store needs a general purpose feature engineering, feature selection, and feature transformation platform.
In this talk, we describe how we built a general purpose, open-source Feature Store for ML around dataframes and Apache Spark. We will demonstrate how data engineers can transform and engineers features from backend databases and data lakes, while data scientists can use PySpark to select and transform features into train/test data in a file format of choice (.tfrecords, .npy, .petastorm, etc) on a file system of choice (S3, HDFS). Finally, we will show how the Feature Store enables end-to-end ML pipelines to be factored into feature engineering and data science stages that each can run at different cadences.
Bio:
Fabio Buso is the head of engineering at Logical Clocks AB, where he leads the Feature Store development. Fabio holds a master's degree in cloud computing and services with a focus on data intensive applications, awarded by a joint program between KTH Stockholm and TU Berlin.
Topics: feature store, MLOps.
Reinforcement Learning is a growing subset of Machine Learning and one of the most important frontiers of Artificial Intelligence. Its goal is to capture higher logic and use more adaptable algorithms than classical Machine Learning.
Formally it denotes a set of algorithms that deal with sequential decision-making and have the potential capability to make highly intelligent decisions depending on their local environment.
Reinforcement Learning problems can be described as an agent that has to make decisions in its environment in order to optimize a cumulative reward, and it is clear that this formalization applies to a great variety of tasks in many different fields.
In this talk, the main features of the most important Reinforcement Learning algorithms will be illustrated and deepened, with some concrete and explanatory examples.
Bio:
Marco Del Pra
Marco was born in Venice 41 years ago, has two master's degrees (Computer Science and Mathematics), and has two important publications in applied mathematics.
He has been working in Artificial Intelligence for 10 years, mainly as a freelancer. Among others, he worked for the European Commission's Joint Research Center, for Cuebiq, and as Data Science Lead for Microsoft's Artificial Intelligence projects in Italy.
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
Today there are a lot of data that are stored in the form of time series, and with the actual large diffusion of real-time applications many areas are strongly increasing their interest in applications based on this kind of data, like for example finance, advertising, marketing, health care, automated disease detection, biometrics, retail, and identification of anomalies of any kind. It is therefore very interesting to understand the role and potential of machine learning in this sector.
Many methods can be used for the classification of the time series, but all of them, apart from deep learning, require some kind of feature engineering as a separate stage before the classification is performed, and this can imply the loss of some important information and the increase of the development and test time. On the contrary, deep learning models such as recurrent and convolutional neural networks already incorporate this kind of feature engineering internally, optimizing it and eliminating the need to do it manually. Therefore they are able to extract information from the time series in a faster, more direct, and more complete way.
Bio:
Marco Del Pra
I am 41 years old, I was born in Venice, I have 2 master's degrees (Computer Science and Mathematics). I have been working for about 10 years in Artificial Intelligence, first as Data Scientist, then as Team Leader and finally as Head of Data. Among others, I worked for Microsoft, for the European Commission (JRC of Ispra) and for Cuebiq. I am currently working as a freelancer and I am creating with 2 other cofounders an innovative AI startup. I have 2 important publications in applied mathematics.
Topics: recurrent and convolutional neural networks, deep learning, time-series.
Audience projection of target consumers over multiple domains a ner and baye...Data Science Milan
Traditional market research is generally conducted by questionnaires or other forms of explicit feedback, directly asked to an ad hoc panel of individuals that in aggregate are representative of a larger group of people. Unfortunately, those traditional approaches are often invasive, nonscalable, and biased. Indirect approaches based on sparse and implicit consumer feedback (e.g., social network interactions, web browsing, or online purchases) are more scalable, authentic, and more suitable for real-time consumer insights.
Although those sources of implicit consumer feedback provide relevant and detailed pictures of the population, they individually provide only a limited set of observable behaviors.
The Holy Grail of market research is the ability to merge different sources of consumers interests into an augmented view that connects all the dots across multiple domains.
Unfortunately, user-centric "fusion" algorithms present many limitations in the case of heterogeneous datasets strongly differing in terms of size and density and when the number of sources to merge increases.
We propose a novel approach of Audience Projection able to define a target audience as a subset of the population in a source domain and to project this target to a set of users into a destination dataset.
We will show how libraries such as spaCy can provide Deep Learning implementations for Named Entity Recognition (NER) to match related brands and we will use Bayesian Inference to transfer knowledge from the source domain. This way, we can estimate the probability of the user to belong to the target using the source distribution of volume of interests of common entities as model evidence and the source target size as prior probability.
Bio:
Gianmario Spacagna is the chief scientist and head of AI at Helixa. His team’s mission is building the next generation of behavior algorithms and models of human decision making with careful attention to their potential and effects on society. His experience covers a diverse portfolio of machine learning algorithms and data products across different industries. Previously, he worked as a data scientist in IoT automotive (Pirelli Cyber Technology), retail and business banking (Barclays Analytics Centre of Excellence), threat intelligence (Cisco Talos), predictive marketing (AgilOne), plus some occasional freelancing. He’s a co-author of the book Python Deep Learning, contributor to the “Professional Manifesto for Data Science,” and founder of the Data Science Milan community. Gianmario holds a master’s degree in telematics (Polytechnic of Turin) and software engineering of distributed systems (KTH of Stockholm). After having spent half of his career abroad, he now lives in Milan. His favorite hobbies include home cooking, hiking, and exploring the surrounding nature on his motorcycle.
Weakly Supervised Learning: Introduction and Best Practices
In the talk we will introduce the definition of three main types of weakly supervised learning: incomplete, inexact and inaccurate; we examine how the models can be trained in case of weak supervision and view the real application of weakly supervised learning, how it can improve results and decrease the costs.
Bio:
Kristina Khvatova works as a Software Engineer at Softec S.p.A. Currently she is involved in the development of a project for data analysis and visualisation; it includes quantitative and qualitative analysis based on classification, optimisation, time series prediction, anomaly detection techniques. She obtained a master degree in Mathematics at the Saint-Petersburg State University and a master degree in Computer Science at the University of Milano-Bicocca.
GANs beyond nice pictures: real value of data generation, Alex HoncharData Science Milan
GANs beyond nice pictures: real value of data generation (theory and business applications)
About the speaker, Alex Honchar:
I am machine learning expert currently applying AI in medtech, fintech and other areas. I also enjoy teaching and blogging (50k+ views monthly) about deep learning applications. As an academia member, I have a track of scientific publications as well. Beside sciences, I travel, do sports and perform card magic.
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan
Humans have the extraordinary ability to learn continually from experience. Not only can we apply previously learned knowledge and skills to new situations, we can also use these as the foundation for later learning. One of the grand goals of AI is building an artificial continually learning agent that constructs a sophisticated understanding of the world from its own experience through the autonomous incremental development of ever more complex skills and knowledge.
"Continual Learning" (CL) is indeed a fast emerging topic in AI concerning the ability to efficiently improve the performance of a deep model over time, dealing with a long (and possibly unlimited) sequence of data/tasks. In this workshop, after a brief introduction of the topic, we’ll implement different Continual Learning strategies and assess them on common vision benchmarks. We’ll conclude the workshop with a look at possible real world applications of CL.
Vincenzo Lomonaco is a Deep Learning PhD student at the University of Bologna and founder of ContinualAI.org. He is also the PhD students representative at the Department of Computer Science of Engineering (DISI) and teaching assistant of the courses “Machine Learning” and “Computer Architectures” in the same department. Previously, he was a Machine Learning software engineer at IDL in-line Devices and a Master Student at the University of Bologna where he graduated cum laude in 2015 with the dissertation “Deep Learning for Computer Vision: a Comparison Between CNNs and HTMs on Object Recognition Tasks".
Processing 3D images has many use cases. For example, to improve autonomous car driving, to enable digital conversions of old factory buildings, to enable augmented reality solutions for medical surgeries, etc. Also 3D images help in 3D modeling and safety evaluation of products.
3D image processing brings enormous benefits but also amplifies computing cost. The size of the point cloud, the number of points, sparse and irregular point cloud, and the adverse impact of the light reflections, (partial) occlusions, etc., make it difficult for engineers to process point clouds.
Moving from using hand crafted features to using deep learning techniques to semantically segment the images, to classify objects, to detect objects, to detect actions in 3D videos, etc., we have come a long way in 3D image processing.
3D Point Cloud image processing is increasingly used to solve Industry 4.0 use cases to help architects, builders and product managers. I will share some of the innovations that are helping the progress of 3D point cloud processing. I will share the practical implementation issues we faced while developing deep learning models to make sense of 3D Point Clouds.
Attendees: Beginners and Intermediate skilled in Image Processing and 3D Point Clouds
Profile of the speaker:
SK Reddy is the Chief Product Officer AI in Hexagon (www.hexagon.com). He is an AI and ML expert and a successful twice startup entrepreneur. He is an AI startup advisor too. Also he is a frequent speaker in conferences and is an AI blogger.
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Data Science Milan
The notebook and documentation of the original tutorial is available at https://github.com/gm-spacagna/deep-ttf.
Deep Time-to-Failure: predicting failures, churns and customer lifetime using recurrent neural networks.
Machineries and customers are among the most valuable assets for many businesses. A common trait of these assets is that sooner or later they will fail or, in the case of customers, they will churn.
In order to catch those failure events we would ideally consider the whole history of the machine/customer available information and learn smart representations of the system status over time.
Traditional machine learning and statistical models approach the prediction of time-to-failure, aka. expected lifetime, as a supervised regression problem using handcrafted features.
Training those models is hard because of three main reasons:
The complexity of extracting predictive features from time-series without overfitting.
The difficulty of modeling uncertainty and confidence levels in the predictions.
The scarcity of labeled data, failure events are by definition rare and that results in highly unbalanced training datasets.
The first issue can be solved adopting recurrent neural architectures.
A solution to the the last two problems could be to exploit censored data and to build survival regression models.
In this talk we will present a novel technique based on recurrent neural networks that can turn any length-variable sequence of data into a probability distribution representing the estimated remaining time to the failure event. The network will be trained in presence of ground truth as well as with right-censored data.
We will demonstrate using a case study regarding 100 jet engine simulated degradation provided by NASA.
During the tutorial you will learn:
What is Survival Analysis and what are the most popular Survival Regression techniques.
How a Weibull distribution can be used as generic distribution for modeling Time-to-Failure events.
How to build a deep learning algorithm in Keras leveraging recurrent units (LSTM or GRU) that can map raw time-series of covariates into Weibull probability distributions.
The tutorial will also cover a few common pitfalls, visualizations and evaluation tools useful for testing and adapting this approach to generic use cases.
You are free to bring your laptop if you would like to do some live coding and experiment yourself. In this case we strongly encourage to check you have all of the requirements installed in your machine.
More details on the required packages can be found on the Github repository gm-spacagna/deep-ttf.
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan
50 Shades of Text - Leveraging Natural Language Processing (NLP) to validate, improve, and expand the functionalities of a product
Nowadays, every company either stores or produces text data: from web logs and user queries, to translations and support tickets, yet not everyone knows how to extract valuable insights from it. In this session, we will present a practical case on how to move from raw text data to a valuable business application leveraging upon some of the major NLP methodologies (word embedding, word2vec, doc2vec, fastText, etc.)
Bio: Alessandro is a data veteran. He holds two Master’s degrees in computer engineering, one from Politecnico di Milano and the other from University of Illinois at Chicago (UIC).
He started his career in data consultancy, where he mastered Apache Spark for Machine Learning projects and subsequently joined WW Grainger, one of the largest MRO e-commerce companies in the United States. In September 2017, after more than 5 years in the USA, Alessandro returned to his native country, Italy, where he is now leading a team of data scientists. His current work focuses on achieving energy efficiency through the automation of energy management processes for commercial customers.
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyData Science Milan
“Product close-out strategy” by Ilaria Gianoli, Data Scientist, Data Reply
Abstract:
How to deal with products in their decline phase? Ilaria will share her experience in optimizing the close-out strategy for a multinational retail leader, with a particular focus on the price optimization.
Bio:
Ilaria is a Data Scientist at Data Reply, where she works as a consultant across different industries, in particular in the Retail. She uses her mathematical, statistical and machine learning background to turning data into business opportunities. She also works closely to the business to provide quantitative support for decision making, adapting the complexity of the mathematical models to customer needs.
She holds a MSc in Applied Statistics - Mathematical Engineering from Politecnico di Milano.
“Online pricing: from theory to application” by Giovanni Corradini, Data Scientist, Data Reply
Abstract:
Multi-Armed Bandit algorithms are populating the world of e-commerce. How do they work?
Giovanni will share the basic of this field and an application of a state-of-the-art algorithm on real world simulation of the ticket industry.
Bio: Giovanni is a Data Scientist at Data Reply.
He holds a MSc in Applied Statistics - Mathematical Engineering from Politecnico di Milano.
He has a background in statistics, machine learning and data mining and he provides decision making support to industries in many different fields.
“Renewal Price Optimization for Subscription products” by Riccardo Lorenzon, Data Scientist, Data Reply
Abstract:
We are observing a huge shift in modern economy from a pay-per-product model to a subscription-based model. When it comes to pricing strategies, it is important both to close the single deal and monetize long-term relationships with the customer. Riccardo will present an application of subscription renewal pricing optimization models for a company belonging to the publishing industry.
Bio:
Riccardo holds a MSc in Mathematical Models for Decision Making from Politecnico di Milano.
He developed hands-on experience on end-to-end data projects across multiple industries. His proactive creativity helps him be very effective in the business case design and early stages of projects.
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrigoni, Senior Data Scientist, Pirelli (pirelli.com)
Abstract:
Pirelli, a global performance tire manufacturer, uses data science in its 20 factories to improve quality and efficiency, and reduce energy consumption. For this “Smart Manufacturing” initiative, Pirelli’s data science team has developed predictive models and analytics tools to monitor processes, machines and materials on the factory floors. In this talk we will show some of the solutions we deploy, demonstrate how we used Domino’s data science platform and Plot.ly to build these solutions, and discuss the next steps in this journey towards predictive maintenance.
Bio:
Alberto Arrigoni is a data scientist at Pirelli, where he works to process sensors and telemetry data for IoT, Smart Factories and connected-vehicle applications.
He works closely with all major business units such as R&D, industrial engineering and BI to develop tailored machine learning algorithms and production systems.
He holds a PhD in biostatistics from the University of Milan Bicocca and prior to joining Pirelli was a staff data scientist at the National Institute of Molecular Genetics (Milan), as well as a Fulbright student at the Santa Clara University and visiting PhD student at Pacific Biosciences (Menlo Park, CA).
Brief introduction to Cerved data, the role of data scientist in Cerved and how a data scientist can take advantage from graph database.
Bio:
Stefano Gatti: Born in 1970, has been involved for more than 15 years in several big data and technologies driven projects in leading business information companies like Lince and Cerved. He is very fond of agile metodologies, trying to apply them at all organizational levels. In last years he is strongly engaged in facilitating in Cerved the spread of innovation and the taking advantage from the new big and smart data technologies especially from a business usage perspective. datatelling, open innovation, partnership with smart actors of worldwide data driven innovation ecosystem are his actual mantra. Nunzio Pellegrino: Data Scientist in Cerved, as part of Innovation team, with focus on extract value from data and resolve problems with the latest technologies available. I’ve a degree in Statistics with background in Machine Learning. I’ve being worked primarily in Data Integration and Business Intelligence projects for 3 years. In this moment, I’m product owner of a web application based on GraphDB and involved in Italian Open Data projects. I’m a R enthusiastic, Python practitioner and fascinated of graph ecosystem.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
3. !3
UNDERSTANDABLE
We provide standard visualizations to
understand their performances and
compare their predictions.
EASY
No coding skills are required
to train a model and use it
for obtaining predictions.
OPEN
Released under
the open source
Apache License 2.0.
GENERAL
A new data type-based approach to deep
learning model design that makes the tool
suited for many different applications.
FLEXIBLE
Experienced users have deep control
over model building and training, while
newcomers will find it easy to use.
EXTENSIBLE
Easy to add new model
architecture and new
feature data-types.
5. Example
!5
NEWS CLASS
Toronto Feb 26 - Standard Trustco said it expects earnings in 1987
to increase at least 15 to 20 pct from the 9 140 000 dlrs or 252
dlrs...
earnings
New York Feb 26 - American Express Co remained silent on market
rumors it would spinoff all or part of its Shearson Lehman Brothers
Inc...
acquisition
BANGKOK March 25 - Vietnam will resettle 300 000 people on
state farms known as new economic zones in 1987 to create jobs
and grow more high-value export crops...
coffe
6. Example experiment command
!6
The Experiment command:
1. splits the data in training, validation and test sets
2. trains a model on the training set
3. validates on the validation set (early stopping)
4. predicts on the test set
ludwig experiment
--data_csv reuters-allcats.csv
--model_definition "{input_features: [{name: news, type: text}], output_features:
[{name: class, type: category}]}"
# you can specify a --model_definition_file file_path argument instead
# that reads a YAML file
7. Epoch 1
Training: 100%|██████████████████████████████| 23/23 [00:02<00:00, 9.86it/s]
Evaluation train: 100%|██████████████████████| 23/23 [00:00<00:00, 46.39it/s]
Evaluation vali : 100%|████████████████████████| 4/4 [00:00<00:00, 29.05it/s]
Evaluation test : 100%|████████████████████████| 7/7 [00:00<00:00, 43.06it/s]
Took 2.8826s
╒═════════╤════════╤════════════╤══════════╕
│ class │ loss │ accuracy │ hits@3 │
╞═════════╪════════╪════════════╪══════════╡
│ train │ 0.2604 │ 0.9306 │ 0.9899 │
├─────────┼────────┼────────────┼──────────┤
│ vali │ 0.3678 │ 0.8792 │ 0.9769 │
├─────────┼────────┼────────────┼──────────┤
│ test │ 0.3532 │ 0.8929 │ 0.9903 │
╘═════════╧════════╧════════════╧══════════╛
Validation accuracy on class improved, model saved
!7
First epoch of training
8. !8
Epoch 4
Training: 100%|██████████████████████████████| 23/23 [00:01<00:00, 14.87it/s]
Evaluation train: 100%|██████████████████████| 23/23 [00:00<00:00, 48.87it/s]
Evaluation vali : 100%|████████████████████████| 4/4 [00:00<00:00, 59.50it/s]
Evaluation test : 100%|████████████████████████| 7/7 [00:00<00:00, 51.70it/s]
Took 1.7578s
╒═════════╤════════╤════════════╤══════════╕
│ class │ loss │ accuracy │ hits@3 │
╞═════════╪════════╪════════════╪══════════╡
│ train │ 0.0048 │ 0.9997 │ 1.0000 │
├─────────┼────────┼────────────┼──────────┤
│ vali │ 0.2560 │ 0.9177 │ 0.9949 │
├─────────┼────────┼────────────┼──────────┤
│ test │ 0.1677 │ 0.9440 │ 0.9964 │
╘═════════╧════════╧════════════╧══════════╛
Validation accuracy on class improved, model saved
Fourth epoch of training,
validation accuracy improved
9. !9
Epoch 14
Training: 100%|██████████████████████████████| 23/23 [00:01<00:00, 14.69it/s]
Evaluation train: 100%|██████████████████████| 23/23 [00:00<00:00, 48.67it/s]
Evaluation vali : 100%|████████████████████████| 4/4 [00:00<00:00, 61.29it/s]
Evaluation test : 100%|████████████████████████| 7/7 [00:00<00:00, 50.45it/s]
Took 1.7749s
╒═════════╤════════╤════════════╤══════════╕
│ class │ loss │ accuracy │ hits@3 │
╞═════════╪════════╪════════════╪══════════╡
│ train │ 0.0001 │ 1.0000 │ 1.0000 │
├─────────┼────────┼────────────┼──────────┤
│ vali │ 0.4208 │ 0.9100 │ 0.9974 │
├─────────┼────────┼────────────┼──────────┤
│ test │ 0.2081 │ 0.9404 │ 0.9988 │
╘═════════╧════════╧════════════╧══════════╛
Last improvement of accuracy on class happened 10 epochs ago
Fourteenth epoch of training,
validation accuracy has not
improved in a while, let’s
EARLY STOP
13. Training
!13
Fields mappings, model hyperparameters
and weights are saved during training
Raw Data
Processed
Data
Model
Training
Pre
Processing
CSV HDF5 / Numpy
JSON
Weights
Hyper
parameters
Tensor JSON
Fields
Mappings
14. Prediction
!14
The same fields mappings obtained during training are used to preprocess to each
datapoint and to postprocess each prediction of the model in order map back to labels
Raw Data
Processed
Data
Model
Training
Weights
Pre
Processing
Hyper
parameters
Raw Data
Processed
Data
Predictions
and Scores
Pre
Processing
Labels
Post
Processing
Model
Prediction
JSON Tensor JSON
CSV HDF5 / Numpy Numpy CSV
Fields
Mappings
15. Running Experiments
!15
The Experiment trains a model on the training set and predicts on the test set. It outputs predictions
and training and test statistics that can be used by the Visualization component to obtain graphs
Raw Data Experiment Labels
CSV CSV
Weights
Hyper
parameters
Tensor JSON
Training
Statistics
JSON
Test
Statistics
JSON
Visualization Graphs
PNG / PDF
JSON
Fields
Mappings
18. Back to the example
!18
ludwig experiment
--data_csv reuters-allcats.csv
--model_definition "{input_features: [{name: news, type: text}], output_features:
[{name: class, type: category}]}"
28. Each output type can have multiple
decoders to choose from
Each input type can have multiple
encoders to choose from
29. Text input features encoders
CNN RNN
Stacked
CNN
RNN
Parallel
CNN
Words /
Char
6 x
1D Conv
2 x FC
Vector
Encoding
Words /
Char
2 x FC
Vector
Encoding
1D Conv
width 5
1D Conv
width 4
1D Conv
width 3
1D Conv
width 2
Words /
Char
2 x
RNN
2 x FC
Vector
Encoding
Words /
Char
2 x RNN
2 x FC
Vector
Encoding
3 x Conv
Stacked
Parallel CNN
Words /
Char
3 x
Parallel CNN
2 x FC
Vector
Encoding
Transformer
/ BERT
Words /
Char
Vector
Encoding
6 x
Transofmer
2 x FC
36. Sequence, Time Series and Audio / Speech encoders
CNN RNN
Stacked
CNN
RNN
Parallel
CNN
Sequence
6 x
1D Conv
2 x FC
Vector
Encoding
Sequence
2 x FC
Vector
Encoding
1D Conv
width 5
1D Conv
width 4
1D Conv
width 3
1D Conv
width 2
Sequence
2 x
RNN
2 x FC
Vector
Encoding
Sequence
2 x RNN
2 x FC
Vector
Encoding
3 x Conv
Stacked
Parallel CNN
Sequence
3 x
Parallel CNN
2 x FC
Vector
Encoding
Transformer
Sequence
Vector
Encoding
6 x
Transofmer
2 x FC
38. Category features embedding encoding
!38
Grey parameters
are optionalname: category_csv_column_name
type: category
representation: dense, ← sparse = one hot encoding
embedding_size: 256
39. Numerical and binary features
encoders
Numerical features are encoded with one
FC layer + Norm for scaling purposes or
are used as they are
Binary features are used as they are
40. Set and bag features encoders
They are both encoded by embedding
the items, aggregating them, and
passing them through FC layers
In the case of bags, the aggregation is
a weighted sum
50. Supports text preprocessing in
English, Italian, Spanish, German,
French, Portuguese, Dutch, Greek
and Multi-language
51. Custom Encoder Example
!51
class MySequenceEncoder:
def __init__(
self,
my_param=whatever,
**kwargs
):
self.my_param = my_param
def __call__(
self,
input_sequence,
regularizer,
is_training=True
):
# input_sequence = [b x s] int32
# every element of s is the
# int32 id of a token
your code goes here
it can use self.my_param
# hidden to return is
# [b x s x h] or [b x h]
# hidden_size is h
return hidden, hidden_size
53. ludwig experiment
--data_csv reuters-allcats.csv
--model_definition "{input_features:
[{name: news, type: text, encoder:
parallel_cnn, level: word}],
output_features: [{name: class, type:
category}]}"
Text Classification
!53
NEWS CLASS
Toronto Feb 26 - Standard Trustco
said it expects earnings in 1987 to
increase at least 15...
earnings
New York Feb 26 - American Express
Co remained silent on market
rumors...
acquisition
BANGKOK March 25 - Vietnam will
resettle 300 000 people on state
farms known as new economic...
coffe
Text FC
CNN
CNN
CNN
CNN
Softmax Class
54. ludwig experiment
--data_csv imagenet.csv
--model_definition "{input_features:
[{name: image_path, type: image, encoder:
stacked_cnn}], output_features: [{name:
class, type: category}]}"
Image Object Classification (MNIST, ImageNet, ..)
!54
IMAGE_PATH CLASS
imagenet/image_000001.jpg car
imagenet/image_000002.jpg dog
imagenet/image_000003.jpg boat
Image FCCNNs Softmax Class
55. ludwig experiment
--data_csv eats_etd.csv
--model_definition "{input_features: [{name:
restaurant, type: category, encoder: embed},
{name: order, type: bag}, {name: hour, type:
numerical}, {name: min, type: numerical}],
output_features: [{name: etd, type: numerical,
loss: {type: mean_absolute_error}}]}"
Expected Time of Delivery
!55
RESTAURANT ORDER HOUR MIN ETD
Halal Guys
Chicken kebab,
Lamb kebab
20 00 20
Il Casaro Margherita 18 53 15
CurryUp
Chicken Tikka
Masala, Veg
Curry
21 10 35
Restaurant
FC Regress ETD
EMBED
Order
EMBED
SET
...
56. ludwig experiment
--data_csv chitchat.csv
--model_definition "{input_features:
[{name: user1, type: text, encoder: rnn,
cell_type: lstm}], output_features: [{name:
user2, type: text, decoder: generator,
cell_type: lstm, attention: bahdanau}]}"
Sequence2Sequence Chit-Chat Dialogue Model
!56
USER1 USER2
Hello! How are you doing? Doing well, thanks!
I got promoted today Congratulations!
Not doing well today
I’m sorry, can I do
something to help you?
User1 RNN User2RNN
57. ludwig experiment
--data_csv sequence_tags.csv
--model_definition "{input_features:
[{name: utterance, type: text, encoder: rnn,
cell_type: lstm}], output_features: [{name:
tag, type: sequence, decoder: tagger}]}"
Sequence Tagging for sequences or time series
!57
UTTERANCE TAG
Blade Runner is a 1982 neo-
noir science fiction film
directed by Ridley Scott
N N O O D O O O O O O
P P
Harrison Ford and Rutger
Hauer starred in it
P P O P P O O O
Philip Dick 's novel Do
Androids Dream of Electric
Sheep ? was published in
1968
P P O O N N N N N N N
O O O D
Utteranc
e
Softmax Tag
RNN
58. !58
ProgrammaticAPI
Easy to install:
pip install ludwig
Programmatic API:
from ludwig.api import LudwigModel
model = LudwigModel(model_definition)
model.train(train_data)
# or model = LudwigModel.load(path)
predictions = model.predict(other_data)
59. !59
Additional
Goodies
Model serving:
Ludwig serve --model_path <model_path>
Integration with other tools:
▪︎ Train models distributedly using Horovod
▪︎ Deploy models seamlessly using PyML
▪︎ Hyper-parameters optimization using BOCS /
Quicksilver and AutoTune
61. Why it is great for experimentation
Easy to plug in new encoders, decoders
and combiners
Change just one string in the YAML file
to run a new experiment
63. !63
Outputs
A directory containing the trained model
CSV file for each output feature containing predictions
NPY file for each output feature containing probabilities
JSON file containing training statistics
JSON file containing test statistics
Monitoring during training is available through TensorBoard
65. !65
Nextsteps
More features: vectors, nested lists, point clouds, videos
More pretrained inputs encoders
Adding missing decoders
Petastorm integration to work directly on Hive tables
without intermediate CSV
68. Machine Learning democratization
Efficient exploration of model space
through quick iteration
Tweak every aspect of preprocessing,
modeling and training
Extend with your models
Non-expert users Expert users
From idea to model training in minutes
No need for coding and deep knowledge
Easy declarative model construction
hides complexity