Are we reaching a Data Science Singularity? How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services by Natalino Busa
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
In this talk, we will discuss the technical and non-technical challenges faced in designing a system used to obfuscate LinkedIn member data stored in Hadoop at scale.
To assist in this task, we built WhereHows (OpenSource) which serves as a data discovery and metadata catalog for all the datasets at LinkedIn. We integrated WhereHows with a compliance monitoring tool which uses Machine Learning to identify datasets with PII (Personal identity Information). We will cover the building blocks of the system and the lessons learned in the process.
Audience takeaways:
- What is takes to obfuscate member data at scale
- The challenges involved in designing the system
- The lessons learned throughout this journey
- What not to assume!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
Blue Pill/Red Pill: The Matrix of Thousands of Data StreamsDatabricks
Designing a streaming application which has to process data from 1 or 2 streams is easy. Any streaming framework which provides scalability, high-throughput, and fault-tolerance would work. But when the number of streams start growing in order 100s or 1000s, managing them can be daunting. How would you share resources among 1000s of streams with all of them running 24×7? Manage their state, Apply advanced streaming operations, Add/Delete streams without restarting? This talk explains common scenarios & shows techniques that can handle thousands of streams using Spark Structured Streaming.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
In this talk, we will discuss the technical and non-technical challenges faced in designing a system used to obfuscate LinkedIn member data stored in Hadoop at scale.
To assist in this task, we built WhereHows (OpenSource) which serves as a data discovery and metadata catalog for all the datasets at LinkedIn. We integrated WhereHows with a compliance monitoring tool which uses Machine Learning to identify datasets with PII (Personal identity Information). We will cover the building blocks of the system and the lessons learned in the process.
Audience takeaways:
- What is takes to obfuscate member data at scale
- The challenges involved in designing the system
- The lessons learned throughout this journey
- What not to assume!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
Finding the number of unique users out of 10 billion events per day is challenging. At this session, we're going to describe how re-architecting our data infrastructure, relying on Druid and ThetaSketch, enables our customers to obtain these insights in real-time.
To put things into context, at NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. Specifically, we provide them with the ability to see the number of unique users who meet a given criterion.
Historically, we have used Elasticsearch to answer these types of questions, however, we have encountered major scaling and stability issues.
In this presentation we will detail the journey of rebuilding our data infrastructure, including researching, benchmarking and productionizing a new technology, Druid, with ThetaSketch, to overcome the limitations we were facing.
We will also provide guidelines and best practices with regards to Druid.
Topics include :
* The need and possible solutions
* Intro to Druid and ThetaSketch
* How we use Druid
* Guidelines and pitfalls
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Graph Data: a New Data Management FrontierDemai Ni
Graph Data: a New Data Management Frontier -- Huawei’s view and Call for Collaboration by Demai Ni:
Huawei provides Enterprise Databases, and are actively exploring the latest technology to provide end-to-end Data Management Solution on Cloud. We are looking at to bridge classic RDMS to Graph Database on a distributed platform.
Margriet Groenendijk - Open data is available from an incredible number of data sources that can be linked to your own datasets. This talk will present examples of how to visualise and combine data from very different sources such as weather and climate, and statistics collected by individual countries using Python notebooks in Analytics for Apache Spark.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Advanced Analytics for Any Data at Real-Time Speeddanpotterdwch
The kenyote presentation from Predictive Analytics World entitled "Advanced Analytics for Any Data at Real-Time Speed" Dan Potter, CMO from Datawatch, presents a new approach to prepare, incorporate, enrich and visualize streaming data for advanced visual analysis is essential for making timelier, high-impact business decisions in tough competitive markets.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
Do you know who is knocking on your network’s door? Have new regulations left you scratching your head on how to handle what is happening in your network? Network flow data helps answer many questions across a multitude of use cases including network security, performance, capacity planning, routing, operational troubleshooting and more. Today’s modern day streaming data pipelines need to include tools that can scale to meet the demands of these service providers while continuing to provide responsive answers to difficult questions. In addition to stream processing, data needs to be stored in a redundant, operationally focused database to provide fast, reliable answers to critical questions. Together, Kafka and Druid work together to create such a pipeline.
In this talk Eric Graham and Rachel Pedreschi will discuss these pipelines and cover the following topics:
-Network flow use cases and why this data is important.
-Reference architectures from production systems at a major international Bank.
-Why Kafka and Druid and other OSS tools for Network Flows.
-A demo of one such system.
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...sparktc
IBM researchers in Haifa, together with partners from the COSMOS EU-funded project, are using Spark to analyze the new wave of IoT data and solve problems in a way that is generic, integrated, and practical.
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
As data grows in size and connectedness dramatically in all dimensions, the potential for graph-enriched machine learning grows likewise, but scalable technologies are needed to both build models and apply them in real-time. Real-time deep-link graph pattern matching and analytics provides new opportunities for enriching your machine learning models with graph features.
‘In addition to the real-time deep-link aspect, the ability to process large datasets in a production pipeline provides a synergistic approach for the two distributed and performant platforms: Spark and TigerGraph. The TigerGraph graph database provides scalable real-time deep link graph analytics and augments Spark with graph analytics and predictions for a wide range of Machine Learning use cases.
In this session, we will explain the architecture and technical implementation for a TigerGraph+Spark graph-enhanced Machine Learning pipeline: Use TigerGraph both before training to extract (graph and non-graph) features and after training to apply the model on streaming data; use Spark to train and tune machine learning models at scale. As an example, we will present a solution in production at China Mobile that detects and prevents phone-based scams using machine learning with TigerGraph.
Specifically, the solution generates 118 graph features for 600 million users, to feed a machine learning system which detects three types of unwanted phone calls. TigerGraph then helps to deploy the model by extracting these 118 features in real-time for up to 10,000 calls per second, to give customers a real-time diagnosis of their incoming calls.
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Graph Data: a New Data Management FrontierDemai Ni
Graph Data: a New Data Management Frontier -- Huawei’s view and Call for Collaboration by Demai Ni:
Huawei provides Enterprise Databases, and are actively exploring the latest technology to provide end-to-end Data Management Solution on Cloud. We are looking at to bridge classic RDMS to Graph Database on a distributed platform.
Margriet Groenendijk - Open data is available from an incredible number of data sources that can be linked to your own datasets. This talk will present examples of how to visualise and combine data from very different sources such as weather and climate, and statistics collected by individual countries using Python notebooks in Analytics for Apache Spark.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Advanced Analytics for Any Data at Real-Time Speeddanpotterdwch
The kenyote presentation from Predictive Analytics World entitled "Advanced Analytics for Any Data at Real-Time Speed" Dan Potter, CMO from Datawatch, presents a new approach to prepare, incorporate, enrich and visualize streaming data for advanced visual analysis is essential for making timelier, high-impact business decisions in tough competitive markets.
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
Do you know who is knocking on your network’s door? Have new regulations left you scratching your head on how to handle what is happening in your network? Network flow data helps answer many questions across a multitude of use cases including network security, performance, capacity planning, routing, operational troubleshooting and more. Today’s modern day streaming data pipelines need to include tools that can scale to meet the demands of these service providers while continuing to provide responsive answers to difficult questions. In addition to stream processing, data needs to be stored in a redundant, operationally focused database to provide fast, reliable answers to critical questions. Together, Kafka and Druid work together to create such a pipeline.
In this talk Eric Graham and Rachel Pedreschi will discuss these pipelines and cover the following topics:
-Network flow use cases and why this data is important.
-Reference architectures from production systems at a major international Bank.
-Why Kafka and Druid and other OSS tools for Network Flows.
-A demo of one such system.
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...sparktc
IBM researchers in Haifa, together with partners from the COSMOS EU-funded project, are using Spark to analyze the new wave of IoT data and solve problems in a way that is generic, integrated, and practical.
The Email Analyzer interface helps an analyst visualize the email network and identify local group of people who frequently exchange emails amongst themselves. This interface was developed as an entry for VAST Challenge 2014.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Similar to Are we reaching a Data Science Singularity? How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services by Natalino Busa
Data science apps powered by Jupyter NotebooksNatalino Busa
Jupyter notebooks are transforming the way we look at computing, coding, and science. But is this the only "data scientist experience" that this technology can provide? In this presentation we will look at how to create interactive web applications for data exploration and machine learning. In the background this code is still powered by the well-understood and well-documented Jupyter Notebooks.
Code on github: https://github.com/natbusa/kernelgateway_demos
7 steps for highly effective deep neural networksNatalino Busa
Natalino will provide a fast-paced lecture on recognizing handwritten numbers with ANNs. In the past years much progress has been booked on the recognition of images: multi-layer perceptrons, inception networks, skip and residual nets, lstm, attention networks.
In this talk, Natalino will go through 7 different architectures on how to classify handwritten numbers using keras and tensorflow. For each neural network architecture, he will spend some time on the theory, the coding, and the quality of the results. Code and Fun included!.
original file:
https://drive.google.com/file/d/0BwNrPuGaMi8PbVhUYUVKWUhGRjQ
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
How You Can Use Open Source Materials to Learn Python & Data Science - EuroPy...Kamila Stępniowska
Please find the slides from my talk "How You Can Use Open Source Materials to Learn Python & Data Science". The talk has been originally performed at EuroPython 2018 in Edinburgh (July 26th, 2018).
You can think of this slides as a resource library and a limited guideline.
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
https://bigscience.huggingface.co/
EN: Presentation of the BigScience project: a research initiative launched by HuggingFace and aiming to build a large language model (inspired by OpenAI and GPTx) over multiple languages and a very large processing cluster. The participants plan to investigate the dataset and the model from all angles: bias, social impact, capabilities, limitations, ethics, potential improvements, specific domain performances, carbon impact, general AI/cognitive research landscape.
FR : Présentation du projet Bigscience : un projet de recherche ouvert lancé par HuggingFace et qui a pour objectif de contruire un modèle de langue (ie un peu comme openAI et GPT-3) mais en explorant les problèmes liés au jeux de données et au modèle selon les angles des biais cognitifs, de l'impact social et environemental, des limites éthiques, des possibles gain de performance et de l'impact général de ce type d'approche lorsque le but n'est pas seulement "d'avoir un plus gros modèle".
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
This novel and multidisciplinary data centric and scientific movement, promises new and not yet imagined applications that rely on massive amounts of evolving data that need to be cleaned, integrated and analysed for modelling purposes. Yet, data management issues are not usually perceived as central. In this keynote I will explore the key challenges and opportunities for data management in this new scientific world, and discuss how a possible data centric artificial intelligence supported by high performance computing (HPC) can best contribute to these exciting domains. If the moto is not academic, huge numbers of dollars being devoted to related applications are moving industry and academia to analyse these directions.
Coming right from the Recommender Systems conference in San Francisco, I present some latest developments in the field of large scale recommendation engines and machine learning.
The Agenda for the Webinar:
1. Introduction to Python.
2. Python and Big Data.
3. Python and Data Science.
4. Key features of Python and their usage in Business Analytics.
5. Business Analytics with Python – Real world Use Cases.
Python has been one of the premier, flexible, and powerful open-source language that is easy to learn, easy to use, and has powerful libraries for data manipulation and analysis. It’s easy to learn simple syntax is very accessible to new programmers and is similar to Matlab, C/C++, Java, or Visual Basic. Python is general purpose and comparatively easy to learn with an increased adoption for analytical and quantitative computing. For over a decade, Python has been used in scientific computing and highly quantitative domains such as finance, oil and gas, physics, and signal processing.
"Big Data" is term heard more and more in industry – but what does it really mean? There is a vagueness to the term reminiscent of that experienced in the early days of cloud computing. This has led to a number of implications for various industries and enterprises. These range from identifying the actual skills needed to recruit talent to articulating the requirements of a "big data" project. Secondary implications include difficulties in finding solutions that are appropriate to the problems at hand – versus solutions looking for problems. This presentation will take a look at Big Data and offer the audience with some considerations they may use immediately to assess the use of analytics in solving their problems.
The talk begins with an idea of how big "Big Data" can be. This leads to an appreciation of how important "Management Questions" are to assessing analytic needs. The fields of data and analysis have become extremely important and impact nearly all facets of life and business. During the talk we will look at the two pillars of Big Data – Data Warehousing and Predictive Analytics. Then we will explore the open source tools and datasets available to NATO action officers to work in this domain. Use cases relevant to NATO will be explored with the purpose of show where analytics lies hidden within many of the day-to-day problems of enterprises. The presentation will close with a look at the future. Advances in the area of semantic technologies continue. The much acclaimed consultants at Gartner listed Big Data and Semantic Technologies as the first- and third-ranked top technology trends to modernize information management in the coming decade. They note there is an incredible value "locked inside all this ungoverned and underused information." HQ SACT can leverage this powerful analytic approach to capture requirement trends when establishing acquisition strategies, monitor Priority Shortfall Areas, prepare solicitations, and retrieve meaningful data from archives.
A brief intro about me, my past experience, and a small history of data analytics and data science and the latest trends in data science such as deep learning, graph analysis, distributed computing with spark and streaming computing.
Similar to Are we reaching a Data Science Singularity? How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services by Natalino Busa (20)
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
JMeter webinar - integration with InfluxDB and Grafana
Are we reaching a Data Science Singularity? How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services by Natalino Busa
1.
2. 1
Natalino Busa - @natbusa
Natalino Busa
Head of Data Science Teradata
9. 8
Natalino Busa - @natbusa
Learning: The Scientific Method
Ørsted's "First Introduction to General Physics" (1811)
https://en.m.wikipedia.org/wiki/History_of_scientific_method
observation hypothesis deduction synthesis
Hans Christian Ørsted
experiment
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
10. 9
Natalino Busa - @natbusa
Innovation in Data Analytics
Cloud Community AI & ML
12. 11
Natalino Busa - @natbusa
“we live in an age of open source datacenters, so
we can stack all these things together and we
have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
13. 12
Natalino Busa - @natbusa
Analytics in the cloud
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
14. 13
Natalino Busa - @natbusa
DAAAS: AI and ML API’s
Cloud Computing for Deep Neural Networks
> Models, Compute (Train, Score), and Data
AI and ML models for:
● Speech (audio)
● Language (text)
● Vision (images/video)
● Data (classification, regression, clustering, anomaly detection)
15. 14
Natalino Busa - @natbusa
Ephemeral Computing Clusters on a Cloud
data
create load compute store
timeline
destroy
16. 15
Natalino Busa - @natbusa
dPaaS: Analytical clusters
Ephemeral
Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Permanent
Long Lived
Production / Operations
Co-Ordinated
Complex Access Management
vs
17. 16
Natalino Busa - @natbusa
GPU’s and Distributed Computing
GPU support is coming in Kubernetes, Mesos, Spark
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
http://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
out
up
CPU
R,Python
Spark
TensorFrames
26. 25
Natalino Busa - @natbusa
Ask me Anything
Dynamic Memory Networks
for Natural Language
Processing
https://arxiv.org/pdf/1603.01417v1.pdf
https://youtu.be/oGk1v1jQITw
Caiming Xiong,
Stephen Merity,
Richard Socher
27. 26
Natalino Busa - @natbusa
Ask me Anything
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
Dynamic Memory Networks for Natural Language Processing
https://arxiv.org/pdf/1603.01417v1.pdf
http://www.socher.org/
Local
context
Wider
context
NLP, Attention Masks
Semantic Embeddings from Text, Images
29. 28
Natalino Busa - @natbusa
Network Intrusion Detection
http://billsdata.net/?p=105
It contains 130 million flow records involving
12,027 distinct computers over 36 days (not
the full 58 days claimed for the entire data
release).
Each record consists of: time (to nearest
second), duration, source and destination
computer ids, source and destination ports,
protocol, number of packets and number of
bytes
Techniques: TDA, Dimensionality Reduction
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
30. 29
Natalino Busa - @natbusa
Approaching (Almost) Any Machine Learning Problem
- Abhishek Thakur, Kaggle Grandmaster -
data labels
raw data: tables, files Useful dataData munging Feature
Engineering
Tabular Data ready for ML
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-l
earning-problem-abhishek-thakur/
31. 30
Natalino Busa - @natbusa
AutoML challenge
- based on scikit-learn
- 15 classifiers,
- 14 feature preprocessing methods
- 4 data preprocessing methods
- 110 hyperparameters
- Supervised classification challenge:
100 different datasets
Natalino Busa - @natbusa
33. 32
Natalino Busa - @natbusa
Human cognitive biases :
Too much information
Not enough meaning
What should we
remember?
Need to act fast
https://en.wikipedia.org/wiki/List_of_cognitive_biases
34. 33
Natalino Busa - @natbusa
Man vs Machine cognitive limits
Model generation
Explanation
Unsupervised
Planning
Too much information
Not enough meaning
Need to act quickly
Memory limits
35. 34
Natalino Busa - @natbusa
Theorems often tell us complex truths about the simple things,
but only rarely tell us simple truths about the complex ones
Marvin Minsky
K-Linesː A Theory of Memory (1980)
36. 35
Natalino Busa - @natbusa
Data Science: wear the AI/ML Lenses
We are entering a new era of intelligent machines
Boost our understanding of data
Focus on higher level analyses
37. 36
Natalino Busa - @natbusa
Intelligent Data Systems:
Long live the “database”
Wikipedia:
A database is an organized collection of data.
DATA
New-SQL
ML
AI
SQL
Python - Scala - R
NLP
UX
Speech
COG
38. 37
Natalino Busa - @natbusa
The Database.
is never going to be the same.