Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
A study summary on Generative Adversarial Network (GAN)
Alone with Youtube Video in Mandarine (https://www.youtube.com/watch?v=c6mngSqcSIw&feature=youtu.be)
In 4 angles:
GAN Training
GAN vs Latent Space
GAN Autoencoder
Super Resolution GAN
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Nearest neighbor models are conceptually just about the simplest kind of model possible. The problem is that they generally aren’t feasible to apply. Or at least, they weren’t feasible until the advent of Big Data techniques. These slides will describe some of the techniques used in the knn project to reduce thousand-year computations to a few hours. The knn project uses the Mahout math library and Hadoop to speed up these enormous computations to the point that they can be usefully applied to real problems. These same techniques can also be used to do real-time model scoring.
A study summary on Generative Adversarial Network (GAN)
Alone with Youtube Video in Mandarine (https://www.youtube.com/watch?v=c6mngSqcSIw&feature=youtu.be)
In 4 angles:
GAN Training
GAN vs Latent Space
GAN Autoencoder
Super Resolution GAN
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
Presentation of the paper:
Szymon Klarman and Thomas Meyer. Querying Temporal Databases via OWL 2 QL (with appendix). In Proceedings of the 8th International Conference on Web Reasoning and Rule Systems (RR-14), 2014.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
In this project I use a stack of denoising autoencoders to learn low-dimensional
representations of images. These encodings are used as input to a locality sensitive
hashing algorithm to find images similar to a given query image. The results clearly
shows that this approach outperforms basic LSH by far.
Slides by Víctor Garcia about the paper:
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016.
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...WiMLDSMontreal
"Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data"
By Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
Abstract:
The use of complex models –with many parameters– is challenging with high-dimensional small-sample
problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive,
as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via
structured penalties. But rich penalties require mathematical expertise and entail large computational costs.
Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations.
Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic
regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of
groups of features that capture feature covariations. We then randomly select these groups inside a stochastic
gradient descent loop. This procedure acts as a structured regularizer for high-dimensional correlated data
without additional computational cost and it has a denoising effect. We demonstrate the performance of our
approach for logistic regression both on a sample-limited face image dataset with varying additive noise and on
a typical high-dimensional learning problem, brain image classification.
Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.
Presentation of the paper:
Szymon Klarman and Thomas Meyer. Querying Temporal Databases via OWL 2 QL (with appendix). In Proceedings of the 8th International Conference on Web Reasoning and Rule Systems (RR-14), 2014.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
In this project I use a stack of denoising autoencoders to learn low-dimensional
representations of images. These encodings are used as input to a locality sensitive
hashing algorithm to find images similar to a given query image. The results clearly
shows that this approach outperforms basic LSH by far.
Slides by Víctor Garcia about the paper:
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016.
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...WiMLDSMontreal
"Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy Data"
By Sergül Aydöre, Assistant Professor at Stevens Institute of Technology
Abstract:
The use of complex models –with many parameters– is challenging with high-dimensional small-sample
problems: indeed, they face rapid overfitting. Such situations are common when data collection is expensive,
as in neuroscience, biology, or geology. Dedicated regularization can be crafted to tame overfit, typically via
structured penalties. But rich penalties require mathematical expertise and entail large computational costs.
Stochastic regularizers such as dropout are easier to implement: they prevent overfitting by random perturbations.
Used inside a stochastic optimizer, they come with little additional cost. We propose a structured stochastic
regularization that relies on feature grouping. Using a fast clustering algorithm, we define a family of
groups of features that capture feature covariations. We then randomly select these groups inside a stochastic
gradient descent loop. This procedure acts as a structured regularizer for high-dimensional correlated data
without additional computational cost and it has a denoising effect. We demonstrate the performance of our
approach for logistic regression both on a sample-limited face image dataset with varying additive noise and on
a typical high-dimensional learning problem, brain image classification.
Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.
A brief introduction to clustering with Scikit learn. In this presentation, we provide an overview with real examples of how to make use and optimize within k-means clustering.
"Number Crunching in Python": slides presented at EuroPython 2012, Florence, Italy
Slides have been authored by me and by Dr. Enrico Franchi.
Scientific and Engineering Computing, Numpy NDArray implementation and some working case studies are reported.
With datasets becoming larger, new visualizations techniques are needed. One problem is that the number of datapoints is so large, that a scatter plot becomes cluttered. Another problem is that with over a billion objects, only a few cpu cycles are available per object if one wants to process them within a second, making traditional methods not viable.
I will show that it is possible to visualize a billion objects in about 1 second on a modern desktop computer using memory mapping of hdf5 files together with a simple binning algorithm in Python. Density fields in 1, 2 and 3d are computed and statistics of any properties of the objects, or any mathematical operation on them can be done on the fly. This enables efficient exploration or large datasets interactively, making science exploration feasible.
This idea is implemented in a Python library called vaex, which integrates well in the Jupyter/Numpy/Astropy stack. Build on top of this is the vaex application, which allows for interactive exploration and visualization.
The motivation for vaex is the upcoming astronomical catalogue of the Gaia satellite, which will contain properties of over a billion stars. However, vaex can also be used on N-body simulations, other catalogues or any tabular data.
https://www.astro.rug.nl/~breddels/vaex
https://github.com/maartenbreddels/vaex
Lecture 10b: Classification. k-Nearest Neighbor classifier, Logistic Regression, Support Vector Machines (SVM), Naive Bayes (ppt,pdf)
Chapters 4,5 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Christoph Koch is a professor of Computer Science at EPFL, specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University. Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005). He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, an Outrageous Ideas and Vision Paper Award at CIDR 2013, a Google Research Award (in 2009), and an ERC Grant (in 2011). He is a PI of the FET Flagship Human Brain Project and of NCCR MARVEL, a new Swiss national research center for materials research. He (co-)chaired the program committees of DBPL 2005, WebDB 2008, ICDE 2011, VLDB 2013, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology and as Editor-in-Chief of PVLDB.
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
In design acknowledgment, the k-nearest neighbor calculation is a non-parametric strategy proposed by Thomas Cover utilized for characterization and relapse. In the two cases, the info comprises of the k nearest preparing models in the element space.
"Practical Machine Learning With Ruby" by Iqbal Farabi (ID Ruby Community)Tech in Asia ID
This slide was shared on Tech in Asia Jakarta 2016 @ 17 November 2016.
Get updates about our dev events delivered straight to your inbox by signing up here: http://bit.ly/tia-dev ! Be the first to know when new information is available!
Big data classification refers to the process of categorizing or classifying data based on predefined categories or classes. It involves using machine learning algorithms and statistical techniques to analyze large volumes of data and assign them to specific classes or categories.
Here are some key aspects of big data classification:
Training Data: Classification algorithms require a labeled dataset for training. This dataset consists of examples where the class or category of each data point is known. The training data is used to build a classification model that can generalize patterns and relationships between the input features and the corresponding classes.
Feature Selection: In classification, the features or attributes of the data play a crucial role in determining the class labels. Feature selection involves identifying the most relevant and informative features that contribute to accurate classification. This helps reduce dimensionality and improve the performance of the classification model.
Classification Algorithms: Various machine learning algorithms can be used for big data classification, such as decision trees, random forests, support vector machines (SVM), logistic regression, and deep learning techniques like neural networks. Each algorithm has its strengths, limitations, and suitability for different types of data and classification tasks.
Model Training and Evaluation: The classification model is trained using the labeled training data. The model learns the patterns and relationships between the features and the corresponding class labels. After training, the model is evaluated using evaluation metrics such as accuracy, precision, recall, F1 score, and area under the receiver operating characteristic (ROC) curve to assess its performance and generalization ability.
Predictive Classification: Once the classification model is trained and evaluated, it can be used to predict the classes of new, unseen data. The model takes the input features of the new data and applies the learned patterns to assign the appropriate class label.
Handling Big Data Challenges: Big data classification comes with specific challenges, including the volume, velocity, variety, and veracity of data. Processing large-scale datasets requires distributed computing frameworks like Apache Hadoop or Apache Spark. Additionally, data preprocessing, feature engineering, and model optimization techniques are used to handle the complexity and scalability of big data.
Big data classification finds applications in various domains, including customer segmentation, fraud detection, image recognition, sentiment analysis, and medical diagnosis, among others. It enables organizations to extract valuable insights from massive datasets and automate the process of categorizing and organizing data.
To successfully perform big data classification, it is important to have a good understanding of the data, the domain, and the appropriate choice of algorithms and techniqu
How Data-Driven Approaches are Changing Your Data Management Strategies
Introducing data-driven strategies into your business model alters the way your organization manages and provides information to your customers, partners and employees. Gone are the days of “waterfall” implementation strategies from relational data to applications within a data center. Now, data-driven business models require agile implementation of applications based on information from all across an organization–on-premises, cloud, and mobile–and includes information from outside corporate walls from partners, third-party vendors, and customers. Data management strategies need to be ready to meet these challenges or your new and disruptive business models will fail at the most critical time: when your customers want to access it.
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
How Rendezvous Architecture Improves Evaluation in the Real World
In this addition of our machine learning logistics webinar series we build on the ideas of the key requirements for effective management of machine learning logistics presented in the Overview webinar and in Part I Workshop. Here we focus on model-to-model comparison & evaluation, use of decoy models and more. Listen here: http://info.mapr.com/machine-learning-workshop2.html?_ga=2.35695522.324200644.1511891424-416597139.1465233415
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
MapR has launched the MapR Data Science Refinery which leverages a scalable data science notebook with native platform access, superior out-of-the-box security, and access to global event streaming and a multi-model NoSQL database.
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
Big data technologies are being applied to a wide variety of use cases. We will review tangible examples of machine learning, discuss an autonomous driving project and illustrate the role of MapR in next generation initiatives. More: http://info.mapr.com/WB_Machine-Learning-for-Chickens_Global_DG_17.11.02_RegistrationPage.html
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
Having heard the high-level rationale for the rendezvous architecture in the introduction to this series, we will now dig in deeper to talk about how and why the pieces fit together. In terms of components, we will cover why streams work, why they need to be persistent, performant and pervasive in a microservices design and how they provide isolation between components. From there, we will talk about some of the details of the implementation of a rendezvous architecture including discussion of when the architecture is applicable, key components of message content and how failures and upgrades are handled. We will touch on the monitoring requirements for a rendezvous system but will save the analysis of the recorded data for later. Listen to the webinar on demand: https://mapr.com/resources/webinars/machine-learning-workshop-1/
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
Join Ellen Friedman, co-author (with Ted Dunning) of a new short O’Reilly book Machine Learning Logistics: Model Management in the Real World, to look at what you can do to have effective model management, including the role of stream-first architecture, containers, a microservices approach and a DataOps style of work. Ellen will provide a basic explanation of a new architecture that not only leverages stream transport but also makes use of canary models and decoy models for accurate model evaluation and for efficient and rapid deployment of new models in production.
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
For this talk we will explore the power of streaming real time events in the context of the IoT and smart cities.
http://info.mapr.com/WB_Streaming-Real-Time-Events_Global_DG_17.08.02_RegistrationPage.html
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
Deploying storage with a forklift is so 1990s, right? Today’s applications and infrastructure demand systems and services that scale. Customers require performance and capacity to fit the use case and workloads, not the other way around. Architects need multi-temperature, multi-location, highly available, and compliance friendly platforms that grow with the generational shift in data growth and utility.
Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a service. Though originally used within the telecommunications industry, it has become common practice for banks, ISPs, insurance firms, and other verticals. More: http://info.mapr.com/WB_PredictingChurn_Global_DG_17.06.15_RegistrationPage.html
The prediction process is data-driven and often uses advanced machine learning techniques. In this webinar, we'll look at customer data, do some preliminary analysis, and generate churn prediction models – all with Spark machine learning (ML) and a Zeppelin notebook.
Spark’s ML library goal is to make machine learning scalable and easy. Zeppelin with Spark provides a web-based notebook that enables interactive machine learning and visualization.
In this tutorial, we'll do the following:
Review classification and decision trees
Use Spark DataFrames with Spark ML pipelines
Predict customer churn with Apache Spark ML decision trees
Use Zeppelin to run Spark commands and visualize the results
An Introduction to the MapR Converged Data PlatformMapR Technologies
Listen to the webinar on-demand: http://info.mapr.com/WB_Partner_CDP_Intro_EMEA_DG_17.05.31_RegistrationPage.html
In this 90-minute webinar, we discuss:
- The MapR Converged Data Platform and its components
- Use cases for the Converged Data Platform
- MapR Converged Partner Program
- How to get started with MapR
- Becoming a partner
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
IT budgets are shrinking, and the move to next-generation technologies is upon us. The cloud is an option for nearly every company, but just because it is an option doesn’t mean it is always the right solution for every problem.
Most cloud providers would prefer that every customer be tightly coupled with their proprietary services and APIs to create lock-in with that cloud provider. The savvy customer will leverage the cloud as infrastructure and stay loosely bound to a cloud provider. This creates an opportunity for the customer to execute a multicloud strategy or even a hybrid on-premises and cloud solution.
Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations. Along the way, Jim discusses security, backups, event streaming, databases, replication, and snapshots across a variety of use cases that run most businesses today.
Is your organization at the analytics crossroads? Have you made strides collecting and sharing massive amounts of data from electronic health records, insurance claims, and health information exchanges but found these efforts made little impact on efficiency, patient outcomes, or costs?
Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html
MapR announced a few new releases in 2017, and we want to go over those exciting new products and features that are available now. We’d like to invite our customers and partners to this webinar in which members of the MapR product team will share details about the latest updates.
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
SAP® HANA and SAP® IQ are popular platforms for various analytical and transactional use cases. If you’re an SAP customer, you’ve experienced the benefits of deploying these solutions. However, as data volumes grow, you’re likely asking yourself: How do I scale storage to support these applications? How can I have one platform for various applications and use cases?
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
SAP HANA is an increasingly popular platform for various analytical and transactional use cases with its in-memory architecture. If you’re an SAP customer you’ve experienced the benefits.
However, the underlying storage for SAP HANA is painfully expensive. This slows down your ability to grow your SAP HANA footprint and serve up more applications.
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
4. What Works at Scale
• Logging
• Counting
• Session grouping
5. What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
6. What Works at Scale
• Logging
• Counting
• Session grouping
• Really. Don’t bet on anything much more
complex than these
• These are harder than they look
8. Recommendations
• Special case of reflected intelligence
• Traditionally “people who bought x also
bought y”
• But soooo much more is possible
9. Examples
• Customers buying books (Linden et al)
• Web visitors rating music (Shardanand and
Maes) or movies (Riedl, et al), (Netflix)
• Internet radio listeners not skipping songs
(Musicmatch)
• Internet video watchers watching >30 s
10. Dyadic Structure
• Functional
– Interaction: actor -> item*
• Relational
– Interaction ⊆ Actors x Items
• Matrix
– Rows indexed by actor, columns by item
– Value is count of interactions
• Predict missing observations
11. Recommendations Analysis
• R(x,y) = # people who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
12. Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
13. Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
14. Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
15. Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
16. Recommendations Analysis
• R(x,y) = People who bought x also bought y
select x, y, count(*) from (
(select distinct(user_id, item_id) as x from log) A
join
(select distinct(user_id, item_id) as y from log) B
on user_id
) group by x, y
18. Fundamental Algorithmic Structure
• Cooccurrence
• Matrix approximation by factoring
• LLR
K = AT
A
A » USVT
K » VS2
VT
r = VS2
VT
h
r =sparsify(AT
A)h
20. For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• A’A gives query recommendation
– “did you mean to ask for”
• B’B gives video recommendation
– “you might like these videos”
21. The punch-line
• B’A recommends videos in response to a
query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-
data)
22. Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
24. Hypothetical Example
• Want a navigational ontology?
• Just put labels on a web page with traffic
– This gives A = users x label clicks
• Remember viewing history
– This gives B = users x items
• Cross recommend
– B’A = label to item mapping
• After several users click, results are whatever
users think they should be
27. What is Quality?
• Robust clustering not a goal
– we don’t care if the same clustering is replicated
• Generalization is critical
• Agreement to “gold standard” is a non-issue
37. Ball k-means
• Provably better for highly clusterable data
• Tries to find initial centroids in each “core” of each real
clusters
• Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing
tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than
closest cluster
38. Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O(nkd + k3d) time
39. Still Not a Win
• Ball k-means is nearly guaranteed with k = 2
• Probability of successful seeding drops
exponentially with k
• Alternative strategy has high probability of
success, but takes O( nkd + k3d ) time
• But for big data, k gets large
40. Surrogate Method
• Start with sloppy clustering into lots of
clusters
κ = k log n clusters
• Use this sketch as a weighted surrogate for the
data
• Results are provably good for highly
clusterable data
41. Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice
42. Algorithm Costs
• Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted
centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k,
looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice
43. Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
44. Algorithm Costs
• How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
45. How It Works
• For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
• If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold
47. But Wait, …
• Finding nearest centroid is inner loop
• This could take O( d κ ) per point and κ can be
big
• Happily, approximate nearest centroid works
fine
52. Quality
• Ball k-means implementation appears significantly
better than simple k-means
• Streaming k-means + ball k-means appears to be about
as good as ball k-means alone
• All evaluations on 20 newsgroups with held-out data
• Figure of merit is mean and median squared distance
to nearest cluster
53. Contact Me!
• We’re hiring at MapR in US and Europe
• MapR software available for research use
• Get the code as part of Mahout trunk (or 0.8 very soon)
• Contact me at tdunning@maprtech.com or @ted_dunning
• Share news with @apachemahout