(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Functional Patterns for the non-mathematicianBrian Lonsdorf
Fluentconf 2014 talk:
Functional design patterns such as lenses, arrows, functors, and monads all come from category theory. To fully grok them, you’ll probably have to wade through the whitest white papers, fighting the mathematical syntax and abstract examples.
I’m hoping to demonstrate the ideas into javascript. I’ll be showing direct and practical applications for every day programming.
ref:https://www.ggplot2-exts.org/ggtree.html
ggtree
https://bioconductor.org/packages/release/bioc/html/ggtree.html
gtree is designed for visualizing phylogenetic tree and different types of associated annotation data.
(Presented by Antonio Piccolboni to Strata 2012 Conference, Feb 29 2012).
Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from their favorite language, R. RHadoop is comprised of three packages.
- rhdfs provides file level manipulation for HDFS, the Hadoop file system
- rhbase provides access to HBASE, the hadoop database
- rmr allows to write mapreduce programs in R
rmr allows R developers to program in the mapreduce framework, and to all developers provides an alternative way to implement mapreduce programs that strikes a delicate compromise betwen power and usability. It allows to write general mapreduce programs, offering the full power and ecosystem of an existing, established programming language. It doesn’t force you to replace the R interpreter with a special run-time—it is just a library. You can write logistic regression in half a page and even understand it. It feels and behaves almost like the usual R iteration and aggregation primitives. It is comprised of a handful of functions with a modest number of arguments and sensible defaults that combine in many useful ways. But there is no way to prove that an API works: one can only show examples of what it allows to do and we will do that covering a few from machine learning and statistics. Finally, we will discuss how to get involved.
Functional Patterns for the non-mathematicianBrian Lonsdorf
Fluentconf 2014 talk:
Functional design patterns such as lenses, arrows, functors, and monads all come from category theory. To fully grok them, you’ll probably have to wade through the whitest white papers, fighting the mathematical syntax and abstract examples.
I’m hoping to demonstrate the ideas into javascript. I’ll be showing direct and practical applications for every day programming.
ref:https://www.ggplot2-exts.org/ggtree.html
ggtree
https://bioconductor.org/packages/release/bioc/html/ggtree.html
gtree is designed for visualizing phylogenetic tree and different types of associated annotation data.
How fast ist it really? Benchmarking in practiceTobias Pfeiffer
“What’s the fastest way of doing this?” - you might ask yourself during development. Sure, you can guess what’s fastest or how long something will take, but do you know? How long does it take to sort a list of 1 Million elements? Are tail-recursive functions always the fastest?
Benchmarking is here to answer these questions. However, there are many pitfalls around setting up a good benchmark and interpreting the results. This talk will guide you through, introduce best practices and show you some surprising benchmarking results along the way.
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Final tagless. The topic strikes fear into the hearts of Scala developers everywhere—and not without reason. Final tagless allows developers to build composable Domain Specific Languages (DSLs) that model interaction with the outside world. Programs written using the final tagless style can be tested deterministically and reasoned about at compile-time. Yet the technique requires confusing, compiler-choking higher-kinded types, like `F[_]`, and pervasive, non-inferable context bounds like `F[_]: Concurrent: Console: Logging`. Many have looked at final tagless and wondered if all the layers of complexity and ceremony are really worth the benefits.
In this presentation, John A. De Goes provides a gentle and accessible introduction to final tagless, explaining what it is and the problem it intends to solve. John shows that while final tagless is easier to use than free monads, the technique suffers from a litany of drawbacks that push developers away from functional programming in Scala. John then introduces a novel approach that shares some of the benefits of final tagless, but which is idiomatic Scala, easy to explain, doesn’t need any complex type machinery, provides flawless type inference, and works beautifully across Scala 2.x and Scala 3.
Come join John for an evening of fun as you learn how to write functional code in Scala that's easy to test and easy to reason about—all without the complexity of free monads or final tagless.
These are my slides from a presentation to the Chicago R User Group on Oct 3, 2012. It covers how to use R and Gephi to visualize a map of influence in the history of philosophy.
More detail is available on the Design & Analytics Blog.
How fast ist it really? Benchmarking in practiceTobias Pfeiffer
“What’s the fastest way of doing this?” - you might ask yourself during development. Sure, you can guess what’s fastest or how long something will take, but do you know? How long does it take to sort a list of 1 Million elements? Are tail-recursive functions always the fastest?
Benchmarking is here to answer these questions. However, there are many pitfalls around setting up a good benchmark and interpreting the results. This talk will guide you through, introduce best practices and show you some surprising benchmarking results along the way.
Desk reference for data transformation in Stata. Co-authored with Tim Essam (@StataRGIS, linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/03.
Stata cheat sheet: programming. Co-authored with Tim Essam (linkedin.com/in/timessam). See all cheat sheets at http://bit.ly/statacheatsheets. Updated 2016/06/04
Final tagless. The topic strikes fear into the hearts of Scala developers everywhere—and not without reason. Final tagless allows developers to build composable Domain Specific Languages (DSLs) that model interaction with the outside world. Programs written using the final tagless style can be tested deterministically and reasoned about at compile-time. Yet the technique requires confusing, compiler-choking higher-kinded types, like `F[_]`, and pervasive, non-inferable context bounds like `F[_]: Concurrent: Console: Logging`. Many have looked at final tagless and wondered if all the layers of complexity and ceremony are really worth the benefits.
In this presentation, John A. De Goes provides a gentle and accessible introduction to final tagless, explaining what it is and the problem it intends to solve. John shows that while final tagless is easier to use than free monads, the technique suffers from a litany of drawbacks that push developers away from functional programming in Scala. John then introduces a novel approach that shares some of the benefits of final tagless, but which is idiomatic Scala, easy to explain, doesn’t need any complex type machinery, provides flawless type inference, and works beautifully across Scala 2.x and Scala 3.
Come join John for an evening of fun as you learn how to write functional code in Scala that's easy to test and easy to reason about—all without the complexity of free monads or final tagless.
These are my slides from a presentation to the Chicago R User Group on Oct 3, 2012. It covers how to use R and Gephi to visualize a map of influence in the history of philosophy.
More detail is available on the Design & Analytics Blog.
WSO2Con EU 2016: Understanding the WSO2 API Management PlatformWSO2
In this session, we depict the key challenges of deploying an API management solution and how WSO2’s API Management platform can address them by supporting API provisioning, security and analytics. We also describe the various deployment options – on-premise and in the cloud – as well as the key deployment patterns that you need to adopt.
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
In the age of Big Data and large volume analytics there is a lot to cover and a lot to learn. While at Microsoft developing Windows HDInsight and now developing a one of kind Big Data product at my own company Big Data Perspective, San Francisco I have lived last several years covering Big Data at various level. This talk is customized for database and business intelligence (BI) professionals, programmers, Hadoop administrators, researchers, technical architects, operations engineers, data analysts, and data scientists understand the core concepts of Big Data Analytics on Hadoop. This webinar will be useful for those, who wants to know what is Hadoop, and how they can take advantage just by spending few dollars to run the cluster. The webinar is great for those who are looking to deploy their first data cluster and run MapReduce jobs to discover insights.
WSO2 - Forrester Guest Webinar: API Management is not Enough: You Need an API...WSO2
To view recording of this webinar please use the below URL:
http://wso2.com/library/webinars/2016/05/api-management-is-not-enough-you-need-an-api-platform/
APIs are critical for any modern application, whether it be for mobile apps, B2B integration, digital disruption or agile integration strategies. With this greater focus on APIs comes the increased need for API management. When adopting API management many make the mistake of giving it less significance and setting themselves up for big problems.
In this webinar, WSO2’s guest speaker Randy Heffner, analyst at Forrester Research, will set the record straight by providing a clear definition for API management - and the broader API platform - that enterprises need to create and implement a successful API strategy. Using WSO2 customer user stories, Isabelle Mauny, the vice president of product management at WSO2, will then explain how choosing the proper API management platform can set yourself on the road to success for API-led business transformation.
Analysts predict that the Hadoop market will reach $50.2 billion USD by 2020.1 Applications driving these large expenditures are some of the most important workloads for businesses today including:
• Analyzing clickstream data, including site-side clicks and web media tags. • Measuring sentiment by scanning product feedback, blog feeds, social media comments, and Twitter streams. • Analysis of behavior and risk by capturing vehicle telematics. • Optimizing product performance and utilization by gathering data from built-in sensors. • Tracking and analyzing people and material movement with location-aware systems. • Identifying system performance and intrusion attempts by analyzing server and network log. • Enabling automatic document and speech categorization. • Extracting learning from digitized images, voice, video, and other media types.
Predictive analytics on large data sets provides organizations with a key opportunity to improve a broad variety of business outcomes, and many have embraced Apache Hadoop as the platform of choice.
In the last few years, large businesses have adopted Apache Hadoop as a next-generation data platform, one capable of managing large data assets in a way that is flexible, scalable, and relatively low cost. However, to realize predictive benefits of big data, organizations must be able to develop or hire individuals with the requisite statistics skills, then provide them with a platform for analyzing massive data assets collected in Hadoop “data lakes.”
As users adopted Hadoop, many discovered performance and complexity limited Hadoop’s use for broad predictive analytics use. In response, the Hadoop community has focused on the Apache Spark platform to provide Hadoop with significant performance improvements. With Spark atop Hadoop, users can leverage Hadoop’s big-data management capabilities while achieving new performance levels by running analytics in Apache Spark.
What remains is a challenge—conquering the complexity of Hadoop when developing predictive analytics applications.
In this white paper, we’ll describe how Microsoft R Server helps data scientists, actuaries, risk analysts, quantitative analysts, product planners, and other R users to capture the benefits of Apache Spark on Hadoop by providing a straightforward platform that eliminates much of the complexity of using Spark and Hadoop to conduct analyses on large data assets.
UKOUG - Implementing Enterprise API Management in the Oracle Cloudluisw19
API-led connectivity has become the main mechanism to integrate with SaaS applications. Mobile applications, modern web applications and Internet of things also need APIs. In the Oracle Cloud there are at least 6 cloud services offering a solution for APIs, (Mobile Cloud Service, API Manager Cloud Service, API Platform Cloud Service, API Catalog Cloud Service, IoT Cloud Service and Integration Cloud Service).
This presentation will first and foremost describe what an enterprise-wide API management solution looks like, will elaborate on a solid API taxonomy to then show how to position each of the mentioned cloud services to deliver an end to end API management solution in the Oracle Cloud but also capable of handling hybrid cloud use cases.
In addition real live use cases will be referenced to help contextualise the content presented.
To view recording of this webinar please use the below URL:
http://wso2.com/library/webinars/2015/08/wso2-api-platform-vision-and-roadmap/
WSO2 API platform adopters are driving digital business and creating innovative business models. API platforms create a secure, self-service, managed, and monetized environment that increases safe connected business interactions.
In this presentation, Chris and Shiro will describe:
Key goals and challenges driving API platform adoption
WSO2 API Platform capabilities and advantages
Visionary platform use cases
Innovative customer success stories
IBM API Connect is a Comprehensive API Solution. It is an integrated creation, runtime, management, and security foundation for enterprise grade API’s and Microservices to power modern digital applications.
In this webinar,
API Management Concepts
IBM API Connect overview and features
Kellton Tech’s API Strategy with IBM API Connect.
Technology: IBM API Connect 5.0
Rhadoop is an effective platform for doing exploratory data analysis over big data sets. The convenience of an interactive command-line interpreter and the overwhelming number of statistical and machine learning routines implemented in R libraries make a highly effective environment to perform elementary data science.
We'll discuss the basics of RHadoop: what it is, how to install it, and the API fundamentals. Next we'll discuss common use cases that you might want to use RHadoop for. Last, we'll run through an interactive example.
OBIEE Answers Vs Data Visualization: A Cage MatchMichelle Kolbe
With Oracle's new tool in 12c called Data Visualization, when do you use Answers and when do you use Data Visualization? This presentation included a live demo of the two tools. The slides walk step by step through this demo. You can follow along yourself using the Sample App data.
See my blog post walking through these slides here: https://medium.com/@datacheesehead/the-cage-match-between-obiee-answers-and-data-visualization-73496bbf4dfe#.thiuznp0z
Overview of accessing relational databases from R. Focuses and demonstrates DBI family (RMySQL, RPostgreSQL, ROracle, RJDBC, etc.) but also introduces RODBC. Highlights DBI's dbApply() function to combine strengths of SQL and *apply() on large data sets. Demonstrates sqldf package which provides SQL access to standard R data.frames.
Presented at the May 2011 meeting of the Greater Boston useR Group.
Adoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves. In more than 6 years of writing for the Revolutions blog, I’ve discovered hundreds of applications of R in business, in government, and in the non-profit sector. Sometimes the use of R is obvious, and sometimes it takes a little bit of detective work to learn how R is operating behind the scenes. In this talk, I'll recount some of my favourite applications of R, and show how R is behind some amazing innovations in today’s world.
HP Distributed R is a high-performance scalable platform for the R language. It enables R to
leverage multiple cores and multiple servers to perform Big Data Advanced Analytics. It consists of
new R language constructs to easily parallelize algorithms across multiple R processes.
HP Distributed R simplifies large-scale analysis by extending R. Because R is a single-threaded
environment, it has limited utility for Big Data analytics. HP Distributed R allows you to specify that
parts of programs be run in multiple single-threaded R-processes. This approach results in
significantly reduced execution times for Big Data analysis.
Sure, APIs are a technology. But APIs are part of a value chain, and every value chain is becoming infused with APIs that drive business results. What does it take to create a business strategy that makes the most of APIs? There are clear patterns for success that will enable you to get ahead of change—rather than react to competitors and disruptors.
We cover:
- why APIs are becoming an indispensable part of the value chain
- how APIs open new opportunities for business growth
- three things you should do in the next 100 days
Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
RHadoop is an open source project aiming to combine two rising star in the analytics firmament: R and Hadoop. With more than 2M users, R is arguably the dominant language to express complex statistical computations. Hadoop needs no introduction at HUG. With RHadoop we are trying to combine the expressiveness of R with the scalability of Hadoop and to pave the way for the statistical community to tackle big data with the tools they are familiar with. At this time RHadoop is a collection of three packages that interface with HDFS, HBase and mapreduce, respectively. For mapreduce, the package is called rmr and we tried to give it a simple, high level interface that's true to the mapreduce model and integrated with the rest of the language. We will cover the API and provide some examples.
Implement the following sorting algorithms Bubble Sort Insertion S.pdfkesav24
Implement the following sorting algorithms: Bubble Sort Insertion Sort. Selection Sort.
Merge Sort. Heap Sort. Quick Sort. For each of the above algorithms, measure the execution
time based on input sizes n, n + 10(i), n + 20(i), n + 30(i), .. ., n + 100(i) for n = 50000 and i =
100. Let the array to be sorted be randomly initialized. Use the same machine to measure all the
algorithms. Plot a graph to compare the execution times you collected in part(2).
Solution
This code wil create a graph for each plots comparing time for different sorting methods and also
save those plots in the current directory.
from random import shuffle
from time import time
import numpy as np
import matplotlib.pyplot as plt
def bubblesort(arr):
for i in range(len(arr)):
for k in range(len(arr)-1, i, -1):
if (arr[k] < arr[k-1]):
tmp = arr[k]
arr[k] = arr[k-1]
arr[k-1] = tmp
return arr
def selectionsort(arr):
for fillslot in range(len(arr)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if arr[location]>arr[positionOfMax]:
positionOfMax = location
temp = arr[fillslot]
arr[fillslot] = arr[positionOfMax]
arr[positionOfMax] = temp
return arr
def insertionsort(arr):
for i in range( 1, len( arr ) ):
tmp = arr[i]
k = i
while k > 0 and tmp < arr[k - 1]:
arr[k] = arr[k-1]
k -= 1
arr[k] = tmp
return arr
# def mergesort(arr):
#
# if len(arr)>1:
# mid = len(arr)//2
# lefthalf = arr[:mid]
# righthalf = arr[mid:]
#
# mergesort(lefthalf)
# mergesort(righthalf)
#
# i=0
# j=0
# k=0
# while i < len(lefthalf) and j < len(righthalf):
# if lefthalf[i] < righthalf[j]:
# arr[k]=lefthalf[i]
# i=i+1
# else:
# arr[k]=righthalf[j]
# j=j+1
# k=k+1
#
# while i < len(lefthalf):
# arr[k]=lefthalf[i]
# i=i+1
# k=k+1
#
# while j < len(righthalf):
# arr[k]=righthalf[j]
# j=j+1
# k=k+1
#
# return arr
def mergesort(x):
result = []
if len(x) < 2:
return x
mid = int(len(x)/2)
y = mergesort(x[:mid])
z = mergesort(x[mid:])
i = 0
j = 0
while i < len(y) and j < len(z):
if y[i] > z[j]:
result.append(z[j])
j += 1
else:
result.append(y[i])
i += 1
result += y[i:]
result += z[j:]
return result
def quicksort(arr):
less = []
equal = []
greater = []
if len(arr) > 1:
pivot = arr[0]
for x in arr:
if x < pivot:
less.append(x)
if x == pivot:
equal.append(x)
if x > pivot:
greater.append(x)
return quicksort(less)+equal+quicksort(greater) # Just use the + operator to join lists
else:
return arr
#### Heap sort
def heapsort(arr): #convert arr to heap
length = len(arr) - 1
leastParent = length / 2
for i in range(leastParent, -1, -1):
moveDown(arr, i, length)
# flatten heap into sorted array
for i in range(length, 0, -1):
if arr[0] > arr[i]:
swap(arr, 0, i)
moveDown(arr, 0, i - 1)
def moveDown(arr, first, last):
largest = 2 * first + 1
while largest <= last: #right child exists and is larger than left child
if (largest < last) and(arr[largest] < arr[largest + 1]):
largest += 1
# right child is larger than parent
if arr[largest] > arr[first]:
swap(arr, largest, first)# move down to largest child
first = largest
lar.
Solución a 4 retos de programación de Advent of Code y Google Code Jam para ilustrar construcciones pythónicas de código y algunas utilidades poco conocidas del lenguaje.
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
Python 101 language features and functional programmingLukasz Dynowski
Presentation reviles the syntax solution for common encountered programming challenges, gives insight in to python datatypes, and explains core design principles behind the program
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Presentation delivered by David Smith to NY R Conference https://www.rstats.nyc/, April 2018:
Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, we created the "miner" package. This package is a collection of simple functions that allow you to connect with a Minecraft instance, manipulate the world within by creating blocks and controlling the player, and to detect events within the world and react accordingly.
The miner package is intended mainly for kids, to inspire them to learn R while playing Minecraft. But the development of the package also provides some useful insights into how to build an R package to interface with a persistent API, and how to instruct others on its use. In this talk I'll describe how to set up your own Minecraft server, and how to use and extend the package. I'll also provide a few examples of the package in action in a live Minecraft session.
While Python is a widely-used tool for AI development, in this talk I'll make the case for considering R as a platform for developing models for intelligent applications. Firstly, R provides a first-class experience working deep learning frameworks with its keras integration. Equally importantly, it provides the most comprehensive suite of statistical data analysis tools, which are extremely useful for many intelligent applications such as transfer learning. I'll give a few high-level examples in this talk, and we'll go into further detail in the accompanying interactive code lab.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
A look at the changing perceptions of R, from the early days of the R project to today. Microsoft sponsor talk, presented by David Smith to the useR!2017 conference in Brussels, July 5 2017.
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Presented by David Smith, R Community Lead (Microsoft), at Monktoberfest October 2016.
The value of open source isn’t just in the software itself. The communities that form around open source software provide just as much value and sometimes even more: in ongoing development, in documentation, in support, in marketing, and as a supply of ready-trained employees. Companies who build on open source tend to focus on the software, but neglect communities at their peril.
In this talk, I share some of my experiences in building community for an open-source software company, Revolution Analytics, and perspectives since the acquisition by Microsoft in 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
21. #!/usr/bin/python
import sys
from math import fabs
from org.apache.pig.scripting import Pig
filename = "student.txt"
k=4
tolerance = 0.01
MAX_SCORE = 4
MIN_SCORE = 0
MAX_ITERATION = 100
# initial centroid, equally divide the space
initial_centroids = ""
last_centroids = [None] * k
for i in range(k):
last_centroids[i] = MIN_SCORE + float(i)/k*(MAX_SCORE-MIN_SCORE)
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + ":"
P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
raw = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into 'output';
""")
converged = False
iter_num = 0
while iter_num<MAX_ITERATION:
Q = P.bind({'centroids':initial_centroids})
results = Q.runSingle()
22. if results.isSuccessful() == "FAILED":
raise "Pig job failed"
iter = results.result("result").iterator()
centroids = [None] * k
distance_move = 0
# get new centroid of this iteration, caculate the moving distance with last iteration
for i in range(k):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-centroids[i])
distance_move = distance_move / k;
Pig.fs("rmr output")
print("iteration " + str(iter_num))
print("average distance moved: " + str(distance_move))
if distance_move<tolerance:
sys.stdout.write("k-means converged at centroids: [")
sys.stdout.write(",".join(str(v) for v in centroids))
sys.stdout.write("]n")
converged = True
break
last_centroids = centroids[:]
initial_centroids = ""
for i in range(k):
initial_centroids = initial_centroids + str(last_centroids[i])
if i!=k-1:
initial_centroids = initial_centroids + ":"
iter_num += 1
if not converged:
print("not converge after " + str(iter_num) + " iterations")
sys.stdout.write("last centroids: [")
sys.stdout.write(",".join(str(v) for v in last_centroids))
sys.stdout.write("]n")
23. import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class FindCentroid extends EvalFunc<Double> {
double[] centroids;
public FindCentroid(String initialCentroid) {
String[] centroidStrings = initialCentroid.split(":");
centroids = new double[centroidStrings.length];
for (int i=0;i<centroidStrings.length;i++)
centroids[i] = Double.parseDouble(centroidStrings[i]);
}
@Override
public Double exec(Tuple input) throws IOException {
double min_distance = Double.MAX_VALUE;
double closest_centroid = 0;
for (double centroid : centroids) {
double distance = Math.abs(centroid - (Double)input.get(0));
if (distance < min_distance) {
min_distance = distance;
closest_centroid = centroid;
}
}
return closest_centroid;
}
}