This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
This is an excerpt of the "Tier-1 BI in the World of Big Data" by Thomas Kejser, Denny Lee, and Kenneth Lieu specific to the Yahoo! TAO Case Study published at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
This is an excerpt of the "Tier-1 BI in the World of Big Data" by Thomas Kejser, Denny Lee, and Kenneth Lieu specific to the Yahoo! TAO Case Study published at: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Linked Data and Semantic Technologies can support a next generation of science. This talk shows examples of discovery, access, integration, analysis, and shows directions towards prediction and vision.
Data Culture Series - Keynote & Panel - Reading - 12th May 2015Jonathan Woodward
Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward?
Data is growing exponentially and it’s now possible to mine and unlock insights from data in new and unexpected ways. Empower your business to take advantage of this data by harnessing the rich capabilities of Microsoft SQL Server and the familiarity of Microsoft Office to help organize, analyze, and make sense of your data—no matter the size.
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
Presentation given at nosql east 2009 in Atlanta. Introduces the NOSQL space by offering a framework for categorization and discusses the benefits of graph databases. Oh, and also includes some tongue-in-cheek party poopers about sucky things in the NOSQL space.
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
Integrating applications & projects
= Dynamic & repeatable transformation of existing Thesauri and Authority lists into SKOS
+ Cross-tabulation of Concepts Linked Data
Presentation to the Linked Data Meeting
University College of London, September 14th 2010
by Christophe Dupriez, Destin SSEB, working for Belgium Poison Centre
A very categorized presentation about big data analytics Various topics like Introduction to Big Data,Hadoop,HDFS Map Reduce, Mahout,K-means Algorithm,H-Base are explained very clearly in simple language for everyone to understand easily.
SeCold - A Linked Data Platform for Mining Software Repositoriesimanmahsa
This is the SeCold presentation at MSR 2012 Conference. More info at secold.org
Paper Title:
A Linked Data Platform for Mining Software Repositories
Paper Abstract:
The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
Presented at Splunk .conf 2012 in Las Vegas. Includes an overview of the Cascading app based on City of Palo Alto open data. PS: email me if you need a different format than Keynote: @pacoid or pnathan AT concurrentinc DOT com
Linked Data and Semantic Technologies can support a next generation of science. This talk shows examples of discovery, access, integration, analysis, and shows directions towards prediction and vision.
Data Culture Series - Keynote & Panel - Reading - 12th May 2015Jonathan Woodward
Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward?
Data is growing exponentially and it’s now possible to mine and unlock insights from data in new and unexpected ways. Empower your business to take advantage of this data by harnessing the rich capabilities of Microsoft SQL Server and the familiarity of Microsoft Office to help organize, analyze, and make sense of your data—no matter the size.
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
Presentation given at nosql east 2009 in Atlanta. Introduces the NOSQL space by offering a framework for categorization and discusses the benefits of graph databases. Oh, and also includes some tongue-in-cheek party poopers about sucky things in the NOSQL space.
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
Integrating applications & projects
= Dynamic & repeatable transformation of existing Thesauri and Authority lists into SKOS
+ Cross-tabulation of Concepts Linked Data
Presentation to the Linked Data Meeting
University College of London, September 14th 2010
by Christophe Dupriez, Destin SSEB, working for Belgium Poison Centre
A very categorized presentation about big data analytics Various topics like Introduction to Big Data,Hadoop,HDFS Map Reduce, Mahout,K-means Algorithm,H-Base are explained very clearly in simple language for everyone to understand easily.
SeCold - A Linked Data Platform for Mining Software Repositoriesimanmahsa
This is the SeCold presentation at MSR 2012 Conference. More info at secold.org
Paper Title:
A Linked Data Platform for Mining Software Repositories
Paper Abstract:
The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
Updated from the Hadoop Summit slides (http://www.slideshare.net/Hadoop_Summit/klout-changing-landscape-of-social-media), we've included additional screenshots to help tell the whole story.
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
Leonard Austin (Ravelin) - DevOps in a Machine Learning WorldOutlyer
As machine learning moves from niche to mainstream tech stacks how do DevOps engineers prepare for a very different set of problems. A brief look at the new issues that arise from machine learning, an overview of cutting-edge "old school" solutions and how to drag data science (kicking and screaming) into a world of automation.
Video: https://www.youtube.com/watch?v=KHxZCRajRiA
Join DevOps Exchange London here: http://meetup.com/DevOps-Exchange-London/
Follow DOXLON on twitter http://www.twitter.com/doxlon
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Hadoop and Internet of Things presentation from Sinergija 2014 conference, held in Belgrade in October 2014. How the rising data resources change the business, and how the Big Data technologies combined with Internet of Things devices can help to improve the business and the everyday life. Hadoop is already the most significant technology for working with Big Data. Microsoft is playing a very important role in this field, with the Stinger initiative. The main goal is to bring the enterprise SQL at Hadoop scale.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Galaxy of bits
1. Galaxy of bits
Surviving the flood of information
Michał Żyliński, Microsoft
(michal.zylinski@microsoft.com)
2. In 2000 the Sloan Digital Sky Survey collected more data in its 1st
week than was collected in the entire history of Astronomy
By 2016 the New Large Synoptic Survey Telescope in Chile will
acquire 140 terabytes in 5 days - more than Sloan acquired in 10
years
The Large Hadron Collider at CERN generates 40 terabytes of data
every second
2
Sources: The Economist, Feb ‘10; IDC
3. Bing ingests > 7 petabyte a month
The Twitter community generates over 1 terabyte of tweets every day
Cisco predicts that by 2013 annual internet traffic flowing will reach 667
exabytes
3
Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp
4. 1,800,000,00
1,8 0,000,000,00
0,000 bytes
The size of Digital Universe in
ZB 2011
9
8
7
6
5
Within 24 months #
of intelligent devices >
traditional IT devices
4
3
2 In 2015 nearly 20%
1
0 of the information will
2010 2011 2012 2015 be touched by cloud
Sources: IDC Digital Universe Study 2011, Worldwide Big Data Technology and Services 2012–2015 Forecast
10. So how does it work?
SECOND, TAKE THE PROCESSING TO THE DATA
// Map Reduce function in
JavaScript
var map = function
(key, value, context) {
var words =
value.split(/[^a-zA-Z]/);
for (var i = 0; i <
words.length; i++) {
if
(words[i] !== "")
{context.write(words[i].to
LowerCase(), 1);}
}};
var reduce = function
(key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum +=
parseInt(values.next());
}
context.write(key, sum);
};
11. Hadoop in detail
Analysis of semi and unstructured data distributed across a commodity cluster
Based on Google’s MapReduce paper
and Google File system (GFS)
Programs = Sequence of “map” and
“reduce” tasks.
Simplify writing distributed applications
Highly fault tolerant – multiple copies
Move computation close to data
Implemented in Java and optimized for
Linux
14. Traditional RDBMS MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
DBA Ratio 1:40 1:3000
16. Hadoop + Microsoft
Our own • Submit changes back to
distribution of Apache Foundation
Hadoop • Download for free
• AD & Systems Center
Optimized for integration
Windows & Azure • Hadoop-as-a-service-on-
Azure
Focus on .NET • Integration with Visual Studio
Developers • Support for C#
• Performance and Scale
• High Availability
• Ease of use
17. Why Hadoop as a Service?
• Task based billing
• Easy admin
• Zero install
• Support a wide variety of job types
– Machine Learning (mahout), Graph Mining
(Pegasus), HIVE, Pig, Java, JS, etc.
• Greatly simplified UI
cheap fast
27. Benefits
Some other fancy stuff...
Models augmented with
publicly available data
from social media sites
Key Features
Microsoft
Codename
"Social Analytics"
29. Reality check A.D. 2012
ANALYTICS
SELF-SERVICE MOBILE
OPERATIONAL REAL-TIME
PREDICTIVE COLLABORATIVE
MARKETPLACE
DATA ENRICHMENT
External Data
and Services
DISCOVER TRANSFORM SHARE
AND RECOMMEND AND CLEAN AND GOVERN
DATA MANAGEMENT
1
011
01
RELATIONAL NON RELATIONAL MULTIDIMENSIONAL STREAMING
30. Use Case:
• Extremely large volume of
Microsoft unstructured web log
BI Tools analysis
• Ad hoc analysis of
unstructured web logs to
prototype patterns
• Hadoop data feeds large
24TB Cube
24 TB Cube
Hadoop Distribution
Share and collaborate via Windows Azure Marketplace:The Microsoft Big Data solution enables customers to share data and insights through Windows Azure Marketplace, which exposes hundreds of applications and data mining algorithms from Microsoft and third parties to help unlock unprecedented insights for customers. Microsoft’s Hadoop based service for Windows Azure offers seamless connection to Azure Marketplace through the Open Data (ODATA) Protocol.
Integrate with social media:Microsoft’s Big Data solution enables customers to augment their analysis with publicly available data from social media sites (such as Twitter and Facebook) and hundreds of trusted data providers on Windows Azure Marketplace. Microsoft Codename "Social Analytics" allows for integration of social information with business applications.