A Year in Review -
Building a
Comprehensive Data
Management Program
@ Microsoft Research
What Exactly
Is Big Data?
2
Wikipedia: “Big data is a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications”
Critical tool for Microsoft’s businesses
Opportunity to deliver transformative new
capabilities to our enterprise customers
MSR and Big
Data
3
First, the sword: Shame on us…
Many undergrads with better big data capabilities
Martians versus Earthlings
Finally…Big data has been fully embraced by MSR as
A vital tool to enable research
A vital area in which to do research
We are MAKING THE INVESTMENT
Microsoft Research’s Centralized Data Management and Data Processing Platform
Founded June - 2013
Microsoft Research’s Centralized Data Management and Data Processing Platform
Project Vision
Motivation:
• Numerous Areas of Research are Driven by Data (Research
Need)
• Data comes in very different forms from very different sources
(Adapting to Change)
• Identified need standardized Data Storage and Data Processing
resource for MSR (Community)
• Many different research groups were processing and storing the
same data sets. (Shared Knowledge / Data Sharing)
• Some research groups were not aware that so many different
types of data was available. (Communication / Collaboration)
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Adapting to
Change
Community
Collaboration
Shared
Knowledge
Data Sharing
Research
Need
Guiding Principles:
• Secure and Compliant (e.g. Data Security, Privacy and Ethics)
• World-wide Access (equal opportunity for access and use given
to all MSR labs)
• Created through Partnerships with teams throughout Microsoft
• Driven by Researcher Needs and Requirements (e.g. Tools,
Hardware, Software, Datasets)
• Flexibility
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Security
Driven by
Researcher
Needs
Research and
Product Team
Partnerships
Global Access
Compliance
Ethics
Goals:
• Centralized, Compliant, and Curated Data Storage Facilities
• Multi-Purpose Data Processing Architecture (mix of different
types of Hardware)
• Flexibility with Software
• Active User Community (supported through Outreach and
Training)
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Centralized
Compliant
Curated
User
Community
Flexibility
with Software
and Tools
Blend of
Technology
and Services
Centralized
Data
Management
Research and
Innovation
Support
Innovative
Hardware and
Tools
Partnerships
Data Privacy
and Security
Community
and Outreach
Microsoft Research’s Centralized Data Management and Data Processing Platform
Microsoft Research’s Centralized Data Management and Data Processing Platform
System Architecture
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Microsoft Research’s Centralized Data Management and Data Processing Platform
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
Data Management
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Compliance
Security
Data
Management
Ethics
Policy
Microsoft Research’s Centralized Data Management and Data Processing Platform
ComplianceSecurity Ethics
• Policy / Procedure
• Standardization /
Common Platform
• Technology
• Corporate Technology
and Compliance
• Standardization /
Common Platform
• Technology
• Ethical Review Board /
Legal and Corporate
Affairs
• Standardization /
Common Platform
• Technology
Microsoft Research’s Centralized Data Management and Data Processing Platform
ComplianceSecurity Ethics
Microsoft Research’s Centralized Data Management and Data Processing Platform
Fun Examples
F sharp
Naiad
Skype
Translator
Azure ML
Microsoft Research’s Centralized Data Management and Data Processing Platform
Discussion / Questions / Next Steps

A Year in Review - Building a Comprehensive Data Management Program

  • 1.
    A Year inReview - Building a Comprehensive Data Management Program @ Microsoft Research
  • 2.
    What Exactly Is BigData? 2 Wikipedia: “Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Critical tool for Microsoft’s businesses Opportunity to deliver transformative new capabilities to our enterprise customers
  • 3.
    MSR and Big Data 3 First,the sword: Shame on us… Many undergrads with better big data capabilities Martians versus Earthlings Finally…Big data has been fully embraced by MSR as A vital tool to enable research A vital area in which to do research We are MAKING THE INVESTMENT
  • 4.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Founded June - 2013
  • 5.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Project Vision
  • 6.
    Motivation: • Numerous Areasof Research are Driven by Data (Research Need) • Data comes in very different forms from very different sources (Adapting to Change) • Identified need standardized Data Storage and Data Processing resource for MSR (Community) • Many different research groups were processing and storing the same data sets. (Shared Knowledge / Data Sharing) • Some research groups were not aware that so many different types of data was available. (Communication / Collaboration) Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Adapting to Change Community Collaboration Shared Knowledge Data Sharing Research Need
  • 7.
    Guiding Principles: • Secureand Compliant (e.g. Data Security, Privacy and Ethics) • World-wide Access (equal opportunity for access and use given to all MSR labs) • Created through Partnerships with teams throughout Microsoft • Driven by Researcher Needs and Requirements (e.g. Tools, Hardware, Software, Datasets) • Flexibility Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Security Driven by Researcher Needs Research and Product Team Partnerships Global Access Compliance Ethics
  • 8.
    Goals: • Centralized, Compliant,and Curated Data Storage Facilities • Multi-Purpose Data Processing Architecture (mix of different types of Hardware) • Flexibility with Software • Active User Community (supported through Outreach and Training) Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Centralized Compliant Curated User Community Flexibility with Software and Tools Blend of Technology and Services
  • 9.
    Centralized Data Management Research and Innovation Support Innovative Hardware and Tools Partnerships DataPrivacy and Security Community and Outreach Microsoft Research’s Centralized Data Management and Data Processing Platform
  • 10.
    Microsoft Research’s CentralizedData Management and Data Processing Platform System Architecture
  • 11.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 12.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 13.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 14.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 15.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST
  • 16.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Bing
  • 17.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Data Management
  • 18.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST
  • 19.
    Microsoft Research’s CentralizedData Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST Compliance Security Data Management Ethics Policy
  • 20.
    Microsoft Research’s CentralizedData Management and Data Processing Platform ComplianceSecurity Ethics • Policy / Procedure • Standardization / Common Platform • Technology • Corporate Technology and Compliance • Standardization / Common Platform • Technology • Ethical Review Board / Legal and Corporate Affairs • Standardization / Common Platform • Technology
  • 21.
    Microsoft Research’s CentralizedData Management and Data Processing Platform ComplianceSecurity Ethics
  • 22.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Fun Examples F sharp Naiad Skype Translator Azure ML
  • 23.
    Microsoft Research’s CentralizedData Management and Data Processing Platform Discussion / Questions / Next Steps

Editor's Notes