SlideShare a Scribd company logo
My Other Computer Is A Data Center: The Sector Perspective on Big Data July 25, 2010 1 RobertGrossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago
What Is Small Data? 100 million movie ratings 480 thousand customers 17,000 movies From 1998 to 2005 Less than 2 GB data. Fits into memory, but very sophisticated models required to win.
What is Large Data? One simple definition is consider a dataset large if it is to big to fit into a database Search engines, social networks, and other Internet applications generate large datasets in this sense.
Part 1What’s Different About Data Center Computing? 4
Data center scale computing provides storage and computational resources at the scale and with the reliability of a data center.
A very nice recent book by Barroso and Holzle
7 Scale is new
Elastic, Usage Based Pricing Is New 8 costs the same as 1 computer in a rack for 120 hours 120 computers in  three racks for 1 hour
Simplicity of the Parallel Programming Framework is New 9 A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
Elastic Clouds Large Data Clouds Goal: Minimize cost of virtualized machines & provide on-demand.   HPC Goal: Maximize data (with matching compute) and control cost. Goal: Minimize latency and control heat.
2003 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
12
13
Part 2.  Stacks for Big Data 14
The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing… (2004) BigTable: A Distributed Storage System… (2006) 15
Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
Applying MapReduce to the Data in Storage Cloud map/shuffle reduce 18
Google’s Large Data Cloud Compute Services Data Services Storage Services 19 Applications Google’s MapReduce Google’s BigTable Google File System (GFS) Google’s Stack
Hadoop’s Large Data Cloud Applications Compute Services 20 Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack
Amazon Style Data Cloud Load Balancer Simple Queue Service 21 SDB EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances EC2 Instances S3 Storage Services
Evolution of NoSQL Databases Standard architecture for simple web apps: Front end load balanced web servers Business logic layer in the middle  Backend database Databases do not scale well with very large numbers of users or very large amounts of data Alternatives include Sharded (partitioned) databases master-slave databases memcache 22
NoSQL Systems Suggests No SQL support, also Not Only SQL One or more of the ACID properties not supported Joins generally not supported Usually flexible schemas Some well known examples: Google’s BigTable, Amazon’s S3 & Facebook’s Cassandra Several recent open source systems 23
Different Types of NoSQL Systems Distributed Key-Value Systems Amazon’s S3 Key-Value Store (Dynamo) Voldemort Column-based Systems BigTable HBase Cassandra Document-based systems CouchDB 24
Cassandra vsMySQLComparison MySQL > 50 GB Data Writes Average : ~300 msReads Average : ~350 ms Cassandra > 50 GB DataWrites Average : 0.12 msReads Average : 15 ms Source: AvinashLakshman, PrashantMalik, Cassandra Structured Storage System over a P2P Network, static.last.fm/johan/nosql-20090611/cassandra_nosql.pdf
CAP Theorem Proposed by Eric Brewer, 2000 Three properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system Scale out requires partitions Most large web-based systems choose availability over consistency 26 Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002
Eventual Consistency All updates eventually propagate through the system and all nodes will eventually be consistent (assuming no more updates) Eventually, a node is either updated or removed from service.   Can be implemented with Gossip protocol Amazon’s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 27
Part 3.  Sector Architecture 28
Design Objectives 1. Provide Internet scale data storage for large data Support multiple data centers connected by high speed wide networks 2. Simplify data intensive computing for a larger class of problems than covered by MapReduce Support applying User Defined Functions to the data managed by a storage cloud, with transparent load balancing and fault tolerance
Sector’s Large Data Cloud 30 Applications Compute Services Sphere’s UDFs Data Services Sector’s Distributed File System (SDFS) Storage Services Routing & Transport Services UDP-based Data Transport Protocol (UDT) Sector’s Stack
Apply User Defined Functions (UDF) to Files in Storage Cloud map/shuffle reduce 31 UDF UDF
System Architecture User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Security Server Masters Clients SSL SSL Data UDT Encryption optional slaves slaves Storage and Processing
33
Terasort Benchmark Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone 36 entities sites dk-2 dk-1 dk time
MalStone Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack.  Data consisted of 20 nodes with 500 million 100-byte records / node.
Parallel Data Processing using Sphere Input data segment is processing by Sphere Processing Engine (SPE) on each slave in parallel. Results can be either stored in local node or sent to multiple bucket files on different nodes All inputs and outputs are Sector files. Sphere processes can combined by communicating via Sector files.
Disks Input Segments ,[object Object]
 Directory directives
 In-memory objects UDF Bucket Writers Output Segments Disks
Application-Aware File System Files are not split into blocks Users supply appropriate sized files to Sector Directory and File Family Sector will keep related files together during upload and replication (e.g., a directory of files, data and index for a database table, etc.) Allow users to specify file location, if necessary Better data locality no need to move blocks for applications that process files or even multiple files as minimum input unit In-memory object Store repeatedly-accessed data in memory
Sector Summary Sector is fastest open source large data cloud As measured by MalStone & Terasort Sector is easy to program UDFs, MapReduce & Python over streams Sector does not require extensive tuning Sector is secure A HIPAA compliant Sector cloud is being launched Sector is reliable Sector supports multiple active master node servers 41
Part 4. Sector Applications
App 1: Bionimbus 43 www.cistrack.org  & www.bionimbus.org
App 2: Bulk Download of the SDSS 44 ,[object Object]
 Sector LLPR varies between 0.61 and 0.98 Recent Sloan Digital Sky Survey (SDSS) data release is 14 TB in size.
App 3: Anomalies in Network Data 45

More Related Content

What's hot

OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
Robert Grossman
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
Robert Grossman
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
Robert Grossman
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
Robert Grossman
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
Robert Grossman
 
PIC Tier-1 (LHCP Conference / Barcelona)
PIC Tier-1 (LHCP Conference / Barcelona)PIC Tier-1 (LHCP Conference / Barcelona)
PIC Tier-1 (LHCP Conference / Barcelona)
Josep Flix
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
Ian Foster
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
Robert Grossman
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
Robert Grossman
 
OpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences Data
OpenTopography Facility
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
Robert Grossman
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
PacificResearchPlatform
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
Ted Habermann
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
PacificResearchPlatform
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Ian Foster
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
Ian Foster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 

What's hot (20)

OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
PIC Tier-1 (LHCP Conference / Barcelona)
PIC Tier-1 (LHCP Conference / Barcelona)PIC Tier-1 (LHCP Conference / Barcelona)
PIC Tier-1 (LHCP Conference / Barcelona)
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)Open Science Data Cloud (June 21, 2010)
Open Science Data Cloud (June 21, 2010)
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
OpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences DataOpenTopography - Scalable Services for Geosciences Data
OpenTopography - Scalable Services for Geosciences Data
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting LiStanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
Stanford/SLAC Cryo-EM Computing and Storage, Yee-Ting Li
 
Earth Science Platform
Earth Science PlatformEarth Science Platform
Earth Science Platform
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Slide 1
Slide 1Slide 1
Slide 1
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 

Similar to My Other Computer is a Data Center: The Sector Perspective on Big Data

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Sarvesh Meena
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
Robert Grossman
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
Robert Grossman
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
KUMARRISHAV37
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009
Robert Grossman
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
Ahmed Gamil
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Animesh Chaturvedi
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage WhitepaperCarina Kordan
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
grid mining
grid mininggrid mining
grid mining
ARNOLD
 
H017144148
H017144148H017144148
H017144148
IOSR Journals
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
IOSR Journals
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
inventionjournals
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
ijsrd.com
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 

Similar to My Other Computer is a Data Center: The Sector Perspective on Big Data (20)

hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)My Other Computer is a Data Center (2010 v21)
My Other Computer is a Data Center (2010 v21)
 
BDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdfBDA Mod2@AzDOCUMENTS.in.pdf
BDA Mod2@AzDOCUMENTS.in.pdf
 
Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009Sector - Presentation at Cloud Computing & Its Applications 2009
Sector - Presentation at Cloud Computing & Its Applications 2009
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Big data
Big dataBig data
Big data
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage Whitepaper
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
grid mining
grid mininggrid mining
grid mining
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 

More from Robert Grossman

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
Robert Grossman
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Robert Grossman
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
Robert Grossman
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
Robert Grossman
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
Robert Grossman
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
Robert Grossman
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Robert Grossman
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Robert Grossman
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
Robert Grossman
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Robert Grossman
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
Robert Grossman
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Robert Grossman
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Robert Grossman
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 

More from Robert Grossman (20)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

My Other Computer is a Data Center: The Sector Perspective on Big Data

  • 1. My Other Computer Is A Data Center: The Sector Perspective on Big Data July 25, 2010 1 RobertGrossman Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems BiologyUniversity of Chicago
  • 2. What Is Small Data? 100 million movie ratings 480 thousand customers 17,000 movies From 1998 to 2005 Less than 2 GB data. Fits into memory, but very sophisticated models required to win.
  • 3. What is Large Data? One simple definition is consider a dataset large if it is to big to fit into a database Search engines, social networks, and other Internet applications generate large datasets in this sense.
  • 4. Part 1What’s Different About Data Center Computing? 4
  • 5. Data center scale computing provides storage and computational resources at the scale and with the reliability of a data center.
  • 6. A very nice recent book by Barroso and Holzle
  • 8. Elastic, Usage Based Pricing Is New 8 costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour
  • 9. Simplicity of the Parallel Programming Framework is New 9 A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
  • 10. Elastic Clouds Large Data Clouds Goal: Minimize cost of virtualized machines & provide on-demand. HPC Goal: Maximize data (with matching compute) and control cost. Goal: Minimize latency and control heat.
  • 11. 2003 10x-100x 1976 10x-100x data science 1670 250x simulation science 1609 30x experimental science
  • 12. 12
  • 13. 13
  • 14. Part 2. Stacks for Big Data 14
  • 15. The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing… (2004) BigTable: A Distributed Storage System… (2006) 15
  • 16. Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains “it”, 1“was”, 1“the”, 1“best”, 1 (“doc cdickens”,“it was the best of times”) map
  • 17. Example (cont’d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = “it”values = 1, 1 “it”, 2“was”, 2“best”, 1“worst”, 1 key = “was”values = 1, 1 reduce key = “best”values = 1 key = “worst”values = 1
  • 18. Applying MapReduce to the Data in Storage Cloud map/shuffle reduce 18
  • 19. Google’s Large Data Cloud Compute Services Data Services Storage Services 19 Applications Google’s MapReduce Google’s BigTable Google File System (GFS) Google’s Stack
  • 20. Hadoop’s Large Data Cloud Applications Compute Services 20 Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack
  • 21. Amazon Style Data Cloud Load Balancer Simple Queue Service 21 SDB EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances EC2 Instances S3 Storage Services
  • 22. Evolution of NoSQL Databases Standard architecture for simple web apps: Front end load balanced web servers Business logic layer in the middle Backend database Databases do not scale well with very large numbers of users or very large amounts of data Alternatives include Sharded (partitioned) databases master-slave databases memcache 22
  • 23. NoSQL Systems Suggests No SQL support, also Not Only SQL One or more of the ACID properties not supported Joins generally not supported Usually flexible schemas Some well known examples: Google’s BigTable, Amazon’s S3 & Facebook’s Cassandra Several recent open source systems 23
  • 24. Different Types of NoSQL Systems Distributed Key-Value Systems Amazon’s S3 Key-Value Store (Dynamo) Voldemort Column-based Systems BigTable HBase Cassandra Document-based systems CouchDB 24
  • 25. Cassandra vsMySQLComparison MySQL > 50 GB Data Writes Average : ~300 msReads Average : ~350 ms Cassandra > 50 GB DataWrites Average : 0.12 msReads Average : 15 ms Source: AvinashLakshman, PrashantMalik, Cassandra Structured Storage System over a P2P Network, static.last.fm/johan/nosql-20090611/cassandra_nosql.pdf
  • 26. CAP Theorem Proposed by Eric Brewer, 2000 Three properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system Scale out requires partitions Most large web-based systems choose availability over consistency 26 Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002
  • 27. Eventual Consistency All updates eventually propagate through the system and all nodes will eventually be consistent (assuming no more updates) Eventually, a node is either updated or removed from service. Can be implemented with Gossip protocol Amazon’s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 27
  • 28. Part 3. Sector Architecture 28
  • 29. Design Objectives 1. Provide Internet scale data storage for large data Support multiple data centers connected by high speed wide networks 2. Simplify data intensive computing for a larger class of problems than covered by MapReduce Support applying User Defined Functions to the data managed by a storage cloud, with transparent load balancing and fault tolerance
  • 30. Sector’s Large Data Cloud 30 Applications Compute Services Sphere’s UDFs Data Services Sector’s Distributed File System (SDFS) Storage Services Routing & Transport Services UDP-based Data Transport Protocol (UDT) Sector’s Stack
  • 31. Apply User Defined Functions (UDF) to Files in Storage Cloud map/shuffle reduce 31 UDF UDF
  • 32. System Architecture User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Security Server Masters Clients SSL SSL Data UDT Encryption optional slaves slaves Storage and Processing
  • 33. 33
  • 34.
  • 35. Terasort Benchmark Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
  • 36. MalStone 36 entities sites dk-2 dk-1 dk time
  • 37. MalStone Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
  • 38. Parallel Data Processing using Sphere Input data segment is processing by Sphere Processing Engine (SPE) on each slave in parallel. Results can be either stored in local node or sent to multiple bucket files on different nodes All inputs and outputs are Sector files. Sphere processes can combined by communicating via Sector files.
  • 39.
  • 41. In-memory objects UDF Bucket Writers Output Segments Disks
  • 42. Application-Aware File System Files are not split into blocks Users supply appropriate sized files to Sector Directory and File Family Sector will keep related files together during upload and replication (e.g., a directory of files, data and index for a database table, etc.) Allow users to specify file location, if necessary Better data locality no need to move blocks for applications that process files or even multiple files as minimum input unit In-memory object Store repeatedly-accessed data in memory
  • 43. Sector Summary Sector is fastest open source large data cloud As measured by MalStone & Terasort Sector is easy to program UDFs, MapReduce & Python over streams Sector does not require extensive tuning Sector is secure A HIPAA compliant Sector cloud is being launched Sector is reliable Sector supports multiple active master node servers 41
  • 44. Part 4. Sector Applications
  • 45. App 1: Bionimbus 43 www.cistrack.org & www.bionimbus.org
  • 46.
  • 47. Sector LLPR varies between 0.61 and 0.98 Recent Sloan Digital Sky Survey (SDSS) data release is 14 TB in size.
  • 48. App 3: Anomalies in Network Data 45
  • 49. Sector Applications Distributing the 15 TB Sloan Digital Sky Survey to astronomers around the world (with JHU, 2005) Managing and analyzing high throughput sequence data (Cistrack, University of Chicago, 2007). Detecting emergent behavior in distributed network data (Angle, won SC 07 Analytics Challenge) Wide area clouds (won SC 09 BWC with 100 Gbps wide area computation) New ensemble-based algorithms for trees Graph processing Image processing (OCC Project Matsu) 46
  • 50. Credits Sector was developed by YunhongGu from the University of Illinois at Chicago and verycloud.com
  • 51. For More Information For more information, please visit sector.sourceforge.net rgrossman.com