The Open Cloud Consortium operates the Open Science Data Cloud, a not-for-profit cloud computing infrastructure that supports scientific research. The Open Cloud Consortium manages cloud computing testbeds and resources donated by universities, companies, government agencies, and international partners. Its goal is to democratize access to data and computing power for scientific discovery through its Open Science Data Cloud.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
This is a status update (as of Feb 22, 2010) of a new Open Cloud Consortium project that will provide on-demand, large scale image processing to assist with disaster relief efforts.
This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
This is a status update (as of Feb 22, 2010) of a new Open Cloud Consortium project that will provide on-demand, large scale image processing to assist with disaster relief efforts.
This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Presentation at a public event at C asean, hosted by the National Innovation Agency of Thailand. This talk provides an overview of the Open and Collaborative Science in Development Network, its history, goals, research objectives and the network partners. In particular, it highlights the rationale behind the drafting of a set of principles underlying a vision of open science that has at its core a commitment to equitable participation in the production and circulation of scientific knowledge.
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
Next-generation sequencing format and visualization with ngs.plotLi Shen
Lecture given at the department of neuroscience, Icahn school of medicine at Mount Sinai. ngs.plot has been published in BMC genomics. Link: http://www.biomedcentral.com/1471-2164/15/284
Presentation to cover the data and file formats commonly used in next generation sequencing (high throughput sequencing) analyses. From nucleotide ambiguity codes, FASTA and FASTQ, quality scores to SAM and BAM, CIGAR strings and variant calling format. This was given as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels, Belgium in April 2016.
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
Robert L. Grossman, Practical Methods for Identifying Anomalies That Matter in Large Datasets, O’Reilly, Strata + Hadoop World, San Jose, California, February 20, 2015.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Positioning University of California Information Technology for the Future: S...Larry Smarr
05.02.15
Invited Talk
The Vice Chancellor of Research and Chief Information Officer Summit
“Information Technology Enabling Research at the University of California”
Title: Positioning University of California Information Technology for the Future: State, National, and International IT Infrastructure Trends and Directions
Oakland, CA
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...Larry Smarr
11.04.06
Joint Presentation
UCSD School of Medicine Research Council
Larry Smarr, Calit2 & Phil Papadopoulos, SDSC/Calit2
Title: High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biomedical Sciences
Grid optical network service architecture for data intensive applicationsTal Lavian Ph.D.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
From Super-computer to Super-network
In the past, computer processors were the fastest part
peripheral bottlenecks
In the future optical networks will be the fastest part
Computer, processor, storage, visualization, and instrumentation - slower "peripherals”
eScience Cyber-infrastructure focuses on computation, storage, data, analysis, Work Flow.
The network is vital for better eScience
Impact of Grid Computing on Network Operators and HW VendorsTal Lavian Ph.D.
The “Network” is a Prime Resource for Large- Scale Distributed System.
Integrated SW System Provide the “Glue”
Dynamic optical network as a fundamental Grid service in data-intensive Grid application, to be scheduled, to be managed and coordinated to support collaborative operations
The Next-Generation sequencing data-deluge requires storage and compute services to be provisioned at an ever-increasing rate. Can Cloud (and last decade's buzzword, Grid), help us?
Talk given at the NHGRI Cloud computing workshop, 2010.
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...Larry Smarr
11.12.12
Seminar Presentation
Princeton Institute for Computational Science and Engineering (PICSciE)
Princeton University
Title: A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Intensive Research
Princeton, NJ
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
role of women and girls in various terror groupssadiakorobi2
Women have three distinct types of involvement: direct involvement in terrorist acts; enabling of others to commit such acts; and facilitating the disengagement of others from violent or extremist groups.
31052024_First India Newspaper Jaipur.pdfFIRST INDIA
Find Latest India News and Breaking News these days from India on Politics, Business, Entertainment, Technology, Sports, Lifestyle and Coronavirus News in India and the world over that you can't miss. For real time update Visit our social media handle. Read First India NewsPaper in your morning replace. Visit First India.
CLICK:- https://firstindia.co.in/
#First_India_NewsPaper
In a May 9, 2024 paper, Juri Opitz from the University of Zurich, along with Shira Wein and Nathan Schneider form Georgetown University, discussed the importance of linguistic expertise in natural language processing (NLP) in an era dominated by large language models (LLMs).
The authors explained that while machine translation (MT) previously relied heavily on linguists, the landscape has shifted. “Linguistics is no longer front and center in the way we build NLP systems,” they said. With the emergence of LLMs, which can generate fluent text without the need for specialized modules to handle grammar or semantic coherence, the need for linguistic expertise in NLP is being questioned.
01062024_First India Newspaper Jaipur.pdfFIRST INDIA
Find Latest India News and Breaking News these days from India on Politics, Business, Entertainment, Technology, Sports, Lifestyle and Coronavirus News in India and the world over that you can't miss. For real time update Visit our social media handle. Read First India NewsPaper in your morning replace. Visit First India.
CLICK:- https://firstindia.co.in/
#First_India_NewsPaper
03062024_First India Newspaper Jaipur.pdfFIRST INDIA
Find Latest India News and Breaking News these days from India on Politics, Business, Entertainment, Technology, Sports, Lifestyle and Coronavirus News in India and the world over that you can't miss. For real time update Visit our social media handle. Read First India NewsPaper in your morning replace. Visit First India.
CLICK:- https://firstindia.co.in/
#First_India_NewsPaper
‘वोटर्स विल मस्ट प्रीवेल’ (मतदाताओं को जीतना होगा) अभियान द्वारा जारी हेल्पलाइन नंबर, 4 जून को सुबह 7 बजे से दोपहर 12 बजे तक मतगणना प्रक्रिया में कहीं भी किसी भी तरह के उल्लंघन की रिपोर्ट करने के लिए खुला रहेगा।
हम आग्रह करते हैं कि जो भी सत्ता में आए, वह संविधान का पालन करे, उसकी रक्षा करे और उसे बनाए रखे।" प्रस्ताव में कुल तीन प्रमुख हस्तक्षेप और उनके तंत्र भी प्रस्तुत किए गए। पहला हस्तक्षेप स्वतंत्र मीडिया को प्रोत्साहित करके, वास्तविकता पर आधारित काउंटर नैरेटिव का निर्माण करके और सत्तारूढ़ सरकार द्वारा नियोजित मनोवैज्ञानिक हेरफेर की रणनीति का मुकाबला करके लोगों द्वारा निर्धारित कथा को बनाए रखना और उस पर कार्यकरना था।
The Open Science Data Cloud: Empowering the Long Tail of Science
1. A
501(c)(3)
not-‐for-‐profit
operaCng
clouds
for
science.
The
Open
Science
Data
Cloud:
Empowering
the
Long
Tail
of
Science
October
12,
2012
Robert
L.
Grossman
University
of
Chicago
and
Open
Cloud
ConsorCum
2. QuesCon
1.
What
is
the
cyberinfrastructure
required
to
manage,
analyze,
archive
and
share
big
data?
Call
this
analyCc
infrastructure.
3. QuesCon
2.
What
is
the
analogy
of
the
GLIF*
for
analyCc
infrastructure?
*GLIF
(www.glif.is),
the
Global
Lambda
Integrated
Facility,
is
an
internaConal
virtual
organizaCon
that
promotes
the
paradigm
of
lambda
networking.
GLIF
provides
lambdas
internaConally
as
an
integrated
facility
to
support
data-‐
intensive
scienCfic
research,
and
supports
middleware
development
for
lambda
networking.
4. Number
1000’s
Individual
scienCsts
&
small
projects
100’s
Community
based
science
via
Science
as
a
10’s
Service
very
large
projects
Data
Size
Small
Medium
to
Large
Very
Large
Public
Shared
community
Dedicated
infrastructure
infrastructure
infrastructure
5. The
long
tail
of
data
science
A
few
large
data
Many
smaller
data
science
projects.
science
projects.
6. Part
1.
What
Instrument
Do
we
Use
to
Make
Big
Data
Discoveries?
How
do
we
build
a
“datascope?”
8. Another
way:
opencompute.org
Think
of
data
as
big
if
you
measure
it
in
MW,
as
in
Facebook’s
Pineville
Data
Center
is
30
MW.
9. An
algorithm
and
compuCng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
(or
container)
of
data
(and
corresponding
processors)
allows
you
to
do
the
same
computaCon
in
the
same
Cme
but
over
more
data.
10. Commercial
Cloud
Service
Provider
(CSP)
15
MW
Data
Center
Monitoring,
AccounCng
and
network
security
billing
Customer
and
forensics
Facing
Portal
AutomaCc
provisioning
and
100,000
servers
infrastructure
1
PB
DRAM
management
100’s
of
PB
of
disk
~1
Tbps
egress
bandwidth
25
operators
for
15
MW
Commercial
Cloud
Data
center
network
11. My
vote
for
a
datascope:
a
(bouCque)
data
center
scale
facility
with
a
big-‐
data
scalable
analyCc
infrastructure.
What
would
a
global
integrated
facility
for
datascopes
look
like?
12. Some
Examples
of
Big
Data
Science
Discipline
Dura2on
Size
#
Devices
HEP
-‐
LHC
10
years
15
PB/year*
One
Astronomy
-‐
LSST
10
years
12
PB/year**
One
Genomics
-‐
NGS
2-‐4
years
0.5
TB/genome
1000’s
*At
full
capacity,
the
Large
Hadron
Collider
(LHC),
the
world's
largest
parCcle
accelerator,
is
expected
to
produce
more
than
15
million
Gigabytes
of
data
each
year.
…
This
ambiCous
project
connects
and
combines
the
IT
power
of
more
than
140
computer
centres
in
33
countries.
Source:
hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html
**As
it
carries
out
its
10-‐year
survey,
LSST
will
produce
over
15
terabytes
of
raw
astronomical
data
each
night
(30
terabytes
processed),
resulCng
in
a
database
catalog
of
22
petabytes
and
an
image
archive
of
100
petabytes.
Source:
hjp://www.lsst.org/
News/enews/teragrid-‐1004.html
14. Sci
CSP
services
Data
scienCst
Datascope
–
Science
Cloud
Service
Provider
(Sci
CSP)
15. What
are
some
of
the
important
differences
between
commercial
and
research-‐focused
Sci
CSPs?
16. Science
CSP
Commercial
CSP
POV
DemocraCze
access
to
As
long
as
you
pay
the
bill;
data.
Integrate
data
to
as
long
as
the
business
make
discoveries.
Long
model
holds.
term
archive.
Data
&
Data
intensive
Internet
style
scale
out
Storage
Science
Clouds
compuCng
&
HP
storage
and
object-‐based
storage
Flows
Large
data
flows
in
and
Lots
of
small
web
flows
out
Streams
Streaming
processing
NA
required
AccounCng
EssenCal
EssenCal
Lock
in
Moving
environment
Lock
in
is
good
between
CSPs
essenCal
17. Part
2.
The
Open
Cloud
ConsorCum’s
Open
Science
Data
Cloud
18. • U.S
based
not-‐for-‐profit
corporaCon.
• Manages
cloud
compuCng
infrastructure
to
support
scienCfic
research:
Open
Science
Data
Cloud.
• Manages
cloud
compuCng
testbeds:
Open
Cloud
Testbed.
www.opencloudconsorCum.org
18
19. OCC
Members
&
Partners
• Companies:
Cisco,
Yahoo!,
Citrix,
…
• UniversiCes:
University
of
Chicago,
Northwestern
Univ.,
Johns
Hopkins,
Calit2,
ORNL,
University
of
Illinois
at
Chicago,
…
• Federal
agencies
and
labs:
NASA,
LLNL,
ORNL
• InternaConal
Partners:
AIST
(Japan),
U.
Edinburgh,
U.
Amsterdam,
…
• Partners:
NaConal
Lambda
Rail
19
20. OCC
2011
Resources
Resource
Type
Comments
OSDC
Adler
&
UClity
Cloud
1248
cores
and
0.4
PB
disk
Sullivan
OCC
–
Y
Data
Cloud
928
cores
and
1.0
PB
disk
OCC
–
Matsu
Mixed
1
rack
OSDC
Root
Storage
0.8
PB
• OCC-‐Adler,
Sullivan
&
Root
will
more
than
double
in
size
in
2012.
22. One
Million
Genomes
• Sequencing
a
million
genomes
would
most
likely
fundamentally
change
the
way
we
understand
genomic
variaCon.
• The
genomic
data
for
a
paCent
is
about
1
TB
(including
samples
from
both
tumor
and
normal
Cssue).
• One
million
genomes
is
about
1000
PB
or
1
EB
• With
compression,
it
may
be
about
100
PB
• At
$1000/genome,
the
sequencing
would
cost
about
$1B
23. Big
data
driven
discovery
on
1,000,000
genomes
and
1
EB
of
data.
Genomic-‐ Improved
Genomic-‐
driven
understanding
driven
drug
diagnosis
of
genomic
development
science
Precision
diagnosis
and
treatment.
PrevenCve
health
care.
25. UDR
• UDT
is
a
high
performance
network
transport
protocol
• UDR
=
rsync
+
UDT
• It
is
easy
for
an
average
systems
administrator
to
keep
100’s
of
TB
of
distributed
data
synchronized.
• We
are
using
it
to
distribute
c.
1
PB
from
the
OSDC
26. OpenFlow-‐Enabled
Hadoop
WG
• When
running
Hadoop
some
map
and
reduce
jobs
take
significantly
longer
than
others.
• These
are
stragglers
and
can
significantly
slow
down
a
MapReduce
computaCon.
• Stragglers
are
common
(dirty
secret
about
Hadoop)
• Infoblox
and
UChicago
are
leading
a
OCC
Working
Group
on
OpenFlow-‐enabled
Hadoop
that
will
provide
addiConal
bandwidth
to
stragglers.
• We
have
a
testbed
for
a
wide
area
version
of
this
project.
27. OSDC
PIRE
Project
We
select
OSDC
PIRE
Fellows
(US
ciCzens
or
permanent
residents):
• We
give
them
tutorials
and
training
on
big
data
science.
• We
provide
them
fellowships
to
work
with
OSDC
internaConal
partners.
• We
give
them
preferred
access
to
the
OSDC.
Nominate
your
favorite
scienCst
as
an
OSDC
PIRE
Fellow.
www.opensciencedatacloud.org
(look
for
PIRE)
29. Open
Science
Data
Cloud
AccounCng
and
Monitoring,
billing
(OSDC)
compliance,
&
security
Customer
Facing
Science
Cloud
SW
&
Services
Portal
(Tukey)
AutomaCc
provisioning
and
3
PB
2011
infrastructure
10
PB
2012
management
~100
Gbps
bandwidth
able
to
scale
to
100
PB?
5-‐12
operators
to
operate
1-‐5
MW
Science
Cloud
Data
center
network
OSDC
Data
Stack
based
upon
OpenStack,
Hadoop,
GlusterFS,
UDT,
…
30. Cloud
Services
OperaCons
Centers
(CSOC)
• The
OSDC
operates
Cloud
Services
OperaCons
Center
(or
CSOC).
• It
is
a
CSOC
focused
on
supporCng
Science
Clouds
for
researchers.
• Compare
to
Network
OperaCons
Center
or
NOC.
• Both
are
an
important
part
of
cyber
infrastructure
for
big
data
science.
31. OSDC
Racks
• How
quickly
can
we
set
up
a
rack?
• How
efficiently
can
we
operate
a
rack?
(racks/admin)
2012
OSDC
rack
design
(dray)
• 950
TB
/
rack
• 600
cores
/
rack
32. EssenCal
Services
for
a
Science
CSP
• Support
for
data
intensive
compuCng
• Support
for
big
data
flows
• Account
management,
authenCcaCon
and
authorizaCon
services
• Health
and
status
monitoring
• Billing
and
accounCng
• Ability
to
rapidly
provision
infrastructure
• Security
services,
logging,
event
reporCng
• Access
to
large
amounts
of
public
data
• High
performance
storage
• Simple
data
export
and
import
services
34. Acknowledgements
Major
funding
and
support
for
the
Open
Science
Data
Cloud
(OSDC)
is
provided
by
the
Gordon
and
Bejy
Moore
FoundaCon.
This
funding
is
used
to
support
the
OSDC-‐Adler,
Sullivan
and
Root
faciliCes.
AddiConal
funding
for
the
OSDC
has
been
provided
by
the
following
sponsors:
• The
OCC-‐Y
Hadoop
Cluster
(approximately
1000
cores
and
1
PB
of
storage)
was
donated
by
Yahoo!
in
2011.
• Cisco
provides
the
OSDC
access
to
the
Cisco
C-‐Wave,
which
connects
OSDC
data
centers
with
10
Gbps
wide
area
networks.
• NSF
awarded
the
OSDC
a
5-‐year
(2010-‐2016)
PIRE
award
to
train
scienCsts
to
use
the
OSDC
and
to
further
develop
the
underlying
technology.
• OSDC
technology
for
high
performance
data
transport
is
support
in
part
by
NSF
Award
1127316.
• The
StarLight
Facility
in
Chicago
enables
the
OSDC
to
connect
to
over
30
high
performance
research
networks
around
the
world
at
10
Gbps
or
higher,
with
an
increasing
number
of
100
Gbps
connecCons.
The
OSDC
is
managed
by
the
Open
Cloud
ConsorCum,
a
501(c)(3)
not-‐for-‐profit
corporaCon.
If
you
are
interested
in
providing
funding
or
donaCng
equipment
or
services,
please
contact
us
at
info@opensciencedatacloud.org.
35. For
more
informaCon
• You
can
find
some
more
informaCon
on
my
blog:
rgrossman.com.
• Some
of
my
technical
papers
are
also
available
there.
• My
email
address
is
robert.grossman
at
uchicago
dot
edu.
Center for
Research
Informatics