Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
Chesapeake Regional Information System for our Patients (CRISP) is a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and healthcare providers such as Johns Hopkins. CRISP’s work supports the local healthcare community by securely sharing the kind of data that facilitates care and improves health outcomes.
When the pandemic started, the Maryland Department of Health reached out to CRISP with a request: Get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake.
Using the power of the Databricks Lakehouse platform and the flexibility of Delta Lake, Slalom helped CRISP provide the Maryland Department of Health with near real-time reporting of key COVID-19 measures. With this information, Maryland has been able to track the path of the pandemic, target the locations of new testing sites, and ultimately improve access for vulnerable communities.
The work did not stop there—once CRISP’s customers saw the value of the platform, more requests starting coming in. Now, nearly one year since the platform was created, CRISP has processed billons of records from hundreds of data sources in an effort to combat the pandemic. Notable outcomes from the work include hourly contact tracing with data already cross-referenced for individual risk factors, automated reporting on COVID-19 hospitalizations, real-time ICU capacity reporting for EMTs, tracking of COVID-19 patterns in student populations, tracking of the vaccination campaign, connecting Maryland MCOs to vulnerable people who need to be prioritized for the vaccine, and analysis of the impact of COVID-19 on pregnancies.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Role of Data Accessibility During PandemicDatabricks
This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.
We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
Chesapeake Regional Information System for our Patients (CRISP) is a nonprofit healthcare information exchange (HIE) whose customers include states like Maryland and healthcare providers such as Johns Hopkins. CRISP’s work supports the local healthcare community by securely sharing the kind of data that facilitates care and improves health outcomes.
When the pandemic started, the Maryland Department of Health reached out to CRISP with a request: Get us the demographic data we need to track COVID-19 and proactively support our communities. As a result, CRISP employees spent long hours attempting to handle multiple data sources with complex data enrichment processes. To automate these requests, CRISP partnered with Slalom to build a data platform powered by Databricks and Delta Lake.
Using the power of the Databricks Lakehouse platform and the flexibility of Delta Lake, Slalom helped CRISP provide the Maryland Department of Health with near real-time reporting of key COVID-19 measures. With this information, Maryland has been able to track the path of the pandemic, target the locations of new testing sites, and ultimately improve access for vulnerable communities.
The work did not stop there—once CRISP’s customers saw the value of the platform, more requests starting coming in. Now, nearly one year since the platform was created, CRISP has processed billons of records from hundreds of data sources in an effort to combat the pandemic. Notable outcomes from the work include hourly contact tracing with data already cross-referenced for individual risk factors, automated reporting on COVID-19 hospitalizations, real-time ICU capacity reporting for EMTs, tracking of COVID-19 patterns in student populations, tracking of the vaccination campaign, connecting Maryland MCOs to vulnerable people who need to be prioritized for the vaccine, and analysis of the impact of COVID-19 on pregnancies.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Role of Data Accessibility During PandemicDatabricks
This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.
We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
Robert L. Grossman, Practical Methods for Identifying Anomalies That Matter in Large Datasets, O’Reilly, Strata + Hadoop World, San Jose, California, February 20, 2015.
In this presentation we will see why, according to several sources, our work is becoming more and more risk-prone.
Where all sources all agree about causes of this tendency, they also agree on solution: Experimentation.
Experiments can go wrong and therefore our duty is not just "to be agile" but to lower the cost of our experiments failure.
As conclusion we will see three ideas (TrialSourcing, Products management and #NoEstimates movement) all leveraging on experiments to gain organization advantages.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
Accelerating Time to Science: Transforming Research in the CloudJamie Kinney
Researchers working on projects ranging from work at individual labs to some of the world's largest scientific investigations are using AWS to accelerate the pace of scientific discovery and ask questions that were previously impossible to explore. This talk explains why scientists are using Amazon Web Services and showcases a range of real-word examples.
Evolving Storage and Cyber Infrastructure at the NASA Center for Climate Simu...inside-BigData.com
Ellen Salmon from NASA gave this talk at the 2017 MSST conference. "This talk will describe recent developments at the NASA Center for Climate Simulation, which is funded by NASA’s Science Mission Directorate, and supports the specialized data storage and computational needs of weather, ocean, and climate researchers, as well as astrophysicists, heliophysicists, and planetary scientists. To meet requirements for higher-resolution, higher-fidelity simulations, the NCCS augments its High Performance Computing and storage/retrieval environment. As the petabytes of model and observational data grow, the NCCS is broadening data services offerings and deploying and expanding virtualization resources for high performance analytics."
Watch the video: http://wp.me/p3RLHQ-gPj
Learn more: https://www.nccs.nasa.gov/
and
http://storageconference.us/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Making Earth observation data available by using Amazon S3 is accelerating scientific discovery and enabling the creation of new products. Attend and learn how the scale and performance of Amazon S3 lets earth scientists, researchers, startups, and GIS professionals gather and analyse planetary-scale data without worrying about limitations of bandwidth, storage, memory, or processing power. Co-presented with support of the Australian Geoscience Data Cube collaboration, DigitalGlobe’s Geospatial Big Data Platform and the developer of the popular ObservedEarth mobile app.
Speakers:
Craig Lawton, Public Sector Solutions Architect, Amazon Web Services
Lachlan Hurst, Observed Earth
Matt Paget, Senior Experimental Scientist, CSIRO
Dan Getman, Digital Globe
This is a talk titled "Cloud-Based Services For Large Scale Analysis of Sequence & Expression Data: Lessons from Cistrack" that I gave at CAMDA 2009 on October 6, 2009.
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
Work presented in partial fulfillment
of the requirements for the degree of
Bachelor in Computer Science - Federal University of Rio Grande do - Brazil
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
High Performance Computing and Big Data Geoffrey Fox
We propose a hybrid software stack with Large scale data systems for both research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics.
We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled) systems.
We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.
We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/
Similar to What is a Data Commons and Why Should You Care? (20)
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
4. We
have
a
problem
…
The
commodi@za@on
of
sensors
is
crea@ng
an
explosive
growth
of
data
It
can
take
weeks
to
download
large
geo-‐
spa@al
datasets
Analyzing
the
data
is
more
expensive
than
producing
it
There
is
not
enough
funding
for
every
researcher
to
house
all
the
data
they
need
5. Data
Commons
Data
commons
co-‐locate
data,
storage
and
compu@ng
infrastructure,
and
commonly
used
tools
for
analyzing
and
sharing
data
to
create
a
resource
for
the
research
community.
Source:
Interior
of
one
of
Google’s
Data
Center,
www.google.com/about/datacenters/
6. The
Tragedy
of
the
Commons
Source:
GarreY
Hardin,
The
Tragedy
of
the
Commons,
Science,
Volume
162,
Number
3859,
pages
1243-‐1248,
13
December
1968.
Individuals
when
they
act
independently
following
their
self
interests
can
deplete
a
deplete
a
common
resource,
contrary
to
a
whole
group's
long-‐term
best
interests.
GarreY
Hardin
7. 7
www.opencloudconsor@um.org
• U.S
based
not-‐for-‐profit
corpora@on.
• Manages
cloud
compu@ng
infrastructure
to
support
scien@fic
research:
Open
Science
Data
Cloud,
Project
Matsu,
&
OCC
NOAA
Data
Commons.
• Manages
cloud
compu@ng
infrastructure
to
support
medical
and
health
care
research:
Biomedical
Commons
Cloud.
• Manages
cloud
compu@ng
testbeds:
Open
Cloud
Testbed.
8. What
Scale?
• New
data
centers
are
some@mes
divided
into
“pods,”
which
can
be
built
out
as
needed.
• A
reasonable
scale
for
what
is
needed
for
a
commons
is
one
of
these
pods
(“cyberpod”)
• Let’s
use
the
term
“datapod”
for
the
analy@c
infrastructure
that
scales
to
a
cyberpod.
• Think
of
as
the
scale
out
of
a
database.
Pod
A
Pod
B
10. Core
Data
Commons
Services
• Digital
IDs
• Metadata
services
• High
performance
transport
• Data
export
• Pay
for
compute
with
images/containers
containing
commonly
used
tools,
applica@ons
and
services,
specialized
for
each
research
community
11. Cloud
1
Cloud
3
Data
Commons
1
Commons
provide
data
to
other
commons
and
to
clouds
Research
projects
producing
data
Research
scien@sts
at
research
center
B
Research
scien@sts
at
research
center
C
Research
scien@sts
at
research
center
A
downloading
data
Community
develops
open
source
sojware
stacks
for
commons
and
clouds
Cloud
2
Data
Commons
2
12. Complex
sta@s@cal
models
over
small
data
that
are
highly
manual
and
update
infrequently.
Simpler
sta@s@cal
models
over
large
data
that
are
highly
automated
and
update
frequently.
memory
databases
GB
TB
PB
W
KW
MW
datapods
cyber
pods
13. Is
More
Different?
Do
New
Phenomena
Emerge
at
Scale
in
Biomedical
Data?
Source:
P.
W.
Anderson,
More
is
Different,
Science,
Volume
177,
Number
4047,
4
August
1972,
pages
393-‐396.
16. • Public-‐private
data
collabora@ve
announced
April
21,
2015
by
Secretary
of
Commerce
Pritzker.
• AWS,
Google,
Microsoj
and
Open
Cloud
Consor@um
will
form
four
collabora@ons.
17.
18. OSDC
Commons
Architecture
Object
storage
(permanent)
Scalable
light
weight
workflow
Community
data
products
(data
harmoniza@on)
Data
submission
portal
Open
APIs
for
data
access
and
data
access
portal
Co-‐located
“pay
for
compute”
Digital
ID
Service
&
Metadata
Service
Devops
suppor@ng
virtual
machines
and
containers
20. What
is
the
Project
Matsu?
Matsu
is
an
open
source
project
for
processing
satellite
imagery
to
support
earth
sciences
researchers
using
a
data
commons.
Matsu
is
a
joint
project
between
the
Open
Cloud
Consor@um
and
NASA’s
EO-‐1
Mission
(Dan
Mandl,
Lead)
23. Flood/Drought
Dashboard
Examples
GeoSocial
API
Consumer
embedded
in
Dashboard
Ini@al
crowdsourcing
func@onality
(pictures,
GPS
features
and
water
edge
loca@ons
)
GeoSocial
API
used
to
discover
Radarsat
product
in
area
(User
can
see
registra@on
error)
24. 1.
Open
Science
Data
Cloud
(OSDC)
stores
Level
0
data
from
EO-‐1
and
uses
an
OpenStack-‐
based
cloud
to
create
Level
1
data.
2.
OSDC
also
provides
OpenStack
resources
for
the
Nambia
Flood
Dashboard
developed
by
Dan
Mandl’s
team.
3.
Project
Matsu
uses
a
Hadoop/Accumulo
system
to
run
analy@cs
nightly
and
to
create
@les
with
OGC-‐compliant
WMTS.
25. Amount
of
data
retrieved
Number
of
queries
mashup
re-‐analysis
“wheel”
row-‐oriented
column-‐oriented
done
by
staff
self-‐service
by
community
26. Matsu
Hadoop
Architecture
Hadoop
HDFS
Matsu
Web
Map
Tile
Service
Matsu
MR-‐based
Tiling
Service
NoSQL
Database(Accumulo)
Images
at
different
zoom
layers
suitable
for
OGC
Web
Mapping
Server
Level
0,
Level
1
and
Level
2
images
MapReduce
used
to
process
Level
n
to
Level
n+1
data
and
to
par@@on
images
for
different
zoom
levels
NoSQL-‐based
Analy@c
Services
Streaming
Analy@c
Services
MR-‐based
Analy@c
Services
Analy@c
Services
Storage
for
WMTS
@les
and
derived
data
products
Presenta@on
Services
Web
Coverage
Processing
Service
(WCPS)
Workflow
Services
34. Matsu Wheel Spectral Anomaly Detector
§ “Contours and Clusters” – looks for physical contours around
spectral clusters
§ PCA analysis applied to the set of reflectivity values (spectra) for
every pixel, and the top 5 components are extracted for further
analysis.
35. Matsu Wheel Spectral Anomaly Detector
§ “Contours and Clusters” – looks for physical contours around
spectral clusters
§ PCA analysis applied to the set of reflectivity values (spectra) for
every pixel, and the top 5 components are extracted for further
analysis.
§ Pixels are clustered in the transformed 5-D spectral space using a
k-means clustering algorithm.
§ For each image, k = 50 spectral clusters are formed and ranked
from most to least extreme using the Mahalanobis distance of the
cluster from the spectral center.
§ For each spectral cluster, adjacent pixels are grouped together into
contiguous objects.
à returns geographic regions of spectral anomalies that are scored
again as anomalous (0 least , 1000 most) compared to a set of
“normal” spectra, constructed for comparison over a baseline of time
37. Matsu Wheel Land Cover Classifier
§ Support Vector Machine supervised classifier (uses Python’s Sci-
kit learn)
§ “Training set” constructed from a variety of scenes with range
of locations, time of year, sun angle
§ Cloud, Desert/ Dry Land, Water, Vegetation are manually
(visually) classified using RGB image
à returns classified image
40. Amount
of
data
retrieved
Number
of
queries
Number
of
sites
download
data
data
peering
41. Cloud
1
Data
Commons
1
Data
Commons
2
Data
Peering
• Tier
1
Commons
exchange
data
for
the
research
community
at
no
charge.
42. Three
Requirements
Two
Research
Data
Commons
with
a
Tier
1
data
peering
rela@onship
agree
as
follows:
1. To
transfer
research
data
between
them
at
no
cost.
2. To
peer
with
at
least
two
other
Tier
1
Research
Data
Commons
at
10
Gbps
or
higher.
3. To
support
Digital
IDs
(of
a
form
to
be
determined
by
mutual
agreement)
so
that
a
researcher
using
infrastructure
associated
with
one
Tier
1
Research
Data
Commons
can
access
data
transparently
from
any
of
the
Tier
1
Research
Data
Commons
that
hold
the
desired
data.
44. The
5P
Challenges
• Permanent
objects
with
Digital
IDs
• Cyber
Pods
with
scalable
storage
and
analy@cs
• Data
Peering
• Portable
data
• Support
for
Pay
for
compute
45. Challenge
1:
Permanent
Secure
Objects
• How
do
I
assign
Digital
IDs
and
key
metadata
to
“controlled
access”
data
objects
and
collec@ons
of
data
objects
to
support
distributed
computa@on
of
large
datasets
by
communi@es
of
researchers?
– Metadata
may
be
both
public
and
controlled
access
– Objects
must
be
secure
• Think
of
this
as
a
“dns
for
data.”
• The
test:
One
commons
serving
the
earth
science
community
can
transfer
1
PB
of
data
files
to
another
commons
and
no
data
scien@st
needs
to
change
their
code
46. Challenge
2:
Cyber
Pods
and
Datapods
• How
can
I
add
a
rack
of
compu@ng/storage/networking
equipment
to
a
cyber
pod
(that
has
a
manifest)
so
that
– Ajer
aYaching
to
power
– Ajer
aYaching
to
network
– No
other
manual
configura@on
is
required
– The
data
services
can
make
use
of
the
addi@onal
infrastructure
– The
compute
services
can
make
use
of
the
addi@onal
infrastructure
• In
other
words,
we
need
an
open
source
sojware
stack
that
scales
to
cyberpods
and
data
analysis
that
scales
to
datapods.
47. Challenge
3:
Data
Peering
• How
can
a
cri@cal
mass
of
data
commons
support
data
peering
so
that
a
research
at
one
of
the
commons
can
transparently
access
data
managed
by
one
of
the
other
commons
– We
need
to
access
data
independent
of
where
it
is
stored
– “Tier
1
data
commons”
need
to
pass
community
managed
data
at
no
cost
– We
need
to
be
able
to
transport
large
data
efficiently
“end
to
end”
between
commons
48. Challenge
4:
Data
Portability
• We
need
an
“Indigo
BuYon”
to
move
our
data
between
two
commons
that
peer.
49. Challenge
5:
Pay
for
Compute
Challenges
–
Low
Cost
Data
Integra@on
• Commons
should
support
a
“free
storage
for
research
data,
pay
for
compute,”
model
perhaps
with
“chits”
available
to
researchers.
• Today,
we
by
and
large
integrate
data
with
graduate
students
and
technical
staff
• How
can
two
datasets
from
two
different
commons
be
“joined”
at
“low
cost”
– Linked
data
– Controlled
Vocabularies
– Dataspaces
– Universal
Correla@on
Keys
– Sta@s@cal
methods
50. Collect
data
and
distribute
files
via
DAAC
and
apply
data
mining
Make
data
available
via
open
APIs
and
apply
data
science
2000
2010-‐2015
2020
Operate
data
commons
&
support
data
peering