A single-slide image of the data lake and data estate proposed to serve the data storage and access requirements of the Washington State Department of Health
Pfizer's HR division runs a massive data warehouse that
averages over 1 million queries per day from over 3,000 users. They needed to cut data costs and mitigate regulatory and audit risks and decrease costs.
See how they met their goals, and reduced extract, transform and load (ETL) times, using Appfluent!
Seven Ways DOS™ Simplifies the Complexities of Healthcare ITHealth Catalyst
Health Catalyst Data Operating System (DOS™) is a revolutionary architecture that addresses the digital and data problems confronting healthcare now and in the future. It is an analytics galaxy that encompasses data platforms, machine learning, analytics applications, and the fabric to stitch all these components together.
DOS addresses these seven critical areas of healthcare IT:
Healthcare data management and acquisition
Integrating data in mergers and acquisitions
Enabling a personal health record
Scaling existing, homegrown data warehouses
Ingesting the human health data ecosystem
Providers becoming payers
Extending the life and current value of EHR investments
This white paper illustrates these healthcare system needs detail and explains the attributes of DOS. Read how DOS is the right technology for tackling healthcare’s big issues, including big data, physician burnout, rising healthcare expenses, and the productivity backfire created by other healthcare technologies.
It is indeed boom time for Big Data in Healthcare. According to CBE insights, Big Data startups garnered USD 400M in investors funding in first half 2014 as compared to USD133M in the whole of 2013.
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
Editor’s Note: Download the complimentary MapR Guide to Big Data in Healthcare for more information: https://mapr.com/mapr-guide-big-data-healthcare/
There is no better example of the important role that data plays in our lives than in matters of our health and our healthcare. There’s a growing wealth of health-related data out there, and it’s playing an increasing role in improving patient care, population health, and healthcare economics.
Join this webinar to hear how Baptist Health is using big data and advanced analytics to address a myriad of healthcare challenges—from patient to payer—through their consumer- centric approach.
MapR Technologies will cover broader big data healthcare trends and production use cases that demonstrate how to converge data and compute power to deliver data-driven healthcare applications.
Pfizer's HR division runs a massive data warehouse that
averages over 1 million queries per day from over 3,000 users. They needed to cut data costs and mitigate regulatory and audit risks and decrease costs.
See how they met their goals, and reduced extract, transform and load (ETL) times, using Appfluent!
Seven Ways DOS™ Simplifies the Complexities of Healthcare ITHealth Catalyst
Health Catalyst Data Operating System (DOS™) is a revolutionary architecture that addresses the digital and data problems confronting healthcare now and in the future. It is an analytics galaxy that encompasses data platforms, machine learning, analytics applications, and the fabric to stitch all these components together.
DOS addresses these seven critical areas of healthcare IT:
Healthcare data management and acquisition
Integrating data in mergers and acquisitions
Enabling a personal health record
Scaling existing, homegrown data warehouses
Ingesting the human health data ecosystem
Providers becoming payers
Extending the life and current value of EHR investments
This white paper illustrates these healthcare system needs detail and explains the attributes of DOS. Read how DOS is the right technology for tackling healthcare’s big issues, including big data, physician burnout, rising healthcare expenses, and the productivity backfire created by other healthcare technologies.
It is indeed boom time for Big Data in Healthcare. According to CBE insights, Big Data startups garnered USD 400M in investors funding in first half 2014 as compared to USD133M in the whole of 2013.
Baptist Health: Solving Healthcare Problems with Big DataMapR Technologies
Editor’s Note: Download the complimentary MapR Guide to Big Data in Healthcare for more information: https://mapr.com/mapr-guide-big-data-healthcare/
There is no better example of the important role that data plays in our lives than in matters of our health and our healthcare. There’s a growing wealth of health-related data out there, and it’s playing an increasing role in improving patient care, population health, and healthcare economics.
Join this webinar to hear how Baptist Health is using big data and advanced analytics to address a myriad of healthcare challenges—from patient to payer—through their consumer- centric approach.
MapR Technologies will cover broader big data healthcare trends and production use cases that demonstrate how to converge data and compute power to deliver data-driven healthcare applications.
These slides use concepts from my (Jeff Funk) course entitled analyzing hi-tech opportunities to analyze how Big Data is becoming economically feasible for health care. These slides describe how the cost of sensors, data processing, data storage and data analyzing are falling, how new and better forms of storage and algorithms are being implemented, and what this means for sustainable health care. These changes are enabling a move towards personalized health care.
This prevention is a reflection of my vision on how Big Data impacts healthcare and the efforts that Oracle and VX Healthcare Analytics put into making Big Data work in the patient profiling space
BIG Data & Hadoop Applications in HealthcareSkillspeed
Explore the applications of BIG Data & Hadoop in Healthcare via Skillspeed.
BIG Data & Hadoop in Healthcare is a key differentiator, especially in terms of providing superior patient care. They are used for optimizing clinical trials, disease detection & boosting healthcare profitability.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
A brief tutorial on Big Data and its applications to healthcare. The discussion is centered around technical aspects related to this method of computing rather than concrete examples of its use in medical practice.
This presentation looks at the role of Big Data with Healthcare. Healthcare is big spending area for both the private and public sector as such it is important to look at ways to improve the delivery of healthcare to patient care.
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
Health system leaders have questions about big data: When will I need it? How should I prepare? What’s the best way to use it? It’s important to separate the hype of big data from the reality. Where big data stands in healthcare today is a far cry from where it will be in the future. Right now, the best use cases are in academic- or research-focused healthcare institutions. Most healthcare organizations are still tackling issues with their transactional databases and learning how to use those databases effectively. But soon—once the issues of expertise and security have been addressed—big data will play a huge role in care management, predictive analytics, prescriptive analytics, and genomics for everyday patients. The transition to big data will be easier if health systems adopt a late-binding approach to the data now.
Every Executive that has a Big Data Hadoop Cluster and their Staff, this is a must see! Getting your big data house in order.
The misalignment and clutter issues waste much of the precious time for critical decisions.
Carl Kesselman and I (along with our colleagues Stephan Erberich, Jonathan Silverstein, and Steve Tuecke) participated in an interesting workshop at the Institute of Medicine on July 14, 2009. Along with Patrick Soon-Shiong, we presented our views on how grid technologies can help address the challenges inherent in healthcare data integration.
Standards metadata management - version control and its governanceKevin Lee
Over the past decade, CDISC Standards have been widely accepted and implemented in clinical research. The FDA’s final “Guidance for Industry on electronic submission” mandates that submission data conform to CDISC standards, including SDTM, ADaM and SEND. Life sciences organizations, therefore, need to ensure that submission data be compliant to regulatory requirement standards (e.g., CDISC and eCTD). One of the biggest challenges, however, that organizations face is the evolution of standards, which lead the different versions of standards. The presentation will discuss how organization manage the different versions of industry standards and company standards. The presentation will introduce governance on metadata management.
Standards governance simply means “Do the right things” in standards implementation and management. The presentation will discuss how life sciences organizations can better fulfill their goals for standards implementation and management using governance. The presentation will also discuss the main aspects of data governance from the CDISC standards perspective, addressing the role of people(e.g., requestor, developer and approval), processes(e.g., work flow of requesting, developing and approving), and technology (e.g., spreadsheet, share point and MDR).
Metadata becomes alive via a web service between MDR and SASKevin Lee
The life science organizations use Metadata Repository (MDR) to manage and govern metadata, which could be used for artifacts generation such as SDTM, ADaM and TFL. In order to develop artifacts, metadata in MDR needs to be assessed by analytic systems such as SAS. The presentation will show how MDR and SAS could exchange metadata over internet. The presentation will introduce the basic concepts of a web service and Simple Object Access protocol (SOAP). It will show how SAS can receive metadata from MDR using SOAP over internet and convert SOAP XML response file to SAS datasets. It will introduce SOAP XML request/response file, SOAPWEB and XMLMAP function.
Achieving Medical Imaging Interoperability with PACS and RIS IntegrationsChetu
Medical images can be difficult to access, though the adoption of technology solutions, industry standards, and APIs, however, are improving imaging interoperability.
https://www.slideshare.net/chetuInc/why-dicom-matters-for-your-ehr-and-medical-imaging-apps
Data centric SDLC for automated clinical data developmentKevin Lee
Many life science organizations have been building systems to automate clinical data development (e.g., SDTM and ADaM). And such systems are considered as IT product and goes through typical system development life cycle (SDLC); requirements, analysis, design, programming, test and implementation. However, SDLC was initially developed for systems that automate the business process, not the data development. So, the question naturally arises that if life science organizations develop systems to automate data development, should the systems still be developed in process-centric SDLC? or will the current process-centric SDLC satisfy the business need? The presentation will introduce data-centric SDLC. First, the presentation will discuss how some steps of typical process-centric SDLC should be modified and adjusted in data-centric SDLC. For example, test of system requires target data quality assurance. And due to unpredictability of source data, maintenance and system update will be required after implementation. Secondly, the presentation will introduce additional steps and approaches for data-centric system development process such as data profiling and compliance.
Whitepaper : The Bridge From PACS to VNA: Scale Out Storage EMC
This whitepaper discusses how a vendor-neutral archive (VNA) for image archive and management requires a phased storage approach due to the capital and operational expenditures involved. The EMC Isilon scale-out approach provides a simple, predictable, and manageable path from PACS (Picture Archiving and Communications System) to VNA.
These slides use concepts from my (Jeff Funk) course entitled analyzing hi-tech opportunities to analyze how Big Data is becoming economically feasible for health care. These slides describe how the cost of sensors, data processing, data storage and data analyzing are falling, how new and better forms of storage and algorithms are being implemented, and what this means for sustainable health care. These changes are enabling a move towards personalized health care.
This prevention is a reflection of my vision on how Big Data impacts healthcare and the efforts that Oracle and VX Healthcare Analytics put into making Big Data work in the patient profiling space
BIG Data & Hadoop Applications in HealthcareSkillspeed
Explore the applications of BIG Data & Hadoop in Healthcare via Skillspeed.
BIG Data & Hadoop in Healthcare is a key differentiator, especially in terms of providing superior patient care. They are used for optimizing clinical trials, disease detection & boosting healthcare profitability.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
A brief tutorial on Big Data and its applications to healthcare. The discussion is centered around technical aspects related to this method of computing rather than concrete examples of its use in medical practice.
This presentation looks at the role of Big Data with Healthcare. Healthcare is big spending area for both the private and public sector as such it is important to look at ways to improve the delivery of healthcare to patient care.
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingHealth Catalyst
Health system leaders have questions about big data: When will I need it? How should I prepare? What’s the best way to use it? It’s important to separate the hype of big data from the reality. Where big data stands in healthcare today is a far cry from where it will be in the future. Right now, the best use cases are in academic- or research-focused healthcare institutions. Most healthcare organizations are still tackling issues with their transactional databases and learning how to use those databases effectively. But soon—once the issues of expertise and security have been addressed—big data will play a huge role in care management, predictive analytics, prescriptive analytics, and genomics for everyday patients. The transition to big data will be easier if health systems adopt a late-binding approach to the data now.
Every Executive that has a Big Data Hadoop Cluster and their Staff, this is a must see! Getting your big data house in order.
The misalignment and clutter issues waste much of the precious time for critical decisions.
Carl Kesselman and I (along with our colleagues Stephan Erberich, Jonathan Silverstein, and Steve Tuecke) participated in an interesting workshop at the Institute of Medicine on July 14, 2009. Along with Patrick Soon-Shiong, we presented our views on how grid technologies can help address the challenges inherent in healthcare data integration.
Standards metadata management - version control and its governanceKevin Lee
Over the past decade, CDISC Standards have been widely accepted and implemented in clinical research. The FDA’s final “Guidance for Industry on electronic submission” mandates that submission data conform to CDISC standards, including SDTM, ADaM and SEND. Life sciences organizations, therefore, need to ensure that submission data be compliant to regulatory requirement standards (e.g., CDISC and eCTD). One of the biggest challenges, however, that organizations face is the evolution of standards, which lead the different versions of standards. The presentation will discuss how organization manage the different versions of industry standards and company standards. The presentation will introduce governance on metadata management.
Standards governance simply means “Do the right things” in standards implementation and management. The presentation will discuss how life sciences organizations can better fulfill their goals for standards implementation and management using governance. The presentation will also discuss the main aspects of data governance from the CDISC standards perspective, addressing the role of people(e.g., requestor, developer and approval), processes(e.g., work flow of requesting, developing and approving), and technology (e.g., spreadsheet, share point and MDR).
Metadata becomes alive via a web service between MDR and SASKevin Lee
The life science organizations use Metadata Repository (MDR) to manage and govern metadata, which could be used for artifacts generation such as SDTM, ADaM and TFL. In order to develop artifacts, metadata in MDR needs to be assessed by analytic systems such as SAS. The presentation will show how MDR and SAS could exchange metadata over internet. The presentation will introduce the basic concepts of a web service and Simple Object Access protocol (SOAP). It will show how SAS can receive metadata from MDR using SOAP over internet and convert SOAP XML response file to SAS datasets. It will introduce SOAP XML request/response file, SOAPWEB and XMLMAP function.
Achieving Medical Imaging Interoperability with PACS and RIS IntegrationsChetu
Medical images can be difficult to access, though the adoption of technology solutions, industry standards, and APIs, however, are improving imaging interoperability.
https://www.slideshare.net/chetuInc/why-dicom-matters-for-your-ehr-and-medical-imaging-apps
Data centric SDLC for automated clinical data developmentKevin Lee
Many life science organizations have been building systems to automate clinical data development (e.g., SDTM and ADaM). And such systems are considered as IT product and goes through typical system development life cycle (SDLC); requirements, analysis, design, programming, test and implementation. However, SDLC was initially developed for systems that automate the business process, not the data development. So, the question naturally arises that if life science organizations develop systems to automate data development, should the systems still be developed in process-centric SDLC? or will the current process-centric SDLC satisfy the business need? The presentation will introduce data-centric SDLC. First, the presentation will discuss how some steps of typical process-centric SDLC should be modified and adjusted in data-centric SDLC. For example, test of system requires target data quality assurance. And due to unpredictability of source data, maintenance and system update will be required after implementation. Secondly, the presentation will introduce additional steps and approaches for data-centric system development process such as data profiling and compliance.
Whitepaper : The Bridge From PACS to VNA: Scale Out Storage EMC
This whitepaper discusses how a vendor-neutral archive (VNA) for image archive and management requires a phased storage approach due to the capital and operational expenditures involved. The EMC Isilon scale-out approach provides a simple, predictable, and manageable path from PACS (Picture Archiving and Communications System) to VNA.
Kaizentric is a Data Analytics firm, based in Chennai, India. Statistical Analysis is performed on a well-built client specific data warehouse, supported by Data Mining.
Automatic Data Reconciliation, Data Quality, and Data Observability.pdf4dalert
Automating Data Reconciliation, Data Observability, and Data Quality Check After Each Data Load, read more: https://medium.com/@nihar.rout_analytics/automatic-data-reconciliation-data-quality-and-data-observability-3eeca4650cd
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
BDaas- BigData as a service by "Sherya Pal" from "Saama". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
BDW Chicago 2016 - Ramu Kalvakuntla, Sr. Principal - Technical - Big Data Pra...Big Data Week
We all are aware of the challenges enterprises are having with growing data and silo’d data stores. Business is not able to make reliable decisions with un-trusted data and on top of that, they don’t have access to all data within and outside their enterprise to stay ahead of the competition and make key decisions in their business
This session will take a deep dive into current challenges business are having today and how to build a Modern Data Architecture using emerging technologies such as Hadoop, Spark, NoSQL data stores, MPP Data stores and scalable and cost effective cloud solutions such as AWS, Azure and Bigstep.
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
Watch full webinar here: https://bit.ly/32TT2Uu
Data virtualization is not just for self-service, it’s also a first-class citizen when it comes to modern data platform architectures. Technology has forced many businesses to rethink their delivery models. Startups emerged, leveraging the internet and mobile technology to better meet customer needs (like Amazon and Lyft), disrupting entire categories of business, and grew to dominate their categories.
Schedule a complimentary Data Virtualization Discovery Session with g2o.
Traditional companies are still struggling to meet rising customer expectations. During this webinar with the experts from g2o and Denodo we covered the following:
- How modern data platforms enable businesses to address these new customer expectation
- How you can drive value from your investment in a data platform now
- How you can use data virtualization to enable multi-cloud strategies
Leveraging the strategy insights of g2o and the power of the Denodo platform, companies do not need to undergo the costly removal and replacement of legacy systems to modernize their systems. g2o and Denodo can provide a strategy to create a modern data architecture within a company’s existing infrastructure.
Data and Analytics at Holland & Barrett: Building a '3-Michelin-star' Data Pl...Dobo Radichkov
This presentation, delivered at the AWS London Summit 2023, provides an in-depth look at how Holland & Barrett built a robust, high-performing data platform on AWS to drive insights at the speed of thought. Dobo Radichkov, Chief Data Officer, shares key aspects of the data strategy, outlining how the company utilised AWS Redshift, Metabase, and Retool to create an efficient data lake, data warehouse, and analytics layer. The presentation also discusses the transformative impact of this data infrastructure on various business areas, including Finance, Commercial, Supply Chain, Customer, Digital, and Wellness. Through this data-driven journey, Holland & Barrett aims to become the beating heart of the organization, unlocking success for colleagues, customers, and partners alike.
In the presentation, Dobo Radichkov lays out Holland & Barrett's vision to make their Data & Analytics team the heartbeat of the organization, a vision that has guided their strategy and tool selection. He explains how this vision is brought to life through their organizational structure, comprising of six specialized teams: Data Engineering, Data Warehouse, Business Intelligence, Data Science, Web & App Analytics, and Digital Analytics.
Dobo takes the audience through the company's strategic roadmap, a three-phase plan guiding the growth and development of their data capabilities. This roadmap isn’t just a technological plan but signifies a transformational journey for the team, aiming to embed data-driven decision-making in the DNA of Holland & Barrett.
Lastly, he showcases the '3-Michelin-star' data platform's architecture, painting a clear picture of how data moves from raw systems to the operational master data and, finally, to the analytics layer. The presentation concludes by highlighting how the newly formed data platform drives core business value and innovation across various business domains, reinforcing Holland & Barrett's commitment to becoming a data-led organization.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
2. CEDARDATA ES
TATE
CDC
(RAW)
SOURCE
SOURCE
CHARS
(RAW)
WAIIS/IMMS
(RAW)
SOURCE
SOURCE
CREST
(RAW)
01_RAW (RAW)
02_USABLE_PREP (COOKED)
03_USABLE
04_USEABLE_OUTPUT
05_OUTPUT
06_SANDBOX
Data Sciences
Support Unit
CEDAR (SERVED)
LAUREL
MADRONA
Data Analysis
Unit
PARQUET_SRC (COOKED)
DEV
TEST
STAGE
PROD
Data Sciences
Unit
RAW (RAW)
AUDIT
RPT_OUT
Data Audit
Unit
REDCap
(RAW)
SOURCE
AZURE PIPELINE
AZURE SHARE
LEGEND
RAW = Identical to source
COOKED = Cleaned and conformed
in common parquet files
SERVED = Business rule driven
under governance teams
CEDAR DATA LAKE
3. CEDAR
Data Estate
Data sources on left side of diagram are typically drawn from delimited text or relational databases and
are accessed via API or direct network connection using Azure Data Factory pipelines.
The CEDAR data lake is focused on extract and load activities, with only enough transformation to fulfill
the Kimball standards for cleaning and conforming. Consumers can receive data as
CEDAR is a “hub” data lake and as such is optimized for reads from a hierarchical Azure Data Lake Gen 2
data store.
Each of the client data consumer units receive a read-only “spoke” from CEDAR that is implemented
using Azure Share.
The Data Sciences Support Unit acts as CEDAR’s “first best customer” and serves as a center for
standards and best practices. DSSU maintains a code repository implemented in GitHub for the benefit
of all the CEDAR data consumer units.
Client data consumers vary in requirements and implementation. Each is envisioned (though not
required) to be built as a compute-optimized data lake that adds value by using both local and shared
data to create data products that are composed of data science experiments and machine learning
models, traditional data analytics, healthcare-driven insights, and various public and private
dashboards.
Third party public data consumers like local healthcare authorities, hospitals, clinics, and autonomous
indigenous healthcare organizations would access the products our data consumers produce via secure
API and Azure Identity Governance-derived accounts.
Content in the CEDAR data lake’s “served” folders is anticipated to carry the following additional
attributes:
Data is organized by business fact subject groups like vaccinations, investigations, hospitalizations
and such rather than “siloed” by individual budgetary units.
Data is cleaned and conformed using Kimball standards and best practices for “Data Mart” units.
Governance is crucial to the “served” folders and is anticipated to be chaired at the level of the
office of technology innovation with stakeholders from the budgetary units who contributed data
and who thereby assisted in breaking down the budgetary “silos”.
Description and Notes