Dynamic Talks: "Implementing data quality automation with open source stack" ...Grid Dynamics
The quality of business decisions, machine learning insights, and executive reports depend on the quality and integrity of the underlying data. There are many ways that data can get corrupted in an analytical data platform from de-synchronization with the system-of-record to defects in data pipelines. We will show how to detect and prevent data corruption with automation, open source tools, and machine learning.
Data quality - The True Big Data ChallengeStefan Kühn
Data Quality is one of the most-overlooked key aspect in any Big Data project or approach. This talk adresses the problem from various perspectives, discusses the main challenges and identifies possible solutions.
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Enterprise Analytics: Serving Big Data Projects for HealthcareDATA360US
Andrew Rosenberg's Presentation on "Enterprise Analytics: Serving Big Data Projects for Healthcare" at DATA 360 Healthcare Informatics Conference - March 5th, 2015
Data Quality Strategy: A Step-by-Step ApproachFindWhitePapers
Learn about the importance of having a data quality strategy and setting the overall goals. The six factors of data are also explained in detail and how to tie it together for implementation.
Dynamic Talks: "Implementing data quality automation with open source stack" ...Grid Dynamics
The quality of business decisions, machine learning insights, and executive reports depend on the quality and integrity of the underlying data. There are many ways that data can get corrupted in an analytical data platform from de-synchronization with the system-of-record to defects in data pipelines. We will show how to detect and prevent data corruption with automation, open source tools, and machine learning.
Data quality - The True Big Data ChallengeStefan Kühn
Data Quality is one of the most-overlooked key aspect in any Big Data project or approach. This talk adresses the problem from various perspectives, discusses the main challenges and identifies possible solutions.
Applying Data Quality Best Practices at Big Data ScalePrecisely
Global organizations are investing aggressively in data lake infrastructures in the pursuit of new, breakthrough business insights. At the same time, however, 2 out of 3 business executives are not highly confident in the accuracy and reliability of their own Big Data. Regaining that confidence requires utilizing proven data quality tools at Big Data scale.
In this on-demand webinar, discover how to ensure your data lake is a trusted source for advanced business insights that lead to new revenue, cost savings and competitiveness. You will have the opportunity to:
• Compare your organization’s data lake “readiness” against initial findings from our upcoming annual Big Data Trends survey
• Gain insight into where and how to leverage data quality best practices for Big Data use cases
• Explore how a ‘Develop Once, Deploy Anywhere’ approach, including to native Big Data infrastructures such as Hadoop and Spark, facilitates consistent data quality patterns
Enterprise Analytics: Serving Big Data Projects for HealthcareDATA360US
Andrew Rosenberg's Presentation on "Enterprise Analytics: Serving Big Data Projects for Healthcare" at DATA 360 Healthcare Informatics Conference - March 5th, 2015
Data Quality Strategy: A Step-by-Step ApproachFindWhitePapers
Learn about the importance of having a data quality strategy and setting the overall goals. The six factors of data are also explained in detail and how to tie it together for implementation.
Gayatri Patel, eBay, presents at the Big Analytics 2012 Roadshow
The wonders of what data can do for an organization is measured in the productivity and competitiveness of their team's decisions. Some believe more data is the key. Agreed...but good decisions require more than just deriving intelligence from big data. In this dynamic market, the need to socialize and evolve ideas with other teams, quickly correlate information across sources, and test ideas to fail fast early are strong enablers to gain competitive footing. eBay¹s analytic and technology advancements garners insights and approaches that continue to help our employees tell their "data stories" and make better decisions.
Big Data - it's the big buzz. But is it dead on arrival?
In this presentation Daragh O Brien looks at the history of information management, the challenges of data quality and governance, and the implications for big data...
Understanding big data and data analytics big dataSeta Wicaksana
Big Data helps companies to generate valuable insights. Companies use Big Data to refine their marketing campaigns and techniques. Companies use it in machine learning projects to train machines, predictive modeling, and other advanced analytics applications.
Enacting the data subjects access rights for gdpr with data services and data...Jean-Michel Franco
GDPR is more than another regulation to be handled by your back office. As stated by the European Commission, “The primary objective of this new set of rules is to give citizens back control over of their personal data.” And surveys show that European citizens are eager to apply for those new fundamental rights, such as access to information, data portability, and the right to be forgotten. Will you be ready to deliver, or will you be forced to tell your customers that unfortunately, you are not yet ready to respect their rights?
Enacting the GDPR’s Data Subject Access Rights (DSAR) requires practical actions. There’s a mandate for an integrated data governance strategy to establish your data inventory, operationalize controls, foster accountability across teams and ensure compliance, and finally unleash personal data to your customers, employees, visitors, and prospects. Only a strong data governance program on top of a modern, collaborative data hub ensures that you have the policies, standards, and controls in place to enforce compliance.
This presentations outlines the practical steps to deploy governed data services that:
Know your customers and employees with a data inventory
Track and trace data using audit trails and data lineage
Manage and propagate opt-in consent across customer-facing applications
Reconcile and protect your sensitive data in a data hub with automated controls, data stewardship, and data masking
Respect the rights for your data subjects with collaborative data management and portals
Big data is like a two-edged sword: It can bring many new opportunities for business, but it can also harm individuals and businesses in unanticipated ways
Big Data Applications | Big Data Application Examples | Big Data Use Cases | ...Simplilearn
In this Big Data presentation, we will be discussing the Big data growth over the last few years followed by the various big data applications. We will look into the various sectors where big data is used such as weather forecast, healthcare, media and entertainment, logistics, travel & tourism and finally in the government & law enforcement sector.
We will be discussing how below industries are using Big Data presentation:
1. Weather forecast
2. Media and entertainment
3. Healthcare
4. Logistics
5. Travel n tourism
6. Government and law enforcement
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Change Management: The Secret to a Successful SAS® ImplementationThotWave
Whether you are deploying a new capability with SAS® or modernizing the tool set that people already use in your organization, change management is a valuable practice. Sharing the news of a change with employees can be a daunting task and is often put off until the last possible second. Organizations frequently underestimate the impact of the change, and the results of that miscalculation can be disastrous. Too often, employees find out about a change just before mandatory training and are expected to embrace it. But change management is far more than training. It is early and frequent communication, an inclusive discussion, encouraging and enabling the development of an individual, and facilitating learning before, during, and long after the change.
This paper not only showcases the importance of change management but also identifies key objectives for a purposeful strategy. We outline our experiences with both successful and not so successful organizational changes. We present best practices for implementing change management strategies and highlighting common gaps. For example, developing and engaging “Change Champions” from the beginning alleviates many headaches and avoids disruptions. Finally, we discuss how the overall company culture can either support or hinder the positive experience change management should be and how to engender support for formal change management in your organization.
Big Data Impact on Purchasing and SCM - PASIA World Conference DiscussionBill Kohnen
The volume, velocity and variety of data available is almost unthinkable. 90% of the world’s data is less than 2 years old, we are able analyze less than 5% of it and 80% of what people generally are looking at is less than 6 weeks old. Harnessing this data for effective decision making is a goal for organizations worldwide and has created a 50Billion dollar industry to provide tools and consulting.
Even before “Big Data” Purchasing groups were swimming in data and struggled to put it to effective use. The success of Strategic Sourcing methodology had the effect of also identifying and standardizing the types and format of information that can be used to drive improvement.
This discussion will connect how big data sources and methodology can be used to develop specific and relevant spend analytics. Also presented will be an illustration of how you can use data and tools you already have - to get immediate results and make you better prepared to evaluate the need for more powerful analytic tools.
Finally will conclude with comments on how Big Data along with other disruptive digital trends will create a new required skill sets for Purchasing and Supply Chain Professionals and are transform how operate all ready.
Hvilke teknologier forventer IBM får størst betydning fremover?
Få indblik i hvordan det er gået med IBM's tidligere forudsigelser og få et bud på, hvad fremtiden bringer fra IBM Research.
Anders Quitzau, Chief Technologist, IBM
Why should we care about integrating data? What should we be trying to achieve? Population Health. The Softer, Human Side of Being “Data Driven” not “Driven By Data." The New Era of Decision Support in Healthcare. Top 10 Challenges To Integrating External Data.
5 Things to Know About the Clinical Analytics Data Management Challenge - Ext...Michael Dykstra
5 Things to Know About the Clinical Analytics Data Management Challenge - Extracting Real Benefit From Your EHR Data
The EHR revolution has created immense promise for improved patient outcomes and reduced costs but most healthcare organizations are struggling to experience significant benefits. The power of Applied Clinical Analytics lies in a simple but powerful concept: the importance of focusing on the accuracy and availability of the underlying data, first and foremost.
Gayatri Patel, eBay, presents at the Big Analytics 2012 Roadshow
The wonders of what data can do for an organization is measured in the productivity and competitiveness of their team's decisions. Some believe more data is the key. Agreed...but good decisions require more than just deriving intelligence from big data. In this dynamic market, the need to socialize and evolve ideas with other teams, quickly correlate information across sources, and test ideas to fail fast early are strong enablers to gain competitive footing. eBay¹s analytic and technology advancements garners insights and approaches that continue to help our employees tell their "data stories" and make better decisions.
Big Data - it's the big buzz. But is it dead on arrival?
In this presentation Daragh O Brien looks at the history of information management, the challenges of data quality and governance, and the implications for big data...
Understanding big data and data analytics big dataSeta Wicaksana
Big Data helps companies to generate valuable insights. Companies use Big Data to refine their marketing campaigns and techniques. Companies use it in machine learning projects to train machines, predictive modeling, and other advanced analytics applications.
Enacting the data subjects access rights for gdpr with data services and data...Jean-Michel Franco
GDPR is more than another regulation to be handled by your back office. As stated by the European Commission, “The primary objective of this new set of rules is to give citizens back control over of their personal data.” And surveys show that European citizens are eager to apply for those new fundamental rights, such as access to information, data portability, and the right to be forgotten. Will you be ready to deliver, or will you be forced to tell your customers that unfortunately, you are not yet ready to respect their rights?
Enacting the GDPR’s Data Subject Access Rights (DSAR) requires practical actions. There’s a mandate for an integrated data governance strategy to establish your data inventory, operationalize controls, foster accountability across teams and ensure compliance, and finally unleash personal data to your customers, employees, visitors, and prospects. Only a strong data governance program on top of a modern, collaborative data hub ensures that you have the policies, standards, and controls in place to enforce compliance.
This presentations outlines the practical steps to deploy governed data services that:
Know your customers and employees with a data inventory
Track and trace data using audit trails and data lineage
Manage and propagate opt-in consent across customer-facing applications
Reconcile and protect your sensitive data in a data hub with automated controls, data stewardship, and data masking
Respect the rights for your data subjects with collaborative data management and portals
Big data is like a two-edged sword: It can bring many new opportunities for business, but it can also harm individuals and businesses in unanticipated ways
Big Data Applications | Big Data Application Examples | Big Data Use Cases | ...Simplilearn
In this Big Data presentation, we will be discussing the Big data growth over the last few years followed by the various big data applications. We will look into the various sectors where big data is used such as weather forecast, healthcare, media and entertainment, logistics, travel & tourism and finally in the government & law enforcement sector.
We will be discussing how below industries are using Big Data presentation:
1. Weather forecast
2. Media and entertainment
3. Healthcare
4. Logistics
5. Travel n tourism
6. Government and law enforcement
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Change Management: The Secret to a Successful SAS® ImplementationThotWave
Whether you are deploying a new capability with SAS® or modernizing the tool set that people already use in your organization, change management is a valuable practice. Sharing the news of a change with employees can be a daunting task and is often put off until the last possible second. Organizations frequently underestimate the impact of the change, and the results of that miscalculation can be disastrous. Too often, employees find out about a change just before mandatory training and are expected to embrace it. But change management is far more than training. It is early and frequent communication, an inclusive discussion, encouraging and enabling the development of an individual, and facilitating learning before, during, and long after the change.
This paper not only showcases the importance of change management but also identifies key objectives for a purposeful strategy. We outline our experiences with both successful and not so successful organizational changes. We present best practices for implementing change management strategies and highlighting common gaps. For example, developing and engaging “Change Champions” from the beginning alleviates many headaches and avoids disruptions. Finally, we discuss how the overall company culture can either support or hinder the positive experience change management should be and how to engender support for formal change management in your organization.
Big Data Impact on Purchasing and SCM - PASIA World Conference DiscussionBill Kohnen
The volume, velocity and variety of data available is almost unthinkable. 90% of the world’s data is less than 2 years old, we are able analyze less than 5% of it and 80% of what people generally are looking at is less than 6 weeks old. Harnessing this data for effective decision making is a goal for organizations worldwide and has created a 50Billion dollar industry to provide tools and consulting.
Even before “Big Data” Purchasing groups were swimming in data and struggled to put it to effective use. The success of Strategic Sourcing methodology had the effect of also identifying and standardizing the types and format of information that can be used to drive improvement.
This discussion will connect how big data sources and methodology can be used to develop specific and relevant spend analytics. Also presented will be an illustration of how you can use data and tools you already have - to get immediate results and make you better prepared to evaluate the need for more powerful analytic tools.
Finally will conclude with comments on how Big Data along with other disruptive digital trends will create a new required skill sets for Purchasing and Supply Chain Professionals and are transform how operate all ready.
Hvilke teknologier forventer IBM får størst betydning fremover?
Få indblik i hvordan det er gået med IBM's tidligere forudsigelser og få et bud på, hvad fremtiden bringer fra IBM Research.
Anders Quitzau, Chief Technologist, IBM
Why should we care about integrating data? What should we be trying to achieve? Population Health. The Softer, Human Side of Being “Data Driven” not “Driven By Data." The New Era of Decision Support in Healthcare. Top 10 Challenges To Integrating External Data.
5 Things to Know About the Clinical Analytics Data Management Challenge - Ext...Michael Dykstra
5 Things to Know About the Clinical Analytics Data Management Challenge - Extracting Real Benefit From Your EHR Data
The EHR revolution has created immense promise for improved patient outcomes and reduced costs but most healthcare organizations are struggling to experience significant benefits. The power of Applied Clinical Analytics lies in a simple but powerful concept: the importance of focusing on the accuracy and availability of the underlying data, first and foremost.
Microsoft: A Waking Giant In Healthcare Analytics and Big DataHealth Catalyst
In 2005, Northwestern Memorial Healthcare embarked upon a strategic Enterprise Data Warehousing (EDW) initiative with the Microsoft technology platform as the foundation. Dale Sanders was CIO at Northwestern and led the development of Northwestern’s Microsoft-based EDW. At that time, Microsoft as an EDW platform was not en vogue and there were many who doubted the success of the Northwestern project. While other organizations were spending millions of dollars and years developing EDW’s and analytics on other platforms, Northwestern achieved great and rapid value at a fraction of the cost of the more typical technology platforms. Now, there are more healthcare data warehouses built around Microsoft products than any other vendor. The risky bet on Microsoft in 2005 paid off.
Ten years ago, critics didn’t believe that Microsoft could scale in the second generation of relational data warehouses, but they did. More recently, many of these same pundits have criticized Microsoft for missing the technology wave du jour in cloud offerings, mobile technology, and big data. But, once again, Microsoft has been quietly reengineering its culture and products, and as a result, they now offer the best value and most visionary platform for cloud services, big data, and analytics in healthcare.
In this context, Dale will talk about:
His up and down journey with Microsoft as an Air Force and healthcare CIO, and why he is now more bullish on Microsoft like never before
A quick review of the Healthcare Analytics Adoption Model and Closed Loop Analytics in healthcare, and how Microsoft products relate to both
The rise of highly specialized, cloud-based analytic services and their value to healthcare organizations’ analytics strategies
Microsoft’s transformation from a closed-system, desktop PC company to an open-system consumer and business infrastructure company
The current transition period of enterprise data warehouses between the decline of relational databases and the rise of non-relational databases, and the new Microsoft products, notably Azure and the Analytic Platform System (APS), that bridge the transition of skills and technology while still integrating with core products like Office, Active Directory, and System Center
Microsoft’s strategy with its PowerX product line, and geospatial analysis and machine learning visualization tools
The third webcast in this series focuses on ways to meet your health system’s specific needs and achieve a 360-degree view of your patients, processes, physicians, and costs without purchasing multiple, disparate solutions, and creating information silos.
Our speakers discuss their collective experience in working with organizations to create tailored platforms that provide convenient access to data collected by, and stored in, disparate clinical information systems and enabling that data to be securely used by users throughout the broader healthcare community. Actionable data – available to all users when they need it – serves as a foundation for analysis and decision-making aimed at improving how care is delivered.
You can find it online at http://www.informationbuilders.com/webevents/online/24637#sthash.RnwoH27x.dpuf
Microsoft: A Waking Giant in Healthcare Analytics and Big DataDale Sanders
Ten years ago, critics didn’t believe that Microsoft could scale in the second generation of relational data warehouses, but they did. More recently, many of these same pundits have criticized Microsoft for missing the technology wave du jour in cloud offerings, mobile technology, and big data. But, once again, Microsoft has been quietly reengineering its culture and products, and as a result, they now offer the best value and most visionary platform for cloud services, big data, and analytics in healthcare.
A hybrid approach to data management is emerging in healthcare as organizations recognize the value of an enterprise data warehouse in combination with a data lake.
In this SlideShare, we discuss data lakes in healthcare and we:
Provide an overview of a Hadoop-based data lake architecture and integration platform, and its application in machine learning, predictive modeling, and data discovery
Discuss several key use cases driving the adoption of data lakes for both providers and health plans
Discuss available data storage forms and the required tools for a data lake environment
Detail best practices for conducting data lake assessments and review key implementation considerations for healthcare
How much is that data in the window : Healthcare data valuationSean Manion PhD
Presentation on healthcare data valuation, data confidence fabrics, layers of trust in healthcare, and health data marketplaces as part of the Health Data Valuation event, Session 10 of the IEEE Healthcare: Blockchain & AI Virtual Series on 25 August 2021
Trends changed from Non compliance to RR --> Gap to RR --> Data Integrity --> DIB --> Smart Audit & Smart Data.
RR = Regulatory Requirements
DIB = Data Integrity Breach
Take a serious Note for Data Integrity whether you are small or big organization. Your Data is the Heart of your business. Regulatory bodies are highly conscious about such issues. For beginners in this path, my small note can help you a lot.
How to Create a Big Data Culture in PharmaChris Waller
A talk presented at the Big Data and Analytics conference in Boston on January 28, 2014. Emphasis on data and information sharing cultures in companies.
Despite massive investment in both people and technology, health systems are still struggling to maximize the value of their greatest asset: their data. Delivering high-quality, valuable insight from data and pushing those insights to the frontline healthcare professionals remains challenging and expensive. According to a recent survey conducted by HealthLeaders Media, health systems are hiring more analytics staff than almost any other role in health care. We know there’s an alternative to the massive hiring of analytics staff, a better way to dramatically increase the efficiency of your existing resources and provide an ROI that grows over time. The better way is the Rapid Response Analytics Solution.
Rapid Response Analytics Solution (RRA Solution) consists of two elements: curated, modular data called DOS™ Marts and Population Builder, a powerful self-service tool that lets any type of user, from physician executive to frontline nurses and population health teams explore their data and quickly build and share populations without needing to know how to write SQL and data science code. RRA Solution increases an analytics team’s productivity by up to 10x and reduces its time to develop analytics by as much as 90 percent. Analysts can spend more time focusing on key strategic analysis and less time on repetitive tasks that can lead to inconsistent results and a backlog of requests.
Learning Objectives:
- Discover how RRA Solution allows you to take components and customize them to quickly tailor and deliver meaningful insights.
- Learn about DOS™ Marts and Population Builder and how they drive consistency and efficiency, without needing to know SQL and data science coding.
- Understand how to use RRA Solution to increase the value of your analytics team and get them operating at the top of their function.
View this webinar and learn how RRA Solution can help you achieve a 10x increase in productivity and reduce your time to develop new analytics reports by more than 90 percent.
The Data Operating System: Changing the Digital Trajectory of HealthcareDale Sanders
This is the next evolution in health information exchanges and data warehouses, specifically designed to support analytics, transaction processing, and third party application development, in one platform, the Data Operating System.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
2. Paul Balas
pbalas@303Computing.com
Over 25 Years Experience Leading Digital
Transformations
Multiple MDM Implementations, Data Governance, and
Data Warehouse Initiatives
● Digital Transformation Consultant
● IT Executive
● Enterprise Architect
● Developer
m
2
3. This is a presentation to uncover the
systemic challenges in our
Government’s response to The COVID
Pandemic through the use of data...
3
And a proposal to fix it.
4. The Team!
4
Mingo Sanchez Elizabeth
Michel
Keith WorfolkKatie Everett
Kamal MaheshwariAlexandra-Cosmina
Comaniciu
Pooja K SwamyEvan Hu Bryan Haagsman,
PMP
5. Agenda
5
● Why is the Data Wrong?
● Why the New HHS System Won’t
Work
● How To Fix It - a POC
7. ➢ Data Quality is suspect
➢ The CDC had an aging system
➢ The virus is spreading quickly - huge volumes of new
data to process
➢ We aren’t capturing the data we need
7
8. ➢ Managing our supply chain
(PPE, Testing Supplies, Hospital ICU Beds)
➢ Ensure we have enough doctors, nurses, and other healthcare
professionals
➢ Issuing protective orders to stop the spread of the virus (Shelter-in-
Place, Social Distancing, Shuttering Business)
Why Do We Need The Data?
8
9. We’ll be asking ourselves, “Could we have saved more lives?”
But without trusted data,
no decision will be driven
with conviction
9
10. Dr. Deborah Birx
said in a White
House
Coronavirus task
force meeting -
“There is
nothing
from the
CDC that I
can trust.”
10
15. 15
➢ CDC has 1,200 Users and about 950 State CDC
Partners
➢ There are over 6,000 Hospitals in the US
➢ About 2,000 Hospitals provide Covid data directly to
HH Protect
➢ TeleTracking provides data from about 1,100
Hospitals
(the contract with Teletracking was a quick way to get more data, more quickly into the system)
The Scope of The Problem
16. 16
The main focus behind improvement of the new system is speed,
data format standardization and validation.
Not on improving Data Governance and Collaboration
How The CDC is Trying to Solve The Problem
18. 18
Health and Human Services Goals for Data Governance
“Use of data across programs… remains a challenge.”
“Data are often housed in … data silos”
19. 19
Even with a standardized platform like
TeleTracking, it’s just a data entry app that is
driven by procedural rules (though it is much
better than handing everyone an instruction
book and Excel)
The main problem in achieving
Data Quality is PEOPLE
Standardized Data Entry Vs. Agile Data Management
20. 20
And here we have the problem; definitions
that may or may not be adhered to when
data is entered into the system
Medical forms are extremely complex and
require a great deal of training for health
practitioners to get it right
The Fallacy of Standardized Data Entry Solutions
Instructions for TeleTracking
21. 21
If you’re a manager who believes that a
standardized data entry screen fixes
your organizations data quality
problems...
I strongly encourage you to speak with
your data scientists or data warehouse
developers
23. The New HHS System Will Fall Short
The new system doesn’t solve for the hard problem to achieve
Data Quality
People efficiently aligning to create standards
Embedding standards into the system (Procedures vs. ML)
Let’s see how we envision a faster approach
23
24. Imagine This is Your Problem to Solve
You are the CDO of the CDC tasked to improve our Nation’s ability
to better manage this and the next pandemic through the use of
data
Your first goal is to understand the key issues in the current system
(“as-is”), and develop a roadmap to address them
The stakes are high
24
25. Architectural Principles
You outline your key architectural principles to keep the broad team focused on
outcomes
Goal 1 - Build better trust in the data
Goal 2 - Understand which issues to fix first (Prioritize)
Goal 3 - The system should be agile to change
(Days and Weeks, not Months or Years for new
features)
Goal 4 - Efficient e-collaboration
25
26. Two Paths
You decide to split the problem in two:
Path 1 - Standardize data entry systems - long path
Path 2 - Build a framework for efficient Data Governance and
do it quickly
26
27. Path 2 - The Reference Platform
Master Data Management
Taxonomy Management
Data Quality
Cloud Data Integration
Data Transformation
Orchestration
Natural Language / Feature Extraction
Data Lake
Compute and Storage
m
28. POC - Understand Data Issues
You want to focus on the systemic issues in the underlying data flow across all
stakeholders
You choose to look at issues around ‘Testing’ as you believe you can get
immediate benefits for public health if you can build confidence in the test
data
What problems are states having in processing Test Data?
Is Test Data being reported consistently, accurately?
28
29. JHU
Johns Hopkins University has become the de-facto authority on COVID-19 data,
but do you know they are pulling it from other agencies. What types of problems
do they see?
“The website relies upon publicly available data from multiple sources that are not always consistent in how and when they are released and
updated. States may report components of testing data with different cadences, or they may even change how they report categories of
data over time, all of which can affect calculations of the rate of positivity. For example, some states report testing positives separately from
testing negatives, which may make it appear that 100% of their tests were positive or 100% negative on that day. Also, states have changed how
they count positives and negative test results and may retroactively change the numbers reported.” - JHU COVID Website
If you could classify all the issues across stakeholders, you believe you could have
a tool to get alignment with stakeholders by listening to them through their own
words
29
30. When is a Test Not a Test?
The CDC, Johns Hopkins, Covid Tracking Project and hundreds of other sites all
deal with testing differently. It seems simple, but even the CDC made this mistake.
30
“The Centers for Disease Control and Prevention is conflating the results of two different types of
coronavirus tests, distorting several important metrics and providing the country with an inaccurate picture of
the state of the pandemic. We’ve learned that the CDC is making, at best, a debilitating mistake: combining
test results that diagnose current coronavirus infections with test results that measure whether someone has
ever had the virus. The upshot is that the government’s disease-fighting agency is overstating the country’s
ability to test people who are sick with COVID-19. “ ALEXIS C. MADRIGALROBINSON MEYERMAY 21, 2020
This is something we want to correct for and monitor in our POC. Can our system
compare test reports from various agencies and help explain why it’s different?
32. It’s Going Well
The framework is proving valuable. You can now see the systemic data
quality issues and importantly communicate with stakeholders effectively
to get alignment.
You see an opportunity to do more, quickly: Can we use the system to show
how people in the public eye might influence people to get tested? You
know that will be critical once our testing capacity increases.
You ask the team to classify news outlets by Public Influencers, Events, and
Locations.
Using a traditional tool, this wouldn’t be possible, but you’ve seen how
efficiently you can master and classify data through this platform
32
34. DataOps and More
Your new DataOps system will be
provide more than just good data
quality for COVID and other
Pandemics
It will also allow you to conduct
data science experiments to see if
there is a correlation between
public policy actions, infection
rates, and ultimately deaths
34
35. What Did You Achieve as CDC CDO?
➢ You delivered a DataOps framework that will expedite realization of data
standards
➢ It puts the power of data governance and master data management into the
hands of the experts at the CDC, HHS, Hospitals and Labs
➢ It works in compliment with systems like TeleTracking
➢ It will scale beyond infectious disease data and can serve as a model for HHS
to ensure and promote data quality for all citizens
35
36. Google Cloud Platform
InfoWorks
How Was It Built
Internet
data sources
Data orchestration TableauTamr
Big Query
VM instances
Google Cloud
Storage
Natural
language
Python
Twitter API
News API
State Health
Department
Web Pages
JHU Github
39. Tamr - Data Experts - Spend More Time Analyzing/Strategizing
Before: Experts spend too much
Time manually fixing data
Today: ML can do 80% of
data mastering lift...
…. Enabling experts to put final
touches on the last 20%.
39
40. The Tamr Agile Approach to Data Mastering
Mastered data
OLD WAY
Rules-based
Source data
Mastered data
Time
Quality
Months to years
60%–80% Accuracy
Modify rules, create
exceptions
Months 1–4
Months 5–12+
Iterate
Machine-driven
NEW WAY
Days to weeks
90%+ accuracy
Source data
Weeks 1–12
Iterate with human-
guided machine
learning
Identify developers
Get business input
Write rules
Review with business
Unified data
Rules
40
43. Taxonomies: Before vs. After Tamr
Tamr enabled us to create standardized taxonomies that can be managed by a
networked group of hospitals, labs, health officials
These taxonomies are critical to having good quality and conformed data across a
widely distributed data network
There is an efficient mechanism for building consensus across experts at the same time
as fixing the data
There is no solution like it in the market.
L
45. Mastering People: 530K to 9K in a Few Days
Using Tamr, I was able to take a corpus of over 500k entity records identified by Google
Natural Language across 60,000 news articles, hundreds of web pages, thousands of
tweets, reducing it to about 9k Golden Master People Records with links back to each
news article they were referenced in regardless of spelling or abbreviation
I estimate the system can be maintained in one to two hours a week at scale,
decreasing to minutes a week as the model learns
I don’t even have to monitor it. Tamr can notify me of my quality score and if I have any
pairs that it’s unsure how to match
47. Conclusions
● The COVID Pandemic data challenges are a macro-view of the same
challenges we all face in our own companies as we use data as information to
improve outcomes
● People need to work together more effectively so we can erase this Pandemic
from our lives
● Trusted data can truly help us save more lives
47