This document discusses using predictive analytics within operating rooms (OR) at Beth Israel Deaconess Medical Center. It describes developing a predictive model to identify available OR time two weeks in advance to better schedule waitlisted cases and staff. Building the model using historical OR data and linear regression with stochastic gradient descent could help forecast case loads three weeks out. This would allow for improved OR utilization, reduced staff overtime and idle time, shorter patient wait times and fewer cancellations.
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...Domino Data Lab
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. Using multi-variate linear regression, we will show how they can predict available OR block times using Spark MLlib resulting in better OR utilization and shorter wait times for patients. Presented by Denny Lee, Data Scientist and Evangelist at Databricks.
Regulating the Workload of Your Clinical Research Coordinator (CRC)TrialJoin
A CRC (clinical research coordinator) is one of the most important people regarding clinical trials. He/she is the person who’s in charge of conducting the clinical trial, under the guidance of the PI (principal investigator). As the people responsible for coordinating all the activities on site, CRCs can sometimes carry a huge workload. This can quickly become a big problem not only for the CRC but also for everyone else involved in the study.
Determining the right workload for your CRC is one of the most important actions that you (as a site owner or PI) should take. Ideally, a CRC should work 4 to 6 effective hours per day. However, you will notice that there will be periods of time when the CRC has to work more than 8 hours but also periods when he/she will be free most of the days. In this article, we’ll help you find the right balance so that your CRC isn’t too overworked or underworked.
Using Spark in Healthcare Predictive Analytics in the OR - Data Science Pop-u...Domino Data Lab
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. Using multi-variate linear regression, we will show how they can predict available OR block times using Spark MLlib resulting in better OR utilization and shorter wait times for patients. Presented by Denny Lee, Data Scientist and Evangelist at Databricks.
Regulating the Workload of Your Clinical Research Coordinator (CRC)TrialJoin
A CRC (clinical research coordinator) is one of the most important people regarding clinical trials. He/she is the person who’s in charge of conducting the clinical trial, under the guidance of the PI (principal investigator). As the people responsible for coordinating all the activities on site, CRCs can sometimes carry a huge workload. This can quickly become a big problem not only for the CRC but also for everyone else involved in the study.
Determining the right workload for your CRC is one of the most important actions that you (as a site owner or PI) should take. Ideally, a CRC should work 4 to 6 effective hours per day. However, you will notice that there will be periods of time when the CRC has to work more than 8 hours but also periods when he/she will be free most of the days. In this article, we’ll help you find the right balance so that your CRC isn’t too overworked or underworked.
Building a Data Warehouse at Clover (PDF)Otis Anderson
A brief tour of why we focused on building out a data warehouse early on at Clover, and why we think the Data Science function has room to grow in health insurance.
A brief tour of why we focused on building out a data warehouse early on at Clover, and why we think the Data Science function has room to grow in health insurance.
A review of some of the pitfalls in planning local practice development programmes, and a suggestions for how to produce a comprehensive and coherent plan that will achieve meaningful goals
In our "Sitter Usage" webinar, you can expect to learn about the challenges of providing a patient with a sitter and the labor costs associated with it.
Nash Analytics™ has the capability to monitor how often PCTs, RNs and other support staff are pulled out of staffing to sit with a patient. We’ll use this webinar to show how to:
-Track sitter usage
-Review reports that track hours and cost
-Identify units that can benefit from alternative sitter options
-Determine when it when its best to “hard wire” sitters into the budget
-Review alternatives to sitters as outlined in a Medscape article.
ehCOS: Global Pionner in the development of "Next-Generation Electronic Healt...everis/ ehCOS
Gartner, in "Market Trends: Vertical-Specific Software Will Be the Heart of New Global Healthcare Bodies" highlights two key aspects of the future EHR: Modular, Flexible, open, and ready to incorporate technological trends That will shape the coming decades. In this White papers, we explain why ehCOS CLINIC, has emerged as an new generation EHR today.
young or old, rich or poor, brain or heart, limbs or lungs
ISCHAEMIA is an unsolved problem.
We are on a mission to set up a world class Diagnostic, Treatment and Research Center in Calcutta for Ischaemic Diseases.
Improve inclusion exclusion criteria to safeguard successful patient recruitm...Kunal Sampat
9 strategies to develop excellent inclusion-exclusion criteria for your clinical trial protocol.
Each strategy is actionable that you can start implementing starting today
Optimize your protocol for increase site satisfaction and faster patient recruitment
Stop wasting money on clinical trials that don't enroll because you don't have fully vetted inclusion-exclusion criteria.
New Drug Application - How to Speed Up FDA ApprovalTrialJoin
We all know that sponsors invest a lot of money in animal and human clinical studies where they test the safety and efficiency of a new drug. Sites are chosen which conduct the trials and in the end, they gather data for further analysis. But, what’s the purpose of it all? What happens after the trials end?
After a trial ends, the sponsor determines if it went well enough for him to be able to submit a New Drug Application to the FDA. An NDA is submitted in order for the FDA to approve the new investigational product on the market. So, here we’ll discuss the ways in which the FDA reviews the NDA and ways in which you can increase your chances to get approved.
Introduction to a panel of architects, public heath professionals, and civic leaders about designing for health. Hosted by American Institute of Architects, Washington, DC, on October 8, 2014
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...Alex Zeltov
Accelerating Insights in Healthcare with “Big Data” with HaDoop , use case description of Hadoop at IBC ( Independence Blue Cross, Alex Zeltov and Darwin Leung speakers for IBC)
Building a Data Warehouse at Clover (PDF)Otis Anderson
A brief tour of why we focused on building out a data warehouse early on at Clover, and why we think the Data Science function has room to grow in health insurance.
A brief tour of why we focused on building out a data warehouse early on at Clover, and why we think the Data Science function has room to grow in health insurance.
A review of some of the pitfalls in planning local practice development programmes, and a suggestions for how to produce a comprehensive and coherent plan that will achieve meaningful goals
In our "Sitter Usage" webinar, you can expect to learn about the challenges of providing a patient with a sitter and the labor costs associated with it.
Nash Analytics™ has the capability to monitor how often PCTs, RNs and other support staff are pulled out of staffing to sit with a patient. We’ll use this webinar to show how to:
-Track sitter usage
-Review reports that track hours and cost
-Identify units that can benefit from alternative sitter options
-Determine when it when its best to “hard wire” sitters into the budget
-Review alternatives to sitters as outlined in a Medscape article.
ehCOS: Global Pionner in the development of "Next-Generation Electronic Healt...everis/ ehCOS
Gartner, in "Market Trends: Vertical-Specific Software Will Be the Heart of New Global Healthcare Bodies" highlights two key aspects of the future EHR: Modular, Flexible, open, and ready to incorporate technological trends That will shape the coming decades. In this White papers, we explain why ehCOS CLINIC, has emerged as an new generation EHR today.
young or old, rich or poor, brain or heart, limbs or lungs
ISCHAEMIA is an unsolved problem.
We are on a mission to set up a world class Diagnostic, Treatment and Research Center in Calcutta for Ischaemic Diseases.
Improve inclusion exclusion criteria to safeguard successful patient recruitm...Kunal Sampat
9 strategies to develop excellent inclusion-exclusion criteria for your clinical trial protocol.
Each strategy is actionable that you can start implementing starting today
Optimize your protocol for increase site satisfaction and faster patient recruitment
Stop wasting money on clinical trials that don't enroll because you don't have fully vetted inclusion-exclusion criteria.
New Drug Application - How to Speed Up FDA ApprovalTrialJoin
We all know that sponsors invest a lot of money in animal and human clinical studies where they test the safety and efficiency of a new drug. Sites are chosen which conduct the trials and in the end, they gather data for further analysis. But, what’s the purpose of it all? What happens after the trials end?
After a trial ends, the sponsor determines if it went well enough for him to be able to submit a New Drug Application to the FDA. An NDA is submitted in order for the FDA to approve the new investigational product on the market. So, here we’ll discuss the ways in which the FDA reviews the NDA and ways in which you can increase your chances to get approved.
Introduction to a panel of architects, public heath professionals, and civic leaders about designing for health. Hosted by American Institute of Architects, Washington, DC, on October 8, 2014
IBM Insight 2014 session (4152 )- Accelerating Insights in Healthcare with “B...Alex Zeltov
Accelerating Insights in Healthcare with “Big Data” with HaDoop , use case description of Hadoop at IBC ( Independence Blue Cross, Alex Zeltov and Darwin Leung speakers for IBC)
Hospital Readmission Reduction: How Important are Follow Up Calls? (Hint: Very)SironaHealth
Starting in 2012, the Centers for Medicare and Medicaid Services (CMS) will begin withholding payments for potentially avoidable readmissions. This presentation reviews these new regulations, what causes excessive readmissions, and how hospitals can positively impact patient health by reaching out 24-72 hours after discharge.
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
Big Data, CEP and IoT : Redefining Healthcare Information Systems and AnalyticsTauseef Naquishbandi
Big Data is a term encompassing the use of techniques to capture, process, analyze and visualize potentially large datasets in a reasonable time frame not accessible to standard technologies.
It refers to the ability to crunch vast collections of information, analyze it instantly, and draw from it sometimes profoundly surprising conclusions
Big data solutions can help stakeholders personalize care, engage patients, reduce variability and costs, and improve quality of health delivery.
Big data analytics can also contribute to providing a rich context to shape many areas of health care like analysis of effects, side-effects of drugs, genome analysis etc.
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Med...Ryan Squire
Medicine of the Future—The Transformation from Reactive to Proactive (P4) Medicine as presented at the Ohio State University Medical Center Personalized Health Care National Conference.
Leroy Hood, MD, PhD, is the president and founder of the Institute of Systems Biology. Dr. Hood is a member of the National Academy of Sciences, the American Philosophical Society, the American Academy of Arts and Sciences, the Institute of Medicine and the National Academy of Engineering. His professional career began at Caltech where he and his colleagues pioneered four instruments — the DNA gene sequencer and synthesizer and the protein synthesizer and sequencer — which comprise the technological foundation for contemporary molecular biology. In particular, the DNA sequencer played a crucial role in contributing to the successful mapping of the human genome during the 1990s.
http://www.systemsbiology.org/Scientists_and_Research
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Cloudera, Inc.
Apache Hadoop, an open-source platform, is increasingly gaining adoption within organizations trying to draw insight from all the big data being generated. Hadoop, and a handful of open-source tools that complement it, are promising to make gigantic and diverse datasets easily and economically available for quick analysis. A burgeoning partner ecosystem is also essential to helping organizations turn big data into business value.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
"Real-Time Analytics with Spark Streaming" presented at QCon São Paulo, 2015-03-26
http://qconsp.com/presentation/real-time-analytics-spark-streaming
This talk presents an overview of Spark and its history and applications, then focuses on the Spark Streaming component used for real-time analytics. We compare it with earlier frameworks such as MillWheel and Storm, and explore industry motivations for open-source micro-batch streaming at scale.
The talk will include demos for streaming apps that include machine-learning examples. We also consider public case studies of production deployments at scale.
We’ll review the use of open-source sketch algorithms and probabilistic data structures that get leveraged in streaming – for example, the trade-off of 4% error bounds on real-time metrics for two orders of magnitude reduction in required memory footprint of a Spark app.
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...WSO2
In this webinar, Srinath Perera, director of research at WSO2, will discuss
Big data landscape: concepts, use cases, and technologies
Real-time analytics with WSO2 CEP
Batch analytics with WSO2 BAM
Combining batch and real-time analytics
Introducing WSO2 Machine Learner
This is the presentation I gave to the HIMSS Management Engineering and Process Improvement (ME-PI) Community on predictive analytics healthcare usage.
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
Mike Gualtieri, Principal Analyst, Forrester Research, presents at the Big Analytics Roadshow, 2012 in New York City on December 12, 2012
Presentation title: Evaluating Big Data Predictive Analytics Platforms
Abstract: Great. You have Big Data. Now what? You have to analyze it to find game-changing predictive models that you can use to make smart decisions, reduce risk, or deliver breakthrough customer experiences. Big Data Predictive Analytics solutions are software and/or hardware solutions that allow firms to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources. In this session, Forrester Principal Analyst Mike Gualtieri will discuss the key criteria you should use to evaluate Big Data Predictive Analytics platforms to meet your specific needs.
Children's Mercy Patient Progression Hub - HIT December 2023KC Digital Drive
These slides were presented at the December 2023 meeting of the KC Digital Drive Health Innovation Team.
This presentation focuses on Children's Mercy's innovative use of data. Bill Saltmarsh, MBA, Vice President and Chief Data Officer says, "We are using data to create value for our patients, their families, and our community. We believe that the key to delivering that value is contingent upon our ability to capture, safeguard, and derive novel insights from our data. It is also contingent upon our ability to take advantage of advanced analytical methods and technologies, including the use of Artificial Intelligence. One example of this type of innovation is our Patient Progression Hub which is enabling us to improve the connected care experience for our patients by consistently providing the right information to the right people and the right time."
Bill leads the Data Intelligence Team for Children's Mercy, which includes groups such as Data Science, Clinical Reporting and Analytics, Data Platform Engineering, and Data Governance. Before joining Children's Mercy in March of 2023, he led data teams at ResMed and Pluralsight.
The Attune LIS is a leading Cloud based lab information system that integrates all departments and centers spread across locations on a stable and secure platform, giving decision makers a unified picture of their business. Attune LIS improves accuracy while accelerating TAT and helping scale business rapidly.
What is the best Healthcare Data Warehouse Model for Your Organization?Health Catalyst
Join Steve Barlow as he addresses the strengths and weaknesses of each of the following three primary Data Model approaches for data warehousing in healthcare:
1. Enterprise Data Model
2. Independent Data Marts
3. Late-binding Solutions
Speaker Presentation from U.S. News Healthcare of Tomorrow leadership summit, Nov. 1-3, 2017 in Washington, DC. Find out more about this forum at www.usnewshot.com.
Staffing Decision-Making Using Simulation ModelingAlexander Kolker
The use of Management Engineering methodology for
staffing decision-making.
• Part 1 - Quality and Cost: Outpatient Flu Clinic.
• Part 2 - Quality and Cost : Optimal PACU Nursing
Staffing.
• Summary of Fundamental Management Engineering
10 Things to Consider When Building a CTMS Business CasePerficient, Inc.
Sponsors and research organizations are often tasked with building a business case for a clinical trial management system (CTMS) before they even evaluate the various solutions in the marketplace.
After multiple successful Oracle Siebel CTMS implementations, Perficient has identified 10 ways you can benefit from a CTMS solution.
In this slideshare we share information that you can leverage as you develop a business case for a CTMS.
We also demonstrate the two most popular CTMS benefits and corresponding features.
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show:
1) How we tried to solve this problem using traditional DW techniques
2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients.
3) Some of the key learnings we had when migrating from DW to Spark.
eHealth Summit: "How a mathematical patient flow modelling study can eliminat...3GDR
Slides from National eHealth Summit, 30 Sept 2015 at Carton House, Kildare: Professor Gary Courtney, Lead, National Acute Medicine Programme (NAMP).
#eHealthSummit15
http://www.ehealthsummit.ie
http://mhealthinsight.com/2015/09/25/mhealth-insights-from-the-ehealth-summit/
Conducting a Summative Study of EHR Usability: Case StudyUXPA Boston
At least year’s conference, a group of us explored the complexity involved with evaluating the usability of Electronic Health Records: The wide range of user profiles and characteristics, a seemingly infinite number of tasks, and challenges in obtaining realistic data while respecting HIPAA regulations. In December, the Usability team at athenahealth conducted a summative usability study of [product]. In this Case Study, the Kris will discuss how the team navigated the challenges of summative EHR evaluation to conduct this study. Topics include task selection, recruiting, metric selection, logistics, and lessons learned.
Health Care: Cost Reductions through Data Insights - The Data Analysis GroupJames Karis
An overview of the cost reduction opportunities for a Health Care provider. These opportunities can be identified, quantified and optimised through data-driven insights. The slide pack also provides a strategic overview of how one would set up such a project within a large organisation, whilst mitigating patient-care concerns.
Nearly the Holy Grail – Clinical Portals for Faster, Better and Borderless CareNHSScotlandEvent
This session explores the piloting of a clinical portal giving clinicians across southern Scotland instant electronic access to patient data in order to deliver better, faster care.
North Tees and Hartlepool NHS Foundation Trust look to O2’s Casebook 3 to support Electronic Patient Records. NHS Trusts around the country face increasing challenges related to an
ageing and increasing population. The balance of budgets and resources, against employee morale and patient care, has meant the need to explore how technology can be used to drive
efficiencies and maximise clinician-patient facing time rather than admin time. Many NHS Trusts currently run at a deficit and many are facing challenges and obstructions to achieving their performance targets, which means they are unable to
access certain exemplar funding.
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. About Ayad Shammout
• Director of Business Intelligence, Beth Israel Deaconess
Medical Center
• Helped build highly available / disaster recovery
infrastructure for BIDMC
2
3. About Denny Lee
• Technology Evangelist, Databricks
• Former Sr. Director of Data Sciences Eng, Concur
• Helped bring Hadoop onto Windows and Azure
3
4. About Databricks
• Founded by Apache Spark Creators
• Largest contributor to Spark project, committed to
keeping Spark 100% open source
• Databricks is an end-to-end hosted platform
4
5. 5
$15-$20 / minute for a
basic surgical procedure
Time is an OR's most valuable resource
Lack of OR availability
means loss of patient
OR efficiency differs depending on the
OR staffing and allocation (8, 10, 13, or 16h),
not the workload (i.e. cases)
6. 6
“You are not going to get the elephant to shrink or change its size. You
need to face the fact that the elephant is 8 OR tall and 11hr wide”
Steven Shafer, MD
7. 7
Operating Room
Better utilization =
Better profit margins
Reduce support and
maintenance costs
Medical Staff
Better utilization =
Better profit margins
Better medical staff
efficiencies = Better
outcomes
Patients
Shorter wait times
and less cancellations
Better medical staff
efficiencies = Better
outcomes
8. Develop Predictive Model
• Develop a predictive model that would identify
available OR time two weeks in advance.
• Allow us to confirm wait list cases two weeks in advance,
instead of when the blocks normally release four days
out.
8
9. Forecast OR Schedule
• Case load three weeks in advance
• Book more cases weeks in advance to prevent under-
utilization
• Reduce staff overtime and idle time
9
10. Background
• Three surgical pools
• GYN, urology, general surgery, colorectal, surgical
oncology
• Eyes, plastics, ENT
• Orthopedics, podiatry
• Currently built using SQL Server Data Mining
10
22. 15
Why the model is working
• Can coordinate waitlist scheduling logistics with physicians and
patients within two weeks of the surgery
• Plan staff scheduling and resources so there are less last-minute
staffing issues for nursing and anesthesia
• Utilization metrics are showing us where we can maximize our
elective surgical schedule and level demand