Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Churn is Dead, Long Live Net Dollar Retention, SaaStr Annual @ Home, SaaStr 2...Dave Kellogg
A slightly revised version of my presentation at SaaStr Annual 2020 which focuses on understanding SaaS business from a metrics viewpoint with a particular focus on the health of the installed base as measured by churn rates and net dollar retention rates
Identifying your North Star Metric & Building a Model to Predict GrowthGrowthHackers
There are a few things undisputed when it comes to sustainably growing a company. If your product is providing value to customers and is considered a 'must-have', then the next step is laying the groundwork for sustainable growth.
During this GrowthHackers original webinar, Sean Ellis of GrowthHackers and Chris More of Mozilla help you build the foundations of:
1. Identifying your company's North Star Metric
2. Building a model to predict your company's growth
For help identifying your North Star Metric, visit: https://blog.growthhackers.com/what-is-a-north-star-metric-b31a8512923f
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
76% of emails never get opened. That makes life for salespeople very difficult. So we've partnered up with Breakthrough Email to bring you email templates that are proven to engage prospects and close more deals. Start using them today and grow your revenue.
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
A/B testing, i.e., measuring the impact of proposed variants of e.g. e-commerce websites, is fundamental for increasing conversion rates and other key business metrics.
We have developed a solution that makes it possible to run dozens of simultaneous A/B tests, obtain conclusive results sooner, and get more interpretable results than just statistical significance, but rather probabilities of the change having a positive effect, how much revenue is risked, etc.
To compute those metrics, we need to estimate the posterior distributions of the metrics, which are computed using Generalized Linear Models (GLMs). Since we process gigabytes of data, we use a PySpark implementation, which however does not provide standard errors of coefficients. We, therefore, use bootstrapping to estimate the distributions.
In this talk, I’ll describe how we’ve implemented parallelization of an already parallelized GLM computation to be able to scale this computation horizontally over a large cluster in Databricks and describe various tweaks and how they’ve improved the performance.
Churn is Dead, Long Live Net Dollar Retention, SaaStr Annual @ Home, SaaStr 2...Dave Kellogg
A slightly revised version of my presentation at SaaStr Annual 2020 which focuses on understanding SaaS business from a metrics viewpoint with a particular focus on the health of the installed base as measured by churn rates and net dollar retention rates
Identifying your North Star Metric & Building a Model to Predict GrowthGrowthHackers
There are a few things undisputed when it comes to sustainably growing a company. If your product is providing value to customers and is considered a 'must-have', then the next step is laying the groundwork for sustainable growth.
During this GrowthHackers original webinar, Sean Ellis of GrowthHackers and Chris More of Mozilla help you build the foundations of:
1. Identifying your company's North Star Metric
2. Building a model to predict your company's growth
For help identifying your North Star Metric, visit: https://blog.growthhackers.com/what-is-a-north-star-metric-b31a8512923f
3 Proven Sales Email Templates Used by Successful CompaniesHubSpot
76% of emails never get opened. That makes life for salespeople very difficult. So we've partnered up with Breakthrough Email to bring you email templates that are proven to engage prospects and close more deals. Start using them today and grow your revenue.
Data Visualization for Management Consultants & AnalystAsen Gyczew
What is the aim of this course?
In consulting you will spend a lot of time on creating presentations to show the results of your analyses to the customer. That is why, data visualization is so important. With proper display of data you have more chances of convincing the customers that your approach makes sense. In this course I will teach how to use different data visualization techniques to show the results of your analyses during consulting projects.
In the course you will learn the following things:
1. What types of slides you should use to present your thoughts
2. What types of charts you should use for data visualization
3. How to read the charts
4. How to create charts in Excel
5. How to create charts in PowerPoint
6. How to create dynamic charts in Excel
For more check the following course
https://bit.ly/DataVisualizationMC
Getting into consulting is one of the most difficult tasks. It’s not only very selective but also quite tough and long recruitment process. You will have at least 5 job interviews, spend on average 2-3 months in recruiting process and your chances of succeeding will be around 5-10%. I will help you significantly boost the odds in your favor.
This course will help you prepare for the cases that you will be asked to solve during the job interviews with consultants. I will improve your knowledge and skills in analysis through a series of practical cases. It is based on my 11 years of experience as a consultant in top consulting companies and as a Board Member responsible for strategy, improvement and turn-arounds in biggest companies from FMCG, SMG, B2B sector that I worked for. I have participated in over 200 recruitments and the materials in this course will encompass all the tricks that you should use during the interview. On the basis of what you will find in this course and I have trained over 100 business analysts who now are Investment Directors, Senior Analyst, Directors in Consulting Companies, Board Members etc.
There is little theory – mainly examples, a lot of tips from my own experience as well as other notable examples worth mentioning. My intention is that thanks to the course you will know:
1. How to approach any type of cases that you may come across in the job interview?
2. How to apply the most useful concepts and methods used later by consultants in their work?
3. How to be efficient during the interview?
4. How the consultant minds works?
Presentation given at the IIR Business Process Management Conference, San Diego, CA, November 13th, 2007. It focuses on the difference between rules and processes, the integration points of BPMS and BRMS, and ways to get started.
The Business Model of Consulting is Dead. Hourly rates are in conflict with customer solutions. How can you create a model that adds value to nowadays customer needs? Here are the slides of a keynote I gave in Ukraine, Kiev.
Silicon Valley’s Tools for Translating Startup Ideas into Billion Dollar Comp...Rod King, Ph.D.
This presentation features the POKER-Scorecard which is a shared language and platform for presenting and applying any business tool especially those used in Silicon Valley.
E-mail Restart 2023: Jakub Malý - Gamifikace, omnichannel komunikace a kineti...Taste
Praktické ukázky s konkrétními čísly, jak se nám povedlo díky gamifikaci a následné omnichannel komunikaci nakopnout a odlišit e-mailing našich klientů od konkurence a dosáhnout lepších výsledků a hlubších vztahů se zákazníky.
Essential Finance & Accounting for Management Consultants and Business AnalystsAsen Gyczew
During many consulting projects you will have to analyze financial statements (balance sheet, income statement, cash flows) and draw conclusion about specific company. This is especially true during due diligence, strategic projects and turn-arounds. Financial analyses require relatively good understanding of finance and accounting. Business Analysts and Management Consultants that did not study Finance or Business tend to have some problems with navigating this area. This course will help you overcome this problem. Those of you who have finished Business School or Economics will find here a great refresher with a lot of practical tips how to do certain analysis during a consulting project.
This course will help you drastically improve your knowledge and skills in finance and accounting. It is designed for people who are or want to become management consultants, business. In the course you will learn 5 main things:
1. How to read financial statements such as a balance sheet, an income (profit & loss) statement, cash flows
2. How to draw conclusions from financial statements
3. Main principles of accounting
4. How to analyze financial indicators
5. How to estimate the value of the firm / do valuation
I will NOT teach you everything about finance & accounting because it is simply not efficient (and frankly you don’t need it). This course is organized around 80/20 rule and I want to teach you the most useful (from business analyst / consultant perspective) things that will enable you to understand the financial data and analyze them.
Essential Real Estate Modeling in ExcelAsen Gyczew
What is the aim of this course?
Modeling real estate in Excel is not easy. In real estate, you will have to forecast the cash flow for long investment periods. In most models, you will have to forecast loan payments, maintenance costs, operational costs, and potential revenues from operating the real estate. On top of that, in real estate, we have different business models that will have different business drivers. To make your work easier, I will teach you in this course how to model in Excel fast and efficiently real estate investments.
In this course you will learn the following things:
1. Essential concepts related to real estate and modeling them in Excel
2. What are the main drivers of the profits in real estate for different business models
3. How to model buy & rent of real estate in Excel
4. How to model in Excel hotels and other short-term renting businesses (Airbnb, booking.com)
5. How to model in Excel flips/flipping
6. How to make decisions on the investment in real estate, based on the Excel model
For more information check my online course: https://bit.ly/RealEstateModels
Essential Excel for Business Analysts and ConsultantsAsen Gyczew
Excel is the most often used first-choice tool of every business analyst and consultant. Maybe it is not the most fancy or sophisticated one, yet it is universally understood by everybody especially your boss and your customers.
Excel is still pretty advanced tool with countless number of features and functions. I have mastered quite a lot of them during my studies and while working. After some time in consulting I discovered that most of them are not that useful; some of them bring more problems than solutions. On top of that there are features that we are taught at university that are not flexible and pretty time consuming. While working as a business analyst I developed my own set of tricks to deal with Excel I learned how to make my analyses idiot-proven and extremely universal.
I will NOT teach you the entire Excel as it is simply not efficient (and frankly you don’t need it). This course is organized around 80/20 rule and I want to teach you the most useful (from business analyst / consultant perspective) formulas as fast as possible. I want you also to acquire thanks to the course good habits in Excel that will save you loads of time.
If done properly, this course will transform you in 1 day into pretty good business analyst that knows how to use Excel in the smart way. It is based on my 12 years of experience as a consultant in top consulting companies and as a Board Member responsible for strategy, improvement and turn-arounds in biggest companies from FMCG, SMG, B2B sector that I worked for. On the basis of what you will find in this course I have trained over 100 business analysts who now are Investment Directors, Senior Analyst, Directors in Consulting Companies, Board Members etc.
I teach step by step on the basis of Excel files that will be attached to the course. To make the best out of the course you should follow my steps and repeat what I do with the data after every lecture. Don’t move to the next lecture if you have not done what I show in the lecture that you have gone through.
I assume that you know basic Excel so the basic features (i.e. how to write formula in Excel) are not explained in this course. I concentrate on intermediate and advanced solutions and purposefully get rid of some things that are advanced yet later become very inflexible and useless (i.e. naming the variables). At the end, I will show 4 full blown analyses in Excel that use the tricks that I show in the lectures.
To every lecture you will find attached (in additional resources) the Excel shown in the Lecture so as a part of this course you will also get a library of ready-made analyses that can, with certain modification, be applied by you in your work.
As a business analyst or a management consultant you have to master some vital skills that will help you survive the first 2-3 years of your work. Some of the have to do with tools (Excel, Presentations). Others with your productivity and skills (financial modeling, business modeling). You also have to learn tools techniques and frameworks that will help understand faster the business and provide more value. In this presentation you will find an overview of most important things that you have to master along with suggested courses that I recommend taking. They will help you become more efficient and will help you develop your skills and knowledge.
This slide deck covers why primary market research (aka customer development, customer research or customer empathy) is important and necessary, outlines how to organize a successful research program, provides a sampling of common qualitative and quantitative primary market research techniques, and provides an FAQ section on common questions.
Liquidity Management for Management Consultants & ManagersAsen Gyczew
Many companies despite having profits still have problems with cash. In other words they have to improve their liquidity. This topic is not widely discussed and a few management consultants as well as managers know how to do it practice. Therefore, more companies die because of problems with liquidity than due to low profits. Luckily, there are a lot of techniques that will help you in a structured way look for ways to generate more cash from the business.
In this course I will show you different methods that you can use to improve the cash position of your business. You will learn the following things:
1. How to identify potential ways to improve your liquidity, especially quick wins
2. How to estimate in Excel how much cash you can generate from specific solution
3. How to pick the optimal solution
4. How to prevent the firm from going bankrupt due to liquidity problems
You will learn here how to cut costs, reduce inventory, reduce receivables, improve payables, and restructure debt and many more:
For more check the following course: http://bit.ly/LiquidityManagementConsulting
Data Visualization for Management Consultants & AnalystAsen Gyczew
What is the aim of this course?
In consulting you will spend a lot of time on creating presentations to show the results of your analyses to the customer. That is why, data visualization is so important. With proper display of data you have more chances of convincing the customers that your approach makes sense. In this course I will teach how to use different data visualization techniques to show the results of your analyses during consulting projects.
In the course you will learn the following things:
1. What types of slides you should use to present your thoughts
2. What types of charts you should use for data visualization
3. How to read the charts
4. How to create charts in Excel
5. How to create charts in PowerPoint
6. How to create dynamic charts in Excel
For more check the following course
https://bit.ly/DataVisualizationMC
Getting into consulting is one of the most difficult tasks. It’s not only very selective but also quite tough and long recruitment process. You will have at least 5 job interviews, spend on average 2-3 months in recruiting process and your chances of succeeding will be around 5-10%. I will help you significantly boost the odds in your favor.
This course will help you prepare for the cases that you will be asked to solve during the job interviews with consultants. I will improve your knowledge and skills in analysis through a series of practical cases. It is based on my 11 years of experience as a consultant in top consulting companies and as a Board Member responsible for strategy, improvement and turn-arounds in biggest companies from FMCG, SMG, B2B sector that I worked for. I have participated in over 200 recruitments and the materials in this course will encompass all the tricks that you should use during the interview. On the basis of what you will find in this course and I have trained over 100 business analysts who now are Investment Directors, Senior Analyst, Directors in Consulting Companies, Board Members etc.
There is little theory – mainly examples, a lot of tips from my own experience as well as other notable examples worth mentioning. My intention is that thanks to the course you will know:
1. How to approach any type of cases that you may come across in the job interview?
2. How to apply the most useful concepts and methods used later by consultants in their work?
3. How to be efficient during the interview?
4. How the consultant minds works?
Presentation given at the IIR Business Process Management Conference, San Diego, CA, November 13th, 2007. It focuses on the difference between rules and processes, the integration points of BPMS and BRMS, and ways to get started.
The Business Model of Consulting is Dead. Hourly rates are in conflict with customer solutions. How can you create a model that adds value to nowadays customer needs? Here are the slides of a keynote I gave in Ukraine, Kiev.
Silicon Valley’s Tools for Translating Startup Ideas into Billion Dollar Comp...Rod King, Ph.D.
This presentation features the POKER-Scorecard which is a shared language and platform for presenting and applying any business tool especially those used in Silicon Valley.
E-mail Restart 2023: Jakub Malý - Gamifikace, omnichannel komunikace a kineti...Taste
Praktické ukázky s konkrétními čísly, jak se nám povedlo díky gamifikaci a následné omnichannel komunikaci nakopnout a odlišit e-mailing našich klientů od konkurence a dosáhnout lepších výsledků a hlubších vztahů se zákazníky.
Essential Finance & Accounting for Management Consultants and Business AnalystsAsen Gyczew
During many consulting projects you will have to analyze financial statements (balance sheet, income statement, cash flows) and draw conclusion about specific company. This is especially true during due diligence, strategic projects and turn-arounds. Financial analyses require relatively good understanding of finance and accounting. Business Analysts and Management Consultants that did not study Finance or Business tend to have some problems with navigating this area. This course will help you overcome this problem. Those of you who have finished Business School or Economics will find here a great refresher with a lot of practical tips how to do certain analysis during a consulting project.
This course will help you drastically improve your knowledge and skills in finance and accounting. It is designed for people who are or want to become management consultants, business. In the course you will learn 5 main things:
1. How to read financial statements such as a balance sheet, an income (profit & loss) statement, cash flows
2. How to draw conclusions from financial statements
3. Main principles of accounting
4. How to analyze financial indicators
5. How to estimate the value of the firm / do valuation
I will NOT teach you everything about finance & accounting because it is simply not efficient (and frankly you don’t need it). This course is organized around 80/20 rule and I want to teach you the most useful (from business analyst / consultant perspective) things that will enable you to understand the financial data and analyze them.
Essential Real Estate Modeling in ExcelAsen Gyczew
What is the aim of this course?
Modeling real estate in Excel is not easy. In real estate, you will have to forecast the cash flow for long investment periods. In most models, you will have to forecast loan payments, maintenance costs, operational costs, and potential revenues from operating the real estate. On top of that, in real estate, we have different business models that will have different business drivers. To make your work easier, I will teach you in this course how to model in Excel fast and efficiently real estate investments.
In this course you will learn the following things:
1. Essential concepts related to real estate and modeling them in Excel
2. What are the main drivers of the profits in real estate for different business models
3. How to model buy & rent of real estate in Excel
4. How to model in Excel hotels and other short-term renting businesses (Airbnb, booking.com)
5. How to model in Excel flips/flipping
6. How to make decisions on the investment in real estate, based on the Excel model
For more information check my online course: https://bit.ly/RealEstateModels
Essential Excel for Business Analysts and ConsultantsAsen Gyczew
Excel is the most often used first-choice tool of every business analyst and consultant. Maybe it is not the most fancy or sophisticated one, yet it is universally understood by everybody especially your boss and your customers.
Excel is still pretty advanced tool with countless number of features and functions. I have mastered quite a lot of them during my studies and while working. After some time in consulting I discovered that most of them are not that useful; some of them bring more problems than solutions. On top of that there are features that we are taught at university that are not flexible and pretty time consuming. While working as a business analyst I developed my own set of tricks to deal with Excel I learned how to make my analyses idiot-proven and extremely universal.
I will NOT teach you the entire Excel as it is simply not efficient (and frankly you don’t need it). This course is organized around 80/20 rule and I want to teach you the most useful (from business analyst / consultant perspective) formulas as fast as possible. I want you also to acquire thanks to the course good habits in Excel that will save you loads of time.
If done properly, this course will transform you in 1 day into pretty good business analyst that knows how to use Excel in the smart way. It is based on my 12 years of experience as a consultant in top consulting companies and as a Board Member responsible for strategy, improvement and turn-arounds in biggest companies from FMCG, SMG, B2B sector that I worked for. On the basis of what you will find in this course I have trained over 100 business analysts who now are Investment Directors, Senior Analyst, Directors in Consulting Companies, Board Members etc.
I teach step by step on the basis of Excel files that will be attached to the course. To make the best out of the course you should follow my steps and repeat what I do with the data after every lecture. Don’t move to the next lecture if you have not done what I show in the lecture that you have gone through.
I assume that you know basic Excel so the basic features (i.e. how to write formula in Excel) are not explained in this course. I concentrate on intermediate and advanced solutions and purposefully get rid of some things that are advanced yet later become very inflexible and useless (i.e. naming the variables). At the end, I will show 4 full blown analyses in Excel that use the tricks that I show in the lectures.
To every lecture you will find attached (in additional resources) the Excel shown in the Lecture so as a part of this course you will also get a library of ready-made analyses that can, with certain modification, be applied by you in your work.
As a business analyst or a management consultant you have to master some vital skills that will help you survive the first 2-3 years of your work. Some of the have to do with tools (Excel, Presentations). Others with your productivity and skills (financial modeling, business modeling). You also have to learn tools techniques and frameworks that will help understand faster the business and provide more value. In this presentation you will find an overview of most important things that you have to master along with suggested courses that I recommend taking. They will help you become more efficient and will help you develop your skills and knowledge.
This slide deck covers why primary market research (aka customer development, customer research or customer empathy) is important and necessary, outlines how to organize a successful research program, provides a sampling of common qualitative and quantitative primary market research techniques, and provides an FAQ section on common questions.
Liquidity Management for Management Consultants & ManagersAsen Gyczew
Many companies despite having profits still have problems with cash. In other words they have to improve their liquidity. This topic is not widely discussed and a few management consultants as well as managers know how to do it practice. Therefore, more companies die because of problems with liquidity than due to low profits. Luckily, there are a lot of techniques that will help you in a structured way look for ways to generate more cash from the business.
In this course I will show you different methods that you can use to improve the cash position of your business. You will learn the following things:
1. How to identify potential ways to improve your liquidity, especially quick wins
2. How to estimate in Excel how much cash you can generate from specific solution
3. How to pick the optimal solution
4. How to prevent the firm from going bankrupt due to liquidity problems
You will learn here how to cut costs, reduce inventory, reduce receivables, improve payables, and restructure debt and many more:
For more check the following course: http://bit.ly/LiquidityManagementConsulting
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael
Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer"
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
// About the Presenter //
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.
Follow Michael on -
Twitter: https://twitter.com/michaelarmbrust
LinkedIn: https://www.linkedin.com/in/michaelarmbrust
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
ONS Local has been established by the Office for National Statistics (ONS) to support evidence-based decision-making at the local level. We aim to host insightful events that connect our users with exciting developments happening in subnational statistics and analysis at the ONS and across other organisations.
Have you ever wondered which local authorities are similar to each other? This presentation discusses cluster analysis ONS has published to draw insight into which local authorities are performing in a similar way against key policy themes, promoting greater joined up working between local authorities with similar characteristics to address common problems they face. Our analysis also provides local authorities with control groups for investigating the impact of policy interventions.
In this webinar, we will cover the methods used to create our outputs, demonstrate some of our findings in our interactive visualisation tool and present information on our future plans to expand on this work.
This event is open to all, however we anticipate it will be of most interest to anyone working at a local level, or with data on the policy themes of economy, transport connectivity, education, skills, health and wellbeing.
If you have any questions, please contact ons.local@ons.gov.uk
Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy
Hadoop Training in Chennai from BigDataTraining.IN is a leading Global Talent Development Corporation, building skilled manpower pool for global industry requirements. BigData Training.in has today grown to be amongst worlds leading talent development companies offering learning solutions to Individuals, Institutions & Corporate Clients.
Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
Euro30 2019 - Benchmarking tree approaches on street dataFabion Kauker
By examining the use of algorithms to solve the Prize Collecting Steiner Tree (PCST) problem we consider the facets which determine effectiveness. Specifically, by measuring a number of solution approaches and comparing them based on metrics. In order to understand the solution approach we must asses why it is useful. Our goal is to determine the effectiveness of Mixed Integer Programming (MIP) and heuristic methods. Utilizing freely available street and address data a base graph representation is created and then computed on. Such that a tree connects every address utilizing the minimum total length of edges from the street network. This is the basis of many approaches used to solve infrastructure problems including telecommunications network design and costing. The analysis is conducted on methods developed by Hegde et al. 2015, Ljubić et al. 2006, and Teitz et al. 1963. We present a data processing architecture, as well as a concise set of results and a framework for assessing the facets and trade-offs for a given approach. In this case the heuristic approaches are proven to have advantages in the simplistic case but fail when more complex requirements are added. This is where the MIP approach is able to capitalize, whilst detrimentally limiting the flexibility due to the strictness and specificity in modelling.
Similar to Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts (20)
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
3. What is IHME?
3
• Independent research center at the
University of Washington
• Core funding by Bill & Melinda Gates
Foundation and State of Washington
• ~300 faculty, researchers, students and
staff
• Providing rigorous, scientific measurement
o What are the world’s major health problems?
o How well is society addressing these problems?
o How should we best dedicate resources to
improving health?
Goal:
improve health
by providing the best
information
on population health
4. Global Burden of Disease
• A systematic scientific effort to
quantify the comparative magnitude
of health loss due to diseases,
injuries and risk factors
o 188 countries
o 300+ diseases
o 1990 – 2015+
o 50+ risk factors
4
11. Forecasting the Burden of Disease
11
1. Generate a baseline scenario projecting the GBD 25 years into the future
o Mortality, morbidity, population, and risk factors
o Every country, cause, sex, age
2. Create a comprehensive simulation framework to assess alternative
scenarios
o Capture the complex interrelationships between risk factors, interventions, and
diseases to explore the effects of changes to the system
3. Build a modular and flexible platform that can incorporate new models to
answer detailed “what if?” questions
o E.g. introduction of a new vaccine, effects of global warming, scale up of ART
coverage, risk of global pandemics, etc
12. Simulations
• Takes into account interdependencies between risk factors,
diseases, etc.
o Use draws of model parameters to advance from t to t+1, allowing for
forecasts of other quantities (e.g. risk factors and mortality) to interact
• Modular structure allows for detailed “what ifs”
o E.g. scale up of coverage of a new intervention
• Allows us to incorporate alternative models to test sensitivity to model
choice and specification
12
13. Basic Simulation Process
1. Fit statistical models (PyMB) capturing most important
relationships as statistical parameters
2. Generate many draws from those parameters, reflecting
their uncertainty and correlation structure
3. Run Monte Carlo simulations to update each quantity
over time, taking into account its dependencies
13
15. SimBuilder
• Directed Acyclic Graph (DAG) construction and validation
o yaml and sympy specification
o curses interface
o graphviz visualization
• Flexible execution backends
o pandas
o pyspark
15
16. Example Global Health DAG
16
GDP 2016
Deaths
2016
Births 2016
Population
2016
Workforce
2016
Deaths
2017
Births 2017
Population
2017
Workforce
2017
GDP 2017
17. Creating a DAG with SimBuilder
1. Create a YAML for the overall simulation
o Specify child models and the dimensions of the simulation
2. Build separate YAML files for each child model
o Dimensions along which the model varies
o Location of historical data
o Expression for how to generate the quantity in t+1
3. Run SimBuilder to validate and explore DAG
17
18. Creating a DAG with SimBuilder
18
name: gdp_pop
models:
- name: gdp_per_capita
- name: workforce
- name: mortality
- name: gdp
- name: pop
global_parameters:
- name: loc
set: List
values: ["USA", "MEX", "CAN", ..., ]
- name: t
set: Range
start: 2014
end: 2040
- name: draw
set: Range
start: 0
end: 999
sim.yaml
19. 19
Creating a DAG with SimBuilder
name: pop
version: 0.1
variables: [loc,t]
history:
- type: csv
path: pop.csv
pop.yaml
20. Creating a DAG with SimBuilder
name: gdp_per_capita
version: 0.1
expr: gdp(draw, t, loc)/pop(draw, t-1, loc)
variables: [draw, t, loc]
history:
- type: csv
path: gdp_pc_w_draw.csv
gdp_per_capita.yaml
20
26. SimBuilder Backends
• SimBuilder traverses the DAG
o Backends plugin via API
• Backends provide
o Data loading strategy
o Methods to combine data from different nodes to
calculate new nodes
o Method to save computed data
26
27. Local Pandas Backend
• Useful for debugging and small simulations
• Master process maintains list of Model objects:
o pandas DataFrames of all model data
─ Indexed by global variables
o sympy expression for how to calculate t+1
─ Vectorized when possible, fallback to row-wise apply
• Simulating over time traverses DAG to join necessary
DataFrames and compute results to fill in new columns
27
28. Spark DataFrame Backend
• Similar to local backend, swapping in Spark’s DataFrame
for Pandas’
• Spark context maintains list of Model objects:
o pyspark DataFrames of all model data
─ Columns for each global variable and
o sympy expression for how to calculate t+1
─ Calculated using row-wise apply
• Joins DataFrames where necessary
28
29. Spark + Numpy Backend
• Uses Spark to distribute and schedule the execution of
vectorized numpy operations
• One RDD for each node and year:
o Containing an ndarray and index vector for each axis
o Can optionally partition the data array into key-value pairs
along one or more axes (similar to bolt)
• To calculate a new node, all the parent RDD’s are
unioned and then reduced into a new child RDD
29
31. Benchmarking
• Single executor (8 cores, 32GB)
• Synthetic data
o 65 nodes in DAG
o 3 countries x 20 ages x 2 sexes x N draws
o Forecasted forward 25 years
31
35. Limitations of Spark DataFrame
• Majority of execution time is spent joining the DataFrame
and aligning the dimension columns
• These times were achieved after careful tuning of
partitions
o Perhaps custom partitioners could reduce runtime further
35
36. Taking Advantage of Numpy
• For multidimensional datasets of consistent shape and
size, simply aligning indices can eliminate the overhead
of joins
o Including generalizations to axes of size 1 to simplify coding
• Numpy’s vectorized operations are highly tuned and
scale well
o Including multithreading when e.g. compiled against MKL
36
37. Future Development
• Experiment with better partitioning of Numpy RDDs
o Perhaps use bolt?
• Improve partial graph execution
o Executing the entire DAG at once often crashes the driver
o Executing each year’s partial DAG in sequence results in idle CPU
time towards the end of that DAG
• Investigate more efficient Spark DataFrame joining
methods for Panel data
37