Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
The journey of Moving from AWS ELK to GCP Data PipelineRandy Huang
This is a real case from VMfive to shifting ELK architecture from AWS. Currently GCP Data Pipeline provide us more efficiency and stable environment for running our service.
Productionizing Spark ML pipelines with the portable format for analyticsDataWorks Summit
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. Because execution of Spark ML pipelines is tightly coupled with the Spark SQL runtime, deployment using Spark is ill-suited to the needs of real-time predictive applications.
In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open, and standardized deployment of data science pipelines and analytic applications. I will also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX, and Apple’s CoreML.
Speaker
Nick Pentreath, Principal Engineer, IBM
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This webinar will be a hands-on demonstration of how to clone and build the Apache Apex source code repositories, how to run the maven archetype to create a new Apex project, how to enhance it to build a word counting application and finally, how to run it and view results. We will also do a brief code walkthrough.
Bio:
Dr. Munagala V. Ramanath is a Committer for Apache Apex and a Software Engineer at DataTorrent. He has many years experience working for a variety of companies in California and a Ph.D. in Computer Science from the University of Wisconsin, Madison.
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
Tuning a Spark ML model with cross-validation can be an extremely computationally expensive process. As the number of hyperparameter combinations increases, so does the number of models being evaluated. The default configuration in Spark is to evaluate each of these models one-by-one to select the best performing. When running this process with a large number of models, if the training and evaluation of a model does not fully utilize the available cluster resources then that waste will be compounded for each model and lead to long run times.
Enabling model parallelism in Spark cross-validation, from Spark 2.3, will allow for more than one model to be trained and evaluated at the same time and make better use of cluster resources. We will go over how to enable this setting in Spark, what effect this will have on an example ML pipeline and best practices to keep in mind when using this feature.
Additionally, we will discuss ongoing work in progress to reduce the amount of computation required when tuning ML pipelines by eliminating redundant transformations and intelligently caching intermediate datasets. This can be combined with model parallelism to further reduce the run time of cross-validation for complex machine learning pipelines.
The journey of Moving from AWS ELK to GCP Data PipelineRandy Huang
This is a real case from VMfive to shifting ELK architecture from AWS. Currently GCP Data Pipeline provide us more efficiency and stable environment for running our service.
Productionizing Spark ML pipelines with the portable format for analyticsDataWorks Summit
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering, and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. Because execution of Spark ML pipelines is tightly coupled with the Spark SQL runtime, deployment using Spark is ill-suited to the needs of real-time predictive applications.
In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open, and standardized deployment of data science pipelines and analytic applications. I will also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX, and Apple’s CoreML.
Speaker
Nick Pentreath, Principal Engineer, IBM
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This webinar will be a hands-on demonstration of how to clone and build the Apache Apex source code repositories, how to run the maven archetype to create a new Apex project, how to enhance it to build a word counting application and finally, how to run it and view results. We will also do a brief code walkthrough.
Bio:
Dr. Munagala V. Ramanath is a Committer for Apache Apex and a Software Engineer at DataTorrent. He has many years experience working for a variety of companies in California and a Ph.D. in Computer Science from the University of Wisconsin, Madison.
A constant flow of fresh water will keep your lake alive and both fish and fishers happy. See how easy it is to feed changes from your operational databases directly to your Data Lake and/or Kafka to support real-/neartime analytics.
Writing an Interactive Interface for SQL on FlinkEventador
This presentation goes into detail on how and why Eventador created SQLStreamBuilder for easy streaming SQL—and the lessons learned along the way.
This presentation was given by Eventador CEO and Co-founder Kenny Gorman at Flink Forward Europe 2019.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward
Flink metrics module allows to use Dropwizard-like metrics and reporters in Flink pipelines. It opens a rich opportunity to not only monitor health of Flink pipelines but also attach real-time business intelligence metrics to run alongside existing Flink data jobs, thus avoiding a need to build a separate BI data-flow infrastructure. However, most of the work still resides with application developers: they have to define how those metrics are registered, calculated and attached to Flink functional operators – and then Flink will do the computation and reporting. The most challenging task is to add BI-like metrics, with key dimensions dynamically extracted from streamed data and metric values likely needed to be aggregated over multiple Flink tasks. We present a toolkit that extends Flink metrics and allows to decorate existing Flink operators with both simple health-check and complex BI-like metrics, updated and observable in real-time through IT monitoring and BI visualization dashboards. This work is based on a real-world production use-case of computing retail merchandise prices in real-time for the Walmart e-commerce catalog using Flink.
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
This datasheet highlights new and enhanced features in IBM SPSS Statistics 24. Extend the value of time-series forecasting using temporal causal modeling, new analytical techniques including geospatial analytics and enhancements to programmability, performance and accessibility.
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Databricks
CTR prediction algorithms are essential, and are used extensively for ads bidding and sponsored search. While logistic regression models have proven effective for this kind of problem, rapid growth in the amount of data has created a lot of challenges. For example, how to train a logistic regression model with billions of parameters in a commodity hardware cluster, or how to improve the model’s accuracy with better feature engineering. Other challenges include figuring out how to benefit from popular deep learning technologies to reduce the dependence on human labor and expert knowledge, and how to improve job performance given such a complicated workload.
At Spark Summit East 2017, Hortonworks introduced vector-free L-BFGS to conquer the scalability challenge of MLlib and provide a very scalable logistic regression implementation. In this talk, hear about their experience integrating this implementation with different feature learning technologies to solve Ad CTR prediction problems, and the lessons they learned.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
SAP ABAP Online Training by Lead Online Training with Our instructors who are having real-world experience and are comfortable going off-script. Our faculty team having lot of experience in <a> Sap abap online training</a>to explore the most complicated concepts.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Originally presented at Strata EU 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57631
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
Recording Link: http://bit.ly/LSImpala
Author: Greg Rahn, Cloudera Director of Product Management
In this session, we'll review the recent set of benchmark tests the Apache Impala (incubating) performance team completed that compare Apache Impala to a traditional analytic database (Greenplum), as well as to other SQL-on-Hadoop engines (Hive LLAP, Spark SQL, and Presto). We'll go over the methodology and results, and we'll also discuss some of the performance features and best practices that make this performance possible in Impala. Lastly, we'll look at some recent advancements in in Impala over the past few releases.
A constant flow of fresh water will keep your lake alive and both fish and fishers happy. See how easy it is to feed changes from your operational databases directly to your Data Lake and/or Kafka to support real-/neartime analytics.
Writing an Interactive Interface for SQL on FlinkEventador
This presentation goes into detail on how and why Eventador created SQLStreamBuilder for easy streaming SQL—and the lessons learned along the way.
This presentation was given by Eventador CEO and Co-founder Kenny Gorman at Flink Forward Europe 2019.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...Flink Forward
Flink metrics module allows to use Dropwizard-like metrics and reporters in Flink pipelines. It opens a rich opportunity to not only monitor health of Flink pipelines but also attach real-time business intelligence metrics to run alongside existing Flink data jobs, thus avoiding a need to build a separate BI data-flow infrastructure. However, most of the work still resides with application developers: they have to define how those metrics are registered, calculated and attached to Flink functional operators – and then Flink will do the computation and reporting. The most challenging task is to add BI-like metrics, with key dimensions dynamically extracted from streamed data and metric values likely needed to be aggregated over multiple Flink tasks. We present a toolkit that extends Flink metrics and allows to decorate existing Flink operators with both simple health-check and complex BI-like metrics, updated and observable in real-time through IT monitoring and BI visualization dashboards. This work is based on a real-world production use-case of computing retail merchandise prices in real-time for the Walmart e-commerce catalog using Flink.
Flink has been used by many users in their ML use cases, such as real-time feature engineering and near-line inference. For the other ML use cases that are more batch-oriented, such as model training, validation, usually other systems are used. This talk we give in Flink Forward 2019 show the efforts in Flink community to let Flink cover all the ML use cases.
This datasheet highlights new and enhanced features in IBM SPSS Statistics 24. Extend the value of time-series forecasting using temporal causal modeling, new analytical techniques including geospatial analytics and enhancements to programmability, performance and accessibility.
Large-Scale Ads CTR Prediction with Spark and Deep Learning: Lessons Learned ...Databricks
CTR prediction algorithms are essential, and are used extensively for ads bidding and sponsored search. While logistic regression models have proven effective for this kind of problem, rapid growth in the amount of data has created a lot of challenges. For example, how to train a logistic regression model with billions of parameters in a commodity hardware cluster, or how to improve the model’s accuracy with better feature engineering. Other challenges include figuring out how to benefit from popular deep learning technologies to reduce the dependence on human labor and expert knowledge, and how to improve job performance given such a complicated workload.
At Spark Summit East 2017, Hortonworks introduced vector-free L-BFGS to conquer the scalability challenge of MLlib and provide a very scalable logistic regression implementation. In this talk, hear about their experience integrating this implementation with different feature learning technologies to solve Ad CTR prediction problems, and the lessons they learned.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
SAP ABAP Online Training by Lead Online Training with Our instructors who are having real-world experience and are comfortable going off-script. Our faculty team having lot of experience in <a> Sap abap online training</a>to explore the most complicated concepts.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
Originally presented at Strata EU 2017: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57631
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability.
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference. (A preprint copy can be obtained here.)
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
Recording Link: http://bit.ly/LSImpala
Author: Greg Rahn, Cloudera Director of Product Management
In this session, we'll review the recent set of benchmark tests the Apache Impala (incubating) performance team completed that compare Apache Impala to a traditional analytic database (Greenplum), as well as to other SQL-on-Hadoop engines (Hive LLAP, Spark SQL, and Presto). We'll go over the methodology and results, and we'll also discuss some of the performance features and best practices that make this performance possible in Impala. Lastly, we'll look at some recent advancements in in Impala over the past few releases.
The state of Hive and Spark in the Cloud (July 2017)Nicolas Poggi
Originally presented at the BDOOP and Spark Barcelona meetup groups: http://meetu.ps/3bwCTM
Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. The talk compares:
• The performance of both v1 and v2 for Spark and Hive
• PaaS cloud services: Azure HDinsight, Amazon Web Services EMR, Google Cloud Dataproc
• Out-of-the-box support for Spark and Hive versions from providers
• PaaS reliability, scalability, and price-performance of the solutions
Using BigBench, the new Big Data benchmark standard. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.).
I shared experience from recent involvement into a project on defining a set of Azure-based architectures & implementing prototypes for one of the customers that would like to publicly expose it's business data to the clients and support different use-cases (BI, data extraction, internal use, etc.).
I walked audience through analyzed & proposed architectural approaches in Azure leveraging native cloud capabilities & PaaS offerings. In addition, I shared results of implemented prototypes, advantages & limitations of each architectural approach and final recommendations made to the customer.
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
The concept of talk is as follows: - to give a general idea about user segmentation task in DMP project and how solving this problem helps our business - to tell how we use autoML to solve this task and to explain its components - to give insights about techniques we apply to make our pipeline fast and stable on huge datasets
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Databricks
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib’s DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications.
I’ll also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA, as well as compare and contrast it to other available alternatives including PMML, MLeap, ONNX and Apple’s CoreML.
Productionizing Spark ML Pipelines with the Portable Format for AnalyticsNick Pentreath
The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.
This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring. While MLlib’s DataFrame API is powerful and elegant, it is relatively ill-suited to the needs of many real-time predictive applications, in part because it is tightly coupled with the Spark SQL runtime. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines & analytic applications.
I’ll also introduce and evaluate Aardpfark, a library for exporting Spark ML pipelines to PFA.
Overcoming write availability challenges of PostgreSQLEDB
There's no shortage of physical replication solutions for PostgreSQL, they scale horizontally and provide high read availability. But where they fall short is write availability, which leads many users to consider PostgreSQL logical replication. Existing solutions have a single point of failure or are dependent on a forked, vendor-provided PostgreSQL extension making reliable, enterprise-class logical replication hard to come by. Furthermore, these solutions put limits on scaling PostgreSQL.
By combining Kafka, an open source event streaming system with PostgreSQL, customers can get a fault tolerant, scalable logical replication service. Learn how EDB Replicate leverages Kafka for high write availability needed for today's demanding consumers who expect their applications to be always available and won't tolerate latency.
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022
Robinhood’s mission is to democratize finance for all. Data driven decision making is key to achieving this goal. Data needed are hosted in various OLTP databases. Replicating this data near real time in a reliable fashion to data lakehouse powers many critical use cases for the company. In Robinhood, CDC is not only used for ingestion to data-lake but is also being adopted for inter-system message exchanges between different online micro services. .
In this talk, we will describe the evolution of change data capture based ingestion in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We will go in-depth into the CDC architecture built around our Kafka ecosystem using open source system Debezium and Apache Hudi. We will cover online inter-system message exchange use-cases along with our experience running this service at scale in Robinhood along with lessons learned.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
In this session, you will learn the key differences between a relational database management service (RDBMS) and non-relational (NoSQL) databases like Amazon DynamoDB. You will learn about suitable and unsuitable use cases for NoSQL databases. You'll learn strategies for migrating from an RDBMS to DynamoDB through a 5-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.
Similar to Index conf sparkml-feb20-n-pentreath (20)
GPUs used with Apache Spark are leveraged to speed up machine learning (ML) model training and inference. Data preparation stages are traditionally run on CPUs. The RAPIDS Accelerator for Apache Spark is a plugin jar that takes advantage of Apache Spark 3.x's ability to schedule on GPUs. The RAPIDS Accelerator replaces CPU expressions in a physical plan with GPU equivalents for dataframe operations. Code change is not required, making transition to GPUs seamless.
We'll give an overview of what the RAPIDS Accelerator is, how it works, and benefits from using the accelerator. We will discuss benchmarks showing the performance and cost benefits of leveraging GPUs for Spark ETL processing. We'll showcase a user tool that will help estimate speedups and cost savings.
Talk at SF Big Analytics https://www.meetup.com/sf-big-analytics/events/285731741/
Distributed systems are made up of many components such as authentication, a persistence layer, stateless services, load balancers, and stateful coordination services. These coordination services are central to the operation of the system, performing tasks such as maintaining system configuration state, ensuring service availability, name resolution, and storing other system metadata. Given their central role in the system it is essential that these systems remain available, fault tolerant and consistent. By providing a highly available file system-like abstraction as well as powerful recipes such as leader election, Apache Zookeeper is often used to implement these services. Although powerful, the Zookeeper interface may not be flexible enough or provide sufficient performance for all applications and many systems are replacing Zookeeper based solutions with Raft which provides a more generic interface to high availability and fault tolerance through the use of State Machine replication. This talk will go over a generic example of stateful coordination service moving from Zookeeper to Raft.
Speaker: Tyler Crain ( Alluxio)
Tyler Crain is a software engineer at Alluxio, working on distributed systems within the Alluxio core team. Before this, Tyler held Post-Doc positions at the University of Sydney and Sorbonne Universities where he performed research on topics including distributed key-value stores, distributed consensus and blockchain. Tyler received his PhD from the University of Rennes where he worked on Transactional Memory. He also holds a Masters degree in Computer Science from University of California Santa Barbara.
talk at SF Big Analytics:
Related Blog: https://www.alluxio.io/blog/from-zookeeper-to-raft-how-alluxio-stores-file-system-state-with-high-availability-and-fault-tolerance/
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...Chester Chen
Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers. In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.
paper link:
https://arxiv.org/pdf/2111.05897.pdf
Speaker: Ji Liu
Dr. Ji Liu received his Ph.D in computer science and his bachelor degree in automation from University of Wisconsin-Madison and University of Science and Technology of China, respectively. After graduation, he joined the University of Rochester as an assistant professor, conducting research in machine learning, optimization, and reinforcement learning. The developed asynchronous and decentralized algorithms were widely used in industry, such as IBM, Microsoft, etc. He left academia and joined Tencent in 2017, exploring AI’s boundary. The developing AI agent Tstarbot was considered to be a milestone for mastering the most challenging RTS game -- Starcraft II. His second stop in industry is Kwai - the second largest short video company in China. He founded and led multiple international teams with different functionalities: platform team, product team, and research team. His team Contributed to 15+% annual revenue growth in Ads. He published 100+ papers in top-tier CS conferences and journals, and received multiple best paper awards (e.g., SIGKDD 2010 and UAI 2015 Facebook best paper). He was an awardee of MIT TR 35 under 35 in China and IBM faculty award in 2017. He was nominated to be one of China top 5 AI innovators under 35 in 2018
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...Chester Chen
Topic:
NVIDIA FLARE: Federated Learning Application Runtime Environment for Developing Robust AI Models
Summary:
Federated learning (FL) enables building robust and generalizable AI models by leveraging diverse datasets from multiple collaborators without moving data. We created NVIDIA FLARE as an open-source SDK to make it easier for data scientists to use FL in their research. The SDK allows existing machine learning and deep learning workflows adapted for distributed learning across enterprises and enables platform developers to build a secure, privacy-preserving offering for multiparty collaboration utilizing homomorphic encryption or differential privacy. The SDK is a lightweight, flexible, and scalable Python package and allows researchers to bring their data science workflows implemented in any training libraries (PyTorch, TensorFlow, or even NumPy), and apply them in real-world FL settings. This talk will introduce the key design principles of NVIDIA FLARE and illustrate use cases (e.g., COVID analysis) with customizable FL workflows that implement different privacy-preserving algorithms.
Speaker: Dr. Holger Roth ( Nvidia)
Holger Roth is a Sr. Applied Research Scientist at NVIDIA focusing on deep learning for medical imaging. He has been working closely with clinicians and academics over the past several years to develop deep learning based medical image computing and computer-aided detection models for radiological applications. He is an Associate Editor for IEEE Transactions of Medical Imaging and holds a Ph.D. from University College London, UK. In 2018, he was awarded the MICCAI Young Scientist Publication Impact Award.
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...Chester Chen
Uncovering performance regressions in the TCP SACKs vulnerability fixes
In early July 2019, Databricks noticed some Apache Spark workloads regressing by as much as 6x. In this talk, we'll discuss how we traced these regressions back to the Linux kernel and the fixes for the TCP SACKs vulnerabilities. We will explain the symptoms we were seeing, walk through how we debugged the TCP connections, and dive into the Linux source to uncover the root cause.
Speaker: Chris Stevens (Databricks)
Chris Stevens is a software engineer at Databricks where he works on the reliability, scalability, and security of Apache Spark clusters. His work focuses on auto-scaling compute, auto-scaling storage, node initialization performance, and node health monitoring. Prior to Databricks, Chris founded the Minoca OS project, where he built a POSIX compliant, general purpose OS - from scratch - to run on resource constrained device. He got his start at Microsoft working on the Windows kernel team, porting the Windows boot environment from BIOS to UEFI.
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleChester Chen
Talk 2. Managing Uber’s Data workflow at Scale.
Uber microservices serving millions of rides a day, leading to 100+ PB of data. To democratize data pipelines, Uber needed a central tool that provides a way to author, manage, schedule, and deploy data workflows at scale. This talk details Uber’s journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected several components of the system—such as scheduling and serialization—to make them highly available and more scalable.
Speaker Alex Kira (Uber)
Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform. In 19-year, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
SFBigAnalytics- hybrid data management using cdapChester Chen
Cloud has emerged as a critical enabler of digital transformation, with the aim of reducing IT overheads and costs. However, cloud
migration is not instantaneous for a variety of reasons including data sensitivity, compliance and application performance. This results in the creation of diverse hybrid and multi-cloud environments and amplifies data management and integration challenges. This talk demonstrates how CDAP’s flexibility can allow you to utilize your existing on-premises infrastructure, as you evolve to the latest Big Data and Cloud services at your own pace, all while providing you a single, unified view of all your data, wherever it resides.
Speaker: Bhooshan Mogal, Google
Bhooshan Mogal is a Product Manager at Google, where he is focused on delivering best-in-class Data and Analytics services to GCP users. Prior to Google, he worked on data systems at Cask Data Inc, Pivotal and Yahoo.
Bighead: Airbnb's end-to-end machine learning platform
Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python, Spark, and Kubernetes. The components include a lifecycle management service, an offline training and inference engine, an online inference service, a prototyping environment, and a Docker image customization tool. Each component can be used individually. In addition, Bighead includes a unified model building API that smoothly integrates popular libraries including TensorFlow, XGBoost, and PyTorch. Each model is reproducible and iterable through standardization of data collection and transformation, model training environments, and production deployment. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adopted in Airbnb and we have variety of models running in production. We plan to open source Bighead to allow the wider community to benefit from our work.
Speaker: Andrew Hoh
Andrew Hoh is the Product Manager for the ML Infrastructure and Applied ML teams at Airbnb. Previously, he has spent time building and growing Microsoft Azure's NoSQL distributed database. He holds a degree in computer science from Dartmouth College.
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
Talk 1 : Evolution of the GoPro's data platform
In this talk, we will share GoPro’s experiences in building Data Analytics Cluster in Cloud. We will discuss: evolution of data platform from fixed-size Hadoop clusters to Cloud-based Spark Cluster with Centralized Hive Metastore +S3: Cost Benefits and DevOp Impact; Configurable, spark-based batch Ingestion/ETL framework;
Migration Streaming framework to Cloud + S3;
Analytics metrics delivery with Slack integration;
BedRock: Data Platform Management, Visualization & Self-Service Portal
Visualizing Machine learning Features via Google Facets + Spark
Speakers: Chester Chen
Chester Chen is the Head of Data Science & Engineering, GoPro. Previously, he was the Director of Engineering at Alpine Data Lab.
David Winters
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. Previously He worked at Apple & Splice Machines.
Hao Zou
Hao is a Senior big data engineer at Data Science and Engineering team. Previously He worked as Alpine Data Labs and Pivotal
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”