The document discusses the benefits and challenges of running big data workloads on cloud native platforms. Some key points discussed include:
- Big data workloads are migrating to the cloud to take advantage of scalability, flexibility and cost effectiveness compared to on-premises solutions.
- Enterprise cloud platforms need to provide centralized management and monitoring of multiple clusters, secure data access, and replication capabilities.
- Running big data on cloud introduces challenges around storage, networking, compute resources, and security that systems need to address, such as consistency issues with object storage, network throughput reductions, and hardware variations across cloud vendors.
- The open source community is helping users address these challenges to build cloud native data architectures
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Yuandong Xie, Platform Researcher (Unisound)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Unified Data Access with Gimel
Deepak Chandramouli, Engineering Lead
Anisha Nainani, Sr. Software Engineer
Dr. Vladimir Bacvanski, Principal Architect (Paypal)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
How to Set Up ApsaraDB for RDS on Alibaba CloudAlibaba Cloud
See Webinar Recording at https://resource.alibabacloud.com/webinar/detail.htm?webinarId=26
Gain an introduction to ApsaraDB for RDS, a cloud-based relational database product provided by Alibaba Cloud. In this webinar you will watch over the shoulder of a Solution Architect and Trainer, as he covers the basic concepts and features of ApsaraDB for RDS including:
- HA feature (Master/Slave Architecture, Backup/Recovery, Temporary Instance),
- Scalability features (Read-only Instance), and also,
- Security and Monitoring features.
This webinar is ideally suited for database engineers and beginners to the Alibaba Cloud product suite.
ApsaraDB for RDS: www.alibabacloud.com/product/apsaradb-for-rds
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data StoresAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Kamil Bajda-Pawlikowski, CTO, Starburst Data
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Speeding Up Atlas Deep Learning Platform with Alluxio + FluidAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Yuandong Xie, Platform Researcher (Unisound)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Unified Data Access with Gimel
Deepak Chandramouli, Engineering Lead
Anisha Nainani, Sr. Software Engineer
Dr. Vladimir Bacvanski, Principal Architect (Paypal)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
April 6, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
How to Set Up ApsaraDB for RDS on Alibaba CloudAlibaba Cloud
See Webinar Recording at https://resource.alibabacloud.com/webinar/detail.htm?webinarId=26
Gain an introduction to ApsaraDB for RDS, a cloud-based relational database product provided by Alibaba Cloud. In this webinar you will watch over the shoulder of a Solution Architect and Trainer, as he covers the basic concepts and features of ApsaraDB for RDS including:
- HA feature (Master/Slave Architecture, Backup/Recovery, Temporary Instance),
- Scalability features (Read-only Instance), and also,
- Security and Monitoring features.
This webinar is ideally suited for database engineers and beginners to the Alibaba Cloud product suite.
ApsaraDB for RDS: www.alibabacloud.com/product/apsaradb-for-rds
Introducing the Hub for Data OrchestrationAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Introducing the Hub for Data Orchestration
Danny Linden, Chapter Lead Software Engineer (Ryte)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data StoresAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Kamil Bajda-Pawlikowski, CTO, Starburst Data
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
A brief intro to how Barracuda Networks uses Cassandra and the ways in which they are replacing their MySQL infrastructure, with Cassandra. This presentation will include the lessons they've learned along the way during this migration.
Speaker: Michael Kjellman, Software Engineer at Barracuda Networks
Michael Kjellman is a Software Engineer, from San Francisco, working at Barracuda Networks. Michael works across multiple products, technologies, and languages. He primarily works on Barracuda's spam infrastructure and web filter classification data.
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
We have the challenge of how to reliably store massive quantities of data that are available even in the face of infrastructure failures. We have similar challenges on the application side. The most successful cloud architectures break applications down into microservices. How then do we deploy, upgrade and manage the scale of those microservices? This session will illustrate how to tackle these challenges by taking advantage of both Cassandra and Microsoft's next generation PaaS infrastructure called Azure Service Fabric.
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
See webinar video recording of this presentation at https://resource.alibabacloud.com/webinar/detail.htm?webinarId=37
As the third installment of the Alibaba Cloud Big Data Quickstart Series, this webinar presentation introduces the basic concepts and architecture of the offline processing engine MaxCompute and online integrated development environment DataWorks. This includes an explanation and demonstration of how to use the Data Integration component of DataWorks to integrate unstructured data stored in OSS and structured data stored with ApsaraDB for RDS (MySQL) to MaxCompute.
This Big Data case study outlines the Hadoop infrastructure deployment for a Fortune 100 media and telecommunications company.
Hadoop adoption in this company had grown organically across multiple different teams, starting with “science projects” and lab initiatives that quickly grew and expanded. Going forward, some of the options they considered for their Big Data deployment included expanding their on-premises infrastructure and using a Hadoop-as-a-Service cloud offering.
Fortunately, they realized that there is a third option: providing the benefits of Hadoop-as-a-Service with on-premises infrastructure. They selected the BlueData EPIC software platform to virtualize their Hadoop infrastructure and provide on-demand access to virtual Hadoop clusters in a secure, multi-tenant model.
Learn more about this case study in the blog post at: http://www.bluedata.com/blog/2015/05/big-data-case-study-hadoop-infrastructure
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
During this session Ben Lackey (DataStax) and Ravi Madasu (Google) will cover best practices for quickly setting up a cluster on Google Cloud Platform (GCP) using both Google Compute Engine (GCE) and Google Container Engine (GKE) which is based on Kubernetes and Docker.
About the Speakers
Ben Lackey Partner Architect, DataStax
I work in the Cloud Strategy group at DataStax where I concentrate on improving the integration between DataStax Enterprise and cloud platforms including Azure, GCP and Pivotal.
Ravi Madasu
Ravi Madasu is a program manager at Google, primarily focused on Google Cloud Launcher. He works closely with ISV partners to make their products and services available on the Google Cloud Platform providing a developer friendly deployment experience. He has 15+ years of experience, working in variety of roles such as software engineer, project manager and product manager. Ravi received a Masters degree in Information Systems from Northeastern University and an MBA from Carnegie Mellon University.
Cloudian HyperStore offer 100% S3 compatibility for low-cost, scalable smart object storage.
With HyperStore 6.0, we are focused on bringing down operational costs so that you can more effectively track, manage, and optimize your data storage as you scale.
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Cloudian HyperStore Storage System is a peer-to-peer software defined storage platform, providing an enterprise grade S3-compliant object storage system on low cost commodity servers. Its multi-tenanted and multi-interface design can support many applications on the same platform.
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
In online residential and commercial real estate, even fractions of seconds in response times affect customer satisfaction and conversion to revenue. The need for continuous availability is paramount to deliver the levels of service customers demand from modern online applications.
Join David Prinzing, Enterprise Architect at Clear Capital to discover why Clear Capital, a premium provider of real estate asset valuation and collateral risk assessment, chose DataStax as the database-backbone for their ClearCollateral Platform. David will discuss how DataStax Enterprise, the world’s fastest, most scalable distributed database technology built on Apache Cassandra ensures 100% uptime for over 122 million properties (90% of all the properties in the United States) and supports reporting on over 1 Billion total valuations while never going down.
- The challenges in building real-time applications using relational technologies is forcing financial services firms to migrate to distributed database technologies
- How Clear Capital delivers 100% availability and real-time decision support across multiple data centers in the Amazon Cloud using DataStax Enterprise
- Why Apache Cassandra’s architecture delivers always-on, customer engaging applications that capture new business opportunities
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
Database sharding involves spreading database contents across multiple servers, with each server holding only part of the database. While it is possible to vertically scale Postgres, and to scale read-only workloads across multiple servers, only sharding allows multi-server read-write scaling. This presentation will cover the advantages of sharding and future Postgres sharding implementation requirements, including foreign data wrapper enhancements, parallelism, and global snapshot and transaction control. This is a followup to my Postgres Scaling Opportunities presentation.
Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Exploring Alluxio for Daily Tasks at Robinhood
Jiawei Zhang, Data Platform Engineer (Robinhood)
Yichuan Huang, Data Platform Engineer (Robinhood)
Grace Lu, Data Platform Engineer (Robinhood)
Wenlong Xiong, Data Platform Engineer (Robinhood)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
Google Cloud Platform, Avere Systems, and Cycle Computing experts will share best practices for advancing solutions to big challenges faced by enterprises with growing compute and storage needs. In this “best practices” webinar, you’ll hear how these companies are working to improve results that drive businesses forward through scalability, performance, and ease of management.
The slides were from a webinar presented January 24, 2017. The audience learned:
- How enterprises are using Google Cloud Platform to gain compute and storage capacity on-demand
- Best practices for efficient use of cloud compute and storage resources
- Overcoming the need for file systems within a hybrid cloud environment
- Understand how to eliminate latency between cloud and data center architectures
- Learn how to best manage simulation, analytics, and big data workloads in dynamic environments
- Look at market dynamics drawing companies to new storage models over the next several years
Presenters communicated a foundation to build infrastructure to support ongoing demand growth.
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
Using “zero-copy” hybrid bursting on remote data to solve data lake analytics capacity and performance problems.
Data scientists want answers on demand. But in today’s enterprise architectures, the reality is that most data remains on-prem, despite the promise of cloud-based analytics. Moving all that data to the cloud has typically not been possible for many reasons including cost, latency, and technical difficulty. So, what if there was a technology that would connect these on-prem environments to any major cloud platform, enabling high-powered computing without the need to move massive amounts of data?
Join us for this webinar where Alex Ma of Alluxio, an open-source data orchestration platform, will discuss how a data orchestration approach offers a solution for connecting traditional on-prem data centers and cloud data lakes with other clouds and data centers. With Alluxio’s “zero-copy” burst solution, companies can bridge remote data centers and data lakes with computing frameworks in other locations, enabling them to offload, compute, and leverage the flexibility, scalability, and power of the cloud for their remote data.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
The Future of Computing is Distributed
Professor Ion Stoica, UC Berkeley RISELab
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
A brief intro to how Barracuda Networks uses Cassandra and the ways in which they are replacing their MySQL infrastructure, with Cassandra. This presentation will include the lessons they've learned along the way during this migration.
Speaker: Michael Kjellman, Software Engineer at Barracuda Networks
Michael Kjellman is a Software Engineer, from San Francisco, working at Barracuda Networks. Michael works across multiple products, technologies, and languages. He primarily works on Barracuda's spam infrastructure and web filter classification data.
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
We have the challenge of how to reliably store massive quantities of data that are available even in the face of infrastructure failures. We have similar challenges on the application side. The most successful cloud architectures break applications down into microservices. How then do we deploy, upgrade and manage the scale of those microservices? This session will illustrate how to tackle these challenges by taking advantage of both Cassandra and Microsoft's next generation PaaS infrastructure called Azure Service Fabric.
Big Data Quickstart Series 3: Perform Data IntegrationAlibaba Cloud
See webinar video recording of this presentation at https://resource.alibabacloud.com/webinar/detail.htm?webinarId=37
As the third installment of the Alibaba Cloud Big Data Quickstart Series, this webinar presentation introduces the basic concepts and architecture of the offline processing engine MaxCompute and online integrated development environment DataWorks. This includes an explanation and demonstration of how to use the Data Integration component of DataWorks to integrate unstructured data stored in OSS and structured data stored with ApsaraDB for RDS (MySQL) to MaxCompute.
This Big Data case study outlines the Hadoop infrastructure deployment for a Fortune 100 media and telecommunications company.
Hadoop adoption in this company had grown organically across multiple different teams, starting with “science projects” and lab initiatives that quickly grew and expanded. Going forward, some of the options they considered for their Big Data deployment included expanding their on-premises infrastructure and using a Hadoop-as-a-Service cloud offering.
Fortunately, they realized that there is a third option: providing the benefits of Hadoop-as-a-Service with on-premises infrastructure. They selected the BlueData EPIC software platform to virtualize their Hadoop infrastructure and provide on-demand access to virtual Hadoop clusters in a secure, multi-tenant model.
Learn more about this case study in the blog post at: http://www.bluedata.com/blog/2015/05/big-data-case-study-hadoop-infrastructure
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
During this session Ben Lackey (DataStax) and Ravi Madasu (Google) will cover best practices for quickly setting up a cluster on Google Cloud Platform (GCP) using both Google Compute Engine (GCE) and Google Container Engine (GKE) which is based on Kubernetes and Docker.
About the Speakers
Ben Lackey Partner Architect, DataStax
I work in the Cloud Strategy group at DataStax where I concentrate on improving the integration between DataStax Enterprise and cloud platforms including Azure, GCP and Pivotal.
Ravi Madasu
Ravi Madasu is a program manager at Google, primarily focused on Google Cloud Launcher. He works closely with ISV partners to make their products and services available on the Google Cloud Platform providing a developer friendly deployment experience. He has 15+ years of experience, working in variety of roles such as software engineer, project manager and product manager. Ravi received a Masters degree in Information Systems from Northeastern University and an MBA from Carnegie Mellon University.
Cloudian HyperStore offer 100% S3 compatibility for low-cost, scalable smart object storage.
With HyperStore 6.0, we are focused on bringing down operational costs so that you can more effectively track, manage, and optimize your data storage as you scale.
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
Big data doesn't mean big money. In fact, choosing a NoSQL solution will almost certainly save your business money, in terms of hardware, licensing, and total cost of ownership. What's more, choosing the correct technology for your use case will almost certainly increase your top line as well.
Big words, right? We'll back them up with customer case studies and lots of details.
This webinar will give you the basics for growing your business in a profitable way. What's the use of growing your top line but outspending any gains on cumbersome, ineffective, outdated IT? We'll take you through the specific use cases and business models that are the best fit for NoSQL solutions.
By the way, no prior knowledge is required. If you don't even know what RDBMS or NoSQL stand for, you are in the right place. Get your questions answered, and get your business on the right track to meeting your customers' needs in today's data environment.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Cloudian HyperStore Storage System is a peer-to-peer software defined storage platform, providing an enterprise grade S3-compliant object storage system on low cost commodity servers. Its multi-tenanted and multi-interface design can support many applications on the same platform.
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
In online residential and commercial real estate, even fractions of seconds in response times affect customer satisfaction and conversion to revenue. The need for continuous availability is paramount to deliver the levels of service customers demand from modern online applications.
Join David Prinzing, Enterprise Architect at Clear Capital to discover why Clear Capital, a premium provider of real estate asset valuation and collateral risk assessment, chose DataStax as the database-backbone for their ClearCollateral Platform. David will discuss how DataStax Enterprise, the world’s fastest, most scalable distributed database technology built on Apache Cassandra ensures 100% uptime for over 122 million properties (90% of all the properties in the United States) and supports reporting on over 1 Billion total valuations while never going down.
- The challenges in building real-time applications using relational technologies is forcing financial services firms to migrate to distributed database technologies
- How Clear Capital delivers 100% availability and real-time decision support across multiple data centers in the Amazon Cloud using DataStax Enterprise
- Why Apache Cassandra’s architecture delivers always-on, customer engaging applications that capture new business opportunities
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
Database sharding involves spreading database contents across multiple servers, with each server holding only part of the database. While it is possible to vertically scale Postgres, and to scale read-only workloads across multiple servers, only sharding allows multi-server read-write scaling. This presentation will cover the advantages of sharding and future Postgres sharding implementation requirements, including foreign data wrapper enhancements, parallelism, and global snapshot and transaction control. This is a followup to my Postgres Scaling Opportunities presentation.
Exploring Alluxio for Daily Tasks at RobinhoodAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Exploring Alluxio for Daily Tasks at Robinhood
Jiawei Zhang, Data Platform Engineer (Robinhood)
Yichuan Huang, Data Platform Engineer (Robinhood)
Grace Lu, Data Platform Engineer (Robinhood)
Wenlong Xiong, Data Platform Engineer (Robinhood)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
Google Cloud Platform, Avere Systems, and Cycle Computing experts will share best practices for advancing solutions to big challenges faced by enterprises with growing compute and storage needs. In this “best practices” webinar, you’ll hear how these companies are working to improve results that drive businesses forward through scalability, performance, and ease of management.
The slides were from a webinar presented January 24, 2017. The audience learned:
- How enterprises are using Google Cloud Platform to gain compute and storage capacity on-demand
- Best practices for efficient use of cloud compute and storage resources
- Overcoming the need for file systems within a hybrid cloud environment
- Understand how to eliminate latency between cloud and data center architectures
- Learn how to best manage simulation, analytics, and big data workloads in dynamic environments
- Look at market dynamics drawing companies to new storage models over the next several years
Presenters communicated a foundation to build infrastructure to support ongoing demand growth.
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
Using “zero-copy” hybrid bursting on remote data to solve data lake analytics capacity and performance problems.
Data scientists want answers on demand. But in today’s enterprise architectures, the reality is that most data remains on-prem, despite the promise of cloud-based analytics. Moving all that data to the cloud has typically not been possible for many reasons including cost, latency, and technical difficulty. So, what if there was a technology that would connect these on-prem environments to any major cloud platform, enabling high-powered computing without the need to move massive amounts of data?
Join us for this webinar where Alex Ma of Alluxio, an open-source data orchestration platform, will discuss how a data orchestration approach offers a solution for connecting traditional on-prem data centers and cloud data lakes with other clouds and data centers. With Alluxio’s “zero-copy” burst solution, companies can bridge remote data centers and data lakes with computing frameworks in other locations, enabling them to offload, compute, and leverage the flexibility, scalability, and power of the cloud for their remote data.
Caching for Microservices Architectures: Session IVMware Tanzu
In this 60 minute webinar, we will cover the key areas of consideration for data layer decisions in a microservices architecture, and how a caching layer, satisfies these requirements. You’ll walk away from this webinar with a better understanding of the following concepts:
- How microservices are easy to scale up and down, so both the service layer and the data layer need to support this elasticity.
- Why microservices simplify and accelerate the software delivery lifecycle by splitting up effort into smaller isolated pieces that autonomous teams can work on independently. Event-driven systems promote autonomy.
- Where microservices can be distributed across availability zones and data centers for addressing performance and availability requirements. Similarly, the data layer should support this distribution of workload.
- How microservices can be part of an evolution that includes your legacy applications. Similarly, the data layer must accommodate this graceful on-ramp to microservices.
Presenter : Jagdish Mirani is a Product Marketing Manager in charge of Pivotal’s in-memory products
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Estimating the Total Costs of Your Cloud Analytics PlatformDATAVERSITY
Organizations today need a broad set of enterprise data cloud services with key data functionality to modernize applications and utilize machine learning. They need a platform designed to address multi-faceted needs by offering multi-function Data Management and analytics to solve the enterprise’s most pressing data and analytic challenges in a streamlined fashion. They need a worry-free experience with the architecture and its components.
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Denodo
Watch full webinar here: https://bit.ly/3dudL6u
It's not if you move to the cloud, but when. Most organisations are well underway with migrating applications and data to the cloud. In fact, most organisations - whether they realise it or not - have a multi-cloud strategy. Single, hybrid, or multi-cloud…the potential benefits are huge - flexibility, agility, cost savings, scaling on-demand, etc. However, the challenges can be just as large and daunting. A poorly managed migration to the cloud can leave users frustrated at their inability to get to the data that they need and IT scrambling to cobble together a solution.
In this session, we will look at the challenges facing data management teams as they migrate to cloud and multi-cloud architectures. We will show how the Denodo Platform can:
- Reduce the risk and minimise the disruption of migrating to the cloud.
- Make it easier and quicker for users to find the data that they need - wherever it is located.
- Provide a uniform security layer that spans hybrid and multi-cloud environments.
A Successful Journey to the Cloud with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/3mPLIlo
A shift to the cloud is a common element of any current data strategy. However, a successful transition to the cloud is not easy and can take years. It comes with security challenges, changes in downstream and upstream applications, and new ways to operate and deploy software. An abstraction layer that decouples data access from storage and processing can be a key element to enable a smooth journey to the cloud.
Attend this webinar to learn more about:
- How to use Data Virtualization to gradually change data systems without impacting business operations
- How Denodo integrates with the larger cloud ecosystems to enable security
- How simple it is to create and manage a Denodo cloud deployment
AWS Summit 2013 | Auckland - Building Web Scale Applications with AWSAmazon Web Services
AWS provides a platform that is ideally suited for deploying highly available and reliable systems that can scale with a minimal amount of human interaction. This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow for agile development practices. We walk through the various architectural decisions taken for each tier and explain our choices for appropriate AWS services and building blocks to ensure the security, scale, availability and reliability of the application.
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
We hope this session was valuable in teaching you more about Cloudera Enterprise on AWS, and how fast and easy it is to deploy a modern data management platform—in your cloud and on your terms.
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
Presented at All Things Open 2023
Presented by Karl Mozurkewich - Storj
Title: What Does Real World Mass Adoption of Decentralized Tech Look Like?
Abstract: We delve into the transformative potential of decentralized technology. Beginning with a brief overview of the rise of centralization with the advent of the internet and the counter-shift marked by blockchain we explore the intrinsic characteristics of decentralized and distributed systems, such as trustless operations, peer-to-peer networks, and enterprise application scalability. Various sectors, including finance, supply chains, media and entertainment, data science and cloud infrastructure are on the brink of disruption. The societal implications are vast, with the potential for greater individual empowerment, a greener planet and more viable resource utilization, but concerns about data security persist.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Cloud Migration headache? Ease the pain with Data Virtualization! (EMEA)Denodo
Watch full webinar here: https://bit.ly/3CWIBzd
Moving data to the Cloud is a priority for many organizations. Benefits - in terms of flexibility, agility, and cost savings - are driving Cloud adoption. This journey to the Cloud is not easy: moving application(s) and data to the Cloud can be challenging and entails disruption of business, when not carefully managed.
When systems are being migrated, the resultant hybrid (or even multi-) Cloud architecture is, by definition, more complex AND making it harder/more costly to retrieve the data we need.
Data Virtualization can help organizations at all stages of a Cloud journey - during migration as well as in our “new hybrid multi-Cloud reality”
Watch on-demand this webinar to learn how Data Virtualization can:
- Help organizations manage risk and minimize the disruption caused as systems are moved to the Cloud
- Provide a single point of access for data that is both on-premise and in the Cloud, making it easier for users to find and access the data that they need
- Provide a secure layer to protect and manage data when it's distributed across hybrid or multi-Cloud architectures
… watch a live demo about how to ease the migration.
Dremio, une architecture simple et performance pour votre data lakehouse.
Dans le monde de la donnée, Dremio, est inclassable ! C’est à la fois une plateforme de diffusion des données, un moteur SQL puissant basé sur Apache Arrow, Apache Calcite, Apache Parquet, un catalogue de données actif et aussi un Data Lakehouse ouvert ! Après avoir fait connaissance avec cette plateforme, il s’agira de préciser comment Dremio aide les organisations à relever les défis qui sont les leurs en matière de gestion et gouvernance des données facilitant l’exécution de leurs analyses dans le cloud (et/ou sur site) sans le coût, la complexité et le verrouillage des entrepôts de données.
SpringPeople - Introduction to Cloud ComputingSpringPeople
Cloud computing is no longer a fad that is going around. It is for real and is perhaps the most talked about subject. Various players in the cloud eco-system have provided a definition that is closely aligned to their sweet spot –let it be infrastructure, platforms or applications.
This presentation will provide an exposure of a variety of cloud computing techniques, architecture, technology options to the participants and in general will familiarize cloud fundamentals in a holistic manner spanning all dimensions such as cost, operations, technology etc
Data Lake and the rise of the microservicesBigstep
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup we’ll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. What’s the role of the microservices in the big data world?
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Big Data on Cloud Native Platform
1. Big Data on Cloud Native
Platform
Rajesh Balamohan
Sunil Govindan
2. Speaker Bio
Rajesh Balamohan
Principal Engineer 2 @ Cloudera
Apache Hive, ORC Committer & Apache Tez PMC and Committer
@rajeshbalamohan
Sunil Govindan
Engineering Manager @ Cloudera
Apache Hadoop, Submarine, YuniKorn PMC member & Committer
@sunilgovind
3. Agenda
● Why Big Data workloads need to migrate to Cloud
● Aspects of Enterprise Ready Cloud Platform
● Challenges of Big Data on Cloud Platform
4. Why Big Data workloads need to migrate to cloud ?
5. About (Big) Data itself...
Key thought process from the customers about today’s DATA are,
“Ability to consistently extract accurate business proposition from data”
“Data will grow over time - probably, exponentially”
“Data analytics returns profound business insights only when you have access to
more data”
So how do we keep data available as needed (to get value from that data) ?
6. Data Architecture Evolution: Gen 1
Data volumes are growing
exponentially and on-prem is
not cost effective & scalable!
7. Cloud Adoption Trend
“The worldwide infrastructure as a service (IaaS) market grew 37.3% in 2019 to
total $44.5 billion, up from $32.4 billion in 2018, according to Gartner, Inc.”
Cloud Adoption is growing at a rapid pace, why ?
“Cloud computing offers access to data storage and compute on a more
scalable, flexible and cost-effective than can be achieved with an on-
premises deployment”
8. Why Big Data workloads need Cloud?
Some high level advantages:
● Pay as you go : No hardware acquisitions, thus Zero CAPEX
● Self Serve : Easier Accessibility
● Cost Effective & On-Demand
● Highly Elastic : Can scale 100s of nodes up/down easily
● No more installation/upgrade hassles
● Disaggregated Storage
10. Big Data in Cloud
Hadoop: “Decade Two, Day Zero”
Philosophy towards a modern Data Architecture
● Disaggregate storage, compute, security and governance
● Build for extremely large-scale using distributed systems
● Leverage open source for open standards and community scale
● Continuously evolve the ecosystem for innovation at every layer,
independently
13. Critical Aspects of Enterprise Cloud Platform
● Manage and monitor multiple
clusters
● Secure data via single window
● Authentication & Authorization via
single window
● Replicate data across multiple
clusters on need basis
● Profile and debug queries across
multiple clusters via single window
● Multiple experiences depending on
the user (Data Engineering,
Streaming, Fast Analytics, Data
profiling etc)
Classic Clusters
(Optional)
17. Challenges in the dimension of
- Storage
- Network
- Compute
- Throttling
- Security
- Hardware Specs
* These are some of dimensions that we would like to cover in today’s talk.
18. Consistency & Latency Issues with ObjectStores
● Eventual Consistency Issues
○ Certain ObjectStores provide eventual consistency (e.g S3)
■ New files may not be visible for listing (until safely propagated internally).
■ Opening deleted file may be possible due to consistency issues
○ S3Guard
■ Uses “DynamoDB” to persist metadata changes. Provides consistent view of S3
objects for processing.
■ Supports DynamoDB on-demand (i.e no need to explicitly set capacity limits).
● Renames can be expensive
○ Rename = “Copy + Delete” in ObjectStores like S3.
○ Need to build stack which reduces rename operations or favours direct write to
destination
● OS Page cache is not leveraged as data is read over network
19. Intelligent Caching for Query Performance
● Avoid reading same data from
ObjectStores
○ Systems like Hive/LLAP and Impala
cache data locally for improving query
performance.
20. Reduce Network Latency
● Reduce number of SSL
connections to
ObjectStores
○ Added lazySeek
implementation to reduce
connection breakages.
21. AutoScaling
● Determining the right cluster size can
be challenging.
● AutoScaling helps in scaling up/down
instances depending on workload
○ Concurrency Based AutoScaling
■ Helps in controlling number of
parallel queries
○ Query Isolation
■ When queries scan beyond a certain
limit, new clusters are automatically
spun up.
22. Affinity Policies for better Network Throughput
- AutoScaling policies allow you spin up instances across different
availability zones
- By default cloud providers tend to spread instances across AZ for availability.
- Impacts network throughput for nodes with 10Gbps speed
- Set affinity policy to have the instances in the same availability zone
23. Spin up Time
● Cluster/Compute spin up time plays a crucial role in adoption and
reducing cost.
● Containerized deployments help a lot in reducing spin up time
significantly with K8S
○ 10s of seconds as opposed to minutes
24. K8S: Pods can have same hostname/port
● Pods can have same hostname/port after restart
● This causes trouble for processes tracking nodes based on
hostname/port
● Added flexibility in the stack to take care of this situation
○ E.g TEZ-4179: [Kubernetes] Extend NodeId in tez to support unique worker identity
25. Throttling
● Cloud services throttle
requests
○ Throttling limits vary across cloud
vendors
● Critical to monitor throttling
metrics
○ Desirable to enable metrics
logging in ObjectStore
○ Accuracy limited to per minute in
most of the objectstores
27. Security
● Perimeter Security
● Encrypted data at rest
● Transfer of intermediate data encrypted
● Need to use optimised libs for improving transport security
28. Hardware Specs across Cloud Vendors
● Watch out for hardware specs across cloud vendors.
○ E.g SSD in Azure can have different perf characteristics than AWS
● OS settings have to be tweaked accordingly
○ E.g network, disk settings
● Choose optimal instance for the workload
○ E.g Instances with high density disks may not be needed as data is stored in ObjectStore
○ Too little disk space can hurt intermediate data being written out.
29. Tomorrow ...
● Plenty of challenges to run Big Data workloads on Cloud
○ Great efforts from Open Source community!
● Users need “No vendor lock in”
○ An Open Data layer for multi-cloud (SODA, CSI etc with infinite possibilities)
○ Network standards across clouds (CNI)
○ Data Lineage and governance for user (Apache Atlas)
○ Security and access as open standard (Apache Ranger)
● Users are looking for an Open Data Architecture for multiple clouds which
is enterprise ready!
30. Thank You
● References
○ Cloudera Data Platform (Multi Cloud): https://docs.cloudera.com/cdp/latest/index.html
○ Hadoop: Decade two, Day zero: https://blog.cloudera.com/hadoop-decade-two-day-zero/
● Cloudera careers
For a true enterprise ready cloud platform
We need a way to register, manage and control multiple clusters in a central place
Need a way to handle security policies via central place
Provide different user experiences depending on the data processing requirements like “Machine Learning”, “Data Warehouse”, “Data Engineering” and so on
Observed this in Azure, where throttling can have adverse impact on CPU utilization.
System was sending good amount of data to Azure ObjectStore and got throttled with 503 exceptions. Due to retry logic, system continued to retry and send over the same data over wire.
This caused high CPU usage due to encryption
Hardware specs across different cloud vendors could be very different. For instance, SSD in AWS gave around 288 MB/s speed, where as in Azure it gave 89 MB/s.
Would recommend to measure performance, before choosing appropriate instances.
OS settings need to be tweaked accordingly as well.
For e.g we had to recently disable certain disk settings to avoid unwanted kernel calls, as we were on SSD.
It would be good choose optimal instance type for the workload