Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks
Today’s Big Data teams demand solutions designed for Big Data that are optimized, secure, and adaptable to changing workload requirements. Working together, Hortonworks, IBM, and Attunity have designed an integrated solution that transfers large volumes of data to a platform that can handle rapid ingest, processing and analysis of data of all types from all sources, at scale.
https://hortonworks.com/webinar/benefits-transferring-real-time-data-hadoop-scale-ibm-hortonworks-attunity/
Journey to Big Data: Main Issues, Solutions, BenefitsDataWorks Summit
One of the most fruit aspects of being chosen as a partner bank is that you can have a backend that can communicate directly with the client's system. With this partnership, Banco Santander has led to running a large series of third party applications on their banking system for many years.
Banking is the most regulated sector to make day-to-day operations more interesting. Adapting the system to regulation is not optional and is mandatory. For today's banks internal and external audits are an important routine. Furthermore, considering that SCIB is a global player, this pattern is repeated in each country where the group exists.
It can be said that it is a really interesting compound! Various kinds of third party systems are installed in many countries, coexisting with our centralized system, transmitting information mutually, being adjusted manually, data is aggregated / integrated at the back office. Spaghetti comes to mind when considering that all data comes and goes without delay. More and more, regulators and auditors are able to perfectly identify the origin of each data. This often means that you need to manually interfere in order to fully locate the data.
Javier Nieto, active in Banco Santander's corporate investment banking architecture and innovation department, talks about integration challenges that Santander experienced when building an on-demand Data Lake to move to global big data.
The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.
The vortex of change continues all around us – inside the company, with our customers and partners. A new norm is upon us. Business models are being turned upside down – the hunters now the hunted, global equalization – size is no longer a guarantee of success. The innovative survive and thrive…the nervous and slow go under...what does all this change means for you? Find out how does Intel’s strengths help our customers in this world of change.
This document discusses climbing the AI ladder and preparing data for artificial intelligence. It notes that 81% of organizations do not yet understand the data required for AI. The first step in the ladder is to ensure data is properly structured and accessible. Future steps include applying machine learning everywhere, scaling insights on demand, and building a trusted analytics foundation. The document promotes IBM tools for data science, machine learning, and building a Hadoop data platform to analyze large volumes and varieties of data. It presents a vision of enabling SQL queries of all data, provisioning data as a utility, and using DevOps practices for data science.
This document discusses making banks more predictive and real-time using Hadoop. It describes the challenges of siloed data and batch processing at banks. The author details his bank's journey from setting up a small "play area" Hadoop cluster to experiment, to building a secure predictive analytics lab, to plans for a production system. Key challenges discussed include securing Hadoop, hardware limitations, and rapid tool innovation. Real-time analytics require tools beyond Hadoop like Storm or Spark. Business cases around marketing, spending forecasts and risk are highlighted. The author concludes Hadoop can save costs and accelerate predictive analytics when driven by business cases.
Benefits of Transferring Real-Time Data to Hadoop at ScaleHortonworks
Today’s Big Data teams demand solutions designed for Big Data that are optimized, secure, and adaptable to changing workload requirements. Working together, Hortonworks, IBM, and Attunity have designed an integrated solution that transfers large volumes of data to a platform that can handle rapid ingest, processing and analysis of data of all types from all sources, at scale.
https://hortonworks.com/webinar/benefits-transferring-real-time-data-hadoop-scale-ibm-hortonworks-attunity/
Journey to Big Data: Main Issues, Solutions, BenefitsDataWorks Summit
One of the most fruit aspects of being chosen as a partner bank is that you can have a backend that can communicate directly with the client's system. With this partnership, Banco Santander has led to running a large series of third party applications on their banking system for many years.
Banking is the most regulated sector to make day-to-day operations more interesting. Adapting the system to regulation is not optional and is mandatory. For today's banks internal and external audits are an important routine. Furthermore, considering that SCIB is a global player, this pattern is repeated in each country where the group exists.
It can be said that it is a really interesting compound! Various kinds of third party systems are installed in many countries, coexisting with our centralized system, transmitting information mutually, being adjusted manually, data is aggregated / integrated at the back office. Spaghetti comes to mind when considering that all data comes and goes without delay. More and more, regulators and auditors are able to perfectly identify the origin of each data. This often means that you need to manually interfere in order to fully locate the data.
Javier Nieto, active in Banco Santander's corporate investment banking architecture and innovation department, talks about integration challenges that Santander experienced when building an on-demand Data Lake to move to global big data.
The Vortex of Change - Digital Transformation (Presented by Intel)Cloudera, Inc.
The vortex of change continues all around us – inside the company, with our customers and partners. A new norm is upon us. Business models are being turned upside down – the hunters now the hunted, global equalization – size is no longer a guarantee of success. The innovative survive and thrive…the nervous and slow go under...what does all this change means for you? Find out how does Intel’s strengths help our customers in this world of change.
This document discusses climbing the AI ladder and preparing data for artificial intelligence. It notes that 81% of organizations do not yet understand the data required for AI. The first step in the ladder is to ensure data is properly structured and accessible. Future steps include applying machine learning everywhere, scaling insights on demand, and building a trusted analytics foundation. The document promotes IBM tools for data science, machine learning, and building a Hadoop data platform to analyze large volumes and varieties of data. It presents a vision of enabling SQL queries of all data, provisioning data as a utility, and using DevOps practices for data science.
This document discusses making banks more predictive and real-time using Hadoop. It describes the challenges of siloed data and batch processing at banks. The author details his bank's journey from setting up a small "play area" Hadoop cluster to experiment, to building a secure predictive analytics lab, to plans for a production system. Key challenges discussed include securing Hadoop, hardware limitations, and rapid tool innovation. Real-time analytics require tools beyond Hadoop like Storm or Spark. Business cases around marketing, spending forecasts and risk are highlighted. The author concludes Hadoop can save costs and accelerate predictive analytics when driven by business cases.
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
The document discusses Informatica's data integration platform and its capabilities for big data and analytics projects. Some key points:
- Informatica is a leading data integration vendor with over 5,000 customers including over 70% of the Global 500.
- The Informatica platform provides capabilities across the entire data lifecycle from ingestion to delivery including data quality, master data management, integration, and analytics.
- It supports a variety of data sources including structured, unstructured, cloud, and big data and can run on-premises or in the cloud.
- Customers report the Informatica platform improves agility, scalability, and operational confidence for data integration projects compared to
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
Bernard Doering, Senior Slaes Director DACH, Cloudera.
Hadoop and the Future of Data Management. As Hadoop takes the data management market by storm, organisations are evolving the role it plays in the modern data centre. Explore how this disruptive technology is quickly transforming an industry and how you can leverage it today, in combination with MongoDB, to drive meaningful change in your business.
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...EMC
The document discusses Pivotal's big data suite and business data lake offerings. It provides an overview of the components of a business data lake, including storage, ingestion, distillation, processing, unified data management, and action components. It also defines various data processing approaches like streaming, micro-batching, batch, and real-time response. The goal is to help organizations build analytics and transactional applications on big data to drive business insights and revenue.
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address RequirementsDataWorks Summit
This presentation discusses forward-looking statements that are subject to risks and uncertainties. It addresses issues around who owns data, who has access to data, and what type of data analysis can be done. It provides details on government-to-government, bank-to-government, and regional data exchanges. It discusses Rante's divisions and approach to unique experiences. Rante aims to anticipate industry trends and push boundaries through research and technology innovations.
This document discusses strategies for transitioning from a traditional data warehousing architecture to a modern data architecture. It outlines a 4 sprint approach including developing social sensing capabilities, integrating additional data sources, implementing statistical and machine learning methods, and designing an operating model. It emphasizes the importance of a "kill strategy" to decommission legacy systems, a user adoption strategy to transition users to the new system, and implementing a "data concierge" service to streamline data provisioning and maximize value from the new platform. The strategies described aim to rationalize costs, simplify the data landscape, and enable more agile analytics and business transformation.
Partner Keynote: How Logical Data Fabric Knits Together Data Visualization wi...Denodo
Watch full webinar here: https://bit.ly/3aALFEC
Data Visualization and Data Virtualization are complementary technologies. But how do they come together under a common data fabric? This presentation will discuss how organizations are advancing their data fabric capabilities leveraging innovations in these two technologies in areas of self-service, data catalog, cloud, and AI/ML.
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
Most enterprises understand the value of hybrid cloud. In fact, your enterprise is already working in a multi-cloud or hybrid cloud environment, whether you know it or not. View this SlideShare to gain a greater understanding of the requirements of a geo-distributed cloud database in hybrid and multi-cloud environments.
View recording: https://youtu.be/tHukS-p6lUI
Explore all DataStax webinars: https://www.datastax.com/resources/webinars
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Hortonworks
The document discusses Hortonworks' Data Science Experience (DSX) platform. It describes challenges data scientists face around data access, tool usage, collaboration and model deployment. DSX aims to address these by providing tools for exploring, modeling and deploying data science projects on Hortonworks Data Platform (HDP) clusters at scale. It also announces an extension of IBM and Hortonworks' partnership to integrate DSX and other IBM data science tools with HDP.
This document discusses how data science and AI are fueling new business models driven by data. It summarizes that (1) connected devices, customers, and sensors are generating massive amounts of data across manufacturing, distribution, marketing, sales, and service; (2) technologies like cloud computing, streaming data, IoT, and machine learning are enabling new ways to harness this data; and (3) a modern data architecture is needed to encompass all data sources, enable analytics and machine learning, and power actionable intelligence across edge, cloud, and on-premises environments.
Put Alternative Data to Use in Capital Markets Cloudera, Inc.
This document discusses alternative data in capital markets. It provides an overview of alternative data sources like social media, satellite imagery, and location data. It also describes how firms are using alternative data to enhance traditional analysis and develop new investment strategies. The document notes that most alternative data users have seen returns from using this data. However, accessing and analyzing large alternative data sets remains a challenge. It promotes the use of data platforms and visual analytics to more effectively ingest, store, and operationalize alternative data.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
Optimizing your Hadoop Infastructure: An Industry Panel PresentationDataWorks Summit
This document introduces the panel speakers for a discussion on modernizing Hadoop infrastructures. It provides brief biographies of five speakers: Armando Acosta of Dell, who has 15 years of experience in IT solutions and big data; Brandon Draeger of Intel, who manages partnerships between Intel, Cloudera, and their shared ecosystems; TJ Laher of Cloudera, who helps organizations implement Hadoop; Vin Sharma of Intel; and Mark Muncy of Syncsort, who leads technical marketing for big data and has experience in data architecture. The panel will discuss how to evolve Hadoop capabilities to meet emerging customer needs.
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
Cloudera empowers cybersecurity innovators to proactively secure the enterprise by accelerating threat detection, investigation, and response through machine learning and complete enterprise visibility. Cloudera’s cybersecurity solution, based on Apache Spot, enables anomaly detection, behavior analytics, and comprehensive access across all enterprise data using an open, scalable platform. But what’s the easiest way to get started?
Join Cloudera, StreamSets, and Arcadia Data as we show you first hand how we have made it easier to get your first use case up and running. During this session you will learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
3 things to learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
In order to deal with customers expecting a seamless omnichannel experience, increased regulations and speed with which innovative fin-techs enter the market, ING has formulated a customer centric strategy based on data and analytics.
Last year we talked about the fact that ING developed a new architecture, the ING Data Lake. And how within ING In parallel the Big Data paradigm, based on Hadoop, appeared and how this was mapped on the Data Lake architecture to make sure Hadoop is leveraged to the maximum.
This year we want to tell you how the international working group helped realizing the advanced analytic pattern on the ING private cloud, without prior management approval.
This presentation will discuss the community strategy, how to stay under the radar, how to surface when actual content is strong enough to force change, open issues and the private cloud challenges ING is dealing with. Join us in this ride from community idea through architecture to private cloud implementation with some organizational challenges along the way.
A modern approach to streaming data integration, event processing with a big data (kappa style) data architecture. Key patterns are discussed with pros/cons of newer approaches and open source technologies. Focus on Oracle and GoldenGate technology. OpenWorld 2018 presentation.
Without the right data management strategy, investments in Internet of Things (IoT) can yield limited results. Cloudera is pioneering next generation data management solutions, enabling organizations to build an enterprise data hub (EDH) as the backbone to any IoT initiative.
Denodo Design Studio: Modeling and Creation of Data ServicesDenodo
Watch full webinar here: https://bit.ly/39T7SON
Change is the only constant and it is very important for enterprises to keep up with the changing times in an agile fashion. To ensure faster time to market, quick business insights and rapid data driven decision making, it is important that the Data Delivery channel is optimized in the best way possible.
With the advent of API Management technologies the demand for data being delivered in the form of a Data Service/APIs is increasing. The ability to make data available in an API format at the click of a button is the need of the hour. Join us to see how easy it is to make enterprise wide data available as Data Services/APIs no matter what format the data is stored in with no prior coding experience. Faster development, zero learning curve and huge value.
Watch on-demand this webinar to learn:
- How to explore datasets available using Denodo Data Catalog
- How to build new data sets using Denodo Design Studio, drag and drop interface
- How to make datasets available in RESTful, OData 4, GeoJSON, GraphQL.
- How to enable different authentication protocols including OAuth 2.0.
- Automatic documentation (Open API) and availability in the Denodo Data Catalog.
In this webinar, we will hear from Mark McKinney, Director – Enterprise Data Analytics at Sprint about the business drivers, key success factors, and challenges faced while undertaking Sprint’s data modernization journey. You will hear how Sprint set about establishing a Hadoop data lake, ingested data from multiple environments, and overcame key skill shortages. You will also hear from Diyotta and Hortonworks about best practices for modernizing your data architecture to support transformational business initiatives.
https://hortonworks.com/webinar/sprints-data-modernization-journey/
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
The document discusses Informatica's data integration platform and its capabilities for big data and analytics projects. Some key points:
- Informatica is a leading data integration vendor with over 5,000 customers including over 70% of the Global 500.
- The Informatica platform provides capabilities across the entire data lifecycle from ingestion to delivery including data quality, master data management, integration, and analytics.
- It supports a variety of data sources including structured, unstructured, cloud, and big data and can run on-premises or in the cloud.
- Customers report the Informatica platform improves agility, scalability, and operational confidence for data integration projects compared to
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
Bernard Doering, Senior Slaes Director DACH, Cloudera.
Hadoop and the Future of Data Management. As Hadoop takes the data management market by storm, organisations are evolving the role it plays in the modern data centre. Explore how this disruptive technology is quickly transforming an industry and how you can leverage it today, in combination with MongoDB, to drive meaningful change in your business.
Pivotal the new_pivotal_big_data_suite_-_revolutionary_foundation_to_leverage...EMC
The document discusses Pivotal's big data suite and business data lake offerings. It provides an overview of the components of a business data lake, including storage, ingestion, distillation, processing, unified data management, and action components. It also defines various data processing approaches like streaming, micro-batching, batch, and real-time response. The goal is to help organizations build analytics and transactional applications on big data to drive business insights and revenue.
Gov & Private Sector Regulatory Compliance: Using Hadoop to Address RequirementsDataWorks Summit
This presentation discusses forward-looking statements that are subject to risks and uncertainties. It addresses issues around who owns data, who has access to data, and what type of data analysis can be done. It provides details on government-to-government, bank-to-government, and regional data exchanges. It discusses Rante's divisions and approach to unique experiences. Rante aims to anticipate industry trends and push boundaries through research and technology innovations.
This document discusses strategies for transitioning from a traditional data warehousing architecture to a modern data architecture. It outlines a 4 sprint approach including developing social sensing capabilities, integrating additional data sources, implementing statistical and machine learning methods, and designing an operating model. It emphasizes the importance of a "kill strategy" to decommission legacy systems, a user adoption strategy to transition users to the new system, and implementing a "data concierge" service to streamline data provisioning and maximize value from the new platform. The strategies described aim to rationalize costs, simplify the data landscape, and enable more agile analytics and business transformation.
Partner Keynote: How Logical Data Fabric Knits Together Data Visualization wi...Denodo
Watch full webinar here: https://bit.ly/3aALFEC
Data Visualization and Data Virtualization are complementary technologies. But how do they come together under a common data fabric? This presentation will discuss how organizations are advancing their data fabric capabilities leveraging innovations in these two technologies in areas of self-service, data catalog, cloud, and AI/ML.
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
Most enterprises understand the value of hybrid cloud. In fact, your enterprise is already working in a multi-cloud or hybrid cloud environment, whether you know it or not. View this SlideShare to gain a greater understanding of the requirements of a geo-distributed cloud database in hybrid and multi-cloud environments.
View recording: https://youtu.be/tHukS-p6lUI
Explore all DataStax webinars: https://www.datastax.com/resources/webinars
Enterprise Data Science at Scale Meetup - IBM and Hortonworks - Oct 2017 Hortonworks
The document discusses Hortonworks' Data Science Experience (DSX) platform. It describes challenges data scientists face around data access, tool usage, collaboration and model deployment. DSX aims to address these by providing tools for exploring, modeling and deploying data science projects on Hortonworks Data Platform (HDP) clusters at scale. It also announces an extension of IBM and Hortonworks' partnership to integrate DSX and other IBM data science tools with HDP.
This document discusses how data science and AI are fueling new business models driven by data. It summarizes that (1) connected devices, customers, and sensors are generating massive amounts of data across manufacturing, distribution, marketing, sales, and service; (2) technologies like cloud computing, streaming data, IoT, and machine learning are enabling new ways to harness this data; and (3) a modern data architecture is needed to encompass all data sources, enable analytics and machine learning, and power actionable intelligence across edge, cloud, and on-premises environments.
Put Alternative Data to Use in Capital Markets Cloudera, Inc.
This document discusses alternative data in capital markets. It provides an overview of alternative data sources like social media, satellite imagery, and location data. It also describes how firms are using alternative data to enhance traditional analysis and develop new investment strategies. The document notes that most alternative data users have seen returns from using this data. However, accessing and analyzing large alternative data sets remains a challenge. It promotes the use of data platforms and visual analytics to more effectively ingest, store, and operationalize alternative data.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
Optimizing your Hadoop Infastructure: An Industry Panel PresentationDataWorks Summit
This document introduces the panel speakers for a discussion on modernizing Hadoop infrastructures. It provides brief biographies of five speakers: Armando Acosta of Dell, who has 15 years of experience in IT solutions and big data; Brandon Draeger of Intel, who manages partnerships between Intel, Cloudera, and their shared ecosystems; TJ Laher of Cloudera, who helps organizations implement Hadoop; Vin Sharma of Intel; and Mark Muncy of Syncsort, who leads technical marketing for big data and has experience in data architecture. The panel will discuss how to evolve Hadoop capabilities to meet emerging customer needs.
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
Cloudera empowers cybersecurity innovators to proactively secure the enterprise by accelerating threat detection, investigation, and response through machine learning and complete enterprise visibility. Cloudera’s cybersecurity solution, based on Apache Spot, enables anomaly detection, behavior analytics, and comprehensive access across all enterprise data using an open, scalable platform. But what’s the easiest way to get started?
Join Cloudera, StreamSets, and Arcadia Data as we show you first hand how we have made it easier to get your first use case up and running. During this session you will learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
3 things to learn:
Signs you need Cloudera’s cybersecurity solution
How StreamSets can help increase enterprise visibility
Providing your security analyst the right context at the right time with modern visualizations
In order to deal with customers expecting a seamless omnichannel experience, increased regulations and speed with which innovative fin-techs enter the market, ING has formulated a customer centric strategy based on data and analytics.
Last year we talked about the fact that ING developed a new architecture, the ING Data Lake. And how within ING In parallel the Big Data paradigm, based on Hadoop, appeared and how this was mapped on the Data Lake architecture to make sure Hadoop is leveraged to the maximum.
This year we want to tell you how the international working group helped realizing the advanced analytic pattern on the ING private cloud, without prior management approval.
This presentation will discuss the community strategy, how to stay under the radar, how to surface when actual content is strong enough to force change, open issues and the private cloud challenges ING is dealing with. Join us in this ride from community idea through architecture to private cloud implementation with some organizational challenges along the way.
A modern approach to streaming data integration, event processing with a big data (kappa style) data architecture. Key patterns are discussed with pros/cons of newer approaches and open source technologies. Focus on Oracle and GoldenGate technology. OpenWorld 2018 presentation.
Without the right data management strategy, investments in Internet of Things (IoT) can yield limited results. Cloudera is pioneering next generation data management solutions, enabling organizations to build an enterprise data hub (EDH) as the backbone to any IoT initiative.
Denodo Design Studio: Modeling and Creation of Data ServicesDenodo
Watch full webinar here: https://bit.ly/39T7SON
Change is the only constant and it is very important for enterprises to keep up with the changing times in an agile fashion. To ensure faster time to market, quick business insights and rapid data driven decision making, it is important that the Data Delivery channel is optimized in the best way possible.
With the advent of API Management technologies the demand for data being delivered in the form of a Data Service/APIs is increasing. The ability to make data available in an API format at the click of a button is the need of the hour. Join us to see how easy it is to make enterprise wide data available as Data Services/APIs no matter what format the data is stored in with no prior coding experience. Faster development, zero learning curve and huge value.
Watch on-demand this webinar to learn:
- How to explore datasets available using Denodo Data Catalog
- How to build new data sets using Denodo Design Studio, drag and drop interface
- How to make datasets available in RESTful, OData 4, GeoJSON, GraphQL.
- How to enable different authentication protocols including OAuth 2.0.
- Automatic documentation (Open API) and availability in the Denodo Data Catalog.
In this webinar, we will hear from Mark McKinney, Director – Enterprise Data Analytics at Sprint about the business drivers, key success factors, and challenges faced while undertaking Sprint’s data modernization journey. You will hear how Sprint set about establishing a Hadoop data lake, ingested data from multiple environments, and overcame key skill shortages. You will also hear from Diyotta and Hortonworks about best practices for modernizing your data architecture to support transformational business initiatives.
https://hortonworks.com/webinar/sprints-data-modernization-journey/
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of interactive SQL queries with the capacity, scalability, and flexibility of a Hadoop cluster. In this webinar, join Cloudera and MicroStrategy to learn how Impala works, how it is uniquely architected to provide an interactive SQL experience native to Hadoop, and how you can leverage the power of MicroStrategy 9.3.1 to easily tap into more data and make new discoveries.
Pivotal Big Data Suite is a comprehensive platform that allows companies to modernize their data infrastructure, gain insights through advanced analytics, and build analytic applications at scale. It includes components for data processing, storage, analytics, in-memory processing, and application development. The suite is based on open source software, supports multiple deployment options, and provides an agile approach to help companies transform into data-driven enterprises.
This document provides an overview of open source data warehousing and business intelligence (DW/BI). It defines cloud computing and explains how open DW consists of pre-designed data warehouse architectures that are free to use. Open DW reduces costs and risks by shortening design and development time. While the architectures are free, vendors charge for services like customization, support, and maintenance. The document discusses the need for and benefits of open DW/BI, including faster deployment, lower costs, and mitigated risks through rapid development. It also outlines some popular open source databases, tools, and vendors in this space.
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
Alluxio - Data Orchestration for Analytics and AI in the Cloud
Oct 8, 2019
Speakers:
Haoyuan Li & Bin Fan, Alluxio
Visit https://www.alluxio.io/events/ for more Alluxio events.
This document provides an overview of Alluxio, a unified data solution that allows applications to access data closer to the computation. It summarizes Alluxio's key innovations including providing a unified namespace, translating between different storage APIs, and using an intelligent caching system. The document also outlines several use cases where Alluxio has helped customers including accelerating machine learning and analytics workloads.
Over the past two decades, the Big Data stack has reshaped and evolved quickly with numerous innovations driven by the rise of many different open source projects and communities. In this meetup, speakers from Uber, Alibaba, and Alluxio will share best practices for addressing the challenges and opportunities in the developing data architectures using new and emerging open source building blocks. Topics include data format (ORC) optimization, storage security (HDFS), data format (Parquet) layers, and unified data access (Alluxio) layers.
Customer migration to Azure SQL database, December 2019George Walters
This is a real life story on how a software as a service application moved to the cloud, to azure, over a period of two years. We discuss migration, business drivers, technology, and how it got done. We talk through more modern ways to refactor or change code to get into the cloud nowadays.
Watch full webinar here: https://bit.ly/2Y0vudM
What is Data Virtualization and why do I care? In this webinar we intend to help you understand not only what Data Virtualization is but why it's a critical component of any organization's data fabric and how it fits. How data virtualization liberates and empowers your business users via data discovery, data wrangling to generation of reusable reporting objects and data services. Digital transformation demands that we empower all consumers of data within the organization, it also demands agility too. Data Virtualization gives you meaningful access to information that can be shared by a myriad of consumers.
Register to attend this session to learn:
- What is Data Virtualization?
- Why do I need Data Virtualization in my organization?
- How do I implement Data Virtualization in my enterprise?
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
Alluxio provides a data orchestration platform that allows applications to access data closer to compute across different storage systems through a unified namespace. Key features include intelligent multi-tier caching that provides local performance for remote data, API translation that enables popular frameworks to access different storages without changes, and data elasticity through a global namespace. Alluxio powers analytics and AI workloads in hybrid cloud environments.
Watch full webinar here: https://bit.ly/3puUCIc
What is Data Virtualization and why do I care? In this webinar we intend to help you understand not only what Data Virtualization is but why it's a critical component of any organization's data fabric and how it fits. How data virtualization liberates and empowers your business users via data discovery, data wrangling to generation of reusable reporting objects and data services. Digital transformation demands that we empower all consumers of data within the organization, it also demands agility too. Data Virtualization gives you meaningful access to information that can be shared by a myriad of consumers.
Watch on-demand this session to learn:
- What is Data Virtualization?
- Why do I need Data Virtualization in my organization?
- How do I implement Data Virtualization in my enterprise? Where does it fit?
Watch full webinar here: https://bit.ly/2vN59VK
What started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
Attend this session to learn:
- What data virtualization really is.
- How it differs from other enterprise data integration technologies.
- Why data virtualization is finding enterprise-wide deployment inside some of the largest organizations.
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
Watch full webinar here: https://bit.ly/3mfFJqb
Presented at Chief Data Officer Live Series 2021, ASEAN (August Edition)
While big data initiatives have become necessary for any business to generate actionable insights, big data fabric has become a necessity for any successful big data initiative. The best-of-breed big data fabrics should deliver actionable insights to the business users with minimal effort, provide end-to-end security to the entire enterprise data platform, and provide real-time data integration while delivering a self-service data platform to business users.
Watch this on-demand session to learn how big data fabric enabled by Data Virtualization:
- Provides lightning fast self-service data access to business users
- Centralizes data security, governance, and data privacy
- Fulfills the promise of data lakes to provide actionable insights
Big data is driving transformative changes in traditional data warehousing. Traditional ETL processes and highly structured data schemas are being replaced with schema flexibility to handle all types of data from diverse sources. This allows for real-time experimentation and analysis beyond just operational reporting. Microsoft is applying lessons from its own big data journey to help customers by providing a comprehensive set of Apache big data tools in Azure along with intelligence and analytics services to gain insights from diverse data sources.
Bridging the Last Mile: Getting Data to the People Who Need ItDenodo
Watch full webinar here: https://bit.ly/3cUA0Qi
Many organizations are embarking on strategically important journeys to embrace data and analytics. The goal can be to improve internal efficiencies, improve the customer experience, drive new business models and revenue streams, or – in the public sector – provide better services. All of these goals require empowering employees to act on data and analytics and to make data-driven decisions. However, getting data – the right data at the right time – to these employees is a huge challenge and traditional technologies and data architectures are simply not up to this task. This webinar will look at how organizations are using Data Virtualization to quickly and efficiently get data to the people that need it.
Attend this session to learn:
- The challenges organizations face when trying to get data to the business users in a timely manner
- How Data Virtualization can accelerate time-to-value for an organization’s data assets
- Examples of leading companies that used data virtualization to get the right data to the users at the right time
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
Alluxio provides a unified interface to access data across multiple storage systems, allowing compute and storage to scale independently for data-driven applications. It uses a virtual unified file system with a global namespace and server-side API translation to abstract data location and access. Alluxio intelligently manages data placement across memory, SSDs and HDDs using multi-tier caching for local performance on remote data. This allows flexible deployment of compute like Spark on any cloud while keeping data fully controlled on-premises. Alluxio is seeing wide adoption with many large production deployments handling thousands of nodes. Upcoming features include POSIX API support and preview of version 2.0.
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
The Internet of Things is moving into the mainstream and this new world of data-driven products is transforming a vast number of industry sectors and technologies.
However, IoT creates a new challenge: how to build and operationalize continual data ingestion from such a wide and ever-changing array of endpoints so that the data arrives consumption-ready and can drive analysis and action within the business.
In this webinar, Sean Anderson from Cloudera and Kirit Busu, Director of Product Management at StreamSets, will discuss Hadoop's ecosystem and IoT capabilities and provide advice about common patterns and best practices. Using specific examples, they will demonstrate how to build and run end-to-end IOT data flows using StreamSets and Cloudera infrastructure.
Similar to Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage (20)
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Eric Wang (Software Engineer, @Uber)
Uber has numerous deep learning models, most of which are highly complex with many layers and a vast number of features. Understanding how these models work is challenging and demands significant resources to experiment with various training algorithms and feature sets. With ML explainability, the ML team aims to bring transparency to these models, helping to clarify their predictions and behavior. This transparency also assists the operations and legal teams in explaining the reasons behind specific prediction outcomes.
In this talk, Eric Wang will discuss the methods Uber used for explaining deep learning models and how we integrated these methods into the Uber AI Michelangelo ecosystem to support offline explaining.
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago)
Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Triston Cao (Senior Deep Learning Software Engineering Manager, @NVIDIA)
From Caffe to MXNet, to PyTorch, and more, Xiande Cao, Senior Deep Learning Software Engineer Manager, will share his perspective on the evolution of deep learning frameworks.
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
AI/ML Infra Meetup
May. 23, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Lu Qiu (Data & AI Platform Tech Lead, @Alluxio)
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub.
In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn:
- The data loading challenges hindering GPU utilization
- The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT
- Real-world examples of boosting model performance and GPU utilization through optimized data access
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio, Inc.
Alluxio Monthly Webinar
May. 14, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Bin Fan (VP of Technology, Alluxio)
Running AI/ML workloads in different clouds present unique challenges. The key to a manageable multi-cloud architecture is the ability to seamlessly access data across environments with high performance and low cost.
This webinar is designed for data platform engineers, data infra engineers, data engineers, and ML engineers who work with multiple data sources in hybrid or multi-cloud environments. Chanchan and Bin will guide the audience through using Alluxio to greatly simplify data access and make model training and serving more efficient in these environments.
You will learn:
- How to access data in multi-region, hybrid, and multi-cloud like accessing a local file system
- How to run PyTorch to read datasets and write checkpoints to remote storage with Alluxio as the distributed data access layer
- Real-world examples and insights from tech giants like Uber, AliPay and more
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
Alluxio Monthly Webinar
Apr. 23, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- ChanChan Mao (Developer Advocate, Alluxio)
- Shawn Sun (Tech Lead of Cloud Native, Alluxio)
Cloud-native model training jobs require fast data access to achieve shorter training cycles. Accessing data can be challenging when your datasets are distributed across different regions and clouds. Additionally, as GPUs remain scarce and expensive resources, it becomes more common to set up remote training clusters from where data resides. This multi-region/cloud scenario introduces the challenges of losing data locality, resulting in operational overhead, latency and expensive cloud costs.
In the third webinar of the multi-cloud webinar series, Chanchan and Shawn dive deep into:
- The data locality challenges in the multi-region/cloud ML pipeline
- Using a cloud-native distributed caching system to overcome these challenges
- The architecture and integration of PyTorch/Ray+Alluxio+S3 using POSIX or RESTful APIs
- Live demo with ResNet and BERT benchmark results showing performance gains and cost savings analysis
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Lucy Ge (Staff Software Engineer @ Alluxio)
In this presentation, Lucy Ge will discuss the data access challenges in the data pipeline and how to optimize the speed and costs of analytics and AI workloads.
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Chen Liang (Staff Software Engineer @ Uber)
In this presentation, Chen Liang will share the design and implementation of the Alluxio-Presto local cache to reduce query latency.
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
Alluxio x Tobiko - ETL Happy Hour
April 16, 2024
For more Alluxio events: https://alluxio.io/events/
Speaker:
Toby Mao (CTO @ Tobiko Data)
Writing efficient and correct incremental pipelines is challenging. Data practitioners who take on this challenge are viewed as performing an "advanced" function, which discourages broader teams from adopting incremental loads. In this lightning talk, CTO of Tobiko Data, Toby Mao, will demystify incremental loading data at scale.
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
Big Data Bellevue Meetup
March 21, 2024
For more Alluxio events: https://alluxio.io/events/
Speakers:
Bin Fan (VP of Open Source, Alluxio)
In this presentation, Bin Fan (VP of Open Source @ Alluxio) will address a critical challenge of optimizing data loading for distributed Python applications within AI/ML workloads in the cloud, focusing on popular frameworks like Ray and Hugging Face. Integration of Alluxio’s distributed caching for Python applications is accomplished using the fsspec interface, thus greatly improving data access speeds. This is particularly useful in machine learning workflows, where repeated data reloading across slow, unstable or congested networks can severely affect GPU efficiency and escalate operational costs.
Attendees can look forward to practical, hands-on demonstrations showcasing the tangible benefits of Alluxio’s caching mechanism across various real-world scenarios. These demos will highlight the enhancements in data efficiency and overall performance of data-intensive Python applications. This presentation is tailored for developers and data scientists eager to optimize their AI/ML workloads. Discover strategies to accelerate your data processing tasks, making them not only faster but also more cost-efficient.
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
Alluxio Monthly Webinar
Feb. 27, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer, Alluxio)
As GenAI and AI continue to transform businesses, scaling these workloads requires optimized underlying infrastructure. A multi-cloud architecture allows organizations to leverage different cloud services to meet diverse workload demands while maximizing efficiency, reducing costs, and avoiding vendor lock-in. However, achieving a multi-cloud vision can be challenging.
In this webinar, Tarik will share how an agonistic data layer, like Alluxio, allows you to embrace the separation of storage from compute and simplify the adoption of multi-cloud for AI.
- Learn why leveraging multiple cloud providers is critical for balancing performance, scalability, and cost of your AI platform
- Discover how an agnostic data layer like Alluxio provides seamless data access in multi-cloud that bridges storage and compute without data replication
- Gain insights into real-world examples and best practices for deploying AI across on-prem, hybrid, and multi-cloud environments
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
Alluxio Monthly Webinar
Jan. 30, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Kevin Petrie (VP of Research, Eckerson Group)
- Omid Razavi (SVP of Customer Success, Alluxio)
2024 is gearing up to be an impactful year for AI and analytics. Join us on January 30, as Kevin Petrie (VP of Research at Eckerson Group) and Omid Razavi (SVP of Customer Success at Alluxio) share key trends that data and AI leaders should know. This event will efficiently guide you with market data and expert insights to drive successful business outcomes.
- Assess current and future trends in data and AI with industry experts
- Discover valuable insights and practical recommendations
- Learn best practices to make your enterprise data more accessible for both analytics and AI applications
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Juncheng Yang(Ph.D Candidate, @CMU)
As a cache eviction algorithm, FIFO has a lot of attractive properties, such as simplicity, speed, scalability, and flash-friendliness. The most prominent criticism of FIFO is its low efficiency (high miss ratio). In this talk, I will describe a simple, scalable FIFO-based algorithm with three static queues (S3-FIFO). Evaluated on 6594 cache traces from 14 datasets, we show that S3- FIFO has lower miss ratios than state-of-the-art algorithms across traces. Moreover, S3-FIFO’s efficiency is robust — it has the lowest mean miss ratio on 10 of the 14 datasets. FIFO queues enable S3-FIFO to achieve good scalability with 6× higher throughput compared to optimized LRU at 16 threads. Our insight is that most objects in skewed workloads will only be accessed once in a short window, so it is critical to evict them early (also called quick demotion). The key of S3-FIFO is a small FIFO queue that filters out most objects from entering the main cache, which provides a guaranteed demotion speed and high demotion precision.
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jingwen Ouyang (Product Manager, @Alluxio)
In this session, Jingwen presents an overview of using Alluxio Edge caching to accelerate Trino or Presto queries. She offers practical best practices for using distributed caching with compute engines. In addition, this session also features insights from real-world examples.
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Siyuan Sheng (Senior Software Engineer, @Alluxio)
- Chunxu Tang (Research Scientist, @Alluxio)
In this session, cloud optimization specialists Chunxu and Siyuan break down the challenges and present a fresh architecture designed to optimize I/O across the data pipeline, ensuring GPUs function at peak performance. The integrated solution of PyTorch/Ray + Alluxio + S3 offers a promising way forward, and the speakers delve deep into its practical applications. Attendees will not only gain theoretical insights but will also be treated to hands-on instructions and demonstrations of deploying this cutting-edge architecture in Kubernetes, specifically tailored for Tensorflow/PyTorch/Ray workloads in the public cloud.
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Shengxuan Liu (Software Engineer, @ByteDance)
Shengxuan Liu from ByteDance presents the new ByteDance’s native Parquet Reader. The talk covers the architecture and key features of the Reader, and how the new Reader is able to facilitate data processing efficiency.
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
Data Infra Meetup
Jan. 25, 2024
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jing Zhao (Principal Engineer, @Uber)
Uber builds one of the biggest data lakes in the industry, which stores exabytes of data. In this talk, we will introduce the evolution of our data storage architecture, and delve into multiple key initiatives during the past several years.
Specifically, we will introduce:
- Our on-prem HDFS cluster scalability challenges and how we solved them
- Our efficiency optimizations that significantly reduced the storage overhead and unit cost without compromising reliability and performance
- The challenges we are facing during the ongoing Cloud migration and our solutions
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
Alluxio Monthly Webinar
Nov. 15, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Tarik Bennett (Senior Solutions Engineer)
- Beinan Wang (Senior Staff Engineer & Architect)
Many companies are working with development architectures for AI platforms but have concerns about efficiency at scale as data volumes increase. They use centralized cloud data lakes, like S3, to store training data for AI platforms. However, GPU shortages add more complications. Storage and compute can be separate, or even remote, making data loading slow and expensive:
1) Optimizing a developmental setup can include manual copies, which are slow and error-prone
2) Directly transferring data across regions or from cloud to on-premises can incur expensive egress fees
This webinar covers solutions to improve data loading for model training. You will learn:
- The data loading challenges with distributed infrastructure
- Typical solutions, including NFS/NAS on object storage, and why they are not the best options
- Common architectures that can improve data loading and cost efficiency
- Using Alluxio to accelerate model training and reduce costs
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Adit Madan (Director of Product Management, @Alluxio)
In this session, Adit Madan, Director of Product Management at Alluxio, presents an overview of using distributed caching to accelerate model training and serving. He explores the requirements of data access patterns in the ML pipeline and offers practical best practices for using distributed caching in the cloud. This session features insights from real-world examples, such as AliPay, Zhihu, and more.
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
AI Infra Day
Oct. 25, 2023
Organized by Alluxio
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Bin Fan (Cheif Architect, VP of Open Source, @Alluxio)
As the AI landscape rapidly evolves, the advancements in generative AI technologies, such as ChatGPT, are driving a need for a robust AI infra stack. This opening keynote will explore the key trends of the AI infra stack in the generative AI era.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Using Query Store in Azure PostgreSQL to Understand Query PerformanceGrant Fritchey
Microsoft has added an excellent new extension in PostgreSQL on their Azure Platform. This session, presented at Posette 2024, covers what Query Store is and the types of information you can get out of it.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Drona Infotech is a premier mobile app development company in Noida, providing cutting-edge solutions for businesses.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
Accelerate and Scale Big Data Analytics and Machine Learning Pipelines with Disaggregated Compute and Storage
1. Dipti Borkar | Head of Product, Alluxio
Shailesh Manjrekar | Head of AI/ML Product and Solutions, SwiftStack
NextGen Data Analytics Stack – Alluxio and SwiftStack
Edge to Core to Cloud
2. Unstoppable Data Growth – Edge to core to cloud
Emphasis on capturing value and show return on investment (ROI) from data
*, IDC Worldwide Storage in Big Data Forecast, 2015-2019, October 2015 and IDC Directions
Value Capture is key
30B
IoT Connected
Devices*
By 2020,
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
1010101010101
101010101010101
100-250EB
Big Data Storage
Capacity*
10101010101010
1010101010101010
101010101
44ZB
data created*
3. Status quo - Existing solutions
Business leaders and storage architects struggling to show return on investment
SILO 1
Compute
+
Storage
SILO 3
Compute
+
Storage
SILO 5
Compute
+
Storage
SILO 2 SILO 4
DAS silos
Poor
Utilization
DIY
Fatigue
High Capex
High OpEx
INEFFICIENT AND EXPENSIVE
Data
Gravity
4. 4 big trends driving the need for a new data analytics
stack
Separation of
Compute &
Storage
Hybrid – Multi
cloud environments
Self-service data
across the
enterprise
Rise
of the object
store
5. Customer Challenges with existing solutions
Lack of Enterprise ready products and continued pressure of every increasing cloud OpEx
Ever increasing operating expenditures on existing (a) poorly utilized DAS solutions or (b)
cloud storage deployments costs at scale
Need for high throughput stack with API compatibility to support batch, interactive and
advanced analytical workloads
Lack of enterprise ready and multi-cloud data lake systems – at scale deployments with
lifecycle management, self healing, geo-replication and faster re-builds
CHALLENGE 1
CHALLENGE 2
CHALLENGE 3
6. Data Ecosystem - Beta Data Ecosystem 1.0
COMPUTE
STORAGE STORAGE
COMPUTE
7. Co-located
Big data journey and innovation options for enterprises
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the same cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
Burst HDFS data in the
cloud,
public or private
Support Presto, Spark
and other computes
without app changes
Enable & accelerate big
data on
object stores
Transition to Object store
HDFS for Hybrid Cloud
Support more frameworks
§ Typically compute-bound
clusters over 100% capacity
§ Compute & I/O need to be
scaled together even when
not needed
§ Compute & I/O can be
scaled independently but I/O
still needed on HDFS which
is expensive
1 2
3
4
5
8. Multi-cloud storage and
Data Management
Java File API HDFS Interface S3 Interface REST APIFUSE Interface
HDFS Driver Swift Driver S3 Driver
The SwiftStack Data Analytics Solution with Alluxio
Accelerated Compute, Data accessibility and Elasticity
9. SwiftStack Data Analytics Solution – business use cases
Customer, Security and Fraud Analysis
Precision Medicine and Bio-Informatics
Customer Churn/ Sentiment Analysis
Analytics As a Service, Operational
Analytics
Internet of Things / Everything
Financial Services, FSI
Healthcare and Life Sciences,
Genomics
Cloud Service Providers
Oil and Gas, Industrial Internet and
Manufacturing
Media and Entertainment
10. SwiftStack Data Analytics solution – Value to be Captured by Enterprises
Data and analytics as a source of competitive advantage
Source: IDC Directions
“Organizations that analyze all relevant data and deliver actionable information will achieve extra $430B in productivity
gains over less analytically oriented peers by 2020”
IDC: Worldwide Bid Data and Analytics 2016 Predictions
Value can be created in the following ways with some industry relevance:
üImprove operational efficiency
üReduce cost
üNew product development
üInsights into new services
üBetter Customer Experience
11. Alluxio and SwiftStack partnership
Originated as Tachyon project, at the UC
Berkley’s AMP Lab by then Ph.D. student & now
Alluxio CTO, Haoyuan (H.Y.) Li.2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data at Memory Speed
for the Cloud for data driven apps such as Big
Data Analytics, ML and AI.
2018 20192018
12. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data on-demand with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Alluxio – Key innovations
14. “Infrastructure challenges are the primary inhibitor for broader adoption of AI/ML workflows. SwiftStack’s
Multi-Cloud data management solution is first of it’s kind in the industry and effectively handles storage I/O
challenges faced by edge to core to cloud, large scale AI/ML data pipelines”
Amita Potnis,
Research Director at IDC’s Infrastructure
System Platform and Technologies Group
15. Property of SwiftStack Inc. 15
Multi-Cloud Storage and Data Management
Storage and multi-cloud data management for
data-driven applications and workflows
SwiftStack
SwiftStack Storage SwiftStack 1space
On-premises cloud storage
Highest throughput performance
Easy to deploy, operate, and scale
From tens of terabytes to hundreds of petabytes
Spans multiple geographic regions
Proven platform to realize more value from data!
Multi-cloud data management
Transparent access to a single storage namespace
Public and private infrastructure
Policy-driven data placement
Metadata search across the namespace
Leverage unique services across clouds!
16. Property of SwiftStack Inc. 16
SwiftStack Object Storage Architecture
Continuous
Auditing
Automatic
Replication
Fault
Tolerant
• Automated storage system
management for standard
servers
Replicas Erasure Codes
Direct-Attached Storage
• Masterless, quorum writes
• Nearest reads
• As-dispersed-as-possible data
placement across
nodes/zones/regions
• Distributed partitions in a
modified consistent hash ring
Replication
Reconstruction
Auditing
Device Inventory Management
Storage System Metrics Collection
Hardware Fault Detection
Standard Servers, Drives & Networking
Site 1 Site 2 Site 3
SwiftStack
Storage
18. 18
Data Analytics Hub – Total Cost Of Ownership (TCO) Analysis
The 5-year TCO for the Hosted Private Cloud solution is 1/4th of the one hosted on a public cloud
20. 1. SwiftStack Data Analytics solution – On-premise deployment
• Customers starting their on-premise analytic
journeys
• Benefits
• Same performance as HDFS
• No more HDFS: operational simplicity
• Compute can be fully virtualized /
containerized!
• Durability++ (Erasure Coding)
• Scale (billions of objects / racks / Geo)
Alluxio
Presto
Alluxio
Presto
Dramatically speed-up big data
on object stores on premise
Same container
/ machine
Alluxio
Presto
Alluxio
Spark / Presto / Hive
21. 2. Cloud bursting with SwiftStack Data Analytics solution
•Hybrid workflow
• customers hosting data on-premise and
leveraging public cloud for economies
of scale
• Alluxio for data locality
•Benefits
• Data as strategic asset stays on-premise
• Leverage cloud economies of scale for
compute
Hadoop cluster node
Alluxio
“alluxio//”
Compute:
Spark, Presto, Hive,
…
WAN
Private Cloud
Public Cloud
22. 3. HDFS off-load to SwiftStack Data Analytics solution
•HDFS off-load
existing HDFS customers on DAS
looking to move to S3 - needs
migration
leverage DistCp - Distributed copy as
data mover and then the same
workflow
Using Alluxio
•Benefits
• Known and well understood process by
administrators: existing HDFS
workflows + rsync-like backup workflow
Co-located environment
ImpalaHive Spark
Same data
center / region
Presto
Enable big data on object stores
across single or multiple clouds
Spark
Alluxio Alluxio
25. 25
§ Come, talk to us about analytics on SwiftStack
§ Data analytics Solution – Alluxio and SwiftStack deliver a winning
combination of performance and capacity – “deliver on the promise of a
future ready data lake”
§ Multiple use cases across industry verticals show success of highly
scalable and lowest TCO big data solution
Get started with the Data Analytics Solution