Modern data lakes are now built on cloud storage, helping organizations leverage the scale and economics of object storage while simplifying overall data storage and analysis flow
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.
In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.
In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
The growth of observability trends and Kubernetes adoption generates more demanding requirements for monitoring systems. Volumes of time series data increase exponentially, and old solutions just can’t keep up with the pace. The talk will cover how and why we created a new open source time series database from scratch. Which architectural decisions, which trade-offs we had to take in order to match the new expectations and handle 100 million metrics per second with VictoriaMetrics. The talk will be interesting for software engineers and DevOps familiar with observability and modern monitoring systems, or for those who’re interested in building scalable high performant databases for time series.
BespinGlobal 컨설팅 본부
최정식 위원(js.choi@bespinglobal.com)
데이터 마이그레이션 세미나 - 데이터로 날자
Helping You Adopt Cloud | 가트너 선정 아시아 No.1 클라우드 MSP, 성공적인 클라우드 도입을 위한 전략, 구축, 운영 및 관리 서비스 제공
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
Roko Kruze of vectorized.io describes real-time analytics using Redpanda event streams and ClickHouse data warehouse. 15 December 2021 SF Bay Area ClickHouse Meetup
Deep Dive and Best Practices for Real Time Streaming ApplicationsAmazon Web Services
Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Deep-dive into Microservices Patterns with Replication and Stream Analytics
Target Audience: Microservices and Data Architects
This is an informational presentation about microservices event patterns, GoldenGate event replication, and event stream processing with Oracle Stream Analytics. This session will discuss some of the challenges of working with data in a microservices architecture (MA), and how the emerging concept of a “Data Mesh” can go hand-in-hand to improve microservices-based data management patterns. You may have already heard about common microservices patterns like CQRS, Saga, Event Sourcing and Transaction Outbox; we’ll share how GoldenGate can simplify these patterns while also bringing stronger data consistency to your microservice integrations. We will also discuss how complex event processing (CEP) and stream processing can be used with event-driven MA for operational and analytical use cases.
Business pressures for modernization and digital transformation drive demand for rapid, flexible DevOps, which microservices address, but also for data-driven Analytics, Machine Learning and Data Lakes which is where data management tech really shines. Join us for this presentation where we take a deep look at the intersection of microservice design patterns and modern data integration tech.
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
A Practical Enterprise Feature Store on Delta LakeDatabricks
The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.
Key Takeaways:
– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.
– Consider edge cases where there may be opportunities for simplification such as “online” predictions.
– Review a typical logical data model for a feature store and how that can be applied to your business domain.
– Consider options for physical storage of the feature store in the Delta Lake.
– Understand common access patterns including metadata-based feature discovery.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;)
After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters.
This is part 1 of an 8 part Data Science for Dummies series:
Databricks for dummies
Titanic survival prediction with Databricks + Python + Spark ML
Titanic with Azure Machine Learning Studio
Titanic with Databricks + Azure Machine Learning Service
Titanic with Databricks + MLS + AutoML
Titanic with Databricks + MLFlow
Titanic with DataRobot
Deployment, DevOps/MLops and Operationalization
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
The growth of observability trends and Kubernetes adoption generates more demanding requirements for monitoring systems. Volumes of time series data increase exponentially, and old solutions just can’t keep up with the pace. The talk will cover how and why we created a new open source time series database from scratch. Which architectural decisions, which trade-offs we had to take in order to match the new expectations and handle 100 million metrics per second with VictoriaMetrics. The talk will be interesting for software engineers and DevOps familiar with observability and modern monitoring systems, or for those who’re interested in building scalable high performant databases for time series.
BespinGlobal 컨설팅 본부
최정식 위원(js.choi@bespinglobal.com)
데이터 마이그레이션 세미나 - 데이터로 날자
Helping You Adopt Cloud | 가트너 선정 아시아 No.1 클라우드 MSP, 성공적인 클라우드 도입을 위한 전략, 구축, 운영 및 관리 서비스 제공
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
3 Things to Learn About:
-How Kudu is able to fill the analytic gap between HDFS and Apache HBase
-The trade-offs between real-time transactional access and fast analytic performance
-How Kudu provides an option to achieve fast scans and random access from a single API
Roko Kruze of vectorized.io describes real-time analytics using Redpanda event streams and ClickHouse data warehouse. 15 December 2021 SF Bay Area ClickHouse Meetup
Deep Dive and Best Practices for Real Time Streaming ApplicationsAmazon Web Services
Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Deep-dive into Microservices Patterns with Replication and Stream Analytics
Target Audience: Microservices and Data Architects
This is an informational presentation about microservices event patterns, GoldenGate event replication, and event stream processing with Oracle Stream Analytics. This session will discuss some of the challenges of working with data in a microservices architecture (MA), and how the emerging concept of a “Data Mesh” can go hand-in-hand to improve microservices-based data management patterns. You may have already heard about common microservices patterns like CQRS, Saga, Event Sourcing and Transaction Outbox; we’ll share how GoldenGate can simplify these patterns while also bringing stronger data consistency to your microservice integrations. We will also discuss how complex event processing (CEP) and stream processing can be used with event-driven MA for operational and analytical use cases.
Business pressures for modernization and digital transformation drive demand for rapid, flexible DevOps, which microservices address, but also for data-driven Analytics, Machine Learning and Data Lakes which is where data management tech really shines. Join us for this presentation where we take a deep look at the intersection of microservice design patterns and modern data integration tech.
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
This is the course that was presented by James Liddle and Adam Vile for Waters in September 2008.
The book of this course can be found at: http://www.lulu.com/content/4334860
IBM Cloud Object Storage System, presented Oct 16, 2017 at IBM Systems Technical University in New Orleans, LA. This covers the object storage from IBM from the acquisition of Cleversafe, formerly known as DSnet product.
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
Watch full webinar here:
Data lakes have been both praised and loathed. They can be incredibly useful to an organization, but it can also be the source of major headaches. Its ease to scale storage with minimal cost has opened the door to many new solutions, but also to a proliferation of runaway objects that have coined the term data swamp.
However, the addition of an MPP engine, based on Presto, to Denodo’s logical layer can change the way you think about the role of the data lake in your overall data strategy.
Watch on-demand this session to learn:
- The new MPP capabilities that Denodo includes
- How to use them to your advantage to improve security and governance of your lake
- New scenarios and solutions where your data fabric strategy can evolve
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
Big Data Bellevue Meetup
May 19, 2022
For more Alluxio events: https://alluxio.io/events/
Speaker: Bin Fan (Founding Engineer & VP of Open Source, Alluxio)
Today, data engineering in modern enterprises has become increasingly more complex and resource-consuming, particularly because (1) the rich amount of organizational data is often distributed across data centers, cloud regions, or even cloud providers, and (2) the complexity of the big data stack has been quickly increasing over the past few years with an explosion in big-data analytics and machine-learning engines (like MapReduce, Hive, Spark, Presto, Tensorflow, PyTorch to name a few).
To address these challenges, it is critical to provide a single and logical namespace to federate different storage services, on-prem or cloud-native, to abstract away the data heterogeneity, while providing data locality to improve the computation performance. [Bin Fan] will share his observation and lessons learned in designing, architecting, and implementing such a system – Alluxio open-source project — since 2015.
Alluxio originated from UC Berkeley AMPLab (used to be called Tachyon) and was initially proposed as a daemon service to enable Spark to share RDDs across jobs for performance and fault tolerance. Today, it has become a general-purpose, high-performance, and highly available distributed file system to provide generic data service to abstract away complexity in data and I/O. Many companies and organizations today like Uber, Meta, Tencent, Tiktok, Shopee are using Alluxio in production, as a building block in their data platform to create a data abstraction and access layer. We will talk about the journey of this open source project, especially in its design challenges in tiered metadata storage (based on RocksDB), embedded state-replicate machine (based on RAFT) for HA, and evolution in RPC framework (based on gRPC) and etc.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
Want to learn Hadoop online? This PPT give you Introduction to Big Data Hadoop Training Online by expert trainers at ITJobZone.biz - Start your Hadoop Online training with this Presentation.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
2. FIRST ORDER LOGIC
10/29/18
2
First-order logic—also known as first-order predicate calculus and predicate logic - is a collection
of formal systems used in mathematics, philosophy, linguistics, and computer science.
Married("Harry", "Sally", "12-Dec-1995").
IsMotherOf("Sally", "Peter").
IsFatherOf("Harry", "Peter").
The Relational Model says that in your
database this is how you think about and
represent all your data
There exists one or more X such that the
marriage happened in 1995
5. CTO GETS HADOOP IN.
10/29/18
5
1. Scale out architecture : 2. Shared Nothing : 3. Compute + Storage together 4. Google like!!
6. 10/29/18
6
ALL WENT SMOOTH UNTIL...
A zip file was sent from a third part vendor which contains one million jpeg files. Wrote a map
reduce program to process it
File is of size 8 GB, separated into 128MB blocks – about 63 blocks. 3 times replication
- total size about 26 GB
Executed the application – What might have happened ?
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/Hado
opRDD.scala
8. 10/29/18
8
THE SUBSEQUENT MONTHS.
1. We copied data from Netezza to HIVE
2. We created reports from Tableau with HIVE ODBC
3. We created a copy of HIVE into HBASE
4. We have HDP, but Cloudera supports Impala
5. MapReduce is slow
6. All data is not at one place
7. May be some more tools are needed
8. We need a unified data architecture solution
9. Rebalancing took entire week
10. Important file types are not splittable
11. 3 copies is too much space
12. Cost of maintenance is high
13. We may need to go to cloud
14. SLA not met
15. Too much operational work
9. 10/29/18
9
WHAT MAKES ORGANIZATION FAMOUS?
CTO wants AI but AI is different from AI !!
AI : Autonomous systems which REPLACES human cognitive thought process
AI (IA): Autonomous systems which SUPPORTS human cognitive thought process
AlgorithmInput Output ?
Both needs machine learning and deep learning. These are means to do AI or IA
Algorithm ?
Input
Output
OUTPUT MAY BE NEEDED INSTANLY, BUT LEARNING IT
MAY TAKE HOURS/DAYS/MONTHS
11. FILE SYSTEMS.
• The problem is the file system. Traditional block-based file systems use
lookup tables to store file locations. They break each file up into small
blocks, generally 4k in size, and store the byte offset of each block in a
large table.
• This is fine for small volumes, but when you attempt to scale to the
petabyte range, these lookup tables become extremely large. It’s like
a database. The more rows you insert, the slower your queries
run. Eventually your performance degrades to the point where your
file system becomes unusable.
• When this happens, users are forced to split their data sets up into
multiple LUNs to maintain an acceptable level of performance. This
adds complexity and makes these systems difficult to manage
29/10/18
11
12. BLOCK BASED STORAGE SYSTEMS.
• To solve this problem, some organizations are deploying scale-out file
systems, like HDFS. This fixes the scalability problem, but keeping these
systems up and running is a labor-intensive process.
• Scale-out file systems are complex and require constant
maintenance. In addition, most of them rely on replication to protect
your data. The standard configuration is triple-replication, where you
store 3 copies of every file.
• This requires an extra 200% of raw disk capacity for
overhead! Everyone thinks that they’re saving money by using
commodity drives, but by the time you store three full copies of your
data set, the cost savings disappears. When we’re talking about
petabyte-scale applications, this is an expensive approach.
29/10/18
12
13. SOLUTION TO STORAGE.
• Object stores achieve their scalability by decoupling file management
from the low-level block management. Each disk is formatted with a
standard local file system, like ext4. Then a set of object storage
services is layered on top of it, combining everything into a single,
unified volume.
• Files are stored as “objects” in the object store rather than files on a
file system. By offloading the low-level block management onto the
local file systems, the object store only has to keep track of the high-
level details.
• This layer of separation keeps the file lookup tables at a manageable
size, allowing you scale to hundreds of petabytes without
experiencing degraded performance.
29/10/18
13
14. SOLUTION TO STORAGE.
• To maximize usable space, object stores use a technique called
Erasure Coding to protect your data. You can think of it as the next
generation of RAID.
• In an erasure coded volume, files are divided into shards, with each
shard being placed on a different disk. Additional shards are added,
containing error correction information, which provide protection from
data corruption and disk failures. Only a subset of the shards is
required to retrieve each file, which means it can survive multiple disk
failures without the risk of data loss.
• Erasure coded volumes can survive more disk failures than RAID and
typically provides more than double the usable capacity of triple
replication, making it the ideal choice for petabyte-scale storage.
29/10/18
14
15. MINIO - ERASURE CODING.
29/10/18
15
• EC is based on a technology called Forward Error Correction
(FEC), developed more than 50 years ago (1940- Richard
Hamming). Used originally for controlling errors in data
transmission over noisy or unreliable tele communication
channels. Reed-Solomon codes are a kind of EC, used widely in
CDs/DVDs, Blue Ray, Satellite commn etc.
• A message of k symbols can be transformed into a longer
message (code word or parity) with n symbols such that
the original message can be recovered from a subset of
the n symbols. If n=k+1, then there is a special case
called parity check
17. TOP 5 : COST.
• https://amzn.to/2Q7AWGo
• S3: 23 USD per TB per month.(12.5 USD per TB for cold access)
• HDFS: Using d2.8xl instance types ($5.52/hr with 71% discount, 48TB
HDD), it costs 5.52 x 0.29 x 24 x 30 / 48 x 3 / 0.7 = $103/month for 1TB of
data. (Note that with reserved instances, it is possible to achieve lower
price on the d2 family.)
• S3 is 5X cheaper than HDFS.
• S3’s human cost is virtually zero, whereas it usually takes a team of
Hadoop engineers or vendor support to maintain HDFS. Once we
factor in human cost, S3 is 10X cheaper than HDFS clusters on EC2 with
comparable capacity.
29/10/18
17
18. TOP 5 : ELASTICITY.
• From Databricks:
• 99.999999999% durability and 99.99% availability. Note that this is
higher than the vast majority of organizations’ in-house services.
• Majority of Hadoop clusters have availability lower than 99.9%, i.e. at
least 9 hours of downtime per year.
• With cross-AZ replication that automatically replicates across different
data centers, S3’s availability and durability is far superior to HDFS’.
• Hortonworks – Data Plane Services in 2019!
29/10/18
18
19. TOP 5 : PERFORMANCE.
• When using HDFS and getting perfect data locality, it is possible to get
~3GB/node local read throughput on some of the instance types (e.g.
i2.8xl, roughly 90MB/s per core). Spark DBIO, cloud I/O optimization
module, provides optimized connectors to S3 and can sustain
~600MB/s read throughput on i2.8xl (roughly 20MB/s per core).
• That is to say, on a per node basis, HDFS can yield 6X higher read
throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS,
we find that S3 is almost 2x better compared to HDFS on performance
per dollar.
29/10/18
19
20. TOP 5 :TRANSACTIONS.
• Hadoop fs –mkdirs sample/a/b/c/
• Now you put the file into a/b/c
• Buckets…not directories
• In a Minio server instance, a single RESTful PUT request will create an
object “a/b/c/data.txt” in “mybucket” without having to create
“a/b/c” in advance
• This happens because object stores support hierarchical naming and
operations without the need for directories.
29/10/18
20
21. TOP 5 :TRANSACTIONS.
• Data Move is very interesting…
• What happens if you have a write code in Spark (saveAsTextFile) fils for
a partition ?
• Rename is atomic – the most critical part in Hadoop write flow
• Minio (or any object store) does not provide an atomic rename. In
fact, rename should be avoided in object storage altogether, since it
consists of two separate operations: copy and delete.
• Normal COPY is mapped to RESTful PUT request or RESTful COPY
request and triggers internal data movements between storage
nodes. The subsequent delete command maps to the RESTful DELETE
request, but usually relies on the bucket listing operation to identify
which data must be deleted. This makes a rename highly inefficient in
object stores, and the lack of atomicity may leave data in a
corrupted state.
29/10/18
21
22. TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
22
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
23. TOP 5 :TRANSACTIONS: PERFORMANCE.
29/10/18
23
• version 1, which moves staged task output files to their final locations
at the end of the job, and version 2, which moves files as individual job
tasks complete.
24. TOP 5 : DATA INTEGRITY - ELEGANT
SOLUTION FROM SPARK.
29/10/18
24
• Version 2.1 : https://docs.databricks.com/spark/latest/spark-sql/dbio-
commit.html
28. DEMO TIME
1) MINIO INTEROPERABILITY WITH HADOOP – PUTTING AND GETTING DATA
2) MINIO INTEROPERABILITY WITH HIVE
3) MINIO WITH UNIFIED DATA ARCHITECTURE – PRESTO
4) MINIO WITH SPARK - FILES
5) MINIO WITH SPARK – OBJECTS
6) MINIO WITH SEARCH
10/29/18
28