What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Learn the current state of the NoSQL landscape and discover the different data models within it. From document stores and key value databases to graph and Wide Column. Then you’ll learn why wide column databases are the most appropriate for scalable high performance use cases, including capabilities for massive scale-out architecture, peer-to-peer clustering to avoid bottlenecking and built-in multi-datacenter replication.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.
We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
In this knolx session, we will come to know about Delta Lake and its features. Delta Lake is one of the greatest innovations by Databricks that makes existing data lakes more scalable and reliable. Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of our existing data lake and is fully compatible with Apache Spark APIs.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Apache Kafka® Use Cases for Financial Servicesconfluent
Traditional systems were designed in an era that predates large-scale distributed systems. These systems often lack the ability to scale to meet the needs of the modern data-driven organisation. Adding to this is the accumulation of technologies and the explosion of data which can result in complex point-to-point integrations where data becomes siloed or separated across the enterprise.
The demand for fast results and decision making, have generated the need for real-time event streaming and processing of data adoption in financial institutions to be on the competitive edge. Apache Kafka and the Confluent Platform are designed to solve the problems associated with traditional systems and provide a modern, distributed architecture and Real-time Data streaming capability. In addition these technologies open up a range of use cases for Financial Services organisations, many of which will be explored in this talk. .
Step-1 Tableau Introduction
Step-2 Connecting to Data
Step-3 Building basic views
Step-4 Data manipulations and Calculated fields
Step-5 Tableau Dashboards
Step-6 Advanced Data Options
Step-7 Advanced graph Options
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Data Mesh is a new socio-technical approach to data architecture, first described by Zhamak Dehghani and popularised through a guest blog post on Martin Fowler's site.
Since then, community interest has grown, due to Data Mesh's ability to explain and address the frustrations that many organisations are experiencing as they try to get value from their data. The 2022 publication of Zhamak's book on Data Mesh further provoked conversation, as have the growing number of experience reports from companies that have put Data Mesh into practice.
So what's all the fuss about?
On one hand, Data Mesh is a new approach in the field of big data. On the other hand, Data Mesh is application of the lessons we have learned from domain-driven design and microservices to a data context.
In this talk, Chris and Pablo will explain how Data Mesh relates to current thinking in software architecture and the historical development of data architecture philosophies. They will outline what benefits Data Mesh brings, what trade-offs it comes with and when organisations should and should not consider adopting it.
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about the capabilities of the PostgreSQL database
- Learn about PostgreSQL offerings on AWS
- Learn how to migrate from Oracle to PostgreSQL with minimal disruption
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
Two #ModernDataStack talks and one DevOps talk: https://youtu.be/4R--iLnjCmU
1. "From Data-driven Business to Business-driven Data: Hands-on #DataModelling exercise" by Jacob Frackson of Montreal Analytics
2. "Trends in the #DataEngineering Consulting Landscape" by Nadji Bessa of Infostrux Solutions
3. "Building Secure #Serverless Delivery Pipelines on #GCP" by Ugo Udokporo of Google Cloud Canada
We ran out of time for the 4th presenter, so the event will CONTINUE in March... stay tuned! Compliments of #ServerlessTO.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...Databricks
A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
In this session we will introduce the delta-rs project which is helping bring the power of Delta Lake outside of the Spark ecosystem. By providing a foundational Delta Lake library in Rust, delta-rs can enable native bindings in Python, Ruby, Golang, and more.We will review what functionality delta-rs supports in its current Rust and Python APIs and the upcoming roadmap.
We will also give an overview of one of the first projects to use it in production: kafka-delta-ingest, which builds on delta-rs to provide a high throughput service to bring data from Kafka into Delta Lake.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Delta Lake delivers reliability, security and performance to data lakes. Join this session to learn how customers have achieved 48x faster data processing, leading to 50% faster time to insight after implementing Delta Lake. You’ll also learn how Delta Lake provides the perfect foundation for a cost-effective, highly scalable lakehouse architecture.
Snowflake: The most cost-effective agile and scalable data warehouse ever!Visual_BI
In this webinar, the presenter will take you through the most revolutionary data warehouse, Snowflake with a live demo and technical and functional discussions with a customer. Ryan Goltz from Chesapeake Energy and Tristan Handy, creator of DBT Cloud and owner of Fishtown Analytics will also be joining the webinar.
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
In this knolx session, we will come to know about Delta Lake and its features. Delta Lake is one of the greatest innovations by Databricks that makes existing data lakes more scalable and reliable. Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of our existing data lake and is fully compatible with Apache Spark APIs.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Apache Kafka® Use Cases for Financial Servicesconfluent
Traditional systems were designed in an era that predates large-scale distributed systems. These systems often lack the ability to scale to meet the needs of the modern data-driven organisation. Adding to this is the accumulation of technologies and the explosion of data which can result in complex point-to-point integrations where data becomes siloed or separated across the enterprise.
The demand for fast results and decision making, have generated the need for real-time event streaming and processing of data adoption in financial institutions to be on the competitive edge. Apache Kafka and the Confluent Platform are designed to solve the problems associated with traditional systems and provide a modern, distributed architecture and Real-time Data streaming capability. In addition these technologies open up a range of use cases for Financial Services organisations, many of which will be explored in this talk. .
Step-1 Tableau Introduction
Step-2 Connecting to Data
Step-3 Building basic views
Step-4 Data manipulations and Calculated fields
Step-5 Tableau Dashboards
Step-6 Advanced Data Options
Step-7 Advanced graph Options
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...Altinity Ltd
OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers
Alex Merced - Dremio
The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies.
In this talk Alex Merced will cover:
- What is a Data Lakehouse?
- Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom)
- What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie
Data Mesh is a new socio-technical approach to data architecture, first described by Zhamak Dehghani and popularised through a guest blog post on Martin Fowler's site.
Since then, community interest has grown, due to Data Mesh's ability to explain and address the frustrations that many organisations are experiencing as they try to get value from their data. The 2022 publication of Zhamak's book on Data Mesh further provoked conversation, as have the growing number of experience reports from companies that have put Data Mesh into practice.
So what's all the fuss about?
On one hand, Data Mesh is a new approach in the field of big data. On the other hand, Data Mesh is application of the lessons we have learned from domain-driven design and microservices to a data context.
In this talk, Chris and Pablo will explain how Data Mesh relates to current thinking in software architecture and the historical development of data architecture philosophies. They will outline what benefits Data Mesh brings, what trade-offs it comes with and when organisations should and should not consider adopting it.
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Learn about the capabilities of the PostgreSQL database
- Learn about PostgreSQL offerings on AWS
- Learn how to migrate from Oracle to PostgreSQL with minimal disruption
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Daniel Zivkovic
Two #ModernDataStack talks and one DevOps talk: https://youtu.be/4R--iLnjCmU
1. "From Data-driven Business to Business-driven Data: Hands-on #DataModelling exercise" by Jacob Frackson of Montreal Analytics
2. "Trends in the #DataEngineering Consulting Landscape" by Nadji Bessa of Infostrux Solutions
3. "Building Secure #Serverless Delivery Pipelines on #GCP" by Ugo Udokporo of Google Cloud Canada
We ran out of time for the 4th presenter, so the event will CONTINUE in March... stay tuned! Compliments of #ServerlessTO.
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
Watch full webinar here: https://bit.ly/3offv7G
Presented at AI Live APAC
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Watch this on-demand session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc.
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
Watch full webinar here: https://bit.ly/3mJJ4w9
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
In Data Engineer's Lunch #60, Rahul Singh, CEO here at Anant, will discuss modern data processing/pipeline approaches.
Want to learn about modern data engineering patterns & practices for global data platforms? A high-level overview of different types, frameworks, and workflows in data processing and pipeline design.
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Denodo
Watch full webinar here: https://bit.ly/34iCruM
Many organizations are embarking on strategically important journeys to embrace data and analytics. The goal can be to improve internal efficiencies, improve the customer experience, drive new business models and revenue streams, or – in the public sector – provide better services. All of these goals require empowering employees to act on data and analytics and to make data-driven decisions. However, getting data – the right data at the right time – to these employees is a huge challenge and traditional technologies and data architectures are simply not up to this task. This webinar will look at how organizations are using Data Virtualization to quickly and efficiently get data to the people that need it.
Attend this session to learn:
- The challenges organizations face when trying to get data to the business users in a timely manner
- How Data Virtualization can accelerate time-to-value for an organization’s data assets
- Examples of leading companies that used data virtualization to get the right data to the users at the right time
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
Doing Analytics Right - Building the Analytics EnvironmentTasktop
Implementing analytics for development processes is challenging. As in discussed in the previous webinars, the right analytics are determined by the goals of the organization, not by the available data. So implementing your analytics solutions will require an efficient analytics and data architecture, including the ability to combine and stage data from heterogeneous sources. An architecture that excludes the ability to gain access to the necessary data will create a barrier to deploying your newly designed analytics program, and will force you back into the “light is brighter here” anti-pattern.
This webinar will describe the technical considerations of implementing the data architecture for your analytics program, and explain how Tasktop can help.
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
Watch here: https://bit.ly/2NGQD7R
In an era increasingly dominated by advancements in cloud computing, AI and advanced analytics it may come as a shock that many organizations still rely on data architectures built before the turn of the century. But that scenario is rapidly changing with the increasing adoption of real-time data virtualization - a paradigm shift in the approach that organizations take towards accessing, integrating, and provisioning data required to meet business goals.
As data analytics and data-driven intelligence takes centre stage in today’s digital economy, logical data integration across the widest variety of data sources, with proper security and governance structure in place has become mission-critical.
Attend this session to learn:
- Learn how you can meet cloud and data science challenges with data virtualization.
- Why data virtualization is increasingly finding enterprise-wide adoption
- Discover how customers are reducing costs and improving ROI with data virtualization
Using Data Platforms That Are Fit-For-PurposeDATAVERSITY
We must grow the data capabilities of our organization to fully deal with the many and varied forms of data. This cannot be accomplished without an intense focus on the many and growing technical bases that can be used to store, view, and manage data. There are many, now more than ever, that have merit in organizations today.
This session sorts out the valuable data stores, how they work, what workloads they are good for, and how to build the data foundation for a modern competitive enterprise.
Guest Speaker in the 2nd National level webinar titled "Big Data Driven Solutions to Combat Covid 19" on 4th July 2020, Ethiraj College for Women(Auto), Chennai.
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfAlbert Wong
This article isn't just about building an application, it's about empowering your users to unleash the full potential of their data. We'll delve into the world of StarRocks, a cutting-edge engine for real-time analytics, and guide you through the process of constructing user-facing applications that deliver fast, insightful, and visually captivating experiences.
Dive into the Why:
First, we'll explore the compelling reasons to choose StarRocks. We'll demystify its ability to handle high-velocity data streams and deliver sub-second query responses, making it ideal for applications where real-time insights are king.
Laying the Foundation:
Next, we'll embark on a step-by-step journey through the application development process. We'll discuss:
Data Ingestion: Learn how to seamlessly integrate real-time data sources like Kafka and Pulsar into your application, ensuring a constant flow of fresh insights.
Building the Core: We'll explore various options for structuring your application's backend, showcasing the pros and cons of different frameworks and libraries.
Querying Made Easy: Delve into StarRocks' intuitive SQL interface and discover how to craft powerful queries that unlock the treasure trove of data hiding within.
Visualization Wonderland: Explore the vast landscape of data visualization tools compatible with StarRocks. We'll showcase ways to transform complex data into interactive dashboards, compelling charts, and captivating reports that resonate with your users.
Beyond the Basics:
We'll push the boundaries by exploring advanced features like:
Materialized Views: Discover how to pre-compute data summaries for even faster querying.
Security and Access Control: Learn how to implement robust security measures to protect sensitive data while granting appropriate access levels to different user groups.
Customization and Scalability: Explore options for tailoring the application to your specific needs and ensuring it can gracefully handle growing data volumes and user traffic.
Conclusion:
By the end of this article, you'll be equipped with the knowledge and tools to build user-facing analytics applications that empower your users to harness the power of real-time data and make confident, data-driven decisions. This journey isn't just about building an application, it's about transforming how your users interact with data, unlocking a world of actionable insights and limitless possibilities.
So, are you ready to conquer complexity and unleash the power of StarRocks? Dive into this article and start building your next game-changing user-facing analytics application!
TidalScale has created a software defined computer.
At TidalScale, we have created a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
We configure hosted hardware into one or more TidalPods. Each TidalPod is a virtual supercomputer comprising a set of commodity servers configured with the TidalScale HyperKernel. What the user sees is standard Linux, FreeBSD or Windows running with the sum of all memory, processors, networks, and I/O. The secret sauce is the HyperKernel that fools the guest OS into thinking it’s running directly on a huge, expensive machine when in fact it’s running on a set of smaller, less expensive servers.
We offer an incredibly simple user experience.
• Define the computer size you want (Number of CPU, Amount of Memory), boot the virtual machine, then login to the computer…
Thus, we enable a simple cost-effective way for a data scientist, an analyst, an engineer, a scientist, a database administrator, or a software developer to access a group of servers in a Datacenter through a single operating system instance as if it were a single supercomputer. This dramatically simplifies development, while reducing software scaling complexity not to mention a dramatic cost saving in hardware and software.
Bridging the Last Mile: Getting Data to the People Who Need ItDenodo
Watch full webinar here: https://bit.ly/3cUA0Qi
Many organizations are embarking on strategically important journeys to embrace data and analytics. The goal can be to improve internal efficiencies, improve the customer experience, drive new business models and revenue streams, or – in the public sector – provide better services. All of these goals require empowering employees to act on data and analytics and to make data-driven decisions. However, getting data – the right data at the right time – to these employees is a huge challenge and traditional technologies and data architectures are simply not up to this task. This webinar will look at how organizations are using Data Virtualization to quickly and efficiently get data to the people that need it.
Attend this session to learn:
- The challenges organizations face when trying to get data to the business users in a timely manner
- How Data Virtualization can accelerate time-to-value for an organization’s data assets
- Examples of leading companies that used data virtualization to get the right data to the users at the right time
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
Similar to Data Engineer's Lunch #85: Designing a Modern Data Stack (20)
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
Discussion of LLM fine-tuning with an overview of fine-tuning types and datasets: specifically we will talk about the method that we used to turn an existing collection of Cassandra information into a set of instructions and responses that we can use for fine tuning.
What's AGI? How is it different from an Agent or an AI Assistant? If you're looking to understand how AI Agents/AGI can help your company, check this out.
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
In this meetup, we will introduce the concepts of Real Time Analytics, why it is important, the evolution of Analytics, and how companies such as LinkedIn, Stripe, Uber and more are using Real Time analytics to grow their audience and improve usability by using Apache Pinot. What is Apache Pinot? Followed by Demo and Q&A.
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes? If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
GPT Automation: What it is and How it Works
How Time-Saving GPT Automation Can Improve Your Business
Cost-Effective GPT Automation: How it Can Save Your Business Money
Using GPT Automation for Customer Service: Benefits and Best Practices
The Power of GPT Automation for Content Creation
Data Analysis Made Easy with GPT Automation
Top GPT-3 Automation Tools for Businesses
The Ethical Considerations of GPT Automation
Overcoming Bias in GPT Automation: Best Practices
The Future of GPT Automation: Trends and Predictions
Since we focus on "no code" here, we'll explore the tools that are already out there such as ChatGPT plugins for Chrome, OpenAI GPT API, low-code/no-code platforms like Make/Integromat and Zapier, existing apps like Jasper/Rytr, and ecosystem tools like Everyprompt. We'll also discuss the resources available for those interested in learning more about GPT, including other people’s prompts.
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
Episode 3: The LLM / GPT / AI Prompt / Data Engineer Roadmap
In this episode, we'll discuss the history, fundamentals, and the different flavors of LLMs available, beyond GPT/ChatGPT. This is a dry run of a session that will be on a LLM Bootcamp ( Fill out the survey on the link if you are interested in an in-person vs. virtual session)
Intro / Fundamentals of LLM
LLM Foundations
History of LLMs
Tuning, Training, or "In Context Learning" with LLMs
What is "Prompt Engineering"
Case for Augmenting LLMs
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes? If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
GPT Automation: What it is and How it Works
How Time-Saving GPT Automation Can Improve Your Business
Cost-Effective GPT Automation: How it Can Save Your Business Money
Using GPT Automation for Customer Service: Benefits and Best Practices
The Power of GPT Automation for Content Creation
Data Analysis Made Easy with GPT Automation
Top GPT-3 Automation Tools for Businesses
The Ethical Considerations of GPT Automation
Overcoming Bias in GPT Automation: Best Practices
The Future of GPT Automation: Trends and Predictions
Since we focus on "no code" here, we'll explore the tools that are already out there such as ChatGPT plugins for Chrome, OpenAI GPT API, low-code/no-code platforms like Make/Integromat and Zapier, existing apps like Jasper/Rytr, and ecosystem tools like Everyprompt. We'll also discuss the resources available for those interested in learning more about GPT, including other people’s prompts.
In Apache Cassandra Lunch #131: YugabyteDB Developer Tools, we discussed third party developer tools that are compatible with YugabyteDB. We talked about using Yugabyte Developer Tools for data visualization and schema management. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST.
Developer tools play a critical role in simplifying and streamlining database development and management. They allow developers and administrators to be more productive, reducing the time and effort required to create and maintain database schemas, write SQL queries, test database performance, and enable collaboration. Developer tools also make it possible to track changes over time, improving the ability to manage the entire development lifecycle.
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
In this episode we'll discuss the different flavors of prompt engineering in the LLM/GPT space. According to your skill level you should be able to pick up at any of the following:
Leveling up with GPT
1: Use ChatGPT / GPT Powered Apps
2: Become a Prompt Engineer on ChatGPT/GPT
3: Use GPT API with NoCode Automation, App Builders
4: Create Workflows to Automate Tasks with NoCode
5: Use GPT API with Code, make your own APIs
6: Create Workflows to Automate Tasks with Code
7: Use GPT API with your Data / a Framework
8: Use GPT API with your Data / a Framework to Make your own APIs
9: Create Workflows to Automate Tasks with your Data /a Framework
10: Use Another LLM API other than GPT (Cohere, HuggingFace)
11: Use open source LLM models on your computer
12: Finetune / Build your own models
Series: Using AI / ChatGPT at Work - GPT Automation
Are you a small business owner or web developer interested in leveraging the power of GPT (Generative Pretrained Transformer) technology to enhance your business processes?
If so, Join us for a series of events focused on using GPT in business. Whether you're a small business owner or a web developer, you'll learn how to leverage GPT to improve your workflow and provide better services to your customers.
In Data Engineer’s Lunch #89: Machine Learning Orchestration with Airflow, we discussed using Apache Airflow to manage and schedule machine learning tasks. By following the best practices of ML Ops, teams can streamline their ML workflows and build scalable, efficient, and accurate models that deliver real-world business value. Properly implemented ML Ops can help organizations stay ahead of the curve and achieve their goals in the fast-paced world of machine learning. Apache Airflow is an open-source tool for scheduling and automating workflows. Airflow allows you to define workflows in Python, with tasks defined as Python functions that can include Operators for all sorts of external tools. This makes it easy to automate repeated processes and define dependencies between tasks, creating directed-acyclic-graphs of tasks that can be scheduled using cron syntax or frequency tasks. Airflow also features a user-friendly UI for monitoring task progress and viewing logs, giving you greater control over your data pipeline.
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
If you didn't attend, you don't want to miss a much shorter synopsis of what was covered and get some thoughts from us as to why they are important. We'll talk about the main topics of the event.
1. ACID transactions on Cassandra by Aaron Ploetz, Datastax
2. Apache Flink with Apache Cassandra at Satyajit Thadeswar, Netflix
3. Durable Execution built on Apache Cassandra by Loren Sands-Ramshaw, Temporal
4. Switching from Mongo to Cassandra with Mongoose & new Stargate JSON API, Valeri Karpov
5. Cloud Native and Realtime AI/ML with Patrick Mcfadin and Davor Boncaci, Datastax
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
In Data Engineer's Lunch 90, Eric Ramseur teaches our audience how to use Arcion.
From best practices to real-world examples, this talk will provide you with the knowledge and insights you need to ensure a successful migration of your SQL data. So whether you're new to data migration or looking to improve your existing process, join us and discover how Arcion can help you achieve your goals.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
As the demand for real-time data processing continues to grow, so too do the challenges associated with building production-ready applications that can handle large volumes of data and handle it quickly. In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes. Using telemetry data collected from a fitness app, we’ll demonstrate how we used a combination of Apache Kafka and Python-based microservices running on Kubernetes to build a pipeline for processing and analyzing this data in real-time. We'll also discuss how we used machine learning techniques to build a model for detecting collisions and how we implemented notifications to alert family members of a crash. Our ultimate goal is to help you navigate the challenges that come with building data-intensive, real-time applications that use ML models. By showcasing a real-world example, we aim to provide practical solutions and insights that you can apply to your own projects.
Key takeaways:
An understanding of the common challenges faced when building real-time applications at scale
Strategies for using Apache Kafka and Python-based microservices to process and analyze data in real-time
Tips for implementing machine learning models in a real-time application
Best practices for responding to and handling critical events in a real-time application
In Apache Cassandra Lunch #121: Migrating to Azure Managed Instance for Apache Cassandra, we discussed different methods for migrating data from existing Cassandra instances to Azure hosted options.
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
In this talk, Dremio Developer Advocate, Alex Merced, discusses strategies for migrating your existing data over to Apache Iceberg. He'll go over the following:
How to Migrate Hive, Delta Lake, JSON, and CSV sources to Apache Iceberg
Pros and Cons of an In-place or Shadow Migration
Migrating between Apache Iceberg catalogs Hive/Glue -- Arctic/Nessie
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
In this lunch, Johnny will show us how easy it is to start monitoring your Cassandra cluster in minutes. He will explain the various aspects and features of Cassandra that need to be monitored, how to do it, and most importantly why! Approaches for backups and Cassandra repairs will be discussed and explored in detail.
Learn how AxonOps significantly reduces the complexity and overhead when looking after Cassandra and ensures your Cassandra cluster is reliable and resilient.
Experienced developer, DevOps, architect, and AxonOps co-founder, Johnny Miller, has worked with a wide variety of companies – from small start-ups to large enterprises. He has been working with Cassandra for many years and has a deep understanding of the challenges facing modern companies looking to adopt Apache Cassandra.
In Apache Cassandra Lunch #119, Rahul Singh will cover a refresher on GUI desktop/web tools for users that want to get their hands dirty with Cassandra but don't want to deal with CQLSH to do simple queries. Some of the tools are web-based and others are installed on your desktop. Since the beginning days of Cassandra, a lot has changed and there are many options for command-line-haters to use Cassandra.
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
Data Engineer’s Lunch #67: Machine Learning - Feature SelectionAnant Corporation
In Data Engineer’s Lunch #67: Machine Learning - Feature Selection, we discussed the process of picking particular, relevant data features out of a wider data set, to be used to perform model training.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
4. What is a regular software stack?
A software “stack” is the set of software or
software components needed to run an
application.
Notable examples:
● LAMP
○ Linux
○ Apache
○ MySQL
○ PHP
● MERN
○ MongoDB
○ Express.js
○ React.js
○ Node.js
5. Are data stacks just regular software stacks?
Yes and no.
Data engineering is a specialty within software engineering,
and everything is software running on computers at the end
of the day, so yes, data stacks are software stacks.
But, there are notable differences that are worth addressing…
Especially because every data tool company wants to market
their tool as part of the “Modern Data Stack”
6. What is Modern about the “Modern” data stack?
Four major trends make the
‘modern data stack’ make
sense:
1. Modern Cloud platforms.
2. Column Store Data
Warehouses.
3. Cost of disk trending to zero.
4. Proliferation of managed
data tools.
7. Defining Characteristics of the Modern Data Stack
1. Cloud & SQL Based: Column-store based Cloud Data Warehouse at the center
○ With optional file / object store based data lake.
2. Modular: Managed SaaS tools for almost every part of the data lifecycle.
○ Optional: run open source components and write your own integrations.
8. What is so special about cloud data warehouses?
Modern column store data warehouses run on a cloud
computing platform have some great benefits for building data
intensive applications:
● Flexible & scalable pay-as-you go compute:
○ No upfront hardware or major purchases required.
○ No outgrowing your data center at awkward times.
● Managed services
○ Running your own infrastructure reliably and effectively is hard, so
paying for a cloud computing company to do it for you is usually a great
deal.
○ Allows for data teams to move quickly without needing as much
specialized operational experience.
9. The cost of storage
Cost per GB has fallen
~100,000x since the mid 90s.
The cellphone in your pocket has
more storage and processing
power than a Cray-2
supercomputer from the mid 80s.
The Big Data Revolution is
mostly driven by this trend.
11. Data Solution design process
1. Determine desired capabilities & design constraints
2. Create iteration plan
3. Execute plan.
4. Evaluate delivered data solution.
5. Return to 1.
Same as OODA (Observe, Orient, Decide, Act)/ PDCA (Plan, Do,
Check, Act) frameworks. Iteration cycle scale and length can be
minutes to years (I recommended shorter and smaller).
12. Step 1. Problem Definition
The first step in developing a solution is to identify the problem.
This step can include:
● Requirements gathering
● Software vision documentation
● User research & interviews
● Industry research
● Documentation
● More documentation…
13. Step 2. Create an iteration plan
Create a plan to deliver a working system that has the capabilities to solve all of
the necessary problems.
This can include:
● System design diagrams & documents
● Jira tickets and work breakdown structure
● Doodles on a napkin
14. Step 3. Execute the plan
Once you have a plan that looks good enough, build the thing!
This should include:
● Software development
● Software development to improve the software development process
● Procurement - buying off the shelf tools.
● Testing - systems integration & technical tests.
● Testing - user / client demos.
15. Step 4. Evaluate
After developing a functional data solution, it is important to evaluate whether you
did an acceptable job.
This includes:
● Requirements review - does the data solution meet the requirements?
● Capability value - do the data solution’s new capabilities actually provide value?
● Identify future improvement opportunities
● Identify future development process improvement opportunities
16. Step 5. Repeat the cycle
Data Platform development is an iterative process, and much of the value depends
on the end users: unused data is worthless, so if the developed system is unused,
it won’t have been worth building most of the time.
Iteration is a great way to discover unknown requirements and opportunities, and
work with the end users of data to build good data systems that help cultivate a
vibrant ecosystem.
18. The Modern Data Stack for Business Intelligence
Core Components:
1. Storage - Cloud Data
Warehouse
(Snowflake, Redshift,
BigQuery)
2. Ingestion - Managed
ETL (Stitch, Fivetran)
3. Transformation - dbt /
SQL
4. Visualization - BI tool
of choice
19. Auxiliary Components
You’ll also want:
● Data Observability - tools like Monte Carlo & BigEye
● Data Cataloging - tools like Castor or Alation
● Systems Observability - ELK / Prometheus & Grafana
A modern data platform is a large distributed system with
numerous third party vendors and constantly changing API
integrations. Treat it with respect or it will break on you.
21. High Level Design - Personal Data Warehouse
Primary Design constraints:
1. Low cost.
2. Low maintenance
3. Data Variety: lots of unstructured
data.
Notable freeing design characteristics:
1. Low velocity - weekly update
maximum for most bulk sources
2. Low volume - ~1-5gb per source
per update for full refresh
3. Low user count - single user (me)
1. Raw Storage in Google Cloud Storage
2. Data Transformation Pipelines in Dataflow
(managed Apache beam)
3. BigQuery Data Warehouse for relational data
4. Looker Studio (formerly Google Data Studio) for BI.
25. Caveats:
1. Modern Data Stack – like many other terms – is mostly a marketing term / fad.
2. The major components of modern data stacks have sharp edges
a. Costs can quickly spiral out of control if data access is overly democratic.
b. Powerful configuration options - updates to data pipelines are easier to make, not necessarily
more correct.
3. There are still huge opportunities for tooling improvements.
a. Last ~10 years have seen a huge unbundling of data tools and new ‘best in breed’ SaaS providers.
i. Integrating all these components into a cohesive platform is a lot of work, so we will see
bundled all in one data platforms become increasingly competitive.
b. Metadata / data cataloging tools need improvement to support better data management.
26. The best data stack is the one that works best for you.
● Data Stack Design is system design
○ The best systems are those that provide the desired capabilities.
■ Actually think about what the design goals of your data stack are.
● Data Stack Development is iterative
○ Sometimes everyone will be happiest with a simple solution like a cron job querying the
production database (preferably a replica).
■ This can work well for years.
■ This can also turn into a hot mess operationally and require urgent replacement with a
better solution
○ Finding an optimal balance between planning and learning is hard.
■ Finding a close enough to optimal balance is feasible.
27. Thank you!
Have any data problems? I’m looking for new Data
Engineering / Technical Product Manager Roles.
Email: Will@williamangel.net
Website: www.williamangel.net | www.d8aeng.com
Twitter: @DataDrivenAngel
Linkedin: https://www.linkedin.com/in/william-angel/