Video: https://www.youtube.com/watch?v=P5Q5GZheAEk
Apache Hive is an open-source relational database system that is widely adopted by several organizations for big data analytic workloads. It combines traditional MPP (massively parallel processing) techniques with more recent cloud computing concepts to achieve the increased scalability and high performance needed by modern data intensive applications. Even though it was originally tailored towards long running data warehousing queries, its architecture recently changed with the introduction of LLAP (Live Long and Process) layer. Instead of regular containers, LLAP utilizes long-running executors to exploit data sharing and caching possibilities within and across queries. Executors eliminate unnecessary disk IO overhead and thus reduce the latency of interactive BI (business intelligence) queries by orders of magnitude. However, as container startup cost and IO overhead is now minimized, the need to effectively utilize memory and CPU resources across long-running executors in the cluster is becoming increasingly essential. For instance, in a variety of production workloads, we noticed that the memory bandwidth of early decoding all table columns for every row, even when this row is dropped later on, is starting to overwhelm the performance of single query execution. In this talk, we focus on some of the optimizations we introduced in Hive 4.0 to increase CPU efficiency and save memory allocations. In particular, we describe the lazy decoding (or row-level filtering) and composite bloom-filters optimizations that greatly improve the performance of queries containing broadcast joins, reducing their runtime by up to 50%. Over several production and synthetic workloads, we show the benefit of the newly introduced optimizations as part of Cloudera’s cloud-native Data Warehouse engine. At the same time, the community can directly benefit from the presented features as are they 100% open-source!
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Building a Federated Data Directory Platform for Public HealthDatabricks
Healthcare directories underpin most healthcare systems around the world and is often a core component that enables initiatives like ‘Care Coordination’.
Democratizing data science Using spark, hive and druidDataWorks Summit
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
Options for Data Prep - A Survey of the Current MarketDremio Corporation
Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Databricks
Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts.
Building a Federated Data Directory Platform for Public HealthDatabricks
Healthcare directories underpin most healthcare systems around the world and is often a core component that enables initiatives like ‘Care Coordination’.
Democratizing data science Using spark, hive and druidDataWorks Summit
MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
Options for Data Prep - A Survey of the Current MarketDremio Corporation
Data comes in many shapes and sizes, and every company struggles to find ways to transform, validate, and enrich data for multiple purposes. The problem has been around as long as data, and the market has an overwhelming number of options. In this presentation we look at the problem and key options from vendors in the market today. Dremio is a new approach that eliminates the need for stand alone data prep tools.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.
Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.
We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.
We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.
Building a Cross Cloud Data Protection EngineDatabricks
Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks
<p>Once an obscure branch of applied mathematics, machine learning is now the darling of tech. I will talk about lessons learned democratizing machine learning. How libraries like scikit-learn were designed to empower users: simplifying but avoiding ambiguous behaviors. How the Python data ecosystem was built from scientific computing tools: the importance of good numerics. How some machine-learning patterns easily provide value to real-world situations. I will also discuss remain challenges to address and the progresses that we are making. Scaling up brings different bottlenecks to numerics. Integrating data in the statistical models, a hurdle to data-science practice requires to rethink data cleaning pipelines.</p><p>This talk will drawn from my experience as a scikit-learn developer, but also as a researcher in machine learning and applications.</p>
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
"In the oil and gas industry, utilizing vast amounts of data has long been identified as an important indicator of operational performance. The measurement of key performance indicators is a routine practice in well construction, but a systematic way of statistically analyzing performance against a large data bank of offset wells is not a common practice. The performance of statistical analysis in real-time is even less common. With the adoption of distributed computing platforms, like Apache Spark, new analysis opportunities become available to leverage large-scale time-series data sets to optimize performance. Two case studies are presented in this talk: the rate of penetration (ROP) and the amount of vibration per run.
By collecting real-time, telemetry data and comparing it with historic sample datasets within the Databricks Unified Analytics Platform, the optimization team was able to quickly determine whether the performance being delivered matched or exceeded past performance with statistical certainty. This is extremely important while trying new techniques with data that is highly variable. By substituting anecdotal evidence with statistical analysis, decision making is more precise and better informed. In this talk we'll share how we accomplished this and the lessons learned along the way."
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...Databricks
<p>Sodexo as the world leader of services, we believe that quality of life is created when we integrate our Food services, Facilities management, employee benefits and more… Our ambition is to positively impact one billion consumers worldwide. Sodexo has launched a POC for food services that is satisfying our business, but how do we come from a local pure Python Jupyter notebook to a product that can be scalable to cover all our consumers? To do so we have not only used DataBricks but we have made it interacting with many Azure services: Data Factory, Log Analytics, DevOps, CosmosDB, Kubernetes, Data Lake and Key Vault. This talk deal about what are those services, how we use Databricks, how it’s interacting with those services, and what are our feedbacks and experiences.</p>
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
TechDays NL 2016 - Building your scalable secure IoT Solution on AzureTom Kerkhove
The Internet-of-Things was one of the big hypes in 2015 but it’s more than that – Customers want to build out their own infrastructures and act on their data.
Today we’ll look at how Microsoft Azure helps us to build scalable solutions to process events from thousands of devices in a secure manner and the challenges it has. Once the data is in the cloud we’ll also take a look at ways we can learn from our measurements.
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
This is a step-by-step approach the entire ecosystem of features driven by Azure Data eXplorer. You can find many examples using Kusto dialect, in order to acquire data, process and build up complete web interfaces using only one service: ADX.
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different.
Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsPanagiotis Garefalakis
Apache Hive is an open-source relational database system that is widely adopted by several organizations for big data analytic workloads. It combines traditional MPP (massively parallel processing) techniques with more recent cloud computing concepts to achieve the increased scalability and high performance needed by modern data intensive applications. Even though it was originally tailored towards long running data warehousing queries, its architecture recently changed with the introduction of LLAP (Live Long and Process) layer. Instead of regular containers, LLAP utilizes long-running executors to exploit data sharing and caching possibilities within and across queries. Executors eliminate unnecessary disk IO overhead and thus reduce the latency of interactive BI (business intelligence) queries by orders of magnitude. However, as container startup cost and IO overhead is now minimized, the need to effectively utilize memory and CPU resources across long-running executors in the cluster is becoming increasingly essential. For instance, in a variety of production workloads, we noticed that the memory bandwidth of early decoding all table columns for every row, even when this row is dropped later on, is starting to overwhelm the performance of single query execution. In this talk, we focus on some of the optimizations we introduced in Hive 4.0 to increase CPU efficiency and save memory allocations. In particular, we describe the lazy decoding (or row-level filtering) and composite bloom-filters optimizations that greatly improve the performance of queries containing broadcast joins, reducing their runtime by up to 50%. Over several production and synthetic workloads, we show the benefit of the newly introduced optimizations as part of Cloudera’s cloud-native Data Warehouse engine. At the same time, the community can directly benefit from the presented features as are they 100% open-source!
AWS Lambda Powertools is a developer toolkit to implement Serverless best practices and increase developer velocity. It started as an open-source project in 2020 focused in making Tracing, Logging, and Metrics easier. Fast-forward, Powertools added 13 more features, grew a vibrant community who regularly contributes up to 60% of our releases, now covering a plethora of use cases: REST and GraphQL APIs, Batch processing, Idempotency, Feature Flags, Data Validation, and more.
You’ll learn why this developer toolkit was created, key use cases, and find out how you can adopt common industry and AWS best practices in seconds. We’ll also cover two of the most anticipated new features coming in 2023, and live demo(s).
Building a Cross Cloud Data Protection EngineDatabricks
Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks
<p>Once an obscure branch of applied mathematics, machine learning is now the darling of tech. I will talk about lessons learned democratizing machine learning. How libraries like scikit-learn were designed to empower users: simplifying but avoiding ambiguous behaviors. How the Python data ecosystem was built from scientific computing tools: the importance of good numerics. How some machine-learning patterns easily provide value to real-world situations. I will also discuss remain challenges to address and the progresses that we are making. Scaling up brings different bottlenecks to numerics. Integrating data in the statistical models, a hurdle to data-science practice requires to rethink data cleaning pipelines.</p><p>This talk will drawn from my experience as a scikit-learn developer, but also as a researcher in machine learning and applications.</p>
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters. As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage. We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.
Authors: Andrew Ray, Craig Covey
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
"In the oil and gas industry, utilizing vast amounts of data has long been identified as an important indicator of operational performance. The measurement of key performance indicators is a routine practice in well construction, but a systematic way of statistically analyzing performance against a large data bank of offset wells is not a common practice. The performance of statistical analysis in real-time is even less common. With the adoption of distributed computing platforms, like Apache Spark, new analysis opportunities become available to leverage large-scale time-series data sets to optimize performance. Two case studies are presented in this talk: the rate of penetration (ROP) and the amount of vibration per run.
By collecting real-time, telemetry data and comparing it with historic sample datasets within the Databricks Unified Analytics Platform, the optimization team was able to quickly determine whether the performance being delivered matched or exceeded past performance with statistical certainty. This is extremely important while trying new techniques with data that is highly variable. By substituting anecdotal evidence with statistical analysis, decision making is more precise and better informed. In this talk we'll share how we accomplished this and the lessons learned along the way."
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...Databricks
<p>Sodexo as the world leader of services, we believe that quality of life is created when we integrate our Food services, Facilities management, employee benefits and more… Our ambition is to positively impact one billion consumers worldwide. Sodexo has launched a POC for food services that is satisfying our business, but how do we come from a local pure Python Jupyter notebook to a product that can be scalable to cover all our consumers? To do so we have not only used DataBricks but we have made it interacting with many Azure services: Data Factory, Log Analytics, DevOps, CosmosDB, Kubernetes, Data Lake and Key Vault. This talk deal about what are those services, how we use Databricks, how it’s interacting with those services, and what are our feedbacks and experiences.</p>
Prior to 2014, Walgreens has traditional Enterprise Datawarehouse Systems that have reached the capacity limits. Over the last three years we have evolved, learned lessons, experienced successes and failures. Our initial adoption of Hadoop came from the need to run complex analytics which simply did not scale on MPP RDBMS. Our business data demands were rapidly increasing and the 8 to 12 weeks concomitant extract, transform, and load turn around cycles was not a acceptable deliverable timeframe in the retail space. A self service model where data lands on a distributed platform, apply schema where necessary, and process at scale was a necessary paradigm for business value enablement. Our journey started with single use case which has now evolved to enterprise data hub. We will discuss following points: Evolution of our infrastructure profile, streamlining the hardware provisioning cycle, and our hybrid deployment model (on premise & cloud). Operations, how SmartSense has helped us proactively tune our cluster, and which operational tests we use for benchmarking the cluster. Monitoring, how we monitor and the tools required for enterprise grade monitoring. Security and governance how we progressed from non–compliance to enterprise grade using Ranger, Knox, Kerberos, HP voltage, encryption at rest, and many other services. 3rd Party integration with HDP, what we learned and how we overcame the challenges. Lastly, how we approach our disaster recovery strategy, what is driving the need for a DR and the key capabilities required.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
Netflix is a famously data-driven company. Data is used to make informed decisions on everything from content acquisition to content delivery, and everything in-between. As with any data-driven company, it’s critical that data used by the business is accurate. Or, at worst, that the business has visibility into potential quality issues as soon as they arise. But even in the most mature data warehouses, data quality can be hard. How can we ensure high quality in a cloud-based, internet-scale, modern big data warehouse employing a variety of data engineering technologies?
In this talk, Michelle Ufford will share how the Data Engineering & Analytics team at Netflix is doing exactly that. We’ll kick things off with a quick overview of Netflix’s analytics environment, then dig into details of our data quality solution. We’ll cover what worked, what didn’t work so well, and what we plan to work on next. We’ll conclude with some tips and lessons learned for ensuring data quality on big data.
TechDays NL 2016 - Building your scalable secure IoT Solution on AzureTom Kerkhove
The Internet-of-Things was one of the big hypes in 2015 but it’s more than that – Customers want to build out their own infrastructures and act on their data.
Today we’ll look at how Microsoft Azure helps us to build scalable solutions to process events from thousands of devices in a secure manner and the challenges it has. Once the data is in the cloud we’ll also take a look at ways we can learn from our measurements.
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
This is a step-by-step approach the entire ecosystem of features driven by Azure Data eXplorer. You can find many examples using Kusto dialect, in order to acquire data, process and build up complete web interfaces using only one service: ADX.
Integration Monday - Analysing StackExchange data with Azure Data LakeTom Kerkhove
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different.
Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Analyzing StackExchange data with Azure Data LakeBizTalk360
Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings.
If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsPanagiotis Garefalakis
Apache Hive is an open-source relational database system that is widely adopted by several organizations for big data analytic workloads. It combines traditional MPP (massively parallel processing) techniques with more recent cloud computing concepts to achieve the increased scalability and high performance needed by modern data intensive applications. Even though it was originally tailored towards long running data warehousing queries, its architecture recently changed with the introduction of LLAP (Live Long and Process) layer. Instead of regular containers, LLAP utilizes long-running executors to exploit data sharing and caching possibilities within and across queries. Executors eliminate unnecessary disk IO overhead and thus reduce the latency of interactive BI (business intelligence) queries by orders of magnitude. However, as container startup cost and IO overhead is now minimized, the need to effectively utilize memory and CPU resources across long-running executors in the cluster is becoming increasingly essential. For instance, in a variety of production workloads, we noticed that the memory bandwidth of early decoding all table columns for every row, even when this row is dropped later on, is starting to overwhelm the performance of single query execution. In this talk, we focus on some of the optimizations we introduced in Hive 4.0 to increase CPU efficiency and save memory allocations. In particular, we describe the lazy decoding (or row-level filtering) and composite bloom-filters optimizations that greatly improve the performance of queries containing broadcast joins, reducing their runtime by up to 50%. Over several production and synthetic workloads, we show the benefit of the newly introduced optimizations as part of Cloudera’s cloud-native Data Warehouse engine. At the same time, the community can directly benefit from the presented features as are they 100% open-source!
AWS Lambda Powertools is a developer toolkit to implement Serverless best practices and increase developer velocity. It started as an open-source project in 2020 focused in making Tracing, Logging, and Metrics easier. Fast-forward, Powertools added 13 more features, grew a vibrant community who regularly contributes up to 60% of our releases, now covering a plethora of use cases: REST and GraphQL APIs, Batch processing, Idempotency, Feature Flags, Data Validation, and more.
You’ll learn why this developer toolkit was created, key use cases, and find out how you can adopt common industry and AWS best practices in seconds. We’ll also cover two of the most anticipated new features coming in 2023, and live demo(s).
DEM07 Best Practices for Monitoring Amazon ECS Containers Launched with FargateAmazon Web Services
Containers and other forms of dynamic infrastructure can prove challenging to monitor. How do you define “normal” when your infrastructure is intentionally in motion and changing every minute, or when there are no hosts to monitor at all? Join us as we share proven strategies for monitoring your containerized infrastructure on AWS, Amazon ECS, and AWS Fargate. This session is brought to you by AWS Partner, Datadog.
Supercharge GuardDuty with Partners: Threat Detection and Response at Scale (...Amazon Web Services
Amazon GuardDuty has the ability to detect threats. However, threat detection is only the first step. In this session, we combine the high fidelity findings of GuardDuty with partner products, and we demonstrate how to quickly respond, remediate, and prevent security incidents in order to supercharge and centralize your cloud security operations center (SOC).
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines
https://www.meetup.com/futureofdata-newyork/events/298660453/
Unlocking Financial Data with Real-Time Pipelines
(Flink Analytics on Stocks with SQL )
By Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
Timothy Spann
Principal Developer Advocate, Cloudera
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
https://twitter.com/PaaSDev
https://www.linkedin.com/in/timothyspann/
https://medium.com/@tspann
https://github.com/tspannhw/FLiPStackWeekly/
Elastic @ Adobe: Making Search Smarter with Machine Learning at ScaleElasticsearch
Hear how Adobe scales, manages multiple use cases, and puts machine learning features to work with Elastic and learn about extensions to Elasticsearch that allow them to search at scale natively.
Keynote that was being held at API Days 2014 in Paris. It covers the rapid growth of IoT and how developers can start applying their APIs in order to be ready for this new era of connected hardware.
Running complex data queries in a distributed systemArangoDB Database
With the always-growing amount of data, it is getting increasingly hard to store and get it back efficiently. While the first versions of distributed databases have put all the burden of sharding on the application code, there are now some smarter solutions that handle most of the data distribution and resilience tasks inside the database.
This poses some interesting questions, e.g.
- how are other than by-primary-key queries actually organized and executed in a distributed system, so that they can run most efficiently?
- how do the contemporary distributed databases actually achieve transactional semantics for non-trivial operations that affect different shards/servers?
This talk will give an overview of these challenges and the available solutions that some open source distributed databases have picked to solve them.
Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives. Today, we are excited to continue that track record via some important performance gains for Impala 2.5 (with more to come on the roadmap), summarized below.
Overall, compared to Impala 2.3, in Impala 2.5:
TPC-DS queries run on average 4.3x faster.
TPC-H queries run 2.2x faster on flat tables, and 1.71x faster on nested tables.
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
Percona Live 2017 - How sitecore depends on mongo db for scalability and performance, and what it can teach you by Antonios Giannopoulos and Grant Killian
Uncover the Root Cause of Kafka Performance Anomalies, Daniel Kim & Antón Rod...HostedbyConfluent
Kafka makes it possible to ingest and process large amounts of data concurrently by decoupling your data streams. This successful approach introduces some challenges when monitoring Kafka pipelines - to understand what’s happening, we need to monitor every single component and how they interact with each other.
In this talk, we will take a close look at Kafka’s architecture as well as the key infrastructure, JVM, and system metrics you should monitor for each of its components. Then, we will walk through how to diagnose common Kafka performance anomalies through observing patterns in the metrics for the various components. Finally, we will walk through setting up an open source observability pipeline with OpenTelemetry to enable you to collect and process Kafka metrics at scale.
Extracting Insights from Industrial Data Using AWS IoT Services (IOT368) - AW...Amazon Web Services
IoT (IIoT) bridges the gap between legacy industrial equipment and infrastructure and new technologies, such as machine learning, cloud, mobile, and edge computing. In this session, we focus on how you can extract data from your industrial data sources and build operational insights using AWS IoT services. We cover how to bridge traditional on-premises applications and data stores with new cloud-based IoT applications.
Presenting 3 real-life use cases of Apache Beam in production. Code reusability for bounded and unbounded data as well as running Apache Beam to write into different cloud providers are some of the aspects that will be treated in this presentation.
New and cool in OSGi R7 - David Bosschaert & Carsten Ziegelermfrancis
OSGi Community Event 2016 Presentation by David Bosschaert (Adobe) & Carsten Ziegeler (Adobe)
The OSGi expert groups are working on the next big release. Learn in this session about the various new specification efforts going on and how they will make your developer life easier. The new specifications range from configuration handling, object conversion, JAX-RS, distributed eventing, to cloud and IoT.
Optimize Your SaaS Offering with Serverless Microservices (GPSTEC405) - AWS r...Amazon Web Services
In this hands-on session, we crack open the IDE and transform a SaaS web app comprised of several monolithic single-tenant environments into an efficient, scalable, and secure multi-tenant SaaS platform using ReactJS and NodeJS serverless microservices. We use Amazon API Gateway and Amazon Cognito to simplify the operation and security of the service’s API and identity functionality. We enforce tenant isolation and data partitioning with OIDC’s JWT tokens. We leverage AWS SAM and AWS Amplify to simplify authoring, testing, debugging, and deploying serverless microservices, keeping operational burden to a minimum, maximizing developer productivity, and maintaining a great developer experience.
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Amazon Web Services
A successful transition to a modern elastic, containerized, microservice architecture requires automating all things, including your monitoring and alerting infrastructure. In this talk, we share some of the techniques and best practices we learned at New Relic for applying "infrastructure as code" (IaC) techniques to monitoring and alerting during our 10-year journey from a single-region monolithic application to a global multi-region deployment of hundreds of microservices. This session is brought to you by AWS partner, New Relic.
Similar to Accelerating distributed joins in Apache Hive: Runtime filtering enhancements (20)
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main