This presentation describes the various layers and open source components that can be used to design and implement a lambda architecture enabled to support batch processing for model training and streaming for prediction
This presentation is based on Lawrence To's Maximum Availability Architecture (MAA) Oracle Open World Presentation talking about the latest updates on high availability (HA) best practices across multiple architectures, features and products in Oracle Database 19c. It considers all workloads, OLTP, DWH and analytics, mixed workload as well as on-premises and cloud-based deployments.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
This talk was given during DevOps Con 2017.
Have you ever spent time digging through various terminals, greping, lessing, awking and trying to find that few log lines that may be important? Have you every done that under time pressure, because mission critical services were not working? Have you every heard from your developers that they can’t tell you anything, because they don’t have access to application logs? Have you ever considered a centralized storage for logs, but time and resources are not on your side?
If you said yes, to any of the above questions, than this talk is for you. During the talk we’ll introduce you to the world of log centralization and analysis, both when it comes to open source, but also commercial tools. We will go from top to bottom and learn how to setup log centralization and analysis for servers, virtualized environments and containers. We will get from log shipping, through centralized buffering to storage and analysis to show you, that having a centralized log analysis tool is not a rocket science.
Finally, you will see how useful is to combine the logs from all your servers in a single place for blazingly fast correlation.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
DOAG Oracle Unified Audit in Multitenant EnvironmentsStefan Oehrli
Oracle Audit is a well-known and proven database functionality. Or maybe not? What does auditing look like in combination with Oracle Multitenant Databases? Does database and Unified Audit work analogous to existing configurations? In the context of this presentation the auditing in the environment of container databases will be examined more closely. It will be shown what has to be considered and how an auditing concept has to be adapted to the new architecture. With focus on the current versions of the Oracle database, specific problems and workarounds in the area of Unified Audit will be shown. The presentation will be complemented by corresponding examples and live demos.
This presentation is based on Lawrence To's Maximum Availability Architecture (MAA) Oracle Open World Presentation talking about the latest updates on high availability (HA) best practices across multiple architectures, features and products in Oracle Database 19c. It considers all workloads, OLTP, DWH and analytics, mixed workload as well as on-premises and cloud-based deployments.
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
This talk was given during DevOps Con 2017.
Have you ever spent time digging through various terminals, greping, lessing, awking and trying to find that few log lines that may be important? Have you every done that under time pressure, because mission critical services were not working? Have you every heard from your developers that they can’t tell you anything, because they don’t have access to application logs? Have you ever considered a centralized storage for logs, but time and resources are not on your side?
If you said yes, to any of the above questions, than this talk is for you. During the talk we’ll introduce you to the world of log centralization and analysis, both when it comes to open source, but also commercial tools. We will go from top to bottom and learn how to setup log centralization and analysis for servers, virtualized environments and containers. We will get from log shipping, through centralized buffering to storage and analysis to show you, that having a centralized log analysis tool is not a rocket science.
Finally, you will see how useful is to combine the logs from all your servers in a single place for blazingly fast correlation.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Getting Started with Databricks SQL AnalyticsDatabricks
It has long been said that business intelligence needs a relational warehouse, but that view is changing. With the Lakehouse architecture being shouted from the rooftops, Databricks have released SQL Analytics, an alternative workspace for SQL-savvy users to interact with an analytics-tuned cluster. But how does it work? Where do you start? What does a typical Data Analyst’s user journey look like with the tool?
This session will introduce the new workspace and walk through the various key features – how you set up a SQL Endpoint, the query workspace, creating rich dashboards and connecting up BI tools such as Microsoft Power BI.
If you’re truly trying to create a Lakehouse experience that satisfies your SQL-loving Data Analysts, this is a tool you’ll need to be familiar with and include in your design patterns, and this session will set you on the right path.
DOAG Oracle Unified Audit in Multitenant EnvironmentsStefan Oehrli
Oracle Audit is a well-known and proven database functionality. Or maybe not? What does auditing look like in combination with Oracle Multitenant Databases? Does database and Unified Audit work analogous to existing configurations? In the context of this presentation the auditing in the environment of container databases will be examined more closely. It will be shown what has to be considered and how an auditing concept has to be adapted to the new architecture. With focus on the current versions of the Oracle database, specific problems and workarounds in the area of Unified Audit will be shown. The presentation will be complemented by corresponding examples and live demos.
This presentation mainly intends to explain how to gather the author name, title, submission year, keywords, subjects, department, university, language, and the number of pages of every thesis document in Theseus and then reuse the gathered data for building a Web portal.
Make Your Application “Oracle RAC Ready” & Test For ItMarkus Michalewicz
This presentation talks about the secrets behind Oracle RAC’s horizontal scaling algorithm, Cache Fusion, and how you can ensure that your application is “Oracle RAC ready.”. It discusses do's and don'ts and how to test your application for "Oracle RAC readiness". This version was first presented in Sangam19.
Why Architecting for Disaster Recovery is Important for Your Time Series Data...InfluxData
Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.”
In this talk, the members of IT staff, Saravanan Krisharaju, Rajeev Tomer, and Karl Daman will share how they built a fault-tolerant solution based on InfluxEnterprise and AWS that collects and stores metrics and events. They added to this, Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB time series database for real-time access. This Capital One team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
Oracle Data Guard and Oracle Active Data Guard have long been the answer for the real-time protection, availability, and usability of Oracle data. This presentation provides an in-depth look at several key new features that will make your life easier and protect your data in new and more flexible ways. Learn how Oracle Active Data Guard 19c has been integrated with Oracle Database In-Memory and offers a faster application response after a role transition. See how DML can now be redirected from an Oracle Active Data Guard standby to its primary for more flexible data protection in today’s data centers or your data clouds. This technical deep dive on Active Data Guard is designed to give you a glimpse into upcoming new features brought to you by Oracle Development.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Introduction to Machine Learning with Azure & DatabricksCCG
Join CCG and Microsoft for a hands-on demonstration of Azure’s machine learning capabilities. During the workshop, we will:
- Hold a Machine Learning 101 session to explain what machine learning is and how it fits in the analytics landscape
- Demonstrate Azure Databricks’ capabilities for building custom machine learning models
- Take a tour of the Azure Machine Learning’s capabilities for MLOps, Automated Machine Learning, and code-free Machine Learning
By the end of the workshop, you’ll have the tools you need to begin your own journey to AI.
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real-time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. NISHANT BANGARWA, Software engineer, Hortonworks
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
This presentation mainly intends to explain how to gather the author name, title, submission year, keywords, subjects, department, university, language, and the number of pages of every thesis document in Theseus and then reuse the gathered data for building a Web portal.
Make Your Application “Oracle RAC Ready” & Test For ItMarkus Michalewicz
This presentation talks about the secrets behind Oracle RAC’s horizontal scaling algorithm, Cache Fusion, and how you can ensure that your application is “Oracle RAC ready.”. It discusses do's and don'ts and how to test your application for "Oracle RAC readiness". This version was first presented in Sangam19.
Why Architecting for Disaster Recovery is Important for Your Time Series Data...InfluxData
Time Series data at Capital One consists of Infrastructure, Application, and Business Process Metrics. The combination of these metrics are what the internal stakeholders rely on for observability which allows them to deliver better service and uptime for their customers, so protecting this critical data with a proven and tested recovery plan is not a “nice to have” but a “must have.”
In this talk, the members of IT staff, Saravanan Krisharaju, Rajeev Tomer, and Karl Daman will share how they built a fault-tolerant solution based on InfluxEnterprise and AWS that collects and stores metrics and events. They added to this, Machine Learning, which uses the collected time series to model predictions which are then brought back into InfluxDB time series database for real-time access. This Capital One team shares the journey they took to architect and build this solution as well as plan and execute on their disaster recovery plan.
Achieving Lakehouse Models with Spark 3.0Databricks
It’s very easy to be distracted by the latest and greatest approaches with technology, but sometimes there’s a reason old approaches stand the test of time. Star Schemas & Kimball is one of those things that isn’t going anywhere, but as we move towards the “Data Lakehouse” paradigm – how appropriate is this modelling technique, and how can we harness the Delta Engine & Spark 3.0 to maximise it’s performance?
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
Oracle Data Guard and Oracle Active Data Guard have long been the answer for the real-time protection, availability, and usability of Oracle data. This presentation provides an in-depth look at several key new features that will make your life easier and protect your data in new and more flexible ways. Learn how Oracle Active Data Guard 19c has been integrated with Oracle Database In-Memory and offers a faster application response after a role transition. See how DML can now be redirected from an Oracle Active Data Guard standby to its primary for more flexible data protection in today’s data centers or your data clouds. This technical deep dive on Active Data Guard is designed to give you a glimpse into upcoming new features brought to you by Oracle Development.
Frame - Feature Management for Productive Machine LearningDavid Stein
Presented at the ML Platforms Meetup at Pinterest HQ in San Francisco on August 16, 2018.
Abstract: At LinkedIn we observed that much of the complexity in our machine learning applications was in their feature preparation workflows. To address this problem, we built Frame, a shared virtual feature store that provides a unified abstraction layer for accessing features by name. Frame removes the need for feature consumers to deal directly with underlying data sources, which are often different across computing environments. By simplifying feature preparation, Frame has made ML applications at LinkedIn easier to build, modify, and understand.
Introduction to Machine Learning with Azure & DatabricksCCG
Join CCG and Microsoft for a hands-on demonstration of Azure’s machine learning capabilities. During the workshop, we will:
- Hold a Machine Learning 101 session to explain what machine learning is and how it fits in the analytics landscape
- Demonstrate Azure Databricks’ capabilities for building custom machine learning models
- Take a tour of the Azure Machine Learning’s capabilities for MLOps, Automated Machine Learning, and code-free Machine Learning
By the end of the workshop, you’ll have the tools you need to begin your own journey to AI.
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real-time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases. NISHANT BANGARWA, Software engineer, Hortonworks
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
PostgreSQL is an open source relational database. Kafka is an open source log-based messaging system. Because both systems are powerful and flexible, they’re devouring whole categories of infrastructure. And they’re even better together.
In this talk, you’ll learn about commit logs and how that fundamental data structure underlies both PostgreSQL and Kafka. We’ll use that basis to understand what Kafka is, what advantages it has over traditional messaging systems, and why it’s perfect for modeling database tables as streams. From there, we’ll introduce the concept of change data capture (CDC) and run a live demo of Bottled Water, an open source CDC pipeline, watching INSERT, UPDATE, and DELETE operations in PostgreSQL stream into Kafka. We’ll wrap up with a discussion of use cases for this pipeline: messaging between systems with transactional guarantees, transmitting database changes to a data warehouse, and stream processing.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
Los Angeles Apache Spark Users Group 2014-12-11 http://meetup.com/Los-Angeles-Apache-Spark-Users-Group/events/218748643/
A look ahead at Spark Streaming in Spark 1.2 and beyond, with case studies, demos, plus an overview of approximation algorithms that are useful for real-time analytics.
Data Engineering A Deep Dive into DatabricksKnoldus Inc.
During this session, you'll gain a comprehensive understanding of Databricks' capabilities for efficiently processing and managing data, with a focus on Apache Spark for data transformation. We'll cover data ingestion methods, storage, orchestration, and best practices to ensure your data engineering workflows are optimized for success.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include:
· Databrick’s efforts in scaling Deep learning with Spark
· Intel announcing the BigDL: A Deep learning library for Spark
· Yahoo’s recent efforts to opensource TensorFlowOnSpark
In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.
ApacheCon 2021 Apache Deep Learning 302Timothy Spann
ApacheCon 2021 Apache Deep Learning 302
Tuesday 18:00 UTC
Apache Deep Learning 302
Timothy Spann
This talk will discuss and show examples of using Apache Hadoop, Apache Kudu, Apache Flink, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to previous talks on Apache Deep Learning 101 and 201 and 301 at ApacheCon, Dataworks Summit, Strata and other events. As part of this talk, the presenter will walk through using Apache MXNet Pre-Built Models, integrating new open source Deep Learning libraries with Python and Java, as well as running real-time AI streams from edge devices to servers utilizing Apache NiFi and Apache NiFi - MiNiFi. This talk is geared towards Data Engineers interested in the basics of architecting Deep Learning pipelines with open source Apache tools in a Big Data environment. The presenter will also walk through source code examples available in github and run the code live on Apache NiFi and Apache Flink clusters.
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
* https://github.com/tspannhw/ApacheDeepLearning302/
* https://github.com/tspannhw/nifi-djl-processor
* https://github.com/tspannhw/nifi-djlsentimentanalysis-processor
* https://github.com/tspannhw/nifi-djlqa-processor
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
We will be talking about the solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.
Autonomous medical coding with discriminative transformersPatrick Nicolas
Application of transformers and deep learning to the extraction of medical codes and insurance claims from electronic health records. This presentation lists modeling challenges and pitfalls and analyzes various configurations for BERT encoder. It compares techniques for pre-training and fine-tuning run in the context of classification.
Comparison of rule-based/ontology systems and machine learning models for the extraction insights from electronic health records and related charts. Inference and prediction.
Non-linear classification models rely commonly, on kernel functions. Models are highly dependent on a training (labeled) data sets. Models and therefore their underlying kernel have to adapt to the most recent labeled observations.
This presentation describes a solution to automated the evaluation and selection of a kernel function appropriate to a specific training set in online training.
This presentation describes some key features of Scala uses in the creation of machine learning algorithms:
1 Functorial definition of tensors for learning non-linear models (manifolds)
2. Monads to compose of explicit kernel functions in Euclidean space
3. Implicit class to extends Scala standard library
4. Stackable traits and dependency injection to build formal models and dynamic workflows
5. Tail recursion to implementation dynamic programming techniques
6. Streaming to reduce memory consumption for big data
7. Control of back pressure in data flows
http://patricknicolas.blogspot.com
http://bit.ly/12GjRu9
Stock Market Prediction using Hidden Markov Models and Investor sentimentPatrick Nicolas
This presentation describes hidden Markov Models to predict financial markets indices using the weekly sentiment survey from the American Association of Individual Investors.
The first section describes the hidden Markov model (HMM), followed by selection of features (investors' sentiment) and labeled data (S&P 500 index).
The second section dives into HMMs for continuous observations and detection of regime shifts/structural breaks using an auto-regressive Markov chain
The last section is devoted to alternative models to HMM.
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
This is an introduction to adaptive intrusion detection systems using rules-based learning classifiers. After listing the limitation of the current clustering and supervised learning techniques, the presentation describes a new class of learning algorithms used for detecting and preventing intrusion in computer networks and data center. Security policies are constantly upgraded or downgrades to adjust to ever changing IT environment, organization and regulations, by combining Genetic Algorithm and Reinforcement learning.
This is an introduction to the concept of symbolic regression for managing effectively data stream. Symbolic regression combines genetic algorithm, reinforcement learning and flexible policies to extract meaning or knowledge from data, in an ever changing environment. As the knowledge extracted from real-time data is human readable and consumable, decision makers can validate the findings of the algorithm and act appropriately. Symbolic regression is used in signal processing, process monitoring and adaptive caching in data centers.
This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
This presentation introduces the different modes of deployment of applications on a private cloud. Each solution is evaluate in terms of access control, performance and scalability.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
2. Overview
3
“… and the wise man said,
thou shall embrace open source”.
21st century proverb
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
4. Overview
5
The world of data scientists accustomed to Python
scientific libraries have been shaken up by the
emergence of ’big data’ framework such as Apache
Hadoop, Spark and Kafka.
This presentation introduces a variant of the
architecture and describes the seamless integration of
various open source components to train, validate and
test deep learning models.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
5. Disclaimer
6
The concept and architecture are versatile enough to
accommodate a variety of open source, commercial
solutions and services beside the frameworks
prescribed in this presentation.
For instance, deep learning frameworks, such as Keras
or tensor flow are excellent alternatives to PyTorch.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
6. Requirements
7
• Process batch and stream data, concurrently
• Enforce data immutability
• Recover gracefully from human errors
• Handle hardware failures
• Minimize latency for real-time requests
• Scale for very large data set
• Optimize full lifecycle of data set
• Guarantee quality and integrity of data
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
A ‘big data’ framework should be able to ….
7. Optimizing data life cycle
8
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
The need for optimizing the data life cycle: 79% of data
scientist time is spent collecting and organizing data.
Source Quora
8. Data quality
9
Accuracy: Correct models and representative data.
Completeness: No missing data
Consistency: Applied to semantic and format
Timeliness: Up-to-date data and notification
Accessibility: Ease of use and high availability
Validity: Comply to constraints, rules and regulations
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Guaranteeing data quality and integrity
9. Solution …
10
- architecture is a large scale data processing that
balanced batch and real-time streamed data.
It is a one-stop shopping for various data sources that
balance latency, redundancy, easy of access and
throughput.
It breaks down into 3 layers
• Speed (streaming, real-time, …)
• Batch (training, analysis, …)
• Serving (query, visualization, …)
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
10. … using open source
11
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture using open source components?
𝜆
The task consists of reviewing and evaluating the trove
of available of open source libraries to build a robust
architecture that support the rigor of training and
tuning deep learning models.
The libraries are weaved through a set language-
agnostic REST API to form a coherent pipeline.
11. … for deep learning
12
• Python scientific libraries have been the go-to tools
for data scientists to analyze data and build models.
• PyTorch framework builds up on these libraries to
support the design and execution of deep learning
models.
• Apache Spark and Kafka complements these
frameworks for very large data set and real-time
processing.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture for deep learning?
𝜆
12. Bird-eye view
13
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Feel overwhelmed?
... Let’s break it down
Example open source
𝜆 architecture
14. Batch layer
15
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Batch layer objective: load batch of data to be distributed,
preprocessed to train deep learning models.
15. Batch layer
16
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Typical use case:
1. Apache Spark loads training set from Amazon S3
2. Spark master partitions training data
3. Spark workers preprocessed data and notify
completion through Kafka event queue
4. Pytorch updated model parameters from pre-
processed training data
5. Pytorch broadcast model parameters and quality
metrics through Kafka
6. Apache Hive powered by Spark stores models related
data and metrics
16. Speed layer
17
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Speed layer objective: process queries to predictive
models with very low latency.
17. Speed layer
18
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Kafka routes data streams to Spark master
2. Spark pre-processes requests and forward them to
deep model micro-service
3. Flask converts requests to prediction query to Pytorch
model
4. Pytorch model generate a prediction
5. Run-time metrics are broadcast through Kafka
18. Serving layer
19
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Serving layer objective: process queries to analyze data,
model performances and execute statistical inference
19. Serving layer
20
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Analyst queries relational data base, MySQL for most
recent data, statistics using Fine report UI (low
latency)
2. Analyst queries asynchronously Hive data warehouse
for archived data, statistics (high latency)
3. Hive processes queries through Spark datasets
4. Spark updates regularly MySQL short term data
21. PyTorch
22
PyTorch is an optimized tensor library for deep
learning using GPUs and CPUs.
It extends the functionality of Numpy and Scikit-
learn to support the training, evaluation and
commercialization of complex machine learning
models.
https://pytorch.org/tutorials/
Alternatives:
Tensor flow: https://www.tensorflow.org/
Keras: https://keras.io
MxNet: https://mxnet.apache.org
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
22. Apache Spark
23
Apache Spark is an open source cluster computing
framework for fast real-time processing.
It supports Scala, Java, Python and R programming
languages and includes streaming, graph and machine
learning libraries.
https://www.scala-lang.org
https://spark.apache.org
Alternative:
PySpark: https://databricks.com/glossary/pyspark
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
23. Streaming
24
Apache Kafka is an open-source distributed event
streaming framework to large scale, real-time data
processing and analytics.
It captures data from various sources in real-time as a
continuous flow and routes it to the appropriate
processor.
https://kafka.apache.org
Alternatives:
Amazon SQS: https://aws.amazon.com/sqs/
RabbitMQ: https://www.rabbitmq.com
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
24. Model tuning
25
Ray-tune is a distributed hyper-parameters
tuning framework particularly suitable to deep learning
models.
It reduces significantly the cost of optimizing the
configuration of a model. It is a wrapper around other
open source libraries
https://docs.ray.io/en/master/tune/index.html
Alternatives:
Amazon SageMaker: https://aws.amazon.com/sagemaker/
HyperOpt: https://github.com/hyperopt/hyperopt
Optuna: https://optuna.readthedocs.io
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
25. Python REST service
26
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Flask is an easy to use implementation of the
RESTful interface to Python applications.
It supports most of web and deployment standards
such Docker, React.js, Angular, HTML5 and WSGI
containers.
https://palletsprojects.com/p/flask/
Alternatives:
Falcon: https://falcon.readthedocs.io
Fast API: https://fastapi.tiangolo.com
26. RDBMS
27
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
MySQL is an open source relational database
supporting partitioning, sharding, replication. It can
be extended with real-time analytics (Heatwave)
and enterprise clustering (CGE)
https://www.mysql.com
Alternatives:
PosgresSQL: https://www.postgresql.org
HyperSQL http://www.hsqldb.org
Amazon RDS: http://aws.amazon.com/rds
27. Data warehouse
28
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Apache Hive is a data warehouse framework that
leverages Spark to execute largely distributed SQL
queries.
It optimizes SQL queries through lazy evaluation of
acyclic execution graph. It is integrated with
Spark data set and HDFS.
https://hive.apache.org
Alternatives:
Vertica http://www.vertica.com
Amazon Redshift https://aws.amazon.com/redshift/
28. Dashboard
29
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Fine report is a business intelligence and
dashboard tool that supports real time analytics,
reporting and visualization. It accomodates needs
of business managers and data scientists
https://www.finereport.com
Alternatives:
Sisense: https://www.sisense.com
Tableau: https://www.tableau.com
29. 30
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Final disclaimer
This presentation is not an endorsement of the various
tools, libraries or frameworks described or suggested in
this presentation.
Allthough the tools listed in the slides are known to work
in the context of the architecture, there are excellent
alternative libraries that may better meet your specific
needs.
30. 31
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Thank you!
Q&A