A whitepaper is about How big data engines are used for exploring and preparing data, building pipelines, and delivering data sets to ML applications.
https://www.qubole.com/resources/white-papers/big-data-engineering-for-machine-learning
Automatic Parameter Tuning for Databases and Big Data Systems Jiaheng Lu
This document provides an overview of automatic parameter tuning for databases and big data systems. It discusses the challenges of tuning many parameters across different systems and workloads. The document then covers various approaches to parameter tuning, including rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. Recent works that use machine learning methods like Gaussian processes and reinforcement learning for automatic parameter tuning are also summarized.
The document provides an overview of Microsoft's AI platform, which includes AI Services, Infrastructure, and Tools. The platform offers a comprehensive set of AI services for rapid development, enterprise-grade infrastructure to run AI workloads at scale, and modern tools for data scientists and developers to create and operationalize AI solutions. It allows building intelligent applications that augment human abilities across various industries.
Transforming Oracle Enterprise Mobility Using Intelligent Chatbot & AI - A Wh...RapidValue
This whitepaper explains, developing intelligent Chatbot for Oracle Enterprise Applications using Oracle Mobile Cloud Enterprise. It discusses in detail the evolution of Artificial Intelligence, NLP and Machine Learning on top of Oracle Mobile Cloud Service, in order to bot-enable Oracle Enterprise Systems. It helps you to understand, how to develop an Oracle Approvals and Time-entry bot, using Oracle mobile cloud enterprise by programming & training the bot in NLP and Machine Learning.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping, and monitoring algorithm convergence in real time.
Real time machine learning visualization with spark -- Hadoop Summit 2016Chester Chen
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping points, and monitoring algorithm convergence in real time.
Data Discovery at Databricks with AmundsenDatabricks
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
Automatic Parameter Tuning for Databases and Big Data Systems Jiaheng Lu
This document provides an overview of automatic parameter tuning for databases and big data systems. It discusses the challenges of tuning many parameters across different systems and workloads. The document then covers various approaches to parameter tuning, including rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. Recent works that use machine learning methods like Gaussian processes and reinforcement learning for automatic parameter tuning are also summarized.
The document provides an overview of Microsoft's AI platform, which includes AI Services, Infrastructure, and Tools. The platform offers a comprehensive set of AI services for rapid development, enterprise-grade infrastructure to run AI workloads at scale, and modern tools for data scientists and developers to create and operationalize AI solutions. It allows building intelligent applications that augment human abilities across various industries.
Transforming Oracle Enterprise Mobility Using Intelligent Chatbot & AI - A Wh...RapidValue
This whitepaper explains, developing intelligent Chatbot for Oracle Enterprise Applications using Oracle Mobile Cloud Enterprise. It discusses in detail the evolution of Artificial Intelligence, NLP and Machine Learning on top of Oracle Mobile Cloud Service, in order to bot-enable Oracle Enterprise Systems. It helps you to understand, how to develop an Oracle Approvals and Time-entry bot, using Oracle mobile cloud enterprise by programming & training the bot in NLP and Machine Learning.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping, and monitoring algorithm convergence in real time.
Real time machine learning visualization with spark -- Hadoop Summit 2016Chester Chen
This document discusses enabling real-time machine learning visualization with Spark. It presents a callback interface for Spark ML algorithms to send messages during training and a task channel to deliver messages from the Spark driver to a client. The messages are pushed to a browser using server-sent events and HTTP chunked responses. This allows visualizing training metrics, determining early stopping points, and monitoring algorithm convergence in real time.
Data Discovery at Databricks with AmundsenDatabricks
Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.
We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:
Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.
This white paper explains how JethroData can help you achieve truly interactive response times for BI on big data, and how the underlying technology works.
It analyzes the challenges of implementing indexes for big data and how JethroData solved these challenges. It then discusses how the JethroData design of separating compute from storage works with Hadoop and with Amazon S3. Finally, it briefly discusses some of the main features behind JethroData's performance, including I/O, query planning and execution features.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
Gobblin is a unified data ingestion framework developed by LinkedIn to ingest large volumes of data from diverse sources into Hadoop. It provides a scalable and fault-tolerant workflow that extracts data, applies transformations, checks for quality, and writes outputs. Gobblin addresses challenges of operating multiple heterogeneous data pipelines by standardizing various ingestion tasks and metadata handling through its pluggable components.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
DataWorks Summit 2017 - Sydney Keynote
Madhu Kochar, Vice President, Analytics Product Development and Client Success, IBM
Data science holds the promise of transforming businesses and disrupting entire industries. However, many organizations struggle to deploy and scale key technologies such as machine learning and deep learning. IBM will share how it is making data science accessible to all by simplifying the use of a range of open source technologies and data sources, including high performing and open architectures geared for cognitive workloads.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Big SQL provides an SQL interface for querying data stored in Hadoop. It uses a new query engine derived from IBM's database technology to optimize queries. Big SQL allows SQL users easy access to Hadoop data through familiar SQL tools and syntax. It supports creating and loading tables, standard SQL queries including joins and subqueries, and integrating Hadoop data with external databases in a single query.
The document discusses agile approaches to data warehousing and big data technologies. It describes traditional data warehousing as brittle and costly to modify. An agile approach uses reusable ETL modules and a hyper-normalized integration layer to flexibly adapt to changing requirements. Big data technologies like Hadoop, NoSQL databases, and cloud-based data warehouses are also discussed as enabling flexible and cost-effective options for large and evolving data and analytics needs.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
The document discusses resource tracking for Hadoop and Storm clusters at Yahoo. It describes how Yahoo developed tools over three years to track resource usage at the application, cluster, queue, user and project levels. This includes capturing CPU and memory usage for Hadoop YARN applications and Storm topologies. The data is stored and made available through dashboards and APIs. Yahoo also calculates total cost of ownership for Hadoop and converts resource usage to estimated monthly costs for projects. This visibility into usage and costs helps with capacity planning, operational efficiency, and ensuring fairness across grid users.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document discusses an advanced visualization tool for Spark and Flink jobs. It collects fine-grained data about task execution, including data characteristics and block fetch information. This information is exposed through a REST API and used to visualize the physical execution plan, detect issues like data skew, and help developers optimize their applications. The tool aims to help understand distributed data processing systems and guide testing of adaptive partitioning techniques. It has been extended to support Flink visualization as well. Future plans include open-sourcing the framework and adding more visualization features and metrics.
Helixa uses serverless machine learning architectures to power an audience intelligence platform. It ingests large datasets and uses machine learning models to provide insights. Helixa's machine learning system is built on AWS serverless services like Lambda, Glue, Athena and S3. It features a data lake for storage, a feature store for preprocessed data, and uses techniques like map-reduce to parallelize tasks. Helixa aims to build scalable and cost-effective machine learning pipelines without having to manage servers.
Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter
The data is increasing, and to digest all this data, there are many distributed systems available. Hadoop and Spark are the most famous ones. Choosing one out of two depends entirely upon the requirement of your project. Read more to know which of these two frameworks is right for you.
The document discusses building an information analytics platform by integrating Hadoop and SAP HANA. It describes VUPICO's profile as a consulting firm focused on analytics using Hadoop and SAP HANA. It outlines the benefits of integrating these technologies, such as scalability, real-time access to large data, and a common view of business data. It also discusses using SAP HANA VORA to bridge Hadoop and SAP HANA and generate further value from predictive analytics.
Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
A whitepaper from qubole about the Tips on how to choose the best SQL Engine for your use case and data workloads
https://www.qubole.com/resources/white-papers/enabling-sql-access-to-data-lakes
This document provides an agenda and summary for a Data Analytics Meetup (DAM) on March 27, 2018. The agenda covers topics such as disruption opportunities in a changing data landscape, transitioning from traditional to modern BI architectures using Azure, Azure SQL Database vs Data Warehouse, data integration with Azure Data Factory and SSIS, Analysis Services, Power BI reporting, and a wrap-up. The document discusses challenges around data growth, digital transformation, and the shrinking time for companies to adapt to disruption. It provides overviews and comparisons of Azure SQL Database, Data Warehouse, and related Azure services to help modernize analytics architectures.
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
An open data lake platform provides a robust and future-proof data management paradigm to support a wide range of data processing needs, including data exploration, ad-hoc analytics, streaming analytics, and machine learning.
This white paper explains how JethroData can help you achieve truly interactive response times for BI on big data, and how the underlying technology works.
It analyzes the challenges of implementing indexes for big data and how JethroData solved these challenges. It then discusses how the JethroData design of separating compute from storage works with Hadoop and with Amazon S3. Finally, it briefly discusses some of the main features behind JethroData's performance, including I/O, query planning and execution features.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA and SAP BusinessObjects enabling a broad range of new analytic applications.
Gobblin is a unified data ingestion framework developed by LinkedIn to ingest large volumes of data from diverse sources into Hadoop. It provides a scalable and fault-tolerant workflow that extracts data, applies transformations, checks for quality, and writes outputs. Gobblin addresses challenges of operating multiple heterogeneous data pipelines by standardizing various ingestion tasks and metadata handling through its pluggable components.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing.
Spark is one of Hadoop's subproject developed in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top-level Apache project from Feb-2014.
This document shares some basic knowledge about Apache Spark.
DataWorks Summit 2017 - Sydney Keynote
Madhu Kochar, Vice President, Analytics Product Development and Client Success, IBM
Data science holds the promise of transforming businesses and disrupting entire industries. However, many organizations struggle to deploy and scale key technologies such as machine learning and deep learning. IBM will share how it is making data science accessible to all by simplifying the use of a range of open source technologies and data sources, including high performing and open architectures geared for cognitive workloads.
Jethro data meetup index base sql on hadoop - oct-2014Eli Singer
JethroData Index based SQL on Hadoop engine.
Architecture comparison of MPP / Full-Scan sql engines such as Impala and Hive to index-based access such as Jethro.
SQL and NoSQL NYC meetup Oct 20 2014
Boaz Raufman
Big SQL provides an SQL interface for querying data stored in Hadoop. It uses a new query engine derived from IBM's database technology to optimize queries. Big SQL allows SQL users easy access to Hadoop data through familiar SQL tools and syntax. It supports creating and loading tables, standard SQL queries including joins and subqueries, and integrating Hadoop data with external databases in a single query.
The document discusses agile approaches to data warehousing and big data technologies. It describes traditional data warehousing as brittle and costly to modify. An agile approach uses reusable ETL modules and a hyper-normalized integration layer to flexibly adapt to changing requirements. Big data technologies like Hadoop, NoSQL databases, and cloud-based data warehouses are also discussed as enabling flexible and cost-effective options for large and evolving data and analytics needs.
The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.
The document discusses resource tracking for Hadoop and Storm clusters at Yahoo. It describes how Yahoo developed tools over three years to track resource usage at the application, cluster, queue, user and project levels. This includes capturing CPU and memory usage for Hadoop YARN applications and Storm topologies. The data is stored and made available through dashboards and APIs. Yahoo also calculates total cost of ownership for Hadoop and converts resource usage to estimated monthly costs for projects. This visibility into usage and costs helps with capacity planning, operational efficiency, and ensuring fairness across grid users.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This document discusses an advanced visualization tool for Spark and Flink jobs. It collects fine-grained data about task execution, including data characteristics and block fetch information. This information is exposed through a REST API and used to visualize the physical execution plan, detect issues like data skew, and help developers optimize their applications. The tool aims to help understand distributed data processing systems and guide testing of adaptive partitioning techniques. It has been extended to support Flink visualization as well. Future plans include open-sourcing the framework and adding more visualization features and metrics.
Helixa uses serverless machine learning architectures to power an audience intelligence platform. It ingests large datasets and uses machine learning models to provide insights. Helixa's machine learning system is built on AWS serverless services like Lambda, Glue, Athena and S3. It features a data lake for storage, a feature store for preprocessed data, and uses techniques like map-reduce to parallelize tasks. Helixa aims to build scalable and cost-effective machine learning pipelines without having to manage servers.
Hadoop Vs Spark — Choosing the Right Big Data FrameworkAlaina Carter
The data is increasing, and to digest all this data, there are many distributed systems available. Hadoop and Spark are the most famous ones. Choosing one out of two depends entirely upon the requirement of your project. Read more to know which of these two frameworks is right for you.
The document discusses building an information analytics platform by integrating Hadoop and SAP HANA. It describes VUPICO's profile as a consulting firm focused on analytics using Hadoop and SAP HANA. It outlines the benefits of integrating these technologies, such as scalability, real-time access to large data, and a common view of business data. It also discusses using SAP HANA VORA to bridge Hadoop and SAP HANA and generate further value from predictive analytics.
Tired of seeing the loading spinner of doom while trying to analyze your big data on Tableau? Learn how Jethro accelerates your database so you can interactively analyze your big data on Tableau and gain the crucial insights that you need without losing your train of thought. Jethro enables you to be completely flexible with no need for partitions in order to speed up the data. This presentation will explain how indexing is a superior architecture for the BI use case when dealing with big data while compared to MPP architecture.
A whitepaper from qubole about the Tips on how to choose the best SQL Engine for your use case and data workloads
https://www.qubole.com/resources/white-papers/enabling-sql-access-to-data-lakes
This document provides an agenda and summary for a Data Analytics Meetup (DAM) on March 27, 2018. The agenda covers topics such as disruption opportunities in a changing data landscape, transitioning from traditional to modern BI architectures using Azure, Azure SQL Database vs Data Warehouse, data integration with Azure Data Factory and SSIS, Analysis Services, Power BI reporting, and a wrap-up. The document discusses challenges around data growth, digital transformation, and the shrinking time for companies to adapt to disruption. It provides overviews and comparisons of Azure SQL Database, Data Warehouse, and related Azure services to help modernize analytics architectures.
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
An open data lake platform provides a robust and future-proof data management paradigm to support a wide range of data processing needs, including data exploration, ad-hoc analytics, streaming analytics, and machine learning.
How Microsoft Synapse Analytics Can Transform Your Data Analytics.pdfAddend Analytics
In this article, I’ll show you some of the key features and benefits of Microsoft Synapse Analytics and how it can help you accelerate time to insight across your data landscape.
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...Mark Tabladillo
Microsoft has introduced a new technology for developing analytics applications in the cloud. The presenter has an insider's perspective, having actively provided feedback to the Microsoft team which has been developing this technology over the past 2 years. This session will 1) provide an introduction to the Azure technology including licensing, 2) provide demos of using R version 3 with AzureML, and 3) provide best practices for developing applications with Azure Machine Learning
Qubole Pipeline Services - A Complete Stream Processing Service - Data SheetsVasu S
A Data Sheet about Qubole Pipeline Service to manage streaming ETL pipelines with zero overhead of installation, Integration with Maintenance.
https://www.qubole.com/resources/data-sheets/qubole-pipeline-services
A whitepaper is about Qubole on AWS provides end-to-end data lake services such as AWS infrastructure management, data management, continuous data engineering, analytics, & ML with zero administration
https://www.qubole.com/resources/white-papers/qubole-on-aws
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
This document summarizes research on optimizing queries for big data analytics. It discusses how organizations use different databases with varied data models to store and query big data. The main focus is improving query performance by having a query framework that can detect optimized data copies created by data engineers and execute queries against these copies. The framework uses the Apache Calcite query optimizer which rewrites queries to use optimized copies when possible based on a cost model. An evaluation on real taxi trip data showed the approach improved query response times.
Analytics and Lakehouse Integration Options for Oracle ApplicationsRay Février
The document discusses various options for extracting data from Oracle Fusion and Oracle EPM Cloud applications for analytics purposes. It outlines using the Business Intelligence Cloud Connector (BICC) to extract data to object storage, which can then be loaded into Oracle Analytics Cloud (OAC) or Autonomous Data Warehouse (ADW) for analysis. For EPM Cloud, it notes using the EPM Automate REST API wrapper or Oracle Data Integrator Marketplace connector. The document provides an overview of tools like OAC, ADW, ODI, and OCI Data Integration that can help transform and model the data for analytics and machine learning.
Introduction to GCP DataFlow PresentationKnoldus Inc.
In this session, we will learn about how Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.
The document discusses Precima's analytics processes and pipeline. It describes moving from on-premise systems like SAS and shell scripting to using AWS services like S3, Control-M, Luigi, and Redshift. It outlines considerations for pipeline design and reviews both past and current systems. The future vision involves using Databricks for data pipelines and Snowflake for queries, allowing decoupled, scalable computing and storage.
This document provides an overview and comparison of RDBMS, Hadoop, and Spark. It introduces RDBMS and describes its use cases such as online transaction processing and data warehouses. It then introduces Hadoop and describes its ecosystem including HDFS, YARN, MapReduce, and related sub-modules. Common use cases for Hadoop are also outlined. Spark is then introduced along with its modules like Spark Core, SQL, and MLlib. Use cases for Spark include data enrichment, trigger event detection, and machine learning. The document concludes by comparing RDBMS and Hadoop, as well as Hadoop and Spark, and addressing common misconceptions about Hadoop and Spark.
Azure BI Cloud Architectural Guidelines.pdfpbonillo1
This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.
This document provides an overview of Oracle Data Integrator and how it can be used to integrate data across heterogeneous databases and platforms in a service-oriented architecture. Specifically, it summarizes:
1) How Oracle Data Integrator uses a modular repository and graphical modules to design and execute integration processes that can integrate data from various sources like databases, files, and web services.
2) An example of using Oracle Data Integrator to integrate orders and customer data from an Oracle database with employee data from a file, and load it into a SQL Server database in near real-time using Oracle's change data capture functionality.
3) How the example creates journals to capture changed data from the Oracle source, builds interfaces to
The document discusses Azure Synapse Analytics and its core services. It provides an overview of Azure Synapse Analytics, its high-level architecture, components and features. It discusses Azure Synapse Studio, Azure Synapse Data Integration, Synapse SQL Pools, Apache Spark for Azure Synapse and Azure Synapse security. It also discusses Azure HDInsight, its features, architecture, metastore best practices, migration practices and security and DevOps.
Similar to Big Data Engineering for Machine Learning (20)
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
Real-world data science practitioners offer perspectives and advice on six common Machine Learning problems
https://www.qubole.com/resources/ebooks/oreilly-ebook-machine-learning-at-enterprise-scale
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
This ebook deep dives into Apache Spark optimizations that improve performance, reduce costs and deliver unmatched scale
https://www.qubole.com/resources/ebooks/accelerating-time-to-value-of-big-data-of-apache-spark
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolrVasu S
An O'Reilly eBook about Creating a Data-Driven Enterprise in Media DataOps Insights from Comcast, Sling TV, and Turner Broadcasting.
https://www.qubole.com/resources/ebooks/ebook-creating-a-data-driven-enterprise-in-media
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Vasu S
Find out how Qubole helped Spotad, Inc's mobile advertising platform, save 50 percent in its operating costs almost instantly after their migration.
https://www.qubole.com/resources/case-study/spotad
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...Vasu S
Oracle Data Cloud uses 82 clusters with Qubole, including 12 Hadoop1, 28 Hadoop2, and 41 Spark clusters. They configured 25 Hadoop2 and 14 Spark clusters with heterogeneous nodes to reduce costs from rising EC2 prices and spot market volatility. Since switching to heterogeneous clusters 6 months ago, Oracle's costs have decreased or remained steady despite increased usage.
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...Vasu S
Read a case study that how Ibotta cut costs thanks to Qubole’s autoscaling and downscaling capabilities, and the ability to isolate workloads to separate clusters
https://www.qubole.com/resources/case-study/ibotta
Case Study - Wikia Provides Federated Access To Data And Business Critical In...Vasu S
A case study of Wikia, that migrated its big data infrastructure and workloads to the cloud in a few months with Qubole and completely eliminated the overhead needed to manage its data platform.
https://www.qubole.com/resources/case-study/wikia
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...Vasu S
A case study of Komli, that has seen big improvements in data processing, lower total cost of ownership, faster performance and unlimited scale at a lower cost with Qubole.
https://www.qubole.com/resources/case-study/komli-media
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...Vasu S
Malaysia Airlines faced increasing pressure to cut costs and improve profitability. They realized departments were hampered by a lack of data availability, as IT required 48 hours on average to access data. Malaysia Airlines migrated to Microsoft Azure and used Qubole to increase data processing capabilities and reduce data ingestion time by over 90%, allowing customer data to be accessed within 20 minutes rather than 6 hours. This near real-time data access enabled dynamic pricing and improved the customer experience.
Case Study - AgilOne: Machine Learning At Enterprise Scale | QuboleVasu S
A case study about Agilone,partnered with Qubole to better automate the provision of machine learning data-processing resources based on workload with jobs, and automating cluster management.
https://www.qubole.com/resources/case-study/agilone
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Vasu S
DataXu uses Qubole Data Platform to automate and manage on-premise deployments, provision clusters, maintain Hadoop distributions, and upkeep Adhoc clusters with Qubole's Hive as a service.
https://www.qubole.com/resources/case-study/dataxu
How To Scale New Products With A Data Lake Using Qubole - Case StudyVasu S
Read the case study of Tivo, that how Qubole helped TiVo make viewership, purchasing behavior, and location-based consumer data easily available for its network and advertising partners.
https://www.qubole.com/resources/case-study/tivo
Big Data Trends and Challenges Report - WhitepaperVasu S
In this whitepaper read How companies address common big data trends & challenges to gain greater value from their data.
https://www.qubole.com/resources/report/big-data-trends-and-challenges-report
Qubole is a cloud-native data platform that includes a native connector for Tableau to enable business intelligence and visual analytics on any cloud data lake with any file format. The Qubole connector delivers fast query response times for Tableau users through Presto on Qubole, while automatically managing cloud infrastructure based on user demand to prevent performance impacts or resource competition for simultaneous users. Tableau customers have flexibility to query unstructured or semi-structured data on any data lake, leveraging Presto's high performance without changing their normal workflow.
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
A data lake, where data is stored in an open format and accessed through open standards-based interfaces, is defined as an Open Data Lake.
https://www.qubole.com/resources/data-sheets/what-is-an-open-data-lake
Qubole GDPR Security and Compliance Whitepaper Vasu S
A Whitepaper is about How Qubole can help with GDPR compliance & regulatory needs by using our domain knowledge and best practices to help you meet the GDPR.
https://www.qubole.com/resources/white-papers/qubole-gdpr-security-and-compliance-whitepaper
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...Vasu S
A whitepaper of TDWI checklist, drills into the data, tools, and platform requirements for machine learning to to identify goals and areas of improvement for current project
https://www.qubole.com/resources/white-papers/tdwi-checklist-the-automation-and-optimzation-of-advanced-analytics-based-on-machine-learning
Qubole on Azure: Security Compliance - White Paper | QuboleVasu S
A whitepaper is about the security strategies we use to protect your information and provides details of how that strategy is implemented on Microsoft Azure.
https://www.qubole.com/resources/white-papers/qubole-on-azure-security-compliance
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Introducing BoxLang : A new JVM language for productivity and modularity!Ortus Solutions, Corp
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Dynamic. Modular. Productive.
BoxLang redefines development with its dynamic nature, empowering developers to craft expressive and functional code effortlessly. Its modular architecture prioritizes flexibility, allowing for seamless integration into existing ecosystems.
Interoperability at its Core
With 100% interoperability with Java, BoxLang seamlessly bridges the gap between traditional and modern development paradigms, unlocking new possibilities for innovation and collaboration.
Multi-Runtime
From the tiny 2m operating system binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, Web Assembly, Android and more. BoxLang has been designed to enhance and adapt according to it's runnable runtime.
The Fusion of Modernity and Tradition
Experience the fusion of modern features inspired by CFML, Node, Ruby, Kotlin, Java, and Clojure, combined with the familiarity of Java bytecode compilation, making BoxLang a language of choice for forward-thinking developers.
Empowering Transition with Transpiler Support
Transitioning from CFML to BoxLang is seamless with our JIT transpiler, facilitating smooth migration and preserving existing code investments.
Unlocking Creativity with IDE Tools
Unleash your creativity with powerful IDE tools tailored for BoxLang, providing an intuitive development experience and streamlining your workflow. Join us as we embark on a journey to redefine JVM development. Welcome to the era of BoxLang.
In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project.
📕 Detailed agenda:
Variables and Datatypes
Workflow Layouts
Arguments
Control Flows and Loops
Conditional Statements
💻 Extra training through UiPath Academy:
Variables, Constants, and Arguments in Studio
Control Flow in Studio
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
1. BIG DATA ENGINEERING
FOR MACHINE LEARNING
AN INTRODUCTION TO COMMONLY USED
ENGINES AND FRAMEWORKS
by Jorge Villamariona, Qubole Technical Marketing
White Paper
2. White Paper
TABLE OF CONTENTS
INTRODUCTION...............................................................................................................................................
ENGINES FOR BUILDING MACHINE LEARNING (ML) DATA PIPELINES....................
EXPLORING YOUR DATA...............................................................................................................................
BUILDING ROBUST DATA PIPELINES.........................................................................................................
Pipelines for batch processing.................................................................................................................
Pipelines for streaming data....................................................................................................................
ORCHESTRATING DATA PIPELINES............................................................................................................
DELIVERING DATA SETS................................................................................................................................
TAKEAWAYS.......................................................................................................................................................
3
4
6
7
7
7
9
10
11
2
3. INTRODUCTION
Even as individual consumers of goods and services we get to experience the results of Machine Learning when it is
used by the institutions we rely on to conduct our daily activities. We may have experienced a text message from a bank
requiring verification right after the bank has paused a credit card transaction. Or, on the more positive side, an online
travel site may have sent us an email that offers the perfect accommodations for our next personal or business trip.
What is much more difficult to appreciate is the work that happens behind the scenes to facilitate these experiences.
Yes, data science teams build applications that lead to these outcomes, but their work relies on massive data sets
supplied by data engineering teams. These data sets are in time leveraged to train, build and test data science models
that in turn facilitate these outcomes.
Data engineering teams are nothing new, they have been around for several decades. Their role has been most recently
extended from building data pipelines that only support traditional data warehouses to also building more technically
demanding continuous data pipelines that feed today’s applications that leverage Artificial Intelligence and Machine
Learning algorithms.
These data pipelines must be cost effective, fast, and reliable regardless of the type of workload and use case. This
document covers the most popular engines used to build these pipelines. It delineates the synergies between data
engineering and data science teams. It provides insights on how Qubole customers build their pipelines leveraging the
steps outlined below:
In the final section this document provides a guide to help decide which engine to use based on the business need
and type of workload.
3
White Paper
BUILD ORCHESTRATE DELIVER
EXPLORE
4. 4
ENGINES FOR BUILDING MACHINE LEARNING (ML) DATA
PIPELINES
Due to the diversity of data sources, and the volume of data that needs to be processed, traditional data processing
tools fail to meet the performance and reliability requirements for modern machine learning and advanced analytics
applications. The need to build reliable pipelines that could handle these workloads coupled with advances in
distributed high performance computing gave rise to data lake processing engines such as Hadoop. Let’s quickly
review the different engines and frameworks often used in data engineering aimed at supporting ML efforts:
White Paper
Apache Hadoop/Hive
Hive is an Apache open-source project built on top of Hadoop for querying, summarizing
and analyzing large data sets using a SQL-like interface. It is noted for bringing the
familiarity of relational technology to data lake processing with its Hive Query Language, as
well as structures and operations comparable to those used by relational databases such
as tables, joins and partitions. Apache Hive is used mostly for batch processing of large
ETL jobs and batch SQL queries on very large data sets as well as exploration on large
volumes of structured, semi-structured, and unstructured data. Hive includes a Metastore
which provides schemas and statistics that are useful for optimizing queries and data
exploration. Hive is Distributed, Scalable, Fault Tolerant, Flexible and it can persist
large data sets on cloud file systems. It is important to note that Hive was created by the
founders of Qubole while working at Facebook’s data team. Qubole provides an enhanced,
cloud-optimized, self-managing, and self-optimizing implementation of Apache Hive.
Qubole’s implementation of Hive leverages AIR (Alerts, Insights, Recommendations) and
allows data teams to focus on generating business value from data rather than managing
the platform. Qubole Hive seamlessly integrates with existing data sources and third-party
tools, while providing best-in-class security.
Apache Spark
Spark is a general purpose open-source computational engine for Hadoop data. Spark
provides a simple and expressive programming model that supports a wide range of
applications, including ETL, machine learning, stream processing, and graph computation.
Spark is also Distributed, Scalable, Fault Tolerant, Flexible, and because of its in-
memory processing capabilities, it is also Fast. Spark natively supports a wide array of
programming languages, including Java, Python, R, and Scala. Spark includes native
integration with a number of data persistence solutions such as HDFS. Because Spark is
memory intensive, it may not be the most cost-effective engine for all use cases. Qubole’s
Spark implementation greatly improves the performance of Spark workloads with
enhancements such as fast storage, distributed caching, advanced indexing, and metadata
caching capabilities. Other enhancements include job isolation on multi-tenant clusters and
SparkLens, an open source Spark profiler that provides insights into the Spark application.
5. 5
White Paper
Presto
Presto is an open-source SQL query engine developed by Facebook. Presto is used for
running interactive analytic queries against data sources of all sizes ranging from gigabytes
to petabytes. Presto was built to provide SQL query capabilities against disparate data
sources, this allows Presto users to combine data from multiple data sources in one query.
This is known as “Federated Queries”. Presto is tuned like an MPP (Massively Parallel
Processing) SQL engine and it is optimized for SQL execution Since the questions are often
ad-hoc, there is some trial and error involved; arriving at the final results may involve a series
of SQL queries. Presto offers connectors to work directly on files that reside on the file
system (e.g. S3, Azure storage) and can join terabytes of data in seconds, or cache queries
intermittently for rapid response upon later runs. Qubole Presto also expedites
performance by reordering the execution sequence of query joins based on table statistics.
By optimizing the response time of these queries, Presto accelerates the time to insight to
better serve your business needs. Presto is also Distributed, Scalable, and Flexible.
Presto is also Fast due to its in-memory processing and it can be ready in just a few minutes.
Qubole has optimized Presto for the cloud. Qubole’s enhancements allow for dynamic
cluster sizing based on workload and termination of idle clusters — ensuring high reliability
while reducing compute costs. Qubole’s Presto clusters support multi-tenancy and provide
logs and metrics to track performance of queries.
Airflow
While technically, not a big data engine, Airflow is an open-source tool to programmatically
author, schedule and monitor data workflows. With Airflow, users can author workflows
as directed acyclic graphs (DAGs) of tasks. A DAG is the set of tasks needed to complete a
pipeline organized to reflect their relationships and interdependencies. Airflow’s rich user
interface makes it easy to visualize pipelines running in production, monitor progress, and
troubleshoot issues when needed. It connects out-of-the-box with multiple data sources, it
can alert via email or slack when a task completes or fails. Because workflows are defined as
code, they become more maintainable, version-able, testable, and collaborative. Airflow is
Distributed, Scalable and Flexible which makes it well suited to handle the orchestration
of complex business logic. Qubole provides its users an enhanced, cloud-optimized version
of Apache Airflow. Qubole provides single-click deployment of Airflow, automates cluster and
configuration management, and includes dashboards to visualize Airflow DAGs.
Besides providing these scalable engines and frameworks it is important to mention that
Qubole also partners with 3rd party ETL/ELT vendors that offer visual tools that facilitate
building data pipelines. These tools are very powerful, and many teams leverage them to
expedite the work with the engines listed above.
6. 6
EXPLORING YOUR DATA
Data engineering always starts with data exploration. Data
exploration is inspecting the data in order to understand its
characteristics and what it represents. At the end of the exploration
stage data engineers should understand the structure of the data,
its volume, its patterns as well as its quality (e.g. are all the records
complete? are there duplicate records?). The learnings acquired
during this stage will impact the amount and type of work that the
data scientist will do during their data preparation phase.
Because SQL is a widely used standard for ad hoc queries
it is a good tool for users to start interacting with structured
datasets, especially larger structured datasets (in the terabytes)
that need to be persisted. Because of its inexpensive storage and
it is compatible with SQL, Hadoop/Hive is an excellent choice in
this case.
Spark works well for data sets that require a programmatic
approach -- e.g. a hierarchical tree needs to be traversed just to
read the data such as is the case with the 835 and 837 file formats
widely used in healthcare insurance processing. Spark also offers
facilities (Python, R, Scala) that can be used to quickly understand
the nature and statistical distribution of the data (such as 5 number
summary, etc.).
On the other hand, Presto provides a quick and easy way to allow access to data from a variety of sources using industry
standard SQL query language. Users don’t have to learn any new complex language or tool, they can simply utilize
existing tools with which they are already comfortable. For data exploration, Presto is very similar to Hive but Presto
offers the advantage of speed because of its in-memory processing.
Because fault tolerance is not critical during this phase, all 3 engines can be used for data exploration. Thus, the decision
regarding which engine to use will depend on the nature of your data, the use case and the team’s skill set. Hive may be
more economical for larger datasets while Presto offers a significant speed advantage for quick interactive exploratory
queries to better understand the data. Spark may be a better option if a programmatic facility is best suited for
exploration. The insert on the right summarizes how Qubole customers often leverage our engines for
data exploration.
Explore
• With SQL on larger
datasets (petabytes)
• Using SQL or
programmatic
constructs
• With SQL when
interactivity and
response time is
important
White Paper
EXPLORE
7. Data pipelines carry and process data from the data sources to the BI and ML applications that take advantage of it.
These pipelines consist of multiple steps: reading data, moving it from one system to the next, reformatting it, joining it
with other data sources and also adding derived columns (feature engineering). These new derived fields could
represent characteristics such as “Age” or “Age Group” derived from date of birth or “Deal Size” derived from the
amount field on a sales record. These derived fields support different data science models as well as more in-depth BI.
Data scientists refer to this step as data preparation. A data pipeline includes all of these steps, and its mission is to
ensure that every step happens consistently and reliably for all data sets. Data engineers often distinguish between two
different types of pipelines: batch and streaming. Let’s take a quick look at each type of pipeline:
Pipelines for batch processing
Traditional data engineering workloads consist of mostly “well-defined” data sets often coming from applications
supporting key business functions such as accounting, inventory management, CRM systems as well as “well-defined”
semi-structured or unstructured files - such as server logs. By “well-defined” we mean that most of these data sets
remain relatively static between the start and end time of the execution of a pipeline. For “well-defined” data sets batch
data pipelines are sufficient and adequate to support BI and ML applications.
When persistence of large data sets is important, Hive offers diverse computational techniques and it is cost effective.
On the other hand, Spark offers an in-memory computational engine that allows the creation of programs that can read
data, build a pipeline, and export the results, and it may be the better choice if speed of processing is critical.
Pipelines for streaming data
With the emergence of big data and the internet of things (IOT), batch data pipelines while still necessary and
relevant are no longer sufficient. Data engineers have to develop streaming data pipelines to deal with near-real time
requirements of ML applications built on top these new data frameworks. Let’s take a closer look at both types of
pipelines.
Streaming data is generated on a continuous basis and often by several data sources, it is transmitted simultaneously
and in relative small packets. Examples of streaming data include call detail records generated by mobile phones, logs
from a web application such as an ecommerce site, social network updates and eGame activities. This data is processed
sequentially and in near-real time and often while the data is still “in flight” (before it is persisted in a database). In many
streaming use cases the value of the data decreases with time. For example, an alert from a machine on a factory
floor calls for immediate attention or a potentially fraudulent bank transaction needs to be paused in order for it to be
verified by the account owner before it completes. Spark offers the best facilities for streaming data pipelines.
Spark Streaming brings Apache Spark’s language-integrated API to stream processing, allowing engineers to create
streaming jobs the same way they write batch jobs. Spark Streaming supports Java, Scala and Python.
7
BUILDING ROBUST DATA PIPELINES
White Paper
BUILD
8. Please note that while Spark offers the best facilities for
near-real time data streaming, micro-batching (i.e. batch
processing with smaller time windows such as intervals of
a few minutes) may be a workable and more economical
option leveraging Hive. Hive is also effective for training ML
models with large, representative and persisted datasets.
The insert on the right outlines how Qubole customers use
these engines for building data pipelines.
White Paper
8
Build
• Batch Pipelines
• Streaming Pipelines
via Micro-batching
• Training ML Models
• Batch Pipelines
• Near-real time
streaming Pipelines
• Training ML Models
9. Orchestration of data pipelines refers to the sequencing,
coordination, scheduling and management of complex data
pipelines from diverse sources with the aim of delivering data sets
that are ready for consumption either by business intelligence
applications and/or data science ML models that support data
lake applications.
Efficient, cost-effective and well-orchestrated data pipelines help
data scientists come up with better tuned and more accurate ML
models because those models have been trained with complete
data sets not just small samples.
Because Airflow is natively integrated to work with engines such
as Hive, Presto and Spark, it is an ideal framework to
orchestrate jobs running on any of these engines. Organizations
are increasingly adopting Airflow to orchestrate their ETL/ELT
jobs. At the time of this writing about 18% of all ETL/ELT jobs on
the Qubole platform leverage Airflow and the number of DAGs is
increasing 15% month over month. Also, with Qubole contributed
features such as the Airflow QuboleOperator customers have the
ability to submit commands to Qubole, thus giving them greater
programmatic flexibility.
Airflow is an ideal solution for orchestration of workloads that
follow the batch processing model as outlined in the insert on the
right.
Orchestrate
• Batch Pipelines
• Orchestration of data
pipelines refers to
the sequencing, co-
ordination, scheduling
and management
of complex data
pipelines from diverse
sources with the aim
of delivering datasets
ready for use
9
ORCHESTRATING DATA PIPELINES
White Paper
ORCHESTRATE
10. 10
Qubole’s multi-engine platform allows data engineers to build,
update and refine data pipelines in order to reliably and cost-
effectively deliver those data sets on predefined schedules or
on-demand. Qubole provides the ability to publish data through
notebooks or templates and deliver the data to downstream
advanced analytics and ML applications. The data delivery
stage has the greatest impact on the training, deployment and
operation of ML models as well as the applications built on top
of them.
At Qubole we believe your use case should determine the
delivery engine. For example, if the information is going to
be delivered as a dashboard or the intention is to probe the
resulting datasets with low-latency SQL queries then Presto
would be the optimal choice because with Presto queries run
faster than with Spark since there is no consideration for mid-
query fault-tolerance. Spark on the other hand supports mid-
query fault-tolerance and will recover in case of a failure but,
actively planning for failure impacts Spark’s query performance,
especially in the absence of any technical hiccups.
Spark also offers excellent support for delivering streamed data,
datasets resulting from long-running (batch) queries that may
require fault tolerance, as well as any dataset that may require
programmatic constructs for interpretation. Common output
destinations for Spark could also be file systems, databases,
dashboards, notebooks and certainly ML/Predictive applications.
If cost effective persistence of the result set is important then
Hadoop/Hive is a great choice because it supports a number of
traditional BI tools via ODBC and JDBC. Also, because its ability to
store large volumes of data, Hadoop/Hive is an ideal platform
for a persisted single view of an entity (customer, patient, asset,
etc.) such as those offered by MDM systems.
All three engines offer distinctive advantages for data delivery.
The insert on the right shows how Qubole customers often
leverage these three engines in this phase.
At Qubole we believe your use
case should determine the
delivery engine.
DELIVERING DATA SETS
White Paper
Deliver
• Larger datasets
(petabytes) that need
persistence such as ML
training datasets or the
ones that underpin a
single view of an entity
such as Customer 360
systems.
• Large Scale data
cleansing and
enrichment processes
• Streaming Datasets in
near-real time
• Programmatic (Scala,
Python, R) treatment of
data workload
• Training datasets
• Ad-hoc interactive SQL
queries that return data
quickly
• Federated SQL Queries
joining multiple
disparate system/
environment
DELIVER
11. 11
TAKEAWAYS
Each one of the engines covered on this document provides distinctive
advantages depending on the use case. Companies leveraging data lakes
have a wide variety of use cases and for a given company or even within a
department not all of the use cases will be known initially. As
companies become more mature and sophisticated in their data
lake deployments, data engineers will discover more use cases
and opportunities to leverage different engines.
For these reasons, building data pipelines calls for a multi-engine
platform with the ability to auto-scale such as the one offered by Qubole.
Qubole is an open, simple, and secure data lake platform for machine
learning, streaming analytics, data exploration, and ad-hoc analytics. No
other platform radically simplifies data management, data engineering
and run-time services like Qubole. We enable reliable, secure data access
and collaboration among users while reducing time to value, improving
productivity, and lowering cloud data lake costs from day one. The
Qubole Open Data Lake Platform:
• Provides a unified environment for creation and orchestration of
multiple data pipelines to dynamically build trusted datasets for ad-
hoc analytics, streaming analytics, and machine learning.
• Optimizes and rebalances underlying multi-cloud infrastructure for
the best financial outcome while supporting the unmatched levels of
concurrent users, and workloads at any point in time.
• Enables collaboration through workbench between data scientists,
data engineers, and data analysts with shareable notebooks,
queries, and dashboards.
• Improves the security, governance, reliability, and accessibility of
data residing in data lakes.
• Provides APIs, and connectors to 3rd party tools such as Tableau,
Looker for analytics, RStudio, H2O.ai for machine learning use cases.
Table 1 below encapsulates how Qubole customers apply our different
engines to fulfill different stages of the data engineering function when
building ML data pipelines.
As companies become more
mature and sophisticated in
their big data deployments,
data engineers will discover
more use cases and
opportunities to leverage
different engines.
White Paper
12. 11
White Paper
Table 1 - Most Common Engine and Framework usage patterns
This is based on how Qubole customers leverage our engines and frameworks in their ETL/ELT processes.
• With SQL on Larger
datasets
• With programmatic
constructs
• With SQL
• With SQL when
interactivity and
response time is
important
• Batch Pipelines
• Training ML Models
• Streaming Pipelines via
Micro- batching
• Batch Pipelines
• Near-real time
streaming Pipelines
• Training ML Models
• Larger datasets requiring
persistence
• Large scale clensing and
enriching processes
• Streaming datasets
• Workloads requiring
programmatic
constructs
• Spark SQL
• Ad-hoc interactive SQL
queries that return data
quickly
• Federated SQL Queries
joining multiple disparate
system
• Batch workflows
• Orchestration of data
pipelines refers to the
sequencing, co-
ordination, scheduling
and management
of complex data pipelines
from diverse sources with
the aim of delivering
datasets ready for use
EXPLORE DELIVER
ORCHESTRATE
BUILD
13. 12
Table 2 dives deeper into the characteristics and merits of each tool and it can help you decide which engine to use
depending on your use case and data workload.
Response Time
(interactive
queries)
ETL SLA Adherence
Fault Tolerance
Choice of Language
Type of Workload/
Use Case
Slower
High
High
SQL
Batch Processing
Larger Data Sets
(Petabytes)
Table 2 - Engine Decision Guide
1. Spark Clusters take longer to start. 2. Qubole offers retry logic at query level
Faster
N/A
Limited2
SQL
Exploration
Interactive Sql
BI/Analytics
Faster1
High
High
Scala, R, Python, SQL
Machine Learning
Batch Processing
Stream Processing
Graph Processing
Interactive Programming
ENGINE OR
FRAMEWORK
CHARACTERISTICS
PRESTO
SPARK
HIVE
White Paper
14. About Qubole
Qubole is revolutionizing the way companies activate their data — the process of putting data into active use across their organizations. With Qubole’s cloud-na-
tive big data platform, companies exponentially activate petabytes of data faster, for everyone and any use case, while continuously lowering costs. Qubole over-
comes the challenges of expanding users, use cases, and variety and volume of data while constrained by limited budgets and a global shortage of big data skills.
Qubole offers the only platform that delivers freedom of choice, eliminating legacy lock in — use any engine, any tool, and any cloud to match your company’s
needs. Qubole investors include CRV, Harmony Partners, IVP, Lightspeed Venture Partners, Norwest Venture Partners, and Singtel Innov8.
For more information visit www.qubole.com.
469 El Camino Real, Suite 201
Santa Clara, CA 95050
(855) 423-6674 | info@qubole.com
WWW.QUBOLE.COM
In addition to offering several engines that allow our customers to
select the most adequate tool for each job, Qubole is the only
cloud data platform that delivers a multi-cloud, multi-
engine, self-service machine learning and analytics
architecture. It automatically provisions, manages and optimizes
cloud resources balancing cost, workloads, and performance
requirements. Qubole’s open data lake platform enables
orchestration and execution of all types of data engineering tasks
whether it is data exploration, building of data pipelines,
orchestration or data delivery.
When building a house, you would
choose different tools for different
tasks, it is impossible to build a
house using only one tool.
Similarly, when building data
pipelines, you should choose
the optimal big data engine by
considering your specific use case
and the specific business needs of
your company or department.
White Paper