This presentation describes four high performance computing techniques. Each technique was applied to answer one particular question related to the NYC Yellow Taxi Data Set.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
This document discusses the components and architecture of InfluxDB IOx for replication, durability, and subscriptions. It describes the write buffer, how writes are routed and distributed across shards, replication between buffers to ensure durability, and how subscriptions are handled for querying data.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
Introduction to Modern Data Virtualization (US)Denodo
Watch full webinar here: https://bit.ly/3uyvxN5
“Through 2022, 60% of all organizations will implement data virtualization as one key delivery style in their data integration architecture," according to Gartner. What is data virtualization and why is its adoption growing so quickly? Modern data virtualization accelerates that time to insights and data services without copying or moving data.
Watch this webinar to learn:
- Why organizations across the world are adopting data virtualization
- What is modern data virtualization
- How data virtualization works and how it compares to alternative approaches to data integration and management
- How modern data virtualization can significantly increase agility while reducing costs
- How to easily get started with Denodo Standard 8.0
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxData
This document discusses the components and architecture of InfluxDB IOx for replication, durability, and subscriptions. It describes the write buffer, how writes are routed and distributed across shards, replication between buffers to ensure durability, and how subscriptions are handled for querying data.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
Introduction to Modern Data Virtualization (US)Denodo
Watch full webinar here: https://bit.ly/3uyvxN5
“Through 2022, 60% of all organizations will implement data virtualization as one key delivery style in their data integration architecture," according to Gartner. What is data virtualization and why is its adoption growing so quickly? Modern data virtualization accelerates that time to insights and data services without copying or moving data.
Watch this webinar to learn:
- Why organizations across the world are adopting data virtualization
- What is modern data virtualization
- How data virtualization works and how it compares to alternative approaches to data integration and management
- How modern data virtualization can significantly increase agility while reducing costs
- How to easily get started with Denodo Standard 8.0
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
June 24, 2014. At Velocity 2014, Fastly engineer Vladimir Vuksan gave an intro to Ganglia concepts (grid, clusters, hosts) as well as an installation of a sample monitoring grid. He also goes through the following commonly used visualization tools and how they may aid in detecting issues, identifying causes, and taking corrective action:
- Cluster/Grid Views
- Aggregate graphs
- Compare Hosts
- Custom graph functionality
- Views
- Interactive graphs
- Trending
- Nagios/Alerting system integration
- How to add metrics to Ganglia
- Different export formats such as JSON, CSV, and XML
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
Watch full webinar here: [https://buff.ly/2CIOtys]
As organizations collect increasing amounts of diverse data, integrating that data for analytics becomes more difficult. Technology that scales poorly and fails to support semi-structured data fails to meet the ever-increasing demands of today’s enterprise. In short, companies everywhere can’t consolidate their data into a single location for analytics.
In this Denodo DataFest 2018 session we’ll cover:
Bypassing the mandate of a single enterprise data warehouse
Modern data sharing to easily connect different data types located in multiple repositories for deeper analytics
How cloud data warehouses can scale both storage and compute, independently and elastically, to meet variable workloads
Presentation by Harsha Kapre, Snowflake
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
- Apache Arrow is an open-source project that provides a shared data format and library for high performance data analytics across multiple languages. It aims to unify database and data science technology stacks.
- In 2021, Ursa Labs joined forces with GPU-accelerated computing pioneers to form Voltron Data, continuing development of Apache Arrow and related projects like Arrow Flight and the Arrow R package.
- Upcoming releases of the Arrow R package will bring additional query execution capabilities like joins and window functions to improve performance and efficiency of analytics workflows in R.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
The document outlines an agenda for the NFTBank x Snowflake Tech Seminar. The seminar will cover three sessions: 1) data quality and productivity with discussions of data validation, cataloging and lineage documentation, and an introduction to DBT; 2) integrating DBT with Airflow using Astronomer Cosmos; and 3) cost optimization through query optimization and cost monitoring. The seminar will be led by Chris Hoyean Song, VP of AIOps at NFTBank.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
An update from Mike Hichwa (Oracle) on the introduction of the "Always Free" Autonomous Database at Oracle Openworld19. See how to sign up for always free Oracle Cloud Infrastructure (OCI) services. Kind of an illustrated quick start. See how to provision an Autonomous Database. See how to use and access Oracle APEX. Learn Oracle APEX architecture overview including the "RAD stack". Also a quick introduction to new APEX 19.2 and faceted search. #ORCLAPEX, #autonomousDB, #OOW19, @oracleapex. ORDS, Oracle REST Data Services.
The document discusses the Optimized Row Columnar (ORC) file format. ORC provides column-based storage with optimizations for performance and compression. These include column-level compression, row groups for parallel reads, protobuf metadata, vectorization, and integration with Hive ACID transactions and Apache Tez for distributed execution. The document outlines the performance benefits ORC has provided for companies like Facebook and Spotify. It also details ongoing work to further optimize ORC using techniques like JDK8 SIMD, dynamic compression, row indexes, and low-latency analytics processing.
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Great Expectations is an open-source Python library that helps validate, document, and profile data to maintain quality. It allows users to define expectations about data that are used to validate new data and generate documentation. Key features include automated data profiling, predefined and custom validation rules, and scalability. It is used by companies like Vimeo and Heineken in their data pipelines. While helpful for testing data, it is not intended as a data cleaning or versioning tool. A demo shows how to initialize a project, validate sample taxi data, and view results.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.
June 24, 2014. At Velocity 2014, Fastly engineer Vladimir Vuksan gave an intro to Ganglia concepts (grid, clusters, hosts) as well as an installation of a sample monitoring grid. He also goes through the following commonly used visualization tools and how they may aid in detecting issues, identifying causes, and taking corrective action:
- Cluster/Grid Views
- Aggregate graphs
- Compare Hosts
- Custom graph functionality
- Views
- Interactive graphs
- Trending
- Nagios/Alerting system integration
- How to add metrics to Ganglia
- Different export formats such as JSON, CSV, and XML
How to Take Advantage of an Enterprise Data Warehouse in the CloudDenodo
Watch full webinar here: [https://buff.ly/2CIOtys]
As organizations collect increasing amounts of diverse data, integrating that data for analytics becomes more difficult. Technology that scales poorly and fails to support semi-structured data fails to meet the ever-increasing demands of today’s enterprise. In short, companies everywhere can’t consolidate their data into a single location for analytics.
In this Denodo DataFest 2018 session we’ll cover:
Bypassing the mandate of a single enterprise data warehouse
Modern data sharing to easily connect different data types located in multiple repositories for deeper analytics
How cloud data warehouses can scale both storage and compute, independently and elastically, to meet variable workloads
Presentation by Harsha Kapre, Snowflake
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
- Apache Arrow is an open-source project that provides a shared data format and library for high performance data analytics across multiple languages. It aims to unify database and data science technology stacks.
- In 2021, Ursa Labs joined forces with GPU-accelerated computing pioneers to form Voltron Data, continuing development of Apache Arrow and related projects like Arrow Flight and the Arrow R package.
- Upcoming releases of the Arrow R package will bring additional query execution capabilities like joins and window functions to improve performance and efficiency of analytics workflows in R.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations.
Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
The document outlines an agenda for the NFTBank x Snowflake Tech Seminar. The seminar will cover three sessions: 1) data quality and productivity with discussions of data validation, cataloging and lineage documentation, and an introduction to DBT; 2) integrating DBT with Airflow using Astronomer Cosmos; and 3) cost optimization through query optimization and cost monitoring. The seminar will be led by Chris Hoyean Song, VP of AIOps at NFTBank.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
The document describes the recommended Iceberg workflow which includes 8 steps:
1) Create Iceberg tables from existing datasets or sample datasets
2) Batch insert data to prepare for time travel scenarios
3) Create security policies for fine-grained access control
4) Build BI queries for reporting
5) Build visualizations from query results
6) Perform time travel queries to audit changes
7) Optimize partition schemas to improve query performance
8) Manage and expire snapshots for table maintenance
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
An update from Mike Hichwa (Oracle) on the introduction of the "Always Free" Autonomous Database at Oracle Openworld19. See how to sign up for always free Oracle Cloud Infrastructure (OCI) services. Kind of an illustrated quick start. See how to provision an Autonomous Database. See how to use and access Oracle APEX. Learn Oracle APEX architecture overview including the "RAD stack". Also a quick introduction to new APEX 19.2 and faceted search. #ORCLAPEX, #autonomousDB, #OOW19, @oracleapex. ORDS, Oracle REST Data Services.
The document discusses the Optimized Row Columnar (ORC) file format. ORC provides column-based storage with optimizations for performance and compression. These include column-level compression, row groups for parallel reads, protobuf metadata, vectorization, and integration with Hive ACID transactions and Apache Tez for distributed execution. The document outlines the performance benefits ORC has provided for companies like Facebook and Spotify. It also details ongoing work to further optimize ORC using techniques like JDK8 SIMD, dynamic compression, row indexes, and low-latency analytics processing.
Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Great Expectations is an open-source Python library that helps validate, document, and profile data to maintain quality. It allows users to define expectations about data that are used to validate new data and generate documentation. Key features include automated data profiling, predefined and custom validation rules, and scalability. It is used by companies like Vimeo and Heineken in their data pipelines. While helpful for testing data, it is not intended as a data cleaning or versioning tool. A demo shows how to initialize a project, validate sample taxi data, and view results.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
We use Kafka as the data backbone to build Milvus, an open-source vector database system that has been adopted by thousands of organizations worldwide for vector similarity search. In this presentation, we will share how Milvus uses Kafka to enable both real-time processing and batch processing on vector data at scale. We will walk through the challenges of unified streaming and batching in vector data processing, as well as the design choices and the Kafka-based data architecture.
In this Introduction to Apache Sqoop the following topics are covered:
1. Why Sqoop
2. What is Sqoop
3. How Sqoop Works
4. Importing and Exporting Data using Sqoop
5. Data Import in Hive and HBase with Sqoop
6. Sqoop and NoSql data store i.e. MongoDB
7. Resources
The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.
The document provides an overview of distributed computing and related technologies. It discusses the history of distributed computing including local, parallel, grid and distributed computing. It then discusses applications of distributed computing like web indexing and recommendations. The document introduces Hadoop and its core components HDFS and MapReduce. It also discusses related technologies like HBase, Mahout and challenges in designing distributed systems. It provides examples of using Mahout for machine learning tasks like classification, clustering and recommendations.
The document discusses using MapReduce for a sequential web access-based recommendation system. It explains how web server logs could be mapped to create a pattern tree showing frequent sequences of accessed web pages. When making recommendations for a user, their access pattern would be compared to patterns in the tree to find matching branches to suggest. MapReduce is well-suited for this because it can efficiently process and modify the large, dynamic tree structure across many machines in a fault-tolerant way.
From Data to Maps to Docs: Turn Days into Minutes with Automated IntegrationSafe Software
Report generation used to take the Valuations Office in Ireland 20 tedious, long hours per report. Now, it's down to just 3 minutes: saving roughly 42,000 hours annually.
In this first of two parts webinar series, join us and special guest Philip Jacob to learn how they made this possible, and how you can capture time savings like this too by converting data using the Mapnik Rasterizer and FME.
To follow Philips' story, Scenario Specialist Dmitri Bagh will walk you through the basics of understanding and getting started with the Mapnik Rasterizer. You'll learn how to generate beautiful raster imagery and save time with FME’s ability to integrate data from 500+ sources.
You’ll also see how to optimize data before it reaches Mapnik -- including the ability to perform a wide range of geometry transformations. And with the MapnikRasterizer transformer, all of this is done with scalable and automated workflows at your fingertips -- no Python, XML or CSS needed.
P.S. Stay tuned in the New Year for the announcement of a second Data Democratization webinar to complete this series: creating Microsoft Word documents with FME.
The document discusses using big data architecture and Hadoop. It compares relational database management systems (RDBMS) to Hadoop, noting differences in schema, speed, governance, processing, and data types between the two. A scenario is presented of a trucking company collecting sensor data from vehicles via GPS, acceleration, braking etc. and how that data could flow through the Hadoop ecosystem using Flume, Sqoop, Hive, Pig, and Spark. Another example discusses acquiring and processing user event data from a bank. The document outlines the reference architecture and requirements extraction process for designing a big data system.
Routing, scheduling, and dispatching are critical in keeping operations running smoothly and at lowest cost. Providing optimal drive routes between a set of locations is a key factor in reducing operating costs for numerous industries (fleet vehicles, transportation, currier services, home delivery). In this session, we’ll discuss routing technology, demonstrate routing applications, and discuss how advances in mobile technology have enabled a cost effective routing and real-time asset tracking solution.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
Cloud Computing course presentation, Tarbiat Modares University
By: Sina Ebrahimi, Mohammadreza Noei
Advisor: Sadegh Dorri Nogoorani, PhD.
Presentation Data: 1397/03/07
Video Link in Aparat: https://www.aparat.com/v/N5VbK
Video Link on TMU Cloud: http://cloud.modares.ac.ir/public.php?service=files&t=9ecb8d2dd08df6f990a3eb63f42011f7
This presenation's pptx file (some animations may be lost in slideshare) : http://cloud.modares.ac.ir/public.php?service=files&t=f62282dbd205abaa66de2512d9fdfc83
Using FME to Automate Data Integration in a CitySafe Software
Learn how the City of Coquitlam uses FME to solve diverse data integration challenges across multiple departments and projects, improving data sharing and accessibility between staff and contractors.
Approximation algorithms for stream and batch processingGabriele Modena
At Improve Digital (http://www.improvedigital.com) we collect and process large amounts of machine generated and behavioral data. Our systems address a variety of use cases that involve both batch and streaming technologies. One common denominator of the overall architecture is the need to share models and workflows across both worlds. Another one is that the analysis of large amounts of data often requires trade-offs; for instance trading accuracy for timeliness in streaming applications. One approach to satisfy these constraints is to make "big data" small. In this talk we will review a number of approximation methods for sketching, summarization and clustering and discuss how they are starting to change the way we think about certain types of analytics, and how they are being integrated into our data pipelines.
Transport for London - London's Operations Digital TwinNeo4j
1) London Transport is developing an Operations Digital Twin to provide a real-time simulation of traffic conditions on London's roads.
2) The Digital Twin integrates multiple real-time and historical data sources into a common framework and graph database aligned by road links and time.
3) This allows the Digital Twin to identify traffic incidents and disruptions, help manage traffic, and support planning and analysis across Transport for London.
This document discusses big data and Hadoop. It defines big data as large data sets that cannot be processed by traditional software tools within a reasonable time frame due to the volume and variety of data. It then describes the three V's of big data - volume, velocity, and variety. The document provides examples of sources of big data and discusses how Hadoop, an open-source software framework, can be used to manage and analyze big data through its core components - HDFS for storage and MapReduce for processing. Finally, it provides a high-level overview of how MapReduce works.
Web Mapping 101: What Is It and Making It Work For YouSafe Software
Web mapping is the process of using the internet to visualize, analyze, and share your geospatial data through a map. Web maps are an important tool for many organizations as they provide the ability to distribute critical information to anyone, anywhere, and at any time.
Web maps provide endless potential for visualizing valuable data that may otherwise go unused. But, not everyone knows how to get started with creating one. In this webinar, we’ll cover:
- An overview of web mapping and how it works
- How OpenLayers and Leaflet work with web mapping
- How to use web mapping tools, including Esri Leaflet and Mapbox with the HTMLReportGenerator
- How to create vector tilesets in FME to make web mapping easier than ever
Join our team of Support Specialists to learn how to get started using FME to create a web map of your own to visualize and share your data.
Stop wasting the value of your geospatial data by letting it sit unused. You’ll leave this webinar with the tools to get you started with creating a web map of your own so you can present your data in a way thats easy to understand and share with others.
Adding Location and Geospatial Analytics to Big Data Analytics (BDT210) | AWS...Amazon Web Services
(Presented by Esri)
When people analyze a problem, they often include location at the core of the analysis. Location and spatial context, combined with geographical knowledge, can make the biggest difference in understanding a problem and analyzing it in a more meaningful way.
In this session, we show how Amazon EMR can be used with location and geospatial analytics, and how the Amazon EMR API and the Python SDK were used to build tools that integrate Big Data and geospatial analysis. We also show powerful visualization options for displaying your results, using maps which can be shared in reports or distributed online and to mobile apps.
Hadoop Master Class : A concise overviewAbhishek Roy
Abhishek Roy will teach a master class on Big Data and Hadoop. The class will cover what Big Data is, the history and background of Hadoop, how to set up and use Hadoop, and tools like HDFS, MapReduce, Pig, Hive, Mahout, Sqoop, Flume, Hue, Zookeeper and Impala. The class will also discuss real world use cases and the growing market for Big Data tools and skills.
The job posting is for a statistician and C# programmer position in Norway. The role involves designing a system to automatically extract factual information from online classified advertisements to analyze market trends and prices. The system must recognize accurate data and filter out corrupt entries when performing statistical calculations. Example tasks listed are analyzing used car sales data to determine the best makes and models to purchase based on various factors. The position requires strong statistics and programming skills in C#, SQL, and .NET, as well as experience with system design, team collaboration, and translating ideas into implemented products and services.
Similar to High Performance Computing on NYC Yellow Taxi Data Set (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
High Performance Computing on NYC Yellow Taxi Data Set
1. EXTRACTING INSIGHTS FROM BIG
DATA: A CASE OF NEW YORK CITY
YELLOW TAXI DATASET
Parag Ahire
January 11, 2020
2. PRESENTATION OUTLINE
Brief introduction
• Big Data & high performance computing
• Describe few techniques for high performance computing
Compare and Contrast few techniques
Sample Dataset introduction – NYC Yellow Taxi
Apply four techniques
• High-level code review
• Demonstration on Hortonworks virtual machine on Azure
3. BIG DATA
Data is growing
• Digital Age : 2002 onwards
• 2019 – 1770 Exabytes
• 2020 – 2000 Exabytes
What is it ?
• A data set that cannot be processed by a “normal” machine in “reasonable” amount of time
• Three V’s
Volume
Velocity
Variety
• May vary by time and prevalent technology
Used to be Giga/Tera bytes
Now Exa/Peta bytes
Future Zeta/Yotta bytes
o Zeta byte – 1000 data centers occupying 20% of Manhattan
o Yotta byte – 1M data centers occupying Delaware and Rhode Island
4. HIGH PERFORMANCE COMPUTING
The ability to process massive amount of data and perform complex
calculations at high speed
New Challenges (7 V’s)
Previously - Volume
Now – Velocity, Variety, Variability, Veracity, Visualization, Value
How to perform ?
• Supercomputers – expensive, require specialized expertise to use and
solve specialized problems
• Cluster of small or medium sized business computers
• Modern “supercomputers” are mostly “cluster of computers”
5. PARALLEL AND DISTRIBUTED COMPUTING
Parallel Computing – All
processors have access to
shared memory
Distributed Computing –
Each processor has its own
memory. Information is
exchanged by passing
messages between
processors
Images taken from : Wikipedia
6. DISTRIBUTED COMPUTING MODELS
Parallel algorithms
Shared-memory model
• All processors access shared memory
• Programmer decides what program is executed by each processor
Message-passing model
• Programmer chooses
o Network structure
o Program executed by each computer
Distributed algorithms
Programmer chooses the computer program
All computers run the same program
7. HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)
Map Reduce
A framework or programming model
Suitable for processing large volume of structured/unstructured data
Pig
Procedural rather than declarative coding approach
Provides a high degree of abstraction for map reduce
Hive
A traditional data warehouse interface for map reduce
Spark
A open source big data framework
A unified analytics engine for large-scale data processing
8. MAP REDUCE
Map function
Input – A Key Value pair
• (k1, v1) -> list(k2, v2)
Output – A list of key value pairs (one or more elements)
Reduce function
Input – A Key and a list of values
• (k2, list(v2)) -> list(v2)
Sort
Merging and sorting of output produced in the map phase
Shuffle
Transfers intermediate output of map phase to reducer
Passes on intermediate output of one or more keys to a single reducer
9. MAP REDUCE
Concerns
Map phase – done in parallel, typically 20% of the work
Reduce phase – executed sequentially for each key, typically 80% of the work
Tips
Increase with work done in the map phase and leave less for the reduce phase
Include the optional combine phase to reduce work done by the reducer
Combine (Optional)
A mini-reducer to summarizes mapper output record for a single key
Reduces data transfer between mapper and reducer
Decreases the amount of data to be processed by the reducer
12. QUESTION : MAP REDUCE
For each unique day in a month across all months in the year 2014 print
the maximum total number of passengers across all months (across all
eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours
09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less
than 3 miles where a tip was paid ? Print the day of the month as a
number and the total number of passengers across all eligible trips during
the month that was a maximum across all months in a month. The day of
the month should be represented as a number between 1 and 31 while
considering the maximum number of days occurring in each month of the
year 2014. Any trip data that did not have a pickup date between 1st
January 2014 and 31st December 2014 should be ignored. The day of the
month need not be sorted while printing the output.
14. TOP HADOOP VENDORS
Amazon Elastic Map Reduce (EMR)
Cloudera* CDH Hadoop Distribution
Hortonworks* Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform
Microsoft Azure HDInsight
Pivotal Big Data Suite
*Merged
15. PIG
Grew out of Yahoo
A platform for analyzing large data sets
Pig Latin – A procedural language
Provides a sequence of data transformations
• To merge, filter, apply functions, group records
• Supports User Defined Functions (UDF) for special processing
Programs are compiled into map reduce jobs
Support for python, java, groovy, JavaScript, ruby
16. PIG
Abstraction for map reduce programming
Improves developer productivity
Suitable for use for data analysts
Lower performance than map reduce
Use additional machines in cluster to increase performance
Used to perform tasks for
Data Storage
Data Execution
Data Manipulation
17. QUESTION : PIG
For all data available for the year 2014 (consider all months), which drop-
off location had the maximum total amount collected by credit card for a
trip exceeding 1 mile where no toll was paid, a tip was also paid but a
standard rate was applied for yellow taxi rides? Any trip data that did not
have a drop-off date between 1st January 2014 and 31st December 2014,
or does not have a valid month or does not have a valid day of the month
should be ignored. Print the drop off location ID (IDS) and the
aggregated total amount for the top location.
18. ANSWER : PIG
Drop-Off Latitude Drop-Off Longitude Sum Total Amount
40.78508 -73.95587 $65221.65
19. HIVE
Developed at Facebook
A SQL engine on its own meta store on HDFS
Can be queried though HQL (Hive Query Language)
Provides a traditional data ware house interface
Hive compiler
Converts hive queries to map reduce programs
Executed in parallel across machines in the Hadoop cluster
20. HIVE
Abstraction for map reduce programming
Improves developer productivity
Suitable for individuals with a SQL background
Lower performance than map reduce
Use additional machines in cluster to increase performance
Supports User Defined Functions (UDF’s)
Used for processing structured data
Data is loaded in tables
Unstructured data needs to be structured
Data is then loaded to tables
21. QUESTION : HIVE
Which three pairs of pickup location / drop off location had the largest ratio of
total amount paid per passenger for trips taken by a yellow taxi for all data
available for the year 2014? Only trips that utilized a payment type of credit
card and utilized a standard rate code should be considered. Any trip data that
did not have a drop-off date between 1st January 2014 and 31st December
2014, or does not have a valid month or does not have a valid day of the
month should be ignored. Print the rank, pickup location, drop off location and
the ratio of total amount paid to the passenger count for these three top pairs
of pickup / drop off locations. Locations should be printed in descending order
of the ratio of total amount paid to the passenger count. The pickup location
and drop off location should be printed as a string of the form
"latitude:longitude" based on the latitude and longitude of the pick-up location
or drop off location. A dense ranking should be performed.
22. ANSWER : HIVE
RANK Pickup Pickup
Longitude
Drop-Off
Latitude
Drop-Off
Longitude
Ratio of Total
Amount to
Total
Count
1 40.72941 -73.98386 41.30529 -72.92268 $401.5
2 40.73249 -73.98791 40.72129 -73.95615 $354.25
3 40.67019 -73.91853 40.87084 -73.90391 $354.0
23. COMPARISON – MAP REDUCE, PIG, HIVE
MAP REDUCE PIG HIVE
Compiled Language Scripting Language Query Language
Lower level of abstraction Higher level of abstraction Higher level of abstraction
Higher learning curve Lower learning curve Lowest learning curve
Best performance for very large
data
Intermediate performance for
very large data(50 % lower)
Least performance for very large
data
Programmer writes more lines of
code
Programmer writes intermediate
lines of code
Programmer writes least lines of
code
Highest code efficiency (more
flexibility)
Relatively less code efficiency
(lesser flexibility)
Relatively less code efficiency
(lesser flexibility)
Possible to handle unstructured
data
Not very friendly with
unstructured data like images
Not very friendly with
unstructured data like images
Possible to deal with poor
schema design of xml, json
Cannot deal with poor design of
xml, json
Not easy to deal with poor
design of xml, json
More potential of introducing
defects due to having to write
very custom code
Limited possibility of introducing
defects due to fixed syntactic
possibilities
Limited possibility of introducing
defects due to fixed syntactic
possibilities
24. SPARK
Developed at UC Berkeley AMPLab
An open source big data framework
Utilizes DAG (Directed Acyclic Graph) programming style
Now maintained by non-profit Apache Software Foundation
An unified analytics engine for
Large scale processing
Faster, general purpose processing
Reduces read/write operations from/to disk
Intermediate data stored in memory to achieve speed
RDD’s (Resilient Distributed Dataset)
DataFrame
Used to build batch, iterative, interactive, graph and streaming
applications
25. SPARK
Supports cross-platform development
Programming in Scala, Java, Python, R, SQL – Core API’s
PySpark (Python)
SparkR
Spark SQL (fka Shark)
Rest of the eco-system
MLLib (Machine Learning)
GraphX (Graph Computation)
Spark Streaming
26. COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Written In Scala Java
License Apache 2 Apache 2
OS support Cross-platform Cross platform
Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy,
Python, Perl
Lines of Code (LOC) Approximately 20,000 Approximately 120,000
Hardware Requirements Requires the use of mid to high level
hardware
Runs well on commodity
hardware
Data Storage Hadoop Distributed File System (HFDS),
Google Cloud Storage, Amazon S3,
Microsoft Azure
Hadoop Distributed File System
(HDFS), MapR, HBase
Community Strong community, one of the most
active projects at Apache
MapReduce community has
shifted to Spark
Scalability Highly scalable, one of the largest
cluster has 8K nodes
Even higher scalability, one of the
largest cluster has 14K nodes
27. COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Speed 100x faster in memory
10x faster on disk
Faster than traditional approaches
Difficulty / Ease of
use
Easy to program with the use of high level
operators (RDD’s and data frames)
Difficult due to the need to program each
and every operation
Ease of management Easy since it is a single analytics engine that
performs various tasks
It is a batch engine and needs to be
coupled with other engines like Storm,
Giraph, Impala etc. to achieve various
tasks
Fault tolerance No need to start from scratch (except for
programming errors) but some limitations
due to in memory operations
No need to start from scratch (except for
programming errors)
Data Processing
modes
Batch, Real Time, Iterative, Interactive, Graph,
Streaming
Batch
API’s and caching Caches data in memory No support for caching
SQL Support Support via Spark SQL (fka Shark) Supported via Hive
28. COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Real Time analysis Possible to handle at scale No support for real-time analysis
Streaming Spark Streaming handles streaming No support for streaming
Interactive mode Supported Not supported
Recovery Allows recovery of failed nodes by re-
computation of DAG
Resilient to system faults or failures. It is
highly tolerant system.
Latency Low High
Scheduler Due to in-memory computations it acts as
its own flow-scheduler
Requires an external job scheduler like
Oozie for its flows
Security / Access
Permission
Less secure since the only mechanism
supported is shared secret authentication
More secure because of Kerberos and
ACL’s (access control lists)
Cost Requires plenty of RAM for in-memory
computations, so increases costs as cluster
size increases
It is cheaper in terms of cost
Category Choice Choice of data scientists since it is a
complete analytics engine
Choice of data engineers since it is a
basic data processing engine
29. QUESTION : SPARK
Which day (or days) across all months in the year 2014 yielded the largest total
tip amount (across all eligible trips) as a percentage of the total amount (across
all eligible trips) for trips that charged the standard rate on a Yellow Taxi where
the total amount for each trip exceeded 5 and no toll was paid? Print the day
(or days) of the month (only the day ranging from 1 to 31) in 2014 and the
total tip amount as a percentage of the total amount. Utilize the pickup date
time for deciding which day of the month that the trip counts against. The drop
off datetime need not be considered. Any trip data that did not have a pickup
date between 1st January 2014 and 31st December 2014, or does not have a
valid month or does not have a valid day of the month should be ignored.
30. ANSWER : SPARK
PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT
TO SUM TOTAL AMOUNT
10 9.991076
32. REFERENCES
What is Big Data?
Data Center storage capacity worldwide from 2016 to 2021, by segment
How big is a Yottabyte?
What is High Performance Computing?
The 7 V’s of Big Data
Distributed Computing
Hadoop Combiner – Best Explanation to MapReduce Combiner
Pig Documentation
UC Berkeley AMPLab
NYC TLC Trip Record Data
Map Reduce vs Pig vs Hive
Spark vs Hadoop MapReduce: Which big data framework to choose
Apache Spark vs Hadoop MapReduce – Feature Wise Comparison
Spark vs Hadoop MapReduce
MapReduce vs Spark – 20 Useful Comparisons To Learn
Spark vs Hadoop : Which is the Best Big Data Framework