Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
This white paper explains how JethroData can help you achieve truly interactive response times for BI on big data, and how the underlying technology works.
It analyzes the challenges of implementing indexes for big data and how JethroData solved these challenges. It then discusses how the JethroData design of separating compute from storage works with Hadoop and with Amazon S3. Finally, it briefly discusses some of the main features behind JethroData's performance, including I/O, query planning and execution features.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
This white paper explains how JethroData can help you achieve truly interactive response times for BI on big data, and how the underlying technology works.
It analyzes the challenges of implementing indexes for big data and how JethroData solved these challenges. It then discusses how the JethroData design of separating compute from storage works with Hadoop and with Amazon S3. Finally, it briefly discusses some of the main features behind JethroData's performance, including I/O, query planning and execution features.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
A Benchmark Test on Presto, Spark Sql and Hive on TezGw Liu
Presto、Spark SQLとHive on Tezの性能に関して、数万件から数十億件までのデータ上に、常用クエリパターンの実行スピードなどを検証してみた。
We conducted a benchmark test on mainstream big data sql engines including Presto, Spark SQL, Hive on Tez.
We focused on the performance over medium data (from tens of GB to 1 TB) which is the major case used in most services.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users.
Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases.
The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal.
Speaker
Sankar Hariappan, Senior Software Engineer, Hortonworks
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,
What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.
We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.
This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc.
In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale.
We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.
Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.
In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.
Speaker:
Yaniv Rodenski, Senior Solutions Architect, Couchbase
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
A Benchmark Test on Presto, Spark Sql and Hive on TezGw Liu
Presto、Spark SQLとHive on Tezの性能に関して、数万件から数十億件までのデータ上に、常用クエリパターンの実行スピードなどを検証してみた。
We conducted a benchmark test on mainstream big data sql engines including Presto, Spark SQL, Hive on Tez.
We focused on the performance over medium data (from tens of GB to 1 TB) which is the major case used in most services.
MiNiFi is a recently started sub-project of Apache NiFi that is a complementary data collection approach which supplements the core tenets of NiFi in dataflow management, focusing on the collection of data at the source of its creation. Simply, MiNiFi agents take the guiding principles of NiFi and pushes them to the edge in a purpose built design and deploy manner. This talk will focus on MiNiFi's features, go over recent developments and prospective plans, and give a live demo of MiNiFi.
The config.yml is available here: https://gist.github.com/JPercivall/f337b8abdc9019cab5ff06cb7f6ff09a
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
Couchbase Chennai Meetup: Developing with Couchbase- made easyKarthik Babu Sekar
This session provided an overview of Couchbase Solutions and whats latest and greatest in the new release. This session also talks about how easy is to develop with Couchbase and query the database
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Couchbase Singapore Meetup #2: Why Developing with Couchbase is easy !! Karthik Babu Sekar
This session provided an overview of Couchbase Solutions and whats latest and greatest in the new release. This session also talks about how easy is to develop with Couchbase and query the database
Apache Hadoop started as batch: simple, powerful, efficient, scalable, and a shared platform. However, Hadoop is more than that. It's true strengths are:
Scalability – it's affordable due to it being open-source and its use of commodity hardware for reliable distribution.
Schema on read – you can afford to save everything in raw form.
Data is better than algorithms – More data and a simple algorithm can be much more meaningful than less data and a complex algorithm.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Processing by "Sampat Kumar" from "Harman". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration?
In this talk we will cover:
- Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven.
- Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js.
- Easy deployment of the Hadoop stack to the cloud.
- Hermes - our homegrown command-line tool which helps us automate data-related tasks.
- Examples of exciting machine learning challenges that we are currently tackling
- Hadoop with Azure and Microsoft stack.
Similar to Interactive SQL-on-Hadoop and JethroData (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
3. About Me
Ofir Manor
Worked in the database industry for 20 years
Started as developer and DBA
Worked for Oracle, Greenplum pre-sales roles
Blogging on Big Data and Hadoop
Currently Product Manager at JethroData
ofir@jethrodata.com
http://ofirm.wordpress.com/
@ofirm
4. Agenda
How we got here?
How do parallel databases on Hadoop work?
Impala, Hive/Tez, Spark SQL
Use case perspective
JethroData – What? How? Demo
6. Sounds Familiar?
“Let's bring all the data that our operational systems
create into one place, and keep it all forever.
When we'll analyze that repository, we will surely
uncover critical business insights...”
What is this concept called?
Today - “Big Data”
Last 20 years - “Enterprise Data Warehouse”
(EDW)
7. Big Data vs. Data Warehouse?
Everyone calls themselves “Big Data” now...
But “Big Data” is lead by large web companies with new
requirements:
Web scale – orders of magnitude more data
Page views and clicks, mobile apps, sensor data etc
Cost – dramatically lower price per node
Aversion from vendor lock-in - open-source preference
Methodology – correctness vs. agility
Classical EDW / Schema-on-write – data must be cleaned and
integrated before business users can access it (EDW vs. ADW)
Big Data / schema-on-read– let the data science team handle the
unfiltered crap (in addition to some vetted schema-on-write data sets)
8. Evolution – 90s
“Big Data” was “Enterprise Data Warehouse”
You could either build your own using a big Unix
machine and enterprise storage
“Scale Up”
Or just buy a Teradata appliance
World's first production 1TB EDW, 1992
at
9. Evolution – 2000s
A new generation of parallel databases -
“10x-100x faster, 10x cheaper”
Netezza, Vertica, Greenplum, Aster, Paraccel etc
All used a cluster of cheap servers without
enterprise storage to run parallel analytic SQL
Shared-Nothing Architecture /
MPP (Massive Parallel Processing) /
Scale-Out
10. Evolution – Our Days
It's all about Hadoop
Hadoop started as a batch platform - HDFS and MapReduce
Lately, it became a shared platform for any type of parallel
processing framework
Example – Hortonworks slide
11. SQL-on-Hadoop
Underlining Design
Hadoop uses the same parallel design pattern as
the parallel databases from last decade
Surprisingly, all SQL-on-Hadoop also uses the same
design pattern of previous parallel databases!
HDFS
MapReduce
Tez
Spark
Hive
Impala
Presto
Tajo
Drill
Shark
Spark SQL
Pivotal HAWQ
IBM BigSQL
Teradata Aster
Hadapt
Rainstor
...
12. Data
Node
The Parallel Design
(Shared-Nothing MPP)
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
Data
Node
Data
Node
Data
Node
Data
Node
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Executor
Query
Planner
/Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
Query
Planner/
Mgr
13. The Parallel Design Principles
(Shared-Nothing MPP)
Full Scan
– Each node reads all its local portion of the table
– Optimize with large sequantial reads
Maximize Locality
– minimize inter-node work
– Work should be evenly distributed to avoid bottlenecks
– Avoid data and processing skew
That leads to a hard design-time trade-off
Greatly depends the physical table organization
Choose partition keys, distribution keys and sort keys
14. How to Get Decent Performance
Hive vs. Impala
1. Minimize Query I/O
Skip reading unneeded data
2. Optimize Query Execution
Pick best plan
Optimize each step
Connect the steps efficiently
15. 1. Minimize Query I/O
Using Partitioning (Hive, Impala etc)
Full scan of less data
Manual tuning – too few vs. too many
Using columnar, scan-optimized file formats
“Write once, full scan a thousand times”
Skip unneeded columns
Full scan smaller files - encode and compress per-column
Skip blocks (for sorted data)
Big effort in the Hadoop space, mostly done
Built two comparable formats - ORC and Parquet
Use the right one – Hive/ORC or Impala/Parquet
16. How Parquet and ORC
columnar format works?
• Data is divided into blocks - chunks of rows
• In each block, data is physically stored column by
column
• Store additional metadata per column in a block,
like min / max values
17. 2. Optimize Query Execution
1. Pick the best execution plan – Query Optimizer
Cost-based Optimizations - currently generally weak
2. Optimize each step – Efficient Processing
Vectorized operations, expression compilation etc
3. Combine the steps efficiently - Execution Engines
Batch-oriented (MapReduce) – focus on recoverability
write intermediate results to disk after every step
Streaming-oriented (Tez, Impala, HAWK etc) – focus on performance
Move intermediate results directly between processes
Required much more resources at once
Hybrid (Spark) – enjoy both worlds
Stream and re-use intermediate results, optimize for in-memory
But can recover / recompute on node failure
18. Impala
Impala Highlights
Basic partitioning (partition per key)
Optimized I/O with Parquet
Built their own streaming execution engine
Efficient processing
Basic cost-based query optimizations
Impala vs. Presto vs. Drill vs. Tajo
Generally same high-level design
Four independent teams implementing it in their own way
Impala way ahead of the pack
Performance, functionality, market share, support
19. Impala
Sample benchmark results from Cloudera (May 2014)
20 nodes running TPCDS data, scale factor 15,000 (GB)
Impala 1.3.0 vs. Hive 0.13 on Tez vs. Shark 0.9.2 vs. Presto 0.6.0
Source: Cloudera blog - New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
20. Hive
Hive Highlights
Rich SQL support
Basic partitioning (partition per key)
Optimized I/O with ORC
Cost-based optimizer coming (Hive 0.14)
Efficient processing – vectorized query execution
Reliable execution (MapReduce) or
fast execution (Tez)
Hive on Spark (Shark)
Have recently reached “End-of-life”, sort-of
21. Spark SQL
Spark SQL Highlights
Very early days / alpha state
Announced Mar 2014
A SQL-on-Spark solution written from scratch
Should support reading / writing from Hadoop,
specifically from/to Hive and Parquet
Could become an interesting player next year
Mostly vs. Impala
22. Query Use Cases
Querying Raw Data
Ad-hoc “Data Science” investigative work
Typically in the minutes to hours range
Need to make sense of many, ever changing data sources
Need to make sense of text / semi-structured / dynamic
schema sources
Need to mix SQL, text processing (UDFs) and machine
learning algorithm in a manual, multi-step fashion
Raw Data
23. Query Use Cases
Reporting
Internal analyst teams repeatedly run reports
on common data sets
Likely cleaned and vetted, potentially
aggregated, shared across many users
Use latest Hive/ORC/Tez or Impala/Parquet
– Improves response time from hours to minutes
Raw Data Reporting
24. Query Use Cases
Pre-Compute Specific Queries (1)
Queries embedded in external websites / apps / dashboards
Customers expecting page loads of a second or two
Must pre-compute results to get response time
“Old Style” implementation
1. Daily massive batch computation (MapReduce)
2. Results are typically pushed to an external OLAP solution (cubes) or
to a key-value store (HBase / Redis etc)
Great performance but only for a few queries or counters
“Freshness”: data is typically a day or more behind production
Complexity: multiple steps and systems, massive daily spike in
computation, storage, data movement
ImmediateRaw Data Reporting
25. Query Use Cases
Pre-Compute Specific Queries (2)
“New Style” implementation – Stream Processing
Continuously update pre-computed results as new events arrive
Events pile up in some queue (Apache Kafka)
Process events in micro-batches (Apache Storm)
The new computation results are constantly stored / updated in a key-
value store
Solves the “freshness” problem
Not suitable for complex pre-computations - best for counters
Bleeding-edge technology
ImmediateRaw Data Reporting
26. Query Use Cases
Fast Ad-hoc Queries
Immediate
Interactive BI is the “Holy Grail” of SQL on Hadoop
Analysts / business users want to interact with select data sets
from their BI tool
drag columns in the BI tool, “slice n' dice”, drill-downs
Response time of seconds to tens of seconds from their BI tool
Existing solutions are too slow – users are stuck with reporting
Very hard to achieve with existing Hadoop technologies
Need a different solution – maybe something that have worked
for databases in the last 30 years?
It's time to introduce JethroData...
Interactive BIRaw Data Reporting
27. JethroData
Index-based, columnar SQL-on-Hadoop
– Delivers interactive BI on Hadoop
– Focused on ad-hoc queries from BI tools
– The more you drill down, the faster you go
SQL-on-Hadoop
– Column data and indexes are stored on HDFS
– Compute is on dedicated edge nodes
• Non-intrusive, nothing installed on the Hadoop cluster
• Our own SQL engine, not using MapReduce / Tez
– Supports standard ANSI SQL, ODBC/JDBC
31. Jethro Indexes
Indexes map each column value
to a set of rows
Jethro stores indexes as
hierarchical compressed bitmaps
Very fast query operations – AND / OR / NOT
Can process the entire WHERE clause to a final list of rows.
Fully indexed – all columns are automatically indexed
INSERT Performance
– Jethro Indexes are append-only
If needed, a new, repeating entries are allowed
INSERT is very fast – files are appended, no random read/write
Compatible with HDFS
Periodic background merge (non-blocking)
Value Rows
FR rows 5,9,10,11,14
IL rows 1,3,7,12,13
US rows 2,4,6,8,15
32. Parallel
Execution
Parallel
ExecutionParallel
Execution
Parallel
Execution
Jethro Execution Logic
Index Phase
Use all relevant indexes to compute which rows are needed for query results
Divide the work to multiple parallel units of work.
Parallel Execution Phase
Parallelize both fetch and compute
Works with or without partitions
Currently multi-threaded in a single server, designed for multi-node
execution
Parse QueryParse Query Identify Results
from Indexes
Identify Results
from Indexes Parallel
Execution
Parallel
Execution
ResultsResults
34. Minimize Query I/O
part 1
Automatically combine all relevant indexes for query
execution – dramatically reduce fetch I/O
Generate a bitmap of all relevant rows after the entire WHERE clause
Skip indexes when their value is minimal (WHERE a>0)
or when indexes are not applicable (WHERE function(col)=5)
Use Jethro columnar file format to:
Skip columns
Encode and compress data (smaller files)
Skip blocks (when a column must be scanned)
HDFS I/O optimizations
Optimized fetch size for bulk fetches
Using skip scans to convert random reads to single I/O, if possible
35. Minimize Query I/O
part 2
Automatic Local Cache
Simple to set up
Metadata is automatically cached locally by priority,
in the background
Can ask to cache specific columns or tables
Improves latency and reduces HDFS roundtrips
Use partitioning?
Only small effect – we only read relevant rows, with or
without partitioning
At the high-end, helps operating on smaller bitmaps
Mostly for rolling window operations
36. Optimize Query Execution
Query Optimizer - pick the best plan (set of steps)
Using up-to-date detailed statistics from the indexes
For example, star transformation
Efficient Processing – optimize each step
Efficient bitmap operations
Execution Engines – combine the steps efficiently
Multi-threaded, streaming, parallel execution
Parallelize everything:
Fetch, Filter, Join, Aggregate, Sort etc
37. Scalability
scale to 100s of billions of rows
Scale out HDFS (Hadoop Cluster), if needed
Provide extra storage or extra I/O performance (rare)
Scale out Jethro nodes
Jethro query nodes are stateless
Add nodes to support additional concurrent queries
Leverage partitioning
Support rolling window maintenance at scale
Works best with a few billion rows per partition – usually
partition for maintenance, not performance
38. High-Availability
Leverage all HDFS HA goodies
DataNode Replication
NameNode HA
HDFS Snapshots
Any HDFS DR solution
All Jethro nodes are stateless
Service auto-start on restart
Can start and stop them on demand