Impala is a SQL query engine for Apache Hadoop that allows for interactive queries on large datasets. It uses a distributed architecture where each node runs an Impala daemon and queries are distributed across nodes. Impala aims to provide general-purpose SQL with high performance by using C++ instead of Java and avoiding MapReduce execution. It runs directly on Hadoop storage systems and supports common file formats like Parquet and Avro.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
This talk will run through the list of filters that are shipped with HBase and show how they are used from a client application. Filters expose varying feature sets, but also exhibit an equally varying impact on read performance – but neither are directly intuitive. A skilled HBase practitioner should know how to select the proper filter for a given use-case, or how to combine sets of filters to achieve what is needed. The talk will conclude with an example for a custom filter and explain how to deploy it on a cluster.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.
Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and cyber security, how we preserve privacy whilst exploiting the advantages of data collection and processing. Big Data technologies provide both governments and corporations powerful tools to offer more efficient and personalized services. The rapid adoption of these technologies has of course created tremendous social benefits. Unfortunately unwanted side effects are the potential rich pickings available to those with malicious intentions. Increasingly, the sophisticated cyber attacker is able to exploit the rich array public data to build detailed profiles on their adversaries to support their malicious intentions
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
Key attributes for modern real time streaming processing and interactive analytics
What is so exciting to me about Spark?
What are some of the myths?
What is missing in Spark for real time?
SnappyData’s mission – fuse Spark with in-memory data management in one unified cluster to offer – OLTP + OLAP + Stream processing + Probabilistic data
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
This talk will run through the list of filters that are shipped with HBase and show how they are used from a client application. Filters expose varying feature sets, but also exhibit an equally varying impact on read performance – but neither are directly intuitive. A skilled HBase practitioner should know how to select the proper filter for a given use-case, or how to combine sets of filters to achieve what is needed. The talk will conclude with an example for a custom filter and explain how to deploy it on a cluster.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.
Big Data is an increasingly powerful enterprise asset and this talk will explore the relationship between big data and cyber security, how we preserve privacy whilst exploiting the advantages of data collection and processing. Big Data technologies provide both governments and corporations powerful tools to offer more efficient and personalized services. The rapid adoption of these technologies has of course created tremendous social benefits. Unfortunately unwanted side effects are the potential rich pickings available to those with malicious intentions. Increasingly, the sophisticated cyber attacker is able to exploit the rich array public data to build detailed profiles on their adversaries to support their malicious intentions
Learn what are the challenges to embrace big data and how AWS enables your organisation to resolve challenges and leverage Big Data for Digital Transformation and innovation.
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
Quick overview of all informations I've gathered on Cloudera Impala. It describes use cases for Impala and what not to use Impala for. Presented at Big Data Montreal #8 at RPM Startup Center.
Making Big Data Projects Successful - Data Science Pop-up SeattleDomino Data Lab
There are exciting and boring elements in any project, both of which must be addressed. This session will cover how to overcome the difficult but necessary problems that are essential to success. Presented by Aaron Cordova
CTO, co-founder at Koverse Inc.
How To Plan a Successful Big Data Pilotfarsitegroup
Big data initiatives often begin with a pilot project. This can generate internal support to invest in larger big data initiatives. Nevertheless, executing pilot projects can be difficult, and many pilots don’t convert into larger big data projects. In this session we’ll explore the challenges of big data pilots and suggest ways to plan and execute a successful pilot.
Combat Cyber Threats with Cloudera Impala & Apache HadoopCloudera, Inc.
Learn how you can use Cloudera Impala to:
- Operate with all data in your domain
- Address cyber security analysis and forensics needs
- Combat fraud, waste, and abuse
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
video: https://www.hakkalabs.co/articles/ibis-operating-python-data-ecosystem-hadoop-scale-wes-mckinney
While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases.
Ibis, a new open source data analytics framework for Python developers, has the goal of enabling the Python data ecosystem (NumPy, pandas, etc.) to operate efficiently at Hadoop scale. To enable high performance Python at scale without the age-old JVM interoperability problems, Ibis takes advantage of unique synergies between Python and Impala, the leading open source MPP analytical query engine. In this talk, Ibis creator Wes McKinney, who was also the creator of pandas, will demo the current capabilities of Ibis as well as explain its roadmap.
October 2016 HUG: Pulsar, a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
Yahoo recently open-sourced Pulsar, a highly scalable, low latency pub-sub messaging system running on commodity hardware. It provides simple pub-sub messaging semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication. Pulsar is used across various Yahoo applications for large scale data pipelines. Learn more about Pulsar architecture and use-cases in this talk.
Speakers:
Matteo Merli from Pulsar team at Yahoo
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
How to Make Awesome SlideShares: Tips & TricksSlideShare
Turbocharge your online presence with SlideShare. We provide the best tips and tricks for succeeding on SlideShare. Get ideas for what to upload, tips for designing your deck and more.
Presentation on Cloudera Impala at PDX Data Science Group at Portland. Delievered on February 27, 2013. Presentation slides borrowed from Cloudera Impala's architect, Marcel Kornacker.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
1. Cloudera Impala:
A Modern SQL Engine for Apache Hadoop
Mark Grover
Software Engineer, Cloudera
January 16, 2013
2. What is Impala?
● General-purpose SQL engine
● Real-time queries in Apache Hadoop
● Beta version released since October 2012
● General availability (GA) release slated for April 2013
3. Overview
● User View of Impala
● Architecture of Impala
● Comparing Impala with Dremel
● Comparing Impala with Hive
● Impala Roadmap
4. Impala Overview: Goals
● General-purpose SQL query engine:
● should work both for analytical and transactional workloads
● will support queries that take from milliseconds to hours
● Runs directly within Hadoop:
● reads widely used Hadoop file formats
● talks to widely used Hadoop storage managers
● runs on same nodes that run Hadoop processes
● High performance:
● C++ instead of Java
● runtime code generation
● completely new execution engine that doesn't build on MapReduce
5. User View of Impala: Overview
● Runs as a distributed service in cluster: one Impala daemon on each
node with data
● User submits query via ODBC/Beeswax Thrift API to any of the
daemons
● Query is distributed to all nodes with relevant data
● If any node fails, the query fails
● Impala uses Hive's metadata interface, connects to Hive's metastore
● Supported file formats:
● text files (GA: with compression, including lzo)
● sequence files with snappy/gzip compression
● GA: Avro data files
● GA: Trevni (columnar format; more on that later)
6. User View of Impala: SQL
● SQL support:
● patterned after Hive's version of SQL
● limited to Select, Project, Join, Union, Subqueries, Aggregation and
Insert
● only equi-joins; no non-equi joins, no cross products
● Order By only with Limit
● GA: DDL support (CREATE, ALTER)
● Functional limitations:
● no custom UDFs, file formats, SerDes
● no beyond SQL (buckets, samples, transforms, arrays, structs, maps,
xpath, json)
● only hash joins; joined table has to fit in memory:
● beta: of single node
● GA: aggregate memory of all (executing) nodes
7. User View of Impala: Apache HBase
● HBase functionality:
● uses Hive's mapping of HBase table into metastore table
● predicates on rowkey columns are mapped into start/stop
row
● predicates on other columns are mapped into
SingleColumnValueFilters
● HBase functional limitations:
● no nested-loop joins
● all data stored as text
8. Impala Architecture
● Two binaries: impalad and statestored
● Impala daemon (impalad)
● handles client requests and all internal requests related to
query execution
● exports Thrift services for these two roles
● State store daemon (statestored)
● provides name service and metadata distribution
● also exports a Thrift service
9. Impala Architecture
● Query execution phases
● request arrives via odbc/beeswax Thrift API
● planner turns request into collections of plan fragments
● coordinator initiates execution on remote impalad's
● during execution
● intermediate results are streamed between executors
● query results are streamed back to client
● subject to limitations imposed to blocking operators
(top-n, aggregation)
10. Impala Architecture: Planner
● join order = FROM clause order GA target: rudimentary cost-
based optimizer
● 2-phase planning process:
● single-node plan: left-deep tree of plan operators
● plan partitioning: partition single-node plan to maximize scan locality,
minimize data movement
● plan operators: Scan, HashJoin, HashAggregation, Union,
TopN, Exchange
● distributed aggregation: pre-aggregation in all nodes, merge
aggregation in single node. GA: hash-partitioned aggregation:
re-partition aggregation input on grouping columns in order to
reduce per-node memory requirement
11. Impala Architecture: Planner
● Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
TopN
Agg
TopN
Agg Hash
Hash Agg Join
Join HDFS HBase
Exch Exch
Scan Scan
HDFS HBase at coordinator at DataNodes at region servers
Scan Scan
14. Impala Architecture: Query Execution
Intermediate results are streamed between impalad's Query
results are streamed back to client
SQL App Hive
HDFS NN Statestore
ODBC Metastore
query
results
Query Planner Query Planner Query Planner
Query Coordinator Query Coordinator Query Coordinator
Query Executor Query Executor Query Executor
HDFS DN HBase HDFS DN HBase HDFS DN HBase
15. Impala Architecture
● Metadata handling:
● utilizes Hive's metastore
● caches metadata: no synchronous metastore API calls
during query execution
● beta: impalad's read metadata from metastore at startup
● GA: metadata distribution through statestore
● post-GA: HCatalog and AccessServer
16. Impala Architecture
● Execution engine
● written in C++
● runtime code generation for "big loops"
● example: insert batch of rows into hash table
● code generation with llvm
● inlines all expressions; no function calls inside loop
● all data is copied into canonical in-memory tuple format;
all fixed-width data is located at fixed offsets
● uses intrinsics/special cpu instructions for text parsing,
crc32 computation, etc.
17. Impala's Statestore
● Central system state repository
● name service (membership)
● GA: metadata
● GA: other scheduling-relevant or diagnostic state
● Soft-state
● all impalad's register at startup
● impalad's re-register after losing connection
● Impala service continues to function in absence of statestored (but:
with increasingly stale state)
● State pushed to impalad's periodically
● Repeated failed heartbeat means impalad evicted from cluster view
● Thrift API for service / subscription registration
18. Statestore: Why not ZooKeeper?
● ZK is not a good pub-sub system
● Watch API is awkward and requires a lot of client logic
● multiple round-trips required to get data for changes to
node's children
● push model is more natural for our use case
● Don't need all the guarantees ZK provides:
● serializability
● persistence
● prefer to avoid complexity where possible
● ZK is bad at the things we care about and good at the
things we don't
19. Comparing Impala to Dremel
● What is Dremel?
● columnar storage for data with nested structures
● distributed scalable aggregation on top of that
● Columnar storage in Hadoop: joint project between Cloudera
and Twitter
● new columnar format, derived from Doug Cutting's Trevni
● stores data in appropriate native/binary types
● can also store nested structures similar to Dremel's ColumnIO
● Distributed aggregation: Impala
● Impala plus columnar format: a superset of the published
version of Dremel (which didn't support joins)
20. Comparing Impala to Hive
● Hive: MapReduce as an execution engine
● High latency, low throughput queries
● Fault-tolerance model based on MapReduce's on-disk
checkpointing; materializes all intermediate results
● Java runtime allows for easy late-binding of functionality:
file formats and UDFs.
● Extensive layering imposes high runtime overhead
● Impala:
● direct, process-to-process data exchange
● no fault tolerance
● an execution engine designed for low runtime overhead
21. Comparing Impala to Hive
● Impala's performance advantage over Hive: no hard
numbers, but
● Impala can get full disk throughput (~100MB/sec/disk);
I/O-bound workloads often faster by 3-4x
● queries that require multiple map-reduce phases in Hive
see a higher speedup
● queries that run against in-memory data see a higher
speedup (observed up to 100x)
22. Impala Roadmap: GA – Q2 2013
● New data formats:
● lzo-compressed text
● Avro
● columnar
● Better metadata handling:
● automatic metadata distribution through statestore
● Connectivity: jdbc
● Improved query execution: partitioned joins
● Further performance improvements
23. Impala Roadmap: GA – Q2 2013
● Guidelines for production deployment:
● load balancing across impalad's
● resource isolation within MR cluster
● Additional packages: RHEL 5.7, Ubuntu, Debian
24. Impala Roadmap: 2013
● Improved HBase support:
● composite keys, Avro data in columns,
index nested-loop joins,
INSERT/UPDATE/DELETE
● Additional SQL:
● UDFs
● SQL authorization and DDL
● ORDER BY without LIMIT
● window functions
● support for structured data types
25. Impala Roadmap: 2013
● Runtime optimizations:
● straggler handling
● join order optimization
● improved cache management
● data collocation for improved join performance
● Resource management:
● cluster-wide quotas
● Teradata-style policies ("user x can never have more than 5
concurrent queries running", etc.)
● goal: run exploratory and production workloads in same
cluster, against same data, w/o impacting production jobs
26. Questions
● Further questions Download Impala (beta)
impala-user@cloudera.org
cloudera.com/downloads
● My email: Learn more about Cloudera
mgrover@cloudera.com Enterprise RTQ, Powered
by Impala
Thank you for attending! cloudera.com/impala
Join the Impala Hackathon,
Monday, Feb 25 @ Cloudera
HQ in Palo Alto
Register: bit.ly/DataHackDay