Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
This session describes how Impala integrates with Kudu for analytic SQL queries on Hadoop and how this integration, taking full advantage of the distinct properties of Kudu, has significant performance benefits.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
Learn about the skills and tools a data scientist needs and how to start training to be one.
There's so much noise about what a data scientist is or isn't that it can be challenging to identify the skills needed to start training a team or becoming one yourself. What exactly is a data scientist and where do you start?
Cloudera's Director of Data Science, Sean Owen, will start by walking through the different skills data scientist should have and why businesses need them. Afterwards, Tom Wheeler, Cloudera's Principal Curriculum Developer, will introduce the latest data science course developed by Cloudera University designed to help people take their first steps to becoming a data scientist.
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
AWS re:Invent re:Cap 행사에서 발표된 강연 자료입니다. 아마존 웹서비스의 김일호 솔루션스 아키텍트가 발표한 내용입니다.
내용 요약: Hadoop과 Elastic MapReduce, Redshift, Kinesis, Data Pipeline, S3 등 다양한 서비스들을 활용하는 데이터 분석의 모범사례 및 아키텍처 설계 패턴에 대해 말씀드리고, re:Invent에서 새로 추가된 Amazon EC2 컴퓨팅 최적화 인스턴스 C4와 새로 발표된 Amazon EBS 볼륨 확장 및 성능 향상에 대해 함께 살펴볼 예정입니다.
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
Learn about the skills and tools a data scientist needs and how to start training to be one.
There's so much noise about what a data scientist is or isn't that it can be challenging to identify the skills needed to start training a team or becoming one yourself. What exactly is a data scientist and where do you start?
Cloudera's Director of Data Science, Sean Owen, will start by walking through the different skills data scientist should have and why businesses need them. Afterwards, Tom Wheeler, Cloudera's Principal Curriculum Developer, will introduce the latest data science course developed by Cloudera University designed to help people take their first steps to becoming a data scientist.
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
The Hadoop ecosystem has improved real-time access capabilities recently, narrowing the gap with relational database technologies. However, gaps remain in the storage layer that complicate the transition to Hadoop-based architectures. In this session, the presenter will describe these gaps and discuss the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. The session also will cover Kudu (currently in beta), the new addition to the open source Hadoop ecosystem with outof-the-box integration with Apache Spark and Apache Impala (incubating), that achieves fast scans and fast random access from a single API.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
AWS re:Invent re:Cap 행사에서 발표된 강연 자료입니다. 아마존 웹서비스의 김일호 솔루션스 아키텍트가 발표한 내용입니다.
내용 요약: Hadoop과 Elastic MapReduce, Redshift, Kinesis, Data Pipeline, S3 등 다양한 서비스들을 활용하는 데이터 분석의 모범사례 및 아키텍처 설계 패턴에 대해 말씀드리고, re:Invent에서 새로 추가된 Amazon EC2 컴퓨팅 최적화 인스턴스 C4와 새로 발표된 Amazon EBS 볼륨 확장 및 성능 향상에 대해 함께 살펴볼 예정입니다.
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera, Inc.
Chief Technologist, Office of the CTO at Cloudera, Eli Collins, shares information about the enterprise data hub in the cloud and Cloudera's relationship with AWS.
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.
Manufacturers have an abundance of data, whether from connected sensors, plant systems, manufacturing systems, claims systems and external data from industry and government. Manufacturers face increased challenges from continually improving product quality, reducing warranty and recall costs to efficiently leveraging their supply chain. For example, giving the manufacturer a complete view of the product and customer information integrating manufacturing and plant floor data, with as built product configurations with sensor data from customer use to efficiently analyze warranty claim information to reduce detection to correction time, detect fraud and even become proactive around issues requires a capable enterprise data hub that integrates large volumes of both structured and unstructured information. Learn how an enterprise data hub built on Hadoop provides the tools to support analysis at every level in the manufacturing organization.
This deck covers key considerations and provides advice for enterprises looking to run production-scale Cloudera on AWS. We touch on everything from security to governance to selecting the right instance type for your Hadoop workload (Spark, Impala, Search, etc).
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
Enterprises are starting to deploy large scale Hadoop clusters to extract value out of the data that they are generating. These clusters often span hundreds of nodes. To speed up the time to value, a lot of the newer deployments are happening in AWS, moving from the traditional on-premises, bare-metal world. Cloudera supports just such deployments. In this session, Cloudera shares the lessons learned and best practices for deploying multi-tenant Hadoop clusters in AWS. They will cover what reference deployments look like, what services are relevant for Hadoop deployments, network configurations, instance types, backup and disaster recovery considerations, and security considerations. They will also talk about what works well, what doesn't, and what has to be done going forward to improve the operability of Hadoop on AWS.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!pyrasis
도커 무작정 따라하기
- 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커의 기본 개념부터 설치와 사용 방법까지 설명합니다.
더 자세한 내용은 가장 빨리 만나는 도커(Docker)를 참조해주세요~
http://www.pyrasis.com/private/2014/11/30/publish-docker-for-the-really-impatient-book
Marcel Kornacker is a tech lead at Cloudera
In this talk from Impala architect Marcel Kornacker, you will explore: How Impala's architecture supports query speed over Hadoop data that not only convincingly exceeds that of Hive, but also that of a proprietary analytic DBMS over its own native columnar format. The current state of, and roadmap for, Impala's analytic SQL functionality. An example configuration and benchmark suite that demonstrate how Impala offers a high level of performance, functionality, and ability to handle a multi-user workload, while retaining Hadoop’s traditional strengths of flexibility and ease of scaling.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Doug Cutting discusses:
- A brief history of Spark and its rise in popularity across developers and enterprises
- Spark's advantages over MapReduce
- The One Platform Initiative and the roadmap for Spark
- The future of data processing in Hadoop
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
Apache Kudu (incubating) is a new storage engine for the Hadoop ecosystem that enables extremely high-speed analytics without imposing data-visibility latencies. This talk provides an introduction to Kudu, and provides an overview of how, when, and why practitioners use Kudu as a platform for building analytics solutions.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
You want to use MySQL in Amazon RDS, Rackspace Cloud, Google Cloud SQL or HP Helion Public Cloud? Check this out, from Percona Live London 2014. (Note that pricing of Google Cloud SQL changed prices on the same day after the presentation)
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014 (20)
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
What we do:
-We build a system for the operation of modern data centers
-Triage and diagnostics, exploration, trends, advanced analytics of complex systems
-Our data: logs, metrics, human activity, anything that occurs in the data center
-Enterprise Software (i.e. we build for others.)
Today's presentation: how we built what we built on top of Apache Hadoop
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014cdmaxime
The Happiness Apps Challenge is an international App building challenge that is aimed to inspire minds in tech and design to create products that will increase world’s happiness quotient!
Create an App to make people happy. One positive person can spread Happiness to more than 1,000 people, but through the power of technology we can reach out to millions.
$ 50,000 in Prizes
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
1. 1
Cloudera Impala
LV Big Data Monthly Meetup #1
November 5th 2014
Maxime Dumas
Systems Engineer
2. Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2
3. What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
4. What This Talk Isn’t About
• deploying
• Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning
• depends heavily on data and workload
• coding
• unless you count XML or CSV or SQL
• algorithms
4
7. cloud·e·ra im·pal·a
7
/kloudˈi(ə)rə imˈpalə/
noun
a modern, open source, MPP SQL query engine
for Apache Hadoop.
“Cloudera Impala provides fast, ad hoc SQL query
capability for Apache Hadoop, complementing
traditional MapReduce batch processing.”
8. Impala adoption
8
Component (and Founder) Vendor Support
Cloudera MapR Amazon IBM Pivotal Hortonworks
Impala (Cloudera) ✔ ✔ ✔ X X X
Hue (Cloudera) ✔ ✔ X X X ✔
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Ambari (Hortonworks) X X X X ✔ ✔
Knox (Hortonworks) X X X X X ✔
Tez (Hortonworks) X X X X X ✔
Drill (MapR) X ✔ X X X X
9. 9
The Apache Hadoop Ecosystem
Quick and dirty, for context.
11. Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
11
12. HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
12
14. MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
14
15. Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
15
21. Cloudera Impala
• Interactive query on Hadoop
• think seconds, not minutes
• ANSI-92 standard SQL
• compatible with HiveQL
• Native MPP query engine
• built for low-latency queries
• HDFS and HBase storage
21
22. Cloudera Impala – Design Choices
• Native daemons, written in C/C++
• No JVM, no MapReduce
• Saturate disks on reads
• Uses in-memory HDFS caching
• Re-uses Hive metastore
• Not as fault-tolerant as MapReduce
22
23. Benefits of Impala
Unlocks BI/analytics on Hadoop
• Interactive SQL in seconds
• Highly concurrent to handle 100s of users
Native Hadoop flexibility
• No data migration, conversion, or duplication required
• Query existing Hadoop data
• Run multiple frameworks on the same data at the same time
• Supports Parquet for best-of-breed columnar performance
Native MPP query engine designed into Hadoop:
• Unified Hadoop storage
• Unified Hadoop metadata (uses Hive and HCatalog)
• Unified Hadoop security
• Fine-grained role-based access controls with Sentry
Apache-licensed open source
Proven in
Production
23
24. Cloudera Impala – Architecture
• Impala Daemon
• runs on every node
• handles client requests
• handles query planning & execution
• State Store Daemon
• provides name service
• metadata distribution
• used for finding data
24
26. Impala Query Execution
26
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
27. Impala Query Execution
27
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query results
28. Cloudera Impala – Results
• Allows for fast iteration/discovery
• How much faster?
• 3-4x faster on I/O bound workloads
• up to 45x faster on multi-MR queries
• up to 90x faster on in-memory cache
28
29. Latest SQL Performance
350
300
250
200
150
100
50
0
Impala Spark SQL Presto Hive-on-Tez
Time (in seconds)
Single User vs 10 User Response Time/Impala
Times Faster
(Lower bars = better)
Single User, 5
10 Users, 11
Single User, 25
10 Users, 120
10 Users, 302
10 Users, 202
Single User, 37
Single User, 77
5.0x
10.6x
7.4x
27.4x
15.4x
18.3x
Independent validation by IBM Research SQL-on-Hadoop VLDB paper:
“Impala’s database architecture provides significant performance gains”
29
30. Previous Milestones
Impala 1.0
(GA)
Impala 1.1
(Security)
Impala 1.2
(Usability)
Impala 1.3
(Resource
Management)
Impala 1.4
(Extensibility)
Impala 2.0
(SQL)
Analytic Database
Capabilities
Spring
2013
Summer
2013
Fall
2013
Spring
2014
Summer
2014
Fall
2014
30
31. Cloudera Impala 2.0
Window Functions
“Aggregate function applied to a partition of the result set” (SQL 2003)
Ex:
sum(population) OVER (PARTITION BY city)
rank() OVER (PARTITION BY state, ORDER BY population)
We’ve implemented most of the spec
• PARTITION BY, ORDER BY
• WINDOW
• PRECEEDING, FOLLOWING
• ROWS
• Any number of analytic functions in one query
31
32. Cloudera Impala 2.0
Subqueries
A query that is part of another query. Ex:
select col from t1
where col in
(select c2 from t2)
Support:
• Correlated and uncorrelated subqueries.
• IN, NOT IN, EXISTS, NOT EXISTS
32
33. Cloudera Impala 2.0
Spill to disk joins & aggregations
• Previously, if a query ran out of memory, Impala would abort it
• This means some big joins (fact table – fact table) joins could never run.
• All operators that accumulate memory can now spill to disk if
necessary.
• Order by (Impala 1.4)
• Join/Agg (Impala 2.0)
• Analytic Functions (Impala 2.0)
• Transparent to existing workloads
33
34. Cloudera Impala 2.1 +
34
• Nested data – enables queries on complex nested structures including maps, structs,
and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET
• SQL SET operators – MINUS, INTERSECT
• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
• UDTFs (user-defined table functions) – for more advanced user functions and
extensibility
• Intra-node parallelized aggregations and joins – to provide even faster joins and
aggregations on on top of the performance gains of Impala
• Parquet enhancements – continued performance gains including index pages
• Amazon S3 integration
39. 39
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.
Editor's Notes
Similar to the Red Hat model.
Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
Similar to the Red Hat model.
Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
Furthermore, for projects that carry the Apache License, open-ness does not always guarantee freedom from lock-in to a single support provider. For example, Drill, Knox, Tez, and Falcon are all open source, and all shipped by a single vendor – what’s a better example of “lock-in” than that?
We’re going to breeze through these really quick, just to show how Search plugs in later…
Lose a server, no problem. Lose a rack, no problem.
We’re going to breeze through these really quick, just to show how Search plugs in later…
More & Faster Value from Big Data
Provides an interactive BI/Analytics experience on Hadoop
Previously BI/Analytics was impractical due to the batch orientation of MapReduce
Enables more users to gain value from organizational data assets (SQL/BI users)
Makes more data available for analysis (raw data, multi-structured data, historical data)
Removes delays from data migration
Into specialized analytical DBMSs
Into proprietary file formats that happen to be stored in HDFS
Into transient in-memory stores
Flexibility
Query across existing data in Hadoop
HDFS and HBase
Access data immediately and directly in its native format
Select best-fit file formats
Use raw data formats when unsure of access patterns (text files, RCFiles, LZO)
Increase performance with optimized file formats when access patterns are known (Parquet, Avro)
Run multiple frameworks on the same data at the same time
All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same time
Run multiple frameworks on the same data at the same time
All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc.
Cost Efficiency
Reduce movement, duplicate storage & compute
Data movement: no time or resource penalty for migrating data into specialized systems or formats
Duplicate storage: no need to duplicate data across systems or within the same system in different file formats
Compute: use the same compute resources as the rest of the Hadoop system –
You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce)
You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions
10% to 1% the cost of analytic DMBS
Less than $1,000/TB
Full Fidelity Analysis
No loss of fidelity from aggregations or conforming to fixed schemas
If the attribute exists in the raw data, you can query against it
These run continuously, always ready. In C/C++ for the most-part.
Impala 1.0
~SQL-92 (minus correlated sub-queries)
Native Hadoop file formats (Parquet, Avro, text, Sequence, …)
Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)
Service-level resource isolation with other Hadoop frameworks
Impala 1.1
Fine-grained, role-based authorization via Apache Sentry
Auditing (Impala 1.1.1 and CM 4.7+)
Impala 1.2
Custom language extensibility (UDFs, UDAFs)
Cost-based join-order optimization
On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility
Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)
Resource management
Do not support RANGE windows.
Range windows let you specify a range based on the current row’s value (as opposed to ROWS, which is the ordinal).
Example:
sum(c) OVER(ORDER BY year BETWEEN RANGE 1 PRECEEDING and 2 FOLLOWING)
Error: “RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW."
No UDA support
Not all aggregate functions are supported (ndv, etc)
Looking at both for 2.1.
All subqueries are rewritten as joins.
No “Independent evaluation”
We’ve added additional join types to support this:
LEFT/RIGHT ANTI-JOIN
RIGHT SEMI-JOIN
NULL AWARE LEFT ANTI JOIN
Subqueries are only supported in the WHERE clause.
Impala can’t reason if a subquery returns one row in all cases:
select col limit 1 works
select min(col) works
select min(col) group by x where x = 1 doesn’t
Can manually add a limit 1 to the subquery.
See docs for more details
These should all have error messages explaining why
We implemented the common use cases.
Impala hash partitions the input to the operator, spilling partitions as necessary
When all the input is partitioned, Impala processes the partitions that are still in memory (did not spill)
Impala then processed the spilled partitions 1 by 1, repartitioning if necessary.
Impala tries to minimize the number of spilled bytes.
Peak memory usage when the first spill happened
Stays high until we handled all the non-spilled partitions
Lower as we handle the spilled partitions 1 by 1.
We’re going to breeze through these really quick, just to show how Search plugs in later…