Start of a New era: Apache YARN 3.1 and Apache HBase 2.0DataWorks Summit
The adoption of Machine Learning (ML) and Deep Learning (DL) is a necessary step in an organization’s digital transformation journey. The insights gained from these applications enable businesses to improve their internal and external processes to maintain a competitive advantage. To address ML and DL apps, organizations are procuring expensive hardware resources to handle the extensive processing power required by these workloads.
As organizations make these investments, it is becoming important that they consider the needs of business stakeholders in addition to the purely infrastructure-focused stakeholders. Hortonworks released HDP 3.0 as our major HDP version change in July of this year. Our HDP 3.0 includes the major version change of Apache Hadoop, Apache Hive, Apache HBase and so on, and we added extensively a lot of features to HDP 3.0. In this session, we are going to talk all about what's new in Apache YARN 3.1 and Apache HBase 2.0.
Start of a New era: Apache YARN 3.1 and Apache HBase 2.0DataWorks Summit
The adoption of Machine Learning (ML) and Deep Learning (DL) is a necessary step in an organization’s digital transformation journey. The insights gained from these applications enable businesses to improve their internal and external processes to maintain a competitive advantage. To address ML and DL apps, organizations are procuring expensive hardware resources to handle the extensive processing power required by these workloads.
As organizations make these investments, it is becoming important that they consider the needs of business stakeholders in addition to the purely infrastructure-focused stakeholders. Hortonworks released HDP 3.0 as our major HDP version change in July of this year. Our HDP 3.0 includes the major version change of Apache Hadoop, Apache Hive, Apache HBase and so on, and we added extensively a lot of features to HDP 3.0. In this session, we are going to talk all about what's new in Apache YARN 3.1 and Apache HBase 2.0.
HDInsight & CosmosDB - Global IoT · Big data processing infrastructureDataWorks Summit
We introduce HDInsight which is PaaS of Hadoop / Spark and IoT and big data processing infrastructure by CosmosDB which is a globally deployable distributed / multi-model database.
今回のウェビナーでは、Hadoop1.xからみなさまに深く親しまれてきたApache Hiveが昨今、どのような形で高速化されてきたかについて話します。MapReduceからTezに変わった実行エンジン、インデックスを持ったカラムナーファイルフォーマットであるORC、モダンなCPUを最大限に活用するVectorization、Apache Calciteを利用したCost Based Optimizerによる実行計画の最適化、そして1秒以下のクエリレスポンスを実現するLLAPについて説明します。いずれの機能も数行の設定やコマンドで活用可能なものばかりですが、今回はそれらの背景でどんな仕組みが動いているのか、どんな仕組みで実現されているのかということについて話します。
A Benchmark Test on Presto, Spark Sql and Hive on TezGw Liu
Presto、Spark SQLとHive on Tezの性能に関して、数万件から数十億件までのデータ上に、常用クエリパターンの実行スピードなどを検証してみた。
We conducted a benchmark test on mainstream big data sql engines including Presto, Spark SQL, Hive on Tez.
We focused on the performance over medium data (from tens of GB to 1 TB) which is the major case used in most services.
HDInsight & CosmosDB - Global IoT · Big data processing infrastructureDataWorks Summit
We introduce HDInsight which is PaaS of Hadoop / Spark and IoT and big data processing infrastructure by CosmosDB which is a globally deployable distributed / multi-model database.
今回のウェビナーでは、Hadoop1.xからみなさまに深く親しまれてきたApache Hiveが昨今、どのような形で高速化されてきたかについて話します。MapReduceからTezに変わった実行エンジン、インデックスを持ったカラムナーファイルフォーマットであるORC、モダンなCPUを最大限に活用するVectorization、Apache Calciteを利用したCost Based Optimizerによる実行計画の最適化、そして1秒以下のクエリレスポンスを実現するLLAPについて説明します。いずれの機能も数行の設定やコマンドで活用可能なものばかりですが、今回はそれらの背景でどんな仕組みが動いているのか、どんな仕組みで実現されているのかということについて話します。
A Benchmark Test on Presto, Spark Sql and Hive on TezGw Liu
Presto、Spark SQLとHive on Tezの性能に関して、数万件から数十億件までのデータ上に、常用クエリパターンの実行スピードなどを検証してみた。
We conducted a benchmark test on mainstream big data sql engines including Presto, Spark SQL, Hive on Tez.
We focused on the performance over medium data (from tens of GB to 1 TB) which is the major case used in most services.
Beginner must-see! A future that can be opened by learning HadoopDataWorks Summit
What is "Hadoop" now? It is difficult to hear ... But those who are interested, those who are thinking about the future as active as a data engineer, those who are new to the first time, through introductions of Hadoop and the surrounding ecosystem, introducing merits and examples, "What now Should I learn? "And I will introduce the future spreading through learning Hadoop and the surrounding ecosystem.
- The document discusses running Hive/Spark on S3 object storage using S3A committers and running HBase on NFS file storage instead of HDFS. This separates compute and storage and avoids HDFS operations and complexity. S3A committers allow fast, atomic writes to S3 without renaming files. Benchmark results show the magic committer is faster than the file committer for S3 writes. HBase performance tests show FlashBlade NFS providing low latency for random reads/writes compared to Amazon EFS.
This document provides an introduction to Apache Kafka. It begins with an overview of Kafka as a distributed messaging system that is real-time, scalable, low latency, and fault tolerant. It then covers key concepts such as topics, partitions, producers, consumers, and replication. The document explains how Kafka achieves fast reads and writes through its design and use of disk flushing and replication for durability. It also discusses how Kafka can be used to build real-time systems and provides examples like connected cars. Finally, it introduces Apache Metron as an example of a cyber security solution built on Kafka.
Hive2 Introduction -- Interactive SQL for Big DataYifeng Jiang
Introducing new feature of Hive 2 and how it achieve interactive SQL for big data. Features including the new LLAP engine, ACID merge, Hive + Druid integration, etc. I will explain what it is, how it works and what use cases it is for. I will also have some benchmark numbers to show.
Introduction to Streaming Analytics ManagerYifeng Jiang
This document introduces Streaming Analytics Manager (SAM), an open source project led by Hortonworks to simplify building streaming analytics applications. SAM aims to provide the same easy experience for streaming analytics as NiFi does for flow management applications. It allows users to create a streaming analytics application in 10 minutes and supports prescriptive, predictive, and descriptive analytics functions including routing, filtering, predictive modeling, and real-time dashboards. SAM applications are scalable through one-click deployment on distributed streaming platforms.
This document discusses Hortonworks DataFlow (HDF) 3.0 for building IoT platforms. It introduces HDF 3.0 and its key components for data ingestion, management, security, and real-time analysis. These include NiFi for data movement, Streaming Analytics Manager (SAM) for building streaming analytics apps visually, and Schema Registry for managing schemas. The document also presents example IoT use cases and demonstrates building a real-time analytics app in SAM to analyze vehicle event data.
Hortonworks Data Cloud for AWS 1.11 UpdatesYifeng Jiang
This document discusses Hortonworks Data Cloud, which provides an enterprise-ready Hadoop distribution on AWS. Key points include: HDC offers pre-configured Hortonworks Data Platform clusters on AWS that can be easily deployed and managed; the latest release of HDC (version 1.11) introduces compute nodes that allow using spot instances to reduce costs; and node recipes enable running custom scripts during cluster installation and configuration.
This document discusses security requirements and solutions for Apache Spark production deployments. It covers authenticating users with Kerberos/AD, authorizing access to Spark jobs and data with Ranger, auditing access, and encrypting data at rest and in motion. It provides examples of configuring Kerberos authentication for Spark, using Ranger to control authorization to HDFS and SparkSQL, and demonstrates dynamic row filtering and masking of sensitive data in SparkSQL queries based on user policies.
Introduction to Hortonworks Data Cloud for AWSYifeng Jiang
Hortonworks Data Cloud is a new cloud product from Hortonworks that offers pay-as-you-go pricing for launching and managing Hadoop clusters on AWS. It handles common big data use cases and focuses on ease of use by providing prescriptive cluster types. The product aims to improve enterprise readiness in the cloud by providing scalable storage, security and governance features, and reliability through auto-recovery of unhealthy nodes. It also matches Hadoop with cloud capabilities like scalable storage, customizability, and cost-effective compute.
This document discusses real-time analytics in the financial industry. It describes a use case of detecting abnormal stock transactions in real-time and an architecture to handle it. The architecture uses Kafka as the messaging bus, Storm for real-time processing, and HBase for the data store. It discusses challenges like data ingestion, lookups, deduplication, and late events. Predictive analytics is also mentioned as an extension where machine learning models can be integrated to enhance detection.
Yifeng Jiang gives a presentation introducing Apache Nifi. He begins with an overview of himself and the agenda. He then provides an introduction to Nifi including terminology like FlowFile and Processor. Key aspects of Nifi are demonstrated including the user interface, provenance tracking, queue prioritization, cluster architecture, and a demo of real-time data processing. Example use cases are discussed like indexing JSON tweets and indexing data from a relational database. The presentation concludes that Nifi is an easy to use and powerful system for processing and distributing data with 90 built-in processors.
This document discusses strategies for achieving sub-second SQL query performance on Hadoop at scale. It describes two use cases: highly parallel batch reporting on a massive dataset, and online reporting with low latency requirements. For the latter use case, the document evaluates Hive LLAP and Phoenix, finding that Phoenix generally has lower latency, especially for queries with large result sets, through optimizations like skip scans, merging improvements, and table splitting. Tuning HBase and Phoenix configurations can further reduce latency.
This document provides a summary of Amazon Kinesis and Apache Kafka, two platforms for processing real-time streaming data at large scale. It describes key features of each system such as durability, interfaces, processing options, and deployment. Kinesis is a fully managed cloud service that provides high durability for data across AWS availability zones. Kafka is an open source platform that offers lower latency and more flexibility in how data is processed but requires more operational overhead. The document also includes a deep dive on concepts and internals of the Kafka platform.
Yifeng Jiang presented on Apache Hive's present and future capabilities. Hive has achieved 100x performance improvements through technologies like ORC file format, Tez execution engine, and vectorized processing. Upcoming features like LLAP caching and a persistent Hive server aim to provide sub-second query response times for interactive analytics. Hive continues to evolve as the standard SQL interface for Hadoop, supporting a wide range of use cases from ETL and reporting to real-time analytics.
Hadoop Present - Open Enterprise HadoopYifeng Jiang
The document is a presentation on enterprise Hadoop given by Yifeng Jiang, a Solutions Engineer at Hortonworks. The presentation covers updates to Hadoop Core including HDFS and YARN, data access technologies like Hive, Spark and stream processing, security features in Hadoop, and Hadoop management with Apache Ambari.
7. About Hortonworks
顧客
• 556 のお客様 (2015年8月5日時点)
• 2015年2期に119 新規お客様追加
• NASDAQに上場(HDP)
Hortonworks Data Platform
• 完全にオープンなマルチテナント プラット
フォーム。あらゆるデータ、あらゆるアプリ。
• 一貫したエンタプライズ サービス:セキュリ
ティ、オペレーション、ガバナンス
お客様のためのパートナー
• オープンソース コミュニティのリーダー、エ
ンタプライズ要件を満たすための革新に注力
• 比類のないHadoopのサポートサブスクリプ
ション
Founded in 2011
Original 24 architects, developers,
operators of Hadoop from Yahoo!
740+
E M P L O Y E E S
1350+
E C O S Y S T E M
PA R T N E R S