Scalable olap with druid

Zero ETL analytics with LLAP in Azure HDInsight

It’s 2017, and big data challenges are as real as they get. Our customers have petabytes of data living in elastic and scalable commodity storage systems such as Azure Data Lake Store and Azure Blob storage. One of the central questions today is finding insights from data in these storage systems in an interactive manner, at a fraction of the cost. Interactive Query leverages [Hive on LLAP] in Apache Hive 2.1, brings the interactivity to your complex data warehouse style queries on large datasets stored on commodity cloud storage. In this session, you will learn how technologies such as Low Latency Analytical Processing [LLAP] and Hive 2.x are making it possible to analyze petabytes of data with sub second latency with common file formats such as csv, json etc. without converting to columnar file formats like ORC/Parquet. We will go deep into LLAP’s performance and architecture benefits and how it compares with Spark and Presto in Azure HDInsight. We also look at how business analysts can use familiar tools such as Microsoft Excel and Power BI, and do interactive query over their data lake without moving data outside the data lake. Speaker Ashish Thapliyal, Principal Program Manager, Microsoft Corp

From Device to Data Center to Insights

Druid at Hadoop Ecosystem

Slim Bouguerra

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Spark Summit

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.

Introduction to the Hadoop EcoSystem

Shivaji Dutta

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Alex Zeltov

This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin. https://github.com/zeltovhorton/intro_spark_zeppelin_meetup There will be a short lecture that includes an introduction to Spark, the Spark components. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes. The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.

Streaming SQL

Improving Organizational Knowledge with Natural Language Processing Enriched ...

The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text. In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.

Hive edw-dataworks summit-eu-april-2017

alanfgates

ebay

Data Regions: Modernizing your company's data ecosystem

How do you decide where your customer was?

Design Patterns For Real Time Streaming Data Analytics

Hadoop from Hive with Stinger to Tez

Jan Pieter Posthuma

Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...

There have been many voices discussing how to architect streaming applications on Hadoop. Before now, there have been very few worked examples existing within the open source. Apache Metron (Incubating) is a streaming advanced analytics cybersecurity application which utilizes the components within the Hadoop stack as its platform. We will attempt to go beyond theoretical discussions of Kappa vs Lambda architectures and describe the nuts and bolts of a streaming architecture that enables advanced analytics in Hadoop. We will discuss the componentry that we had to build and what we could utilize. We will discuss why we made the architectural decisions that we made and how they fit together to knit together a coherent application on top of many different Hadoop ecosystem projects. We will also discuss the domain specific language that we created out of necessity to enable a pluggable layer to enable user defined enrichments. We will discuss how this helped make Metron less rigid and easier to use. We will also candidly discuss mistakes that we made early on.

Apache Spark Crash Course

This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud. Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions. Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course). Speakers: Robert Hryniewicz

Large-Scale Stream Processing in the Hadoop Ecosystem

Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases. In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries. Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases. Agenda - 1) Introduction and Ideal Use cases for Druid 2) Data Architecture 3) Streaming Ingestion with Kafka 4) Demo using Druid, Kafka and Superset. 5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion 6) Future Work

Druid Scaling Realtime Analytics

Aaron Brooks

What's hot

Streaming in the Wild with Apache Flink

Interactive real time dashboards on data streams using Kafka, Druid, and Supe...

Future of Apache Storm

Zero ETL analytics with LLAP in Azure HDInsight

From Device to Data Center to Insights

Druid at Hadoop Ecosystem

Slim Bouguerra

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Spark Summit

Introduction to the Hadoop EcoSystem

Shivaji Dutta

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Alex Zeltov

Streaming SQL

Improving Organizational Knowledge with Natural Language Processing Enriched ...

Hive edw-dataworks summit-eu-april-2017

alanfgates

ebay

Data Regions: Modernizing your company's data ecosystem

How do you decide where your customer was?

Design Patterns For Real Time Streaming Data Analytics

Hadoop from Hive with Stinger to Tez

Jan Pieter Posthuma

Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...

Apache Spark Crash Course

Large-Scale Stream Processing in the Hadoop Ecosystem

What's hot (20)

Streaming in the Wild with Apache Flink

Interactive real time dashboards on data streams using Kafka, Druid, and Supe...

Future of Apache Storm

Zero ETL analytics with LLAP in Azure HDInsight

From Device to Data Center to Insights

Druid at Hadoop Ecosystem

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Introduction to the Hadoop EcoSystem

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Streaming SQL

Improving Organizational Knowledge with Natural Language Processing Enriched ...

Hive edw-dataworks summit-eu-april-2017

ebay

Data Regions: Modernizing your company's data ecosystem

How do you decide where your customer was?

Design Patterns For Real Time Streaming Data Analytics

Hadoop from Hive with Stinger to Tez

Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...

Apache Spark Crash Course

Large-Scale Stream Processing in the Hadoop Ecosystem

Similar to Scalable olap with druid

Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

Druid Scaling Realtime Analytics

Aaron Brooks

Social Media Monitoring with NiFi, Druid and Superset

Thiago Santiago

Enabling the Real Time Analytical Enterprise

Combining IOT, Customer Experience and Real-Time Enterprise Data within Hadoop. What if you could derive real-time insights using ALL of your data? Join us for this webinar and learn how companies are combining “new” real-time data sources (i.e. IOT, Social, Web Logs) with continuously updated enterprise data from SAP and other enterprise transactional systems, providing deep and up-to-the-second analytical insights. This presentation will include a demonstration of how this can be achieved quickly, easily and affordably by utilizing a joint solution from Attunity and Hortonworks.

An Apache Hive Based Data Warehouse

Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.

Big data processing engines, Atlanta Meetup 4/30

Ashish Narasimham

Hive acid and_2.x new_features

Alberto Romero

Unlocking insights in streaming data

Carolyn Duby

Presentation from Future of Data Boston Meetup on Oct 24, 2017. Streaming data is rich with insights but these insights can be difficult to find due to the difficulty of developing and deploying streaming applications. During this presentation we will show how to build and deploy a complex streaming application in a few minutes using open source tools. First we will build an application using Streaming Analytics Manager and Schema Registry that ingests data into Apache Druid. Then we will use Apache Superset to build beautiful, informative dashboards.

Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善

HortonworksJapan

Apache Hiveは、急速に進化しているプロジェクトで、ビッグデータエコシステムで広く活躍しています。 Hiveは、アナリティクス、レポーティング、そして双方型のクエリのサポートを拡大し続け、コミュニティは、その他の多くの側面やユースケースと共にサポートを改善しようと努力しています。セミナーでは、LLAP、Apache Druidのマテリアライズド・ビューおよび統合、ワークロード管理、ACIDの改善、クラウドでのHiveの使用、そしてパフォーマンスの改善を取り上げるベンチマークなど、Hiveで実現する最新の機能と最適化の概要をご紹介します。

Future of Data New Jersey - HDF 3.0 Deep Dive

Aldrin Piri

HDF Powered by Apache NiFi Introduction

Milind Pandit

Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI

Haimo Liu

Hortonworks Data in Motion Webinar Series - Part 1

Hortonworks and Red Hat Webinar - Part 2

Hortonworks - What's Possible with a Modern Data Architecture?

Paris FOD Meetup #5 Hortonworks Presentation

Abdelkrim Hadjidj

BI on Hadoop: which tool for which use case? BI on Hadoop is a hot topic currently. However, there's no one-size-fits-all solution. Like many other topics in Hadoop, there are several solutions for various use case. Hive, Hive LLAP, HBase/Phoenix, Druid, Spark SQL are all potential solutions with their own sweet spots. In this presentation, we will explore these options and provide best guidance to choose the right technology for several use cases. We will also do a live demo to show how you can use Druid to build OLAP cubes on HDP.

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...

Data Con LA

Connecting enterprise systems has always been a tough task. Modern IoT applications have exacerbated the issue by the need to integrate legacy systems with novel high velocity data streams. Various patterns like messaging, REST, etc. have been proposed, but they necessitate rearchitecting the integration layer which is extremely arduous. In this talk we will show you how to use Apache NiFi to solve your data integration, movement and ingestion problems. Next, we will examine how Apache NiFi can be used to construct durable, scalable and responsive IoT apps in conjunction with other stream processing and messaging frameworks.

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3