Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Presto updates to 0.178

Presto@Uber

Zhenxiao Luo

How to ensure Presto scalability  in multi use case

From Batch to Streaming ET(L) with Apache Apex

Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: * Pipeline functionality from event source through queryable state for real-time insights. * API for application development and development process. * Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. * Stateful processing with event time windowing. * Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery * Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. * Who is using Apex in production, and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.

HBaseConEast2016: Splice machine open source rdbms

Michael Stack

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.

Membase Meetup 2010Membase

Presto in my_use_case

wyukawa

Building Distributed Data Streaming System

Ashish Tadose

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Data Con LA

In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures. Bio:- John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Automatic Scaling Iterative Computations

Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)

Matt Fuller

Tempto is a product test framework that allows developers to write and execute tests for SQL databases running on Hadoop. Individual test requirements such as data generation, HDFS file copy/storage of generated data and schema creation are expressed declaratively and are automatically fulfilled by the framework. Developers can write tests using Java (using a TestNG like paradigm and AssertJ style assertion) or by providing query files with expected results. We will show how we use it for presto product tests. Benchto is a benchmark framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment. Understanding behavior of distributed systems is hard and requires good visibility intostate of the cluster and internals of tested system. This project was developed for repeatable benchmarking ofHadoop SQL engines, most importantly Presto.

Presto as a Service - Tips for operation and monitoringTaro L. Saito

What's hot

Presto meetup 2015-03-19 @Facebook

Treasure Data, Inc.

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Martin Traverso

Presto@Netflix Presto Meetup 03-19-15

Zhenxiao Luo

Presto at Twitter

Bill Graham

Presto @ Treasure Data - Presto Meetup Boston 2015

Taro L. Saito

Presto: Distributed sql query engine

kiran palaka

Presto

Knoldus Inc.

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

viirya

Bullet: A Real Time Data Query Engine

Presto updates to 0.178

Presto@Uber

Zhenxiao Luo

How to ensure Presto scalability  in multi use case

From Batch to Streaming ET(L) with Apache Apex

HBaseConEast2016: Splice machine open source rdbms

Michael Stack

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Membase Meetup 2010Membase

Presto in my_use_case

wyukawa

Building Distributed Data Streaming System

Ashish Tadose

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Data Con LA

Automatic Scaling Iterative Computations

What's hot (20)

Presto meetup 2015-03-19 @Facebook

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto@Netflix Presto Meetup 03-19-15

Presto at Twitter

Presto @ Treasure Data - Presto Meetup Boston 2015

Presto: Distributed sql query engine

Presto

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Bullet: A Real Time Data Query Engine

Presto updates to 0.178

Presto@Uber

How to ensure Presto scalability  in multi use case

From Batch to Streaming ET(L) with Apache Apex

HBaseConEast2016: Splice machine open source rdbms

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Membase Meetup 2010

Presto in my_use_case

Building Distributed Data Streaming System

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Automatic Scaling Iterative Computations

Viewers also liked

Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)

Matt Fuller

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

ScyllaDB

Presto overviewShixiong Zhu

Presto

MK JUNG

Presto Meetup 2016 Small Start

Hiroshi Toyama

Presto Meetup @ Facebook (3/22/2016)

Martin Traverso

AWS Meet-up: Logging At Scale on AWS

Chris Riddell

Prestogres internals

Sadayuki Furuhashi

Future of Data Meetup : Boontadata

Abdelkrim Hadjidj

As some big data stream processing engines may become an alternative to batch engines, companies may have to choose the technology they will rely on. There are many considerations to take into account, including how to develop, and what the engine can do. Boontadata (http://boontadata.io) is an environment, available on GitHub where anyone can experiment stream processing engines. A common scenario is used to compare how to develop and run different processing engines.

Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...

Cloudera, Inc.

Recent research has pointed out the complementary nature of Hadoop and other data management solutions and the importance of leveraging existing systems, SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve analytic processing. Come to this session to learn how companies optimize the use of Hadoop with other enterprise systems to improve overall analytical throughput and build new data-driven products. This session covers: ways to achieve high-performance integration between Hadoop and relational-based systems; Hadoop+NoSQL vs Hadoop+SQL architectures; high-speed, massively parallel data transfer to analytical platforms that can aggregate web log data with granular fact data; and strategies for freeing up capacity for more explorative, iterative analytics and ad hoc queries.

Big Data: SQL query federation for Hadoop and RDBMS data

Cynthia Saracco

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse

Amazon EMR Facebook Presto Meetup

stevemcpherson

Presto changes

N Masahiro

Presto in my_use_case2

wyukawa

Presto - SQL on anything

Grzegorz Kokosiński

One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface. Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.

Internals of Presto Service

Treasure Data, Inc.

Teradata Big Data London Seminar

Hortonworks

Viewers also liked (20)

Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)

Presto as a Service - Tips for operation and monitoring

Understanding Presto - Presto meetup @ Tokyo #1

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Presto overview

Presto

Presto Meetup 2016 Small Start

Presto Meetup @ Facebook (3/22/2016)

AWS Meet-up: Logging At Scale on AWS

Prestogres internals

Future of Data Meetup : Boontadata

Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...

Big Data: SQL query federation for Hadoop and RDBMS data

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse

Amazon EMR Facebook Presto Meetup

Presto changes

Presto in my_use_case2

Presto - SQL on anything

Internals of Presto Service

Teradata Big Data London Seminar

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Open Source SQL for Hadoop: Where are we and Where are we Going?

Hortonworks.bdbEmil Andreas Siemes

Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud

Alluxio, Inc.

Alluxio Tech Talk Mar 12, 2019 Speaker: Bin Fan, Alluxio Matt Fuller, Starburst As data analytic needs have increased with the explosion of data, the importance of the speed of analytics and the interactivity of queries has increased dramatically In this tech talk, we will introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn about: - The architecture of Presto, an open source distributed SQL engine, as well as innovations by Starburst like as it’s cost-based optimizer - How Presto can query data from cloud object storage like S3 at high performance and cost-effectively with Alluxio - How to achieve data locality and cross-job caching with Alluxio no matter where the data is persisted and reduce egress costs In addition, we’ll present some real world architectures & use cases from internet companies like JD.com and NetEase.com running the Presto and Alluxio stack at the scale of hundreds of nodes.

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

ssuserd3a367

Teradata - Presentation at Hortonworks Booth - Strata 2014

Hortonworks

Bi on Big Data - Strata 2016 in London

Dremio Corporation

Hitachi Data Systems Hadoop Solution

Hitachi Vantara

Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html

Twitter with hadoop for oow

Gwen (Chen) Shapira

Munich HUG 21.11.2013Emil Andreas Siemes

Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Dataconomy Media

Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Mats Uddenfeldt

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Alluxio, Inc.

Savanna - Elastic Hadoop on OpenStack

Sergey Lukjanov

UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence

Jonathan Pletzke

Hear about and see the latest SAS solutions in use at UNC-CH. In support of ConnectCarolina and InfoPorte for administrative data, two SAS server based platforms have been installed: SAS Business Intelligence, which is being used for Extract-Transform-Load (ETL) manipulation of data SAS Visual Analytics, which is being used for reporting and visualization of data Hear about the high speed and high capacity of the server based solutions, along with how they are being used and benefiting UNC Chapel Hill.

Talend for big_data_intorduction

Lakshman Dhullipalla

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Alluxio Tech Talk January 21, 2020 Speakers: Matt Fuller, Starburst Dipti Borkar, Alluxio With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data. Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about: - The architecture of Presto, an open source distributed SQL engine - How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics - Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Piranha vs. mammoth predator appliances that chew up big data

Jack (Yaakov) Bezalel

If you also got the Big Data itch, here is something to ease the pain :-) Answers to this questions will be available soon (more info in the attached link) Which Big Data Appliance should YOU use? (click on the attached link for Poll results) Appliances are Small and Quick, Right? Revealing the 6 Types of Big Data Appliances Uncovering the Main Players Challenges, Pitfalls, and Winning the Big Data Game Where is all this leading YOU to?

Summer Shorts: Big Data Integration

ibi

Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization. Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks. See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf

Webinar: What's new in CDAP 3.5?

Cask Data

Cask Webinar Date: 08/10/2016 Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0 In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit. Some of the highlights include: - Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger. - Preview mode - Ability to preview and debug data pipelines before deploying them. - Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines - Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming. - Data usage analytics - Ability to report application usage of data sets. - And much more!

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015) (20)

Open Source SQL for Hadoop: Where are we and Where are we Going?

Hortonworks.bdb

Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

Teradata - Presentation at Hortonworks Booth - Strata 2014

Bi on Big Data - Strata 2016 in London

Hitachi Data Systems Hadoop Solution

Twitter with hadoop for oow

Munich HUG 21.11.2013

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Savanna - Elastic Hadoop on OpenStack

UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence

Talend for big_data_intorduction

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Piranha vs. mammoth predator appliances that chew up big data

Summer Shorts: Big Data Integration

Webinar: What's new in CDAP 3.5?

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...