Veracity think bugdata #2 6.7.2015

•

0 likes•858 views

This document discusses data warehousing over Hadoop and strategies for low-latency querying. It covers columnar formats like ORC and Parquet that provide optimizations for queries. It also discusses different query engines like Hive, Impala, Presto, and Spark SQL and their capabilities. The document notes that converting data to optimized formats during or after collection can help minimize latency conflicts between processing and query performance. Technologies like Sqoop, Hive ACID transactions, streaming ingest into Hive, and Flume can help convert or ingest data in optimized formats with lower latency.

COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET)
Projection Push Down
Predicate Push Down
Excellent Compression Ratios
Column Indices
Max/Avg/Min values
Rows must be batched to beneﬁt from these optimizations

PARQUETPARQUET
Strongly endorsed by Cloudera
One of the few formats Impala
supports (and the most optimal
for it)
Also supported by Hive, Spark,
Tajo, Drill & Presto.
Speaking from my
own personal experience a bit
more expensive to generate.
ORCORC
Endorsed by Hortonworks
Most optimal for Presto
Spark support was recently
introduced.

HIVEHIVE
Hive provides a SQL like interface of
accessing the data (ﬁles) called HiveQL.
The HQL is translated into
M/R code and executed immediately.
Batch Oriented
Fault tolerant and thus reliable
Not a DB!
Does not support updates & delete and has
no transaction (or does it ?)

LOWLOW
LATENCYLATENCY
SQLSQL
Map-Reduce can be compared to
a Tractor:
It's very strong and can plow a
ﬁeld better than any other vehicle,
but it's also very slow.
As prices of memory dropped, a
demand emerged to better utilize
it for faster response times.

CLOUDERA IMPALACLOUDERA IMPALA
Writen in C++
Utilizes Hive's metadata
Very fast
Not fault tolerante
Doesn't support custom data
formats
Doesn't support complex data
types (maps/arrays/structs)
A bit complicated setup for non
CDH distributions

FACEBOOK PRESTOFACEBOOK PRESTO
Can connect to:
Cassandra
Hive
JMX Sources
Postgres & Mysql
Allows cross engine joins
Used in Facebook to serve online
dashboards
Easy to setup

SPARK SQLSPARK SQL
Not afﬁliated with any Hadoop
vendor
Support all of the optimized ﬁle
formats (ORC/Parquet/Avro)
Can auto discover schema
Aims to provide second/sub-
second latnecy
Still not very mature

THE USUAL DATA FLOWTHE USUAL DATA FLOW
Collect -> Store -> Convert -> Select
The Data Latency conﬂict - lots
of fragmented small ﬁles or big
optimized ﬁles with big latency
Processing efforts involved in
the conversion process should
be minimized
Example..

A BETTER DATA FLOWA BETTER DATA FLOW
Collec-tor-vert -> Select
Convert the data as it is being
collected where possible
Or convert the data as it is
being stored (streaming) but
without losing optimizations
How can this be achieved?

SQOOPSQOOP
Import data from RDBMS into
Hadoop
Create java classes and hive
tables on import
Export data back to RDBMS
Runs a "Map Only" job to
perform the task
Supports incremental imports
Now supports import right
away as Parquet

HIVE & ACIDHIVE & ACID
Recently a conceptual change has been
introduced into Hive: CRUD with ACID
Transactions.
It is not meant to replace your OLTP but
rather supply a better data modiﬁcation
mechanism to a subset of the data.
Explanation on how it works
Demo simple insert
Still requires M/R :(

HIVE & STREAMING INGESTHIVE & STREAMING INGEST
With the new ACID capabilities it is now
possible to continously insert data into hive
Data apperas almost immediately
Data is optimized in a columnar format
Data is compacted by different triggers
Code snippet

FLUMEFLUME
Distributed
Durable
Scalable
Fault Tolerante
Serves for ingestion and basic
pre-processing of the data
Composed of
source -> channel -> Sink
(Draw Architecture)
Utilized Hive's ACID capabilities
to instantly stream data into hive
- demo

Hive and Impala are tools for querying data stored in Hadoop, but they have key differences. Hive uses MapReduce to transform SQL queries into jobs and is better for long-running ETL processes due to fault tolerance. Impala is a massively parallel processing database that pushes processing directly to nodes, making it faster and more suitable for interactive queries from data analysts. The main differences are that Hive uses disk-based operations while Impala keeps data and calculations in memory, and Hive provides fault tolerance by restarting failed queries while Impala would need to restart from the beginning.

AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics

Amazon Web Services

Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance. Learning Objectives: Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features

Powering Interactive Data Analysis at Pinterest by Amazon Redshift

Jie Li

Deep Dive on Amazon Redshift

Amazon Web Services

Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.

Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration

DataStax Academy

In last few years, technology has seen a major drift in the dominance of traditional / RDMBS databases across different domains. Expeditious adoption of NoSQL databases especially Cassandra in the industry opens up a lot more discussions on what are the major challenges that are faced during implementation of Cassandra and how to mitigate it. Many a times we conclude that migration or POC (proof of concept) is not successful; however the real flaw might be in the data modeling, identifying the right hardware configurations, database parameters, right consistency level and so on. There's no one good model or configuration which fits all use cases and all applications. Performance tuning an application is truly an art and requires perseverance. This paper delve into different performance tuning considerations and anti-patterns that need to be considered during Cassandra migration / implementation to make sure we are able to reap the benefits of Cassandra, what makes it a ‘Visionary’ in 2014 Gartner’s Magic Quadrant for Operational Database Management Systems.

In this session, we take an in-depth look at the latest features in Amazon Redshift. Analyze data stored in and outside of your cluster with Amazon Redshift Spectrum, accelerate all your analytics workloads, and modernize your on-premises data warehouse. We will focus on best practices for designing optimal schemas, load data efficiently, and optimise queries to deliver high throughput an performance. Speaker: Ganesh Raja, Solutions Architect, AWS

AWS July Webinar Series: Amazon redshift migration and load data 20150722

Amazon Web Services

Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses. In this webinar, you will learn how to easily migrate your data from other data warehouses into Amazon Redshift, efficiently load your data with Amazon Redshift's massively parallel processing (MPP) capabilities, and automate data loading with AWS Lambda and AWS Data Pipeline. You will also learn about ETL tools from our partners to extract, transform, and prepare data from disparate data sources before loading it into Amazon Redshift. Learning Objectives: Understand common patterns for migrating your data to Amazon Redshift See live examples of the Copy command that fully parallelizes data ingestion Learn how to automate the load process using AWS Lambda & AWS Data Pipleline Techniques for real time data loading Options for ETL tools from our partners

Apache sqoop

megrhi haikel

AWS May Webinar Series - Getting Started with Amazon EMR

Amazon Web Services

This document provides an overview of Amazon Elastic MapReduce (EMR), a service that makes it easy to process large amounts of data using the Hadoop framework. It discusses how EMR allows users to launch Hadoop clusters in minutes, integrate with other AWS services for storage and databases, customize clusters using various Hadoop applications and design patterns, and pay only for the resources used. The document aims to demonstrate how EMR provides an easy, fast, secure and cost-effective way to run Hadoop in the cloud.

Migration to Redshift from SQL Server

joeharris76

RealityMine collects digital user behavior data to help companies with marketing, product development, and analyzing user patterns. They are migrating from an on-premise SQL Server data warehouse to Amazon Redshift to handle doubling data volumes. Redshift provides better performance and scalability at lower cost compared to other options. It requires extracting raw data from SQL Server without encoding issues, loading to S3, and transforming in Redshift using a star schema with careful consideration of distribution and sort keys for query performance. Ongoing database maintenance and backups are also different in Redshift.

Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...

Amazon Web Services

Optimising Your Amazon Redshift Cluster for Peak Performance In this session we take an in-depth look at the latest features in Amazon Redshift, including analysing data store in and outside of your cluster with Amazon Redshift Spectrum, query and platform enhancements, and more. We will dive deep into best practices on how to design optimal schemas, load data efficiently, and optimise your queries to deliver high throughput and performance. Eric Ferreira , Principal Database Engineer, Amazon Web Services

Loading Data into Redshift: Data Analytics Week SF

Amazon Web Services

tdtechtalk20160330johan

Johan Gustavsson

AWS June Webinar Series - Getting Started: Amazon Redshift

Amazon Web Services

Amazon Redshift is a fast, fully-managed petabyte-scale data warehouse service, for less than $1,000 per TB per year. In this presentation, you'll get an overview of Amazon Redshift, including how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Learn how, with just a few clicks in the AWS Management Console, you can set up with a fully functional data warehouse, ready to accept data without learning any new languages and easily plugging in with the existing business intelligence tools and applications you use today. This webinar is ideal for anyone looking to gain deeper insight into their data, without the usual challenges of time, cost and effort. In this webinar, you will learn: • Understand what Amazon Redshift is and how it works • Create a data warehouse interactively through the AWS Management Console • Load some data into your new Amazon Redshift data warehouse from S3 Who Should Attend • IT professionals, developers, line-of-business managers

Hadoop in the cloud with AWS' EMR

rICh morrow

Loading Data into Redshift

Amazon Web Services

Amazon EMR Masterclass

Amazon Web Services

Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service. See a recording of the webinar based on this presentation on YouTube here: Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/ See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/

Storm over gearpump

Tianlun Zhang

Manu Zhang discusses Gearpump, an open source distributed streaming engine that provides Storm compatibility. Gearpump uses the Akka actor model and supports features like dynamic DAGs, flow control, and high availability. Zhang details how Storm topologies can run on Gearpump without recompilation by translating the DAG and executing tasks similarly while taking advantage of Gearpump's advanced capabilities like back pressure. Future work includes integrating the Storm UI and improving support for features like Trident.

Cost effective BigData Processing on Amazon EC2

Sujee Maniyam

Best Practices for Migrating your Data Warehouse to Amazon Redshift

Amazon Web Services

This document provides best practices for migrating a data warehouse to Amazon Redshift. It discusses why companies migrate to Redshift due to its scalability, performance and cost advantages. Example migration stories are provided from companies that achieved significant improvements after migrating large datasets from Oracle, Greenplum and SQL on Hadoop to Redshift. The document also outlines the Redshift cluster architecture, data loading best practices including file splitting and column encoding, schema design considerations and available migration tools.

Mapping Data Flows Perf Tuning April 2021

Mark Kromer

This document discusses optimizing performance for data flows in Azure Data Factory. It provides sample timing results for various scenarios and recommends settings to improve performance. Some best practices include using memory optimized Azure integration runtimes, maintaining current partitioning, scaling virtual cores, and optimizing transformations and sources/sinks. The document also covers monitoring flows to identify bottlenecks and global settings that affect performance.

Amazon Redshift Masterclass

Amazon Web Services

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour with no commitment or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions. See a recording of the webinar based on this presentation here on YouTube: https://youtu.be/GgLKodmL5xE Masterclass series webinars, including on-demand access to all of this years recorded webinars: http://aws.amazon.com/campaigns/emea/masterclass/ Journey Through the Cloud webinar series, including on-demand access to all webinars so far this year: http://aws.amazon.com/campaigns/emea/journey/

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

ScyllaDB

Scylla 5.0 introduces several new features to improve node operations and compaction: 1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming. 2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion. 3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.

Azure Data Factory Data Flow Performance Tuning 101

Mark Kromer

The document provides performance timing results and recommendations for optimizing Azure Data Factory data flows. Sample 1 processed a 421MB file with 887k rows in 4 minutes using default partitioning on an 80-core Azure IR. Sample 2 processed a table with the same size and transforms in 3 minutes using source and derived column partitioning. Sample 3 processed the same size file in 2 minutes with default partitioning. The document recommends partitioning strategies, using memory optimized clusters, and scaling cores to improve performance.

Advance Hive, NoSQL Database (HBase) - Module 7

Rohit Agrawal

This document provides an overview of Hive and HBase. It discusses how Hive allows SQL-like queries over data stored in Hadoop files, and how data can be loaded into and manipulated within Hive tables. It also describes HBase as a column-oriented NoSQL database built on Hadoop that allows for fast random reads and updates of large datasets. Key concepts covered include HiveQL, user defined functions, dynamic partitioning, and loading data. For HBase, it discusses tables, rows, columns, and cells as well as its architecture, client APIs, and integration with MapReduce.

Pig and Pig Latin - Module 5

Rohit Agrawal

What's hot

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Amazon Web Services

Optimising your Amazon Redshift Cluster for Peak Performance

Amazon Web Services

AWS July Webinar Series: Amazon redshift migration and load data 20150722

Amazon Web Services

Apache sqoop

megrhi haikel

AWS May Webinar Series - Getting Started with Amazon EMR

Amazon Web Services

Migration to Redshift from SQL Server

joeharris76

Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...

Amazon Web Services

Loading Data into Redshift: Data Analytics Week SF

Amazon Web Services

tdtechtalk20160330johan

Johan Gustavsson

AWS June Webinar Series - Getting Started: Amazon Redshift

Amazon Web Services

Hadoop in the cloud with AWS' EMR

rICh morrow

Loading Data into Redshift

Amazon Web Services

Amazon EMR Masterclass

Amazon Web Services

Storm over gearpump

Tianlun Zhang

Cost effective BigData Processing on Amazon EC2

Sujee Maniyam

Best Practices for Migrating your Data Warehouse to Amazon Redshift

Amazon Web Services

Mapping Data Flows Perf Tuning April 2021

Mark Kromer

Amazon Redshift Masterclass

Amazon Web Services

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

ScyllaDB

Azure Data Factory Data Flow Performance Tuning 101

Mark Kromer

What's hot (20)

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Optimising your Amazon Redshift Cluster for Peak Performance

AWS July Webinar Series: Amazon redshift migration and load data 20150722

Apache sqoop

AWS May Webinar Series - Getting Started with Amazon EMR

Migration to Redshift from SQL Server

Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...

Loading Data into Redshift: Data Analytics Week SF

tdtechtalk20160330johan

AWS June Webinar Series - Getting Started: Amazon Redshift

Hadoop in the cloud with AWS' EMR

Loading Data into Redshift

Amazon EMR Masterclass

Storm over gearpump

Cost effective BigData Processing on Amazon EC2

Best Practices for Migrating your Data Warehouse to Amazon Redshift

Mapping Data Flows Perf Tuning April 2021

Amazon Redshift Masterclass

Scylla Summit 2022: Scylla 5.0 New Features, Part 2

Azure Data Factory Data Flow Performance Tuning 101

Viewers also liked

Advance Hive, NoSQL Database (HBase) - Module 7

Rohit Agrawal

Pig and Pig Latin - Module 5

Rohit Agrawal

Hadoop/HBase POC framework

Doug Chang

This document summarizes the key points from a review of a Hadoop/HBase proof of concept (POC). It includes performance tests of HBase write performance on Amazon AWS and Dell hardware. The AWS instances achieved 3,500-4,000 packets per second while the Dell hardware was slower at around 3,500 packets per second. Tuning the Dell hardware configuration and optimizing HBase regions and compactions could potentially improve write performance. The document also covers read performance tests and filtering techniques to improve query performance on large datasets.

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

FX Live Group

Oozie Introduction, Case Study, and Tips also some introduction about Integration of Kettle and Oozie using Spoon PDF download: http://user.cs.tu-berlin.de/~tqiu/Oozie_BigData_Workflow_Scheduler_Case_Study.pdf During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. More: http://www.chinahadoop.net/thread-6659-1-1.html Online Open Course: http://chinahadoop.edusoho.cn/course/19 video: http://www.youtube.com/watch?v=qzk08ggdIDw&hd=1 vimeo -- http://vimeo.com/84164730

Oozie or Easy: Managing Hadoop Workloads the EASY Way

DataWorks Summit

The document discusses managing Hadoop workloads and integrating Hadoop with existing infrastructure. It notes that using tools like Oozie can be difficult for integration. It then provides examples of typical scripts used for SQL queries, file transfers, and running Hadoop jobs. It introduces BMC Control-M as a way to better manage entire business workflows across different technologies and applications from a single point of control. Control-M allows defining, building, managing, and improving Hadoop business processes.

HadoopFileFormats_2016

Jakub Wszolek, PhD

The document discusses various file formats used for large-scale ETL processing with Hadoop, including text, JSON, sequence files, RCFiles, Avro, Parquet, and ORC files. It provides details on the features of each format in terms of schema evolution, compression, storage optimization, and performance for write, partial read, and full read operations. Test results show that column-oriented formats like Parquet and ORC provide faster query performance, especially when filters are applied. The best choice of format depends on the use case requirements around data types, schema changes, speed of writing versus reading, and tool compatibility.

Oozie towards zero downtime

DataWorks Summit

This document summarizes Oozie's implementation of high availability to achieve zero downtime. It discusses Yahoo's large scale use of Oozie, the need for zero downtime upgrades and failures, and the architectural and technical challenges overcome through the use of a load balancer, distributed database, Zookeeper coordination, log streaming across servers, and HCatalog integration. The implementation has provided stable high availability for Oozie at Yahoo for over 7 months.

Apache Pig for Data Scientists

DataWorks Summit

This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.

Big data hbase

ANSHUL GUPTA

This document provides an overview of HBase, including its architecture and how it compares to relational databases and HDFS. Some key points: - HBase is a non-relational, distributed, column-oriented database that runs on top of Hadoop. It uses a master-slave architecture with an HMaster and multiple HRegionServers. - Unlike relational databases, HBase is schema-less, column-oriented, and designed for denormalized data in wide, sparsely populated tables. - Compared to HDFS, HBase provides low-latency random reads/writes instead of batch processing. Data is accessed via APIs instead of MapReduce. - HBase uses LSM

Transactions Over Apache HBase

Cask Data

Transactions for Apache HBase™: Apache Tephra provides globally consistent transactions on top of Apache HBase. While HBase provides strong consistency with row- or region-level ACID operations, it sacrifices cross-region and cross-table consistency in favor of scalability. This trade-off requires application developers to handle the complexity of ensuring consistency when their modifications span region boundaries. By providing support for global transactions that span regions, tables, or multiple RPCs, Tephra simplifies application development on top of HBase, without a significant impact on performance or scalability for many workloads.

Everything you wanted to know, but were afraid to ask about Oozie

Chicago Hadoop Users Group

Hive ppt (1)

marwa baich

HBase Operations and Best Practices

Venu Anuganti

This document provides an overview and best practices for operating HBase clusters. It discusses HBase and Hadoop architecture, how to set up an HBase cluster including Zookeeper and region servers, high availability considerations, scaling the cluster, backup and restore processes, and operational best practices around hardware, disks, OS, automation, load balancing, upgrades, monitoring and alerting. It also includes a case study of a 110 node HBase cluster.

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Sudhir Mallem

Big Data Testing Approach - Rohit Kharabe

ROHIT KHARABE

Apache hbase overview (20160427)

Steve Min

Apache HBase Internals you hoped you Never Needed to Understand

Josh Elser

Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.

Future Of Data Paris - BI and Big Data

Mathias Kluba

ORC Files

Owen O'Malley

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

Proof of Concept for Hadoop: storage and analytics of electrical time-series

DataWorks Summit

1. EDF conducted a proof of concept to store and analyze massive time-series data from smart meters using Hadoop. 2. The proof of concept involved storing over 1 billion records per day from 35 million smart meters and running analytics queries. 3. Results showed Hadoop could handle tactical queries with low latency and complex analytical queries within acceptable timeframes. Hadoop provides a low-cost solution for massive time-series storage and analysis.

Viewers also liked (20)

Advance Hive, NoSQL Database (HBase) - Module 7

Pig and Pig Latin - Module 5

Hadoop/HBase POC framework

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

Oozie or Easy: Managing Hadoop Workloads the EASY Way

HadoopFileFormats_2016

Oozie towards zero downtime

Apache Pig for Data Scientists

Big data hbase

Transactions Over Apache HBase

Everything you wanted to know, but were afraid to ask about Oozie

Hive ppt (1)

HBase Operations and Best Practices

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Big Data Testing Approach - Rohit Kharabe

Apache hbase overview (20160427)

Apache HBase Internals you hoped you Never Needed to Understand

Future Of Data Paris - BI and Big Data

ORC Files

Proof of Concept for Hadoop: storage and analytics of electrical time-series

Similar to Veracity think bugdata #2 6.7.2015

From oracle to hadoop with Sqoop and other tools

Guy Harrison

This document discusses tools for transferring data between relational databases and Hadoop, focusing on Apache Sqoop. It describes how Sqoop was optimized for Oracle imports and exports, reducing database load by up to 99% and improving performance by 5-20x. It also outlines the goals of Sqoop 2 to improve usability, security, and extensibility through a REST API and by separating responsibilities.

Hadoop World Vertica

Omer Trajman

This document discusses integrating Apache Hadoop with Vertica, an analytic database with MPP columnar architecture. It describes how Vertica can be used as a data source and target for Hadoop MapReduce jobs, with Vertica input and output formatters allowing data to be moved between the two systems. Examples are provided of using Vertica to serve as a structured data repository for Hadoop and running algorithms like tickstore with map pushdown to optimize queries.

Hw09 Hadoop + Vertica

Cloudera, Inc.

This document discusses integrating Apache Hadoop with Vertica, an analytic database with MPP columnar architecture. It explains that Vertica can be used as a data source and target for Hadoop MapReduce jobs, and data can be moved between Vertica and HDFS. Vertica's input and output formatters allow Hadoop to interact with Vertica, enabling use cases like parsing logs from S3 and loading them into Vertica, or running algorithms on Vertica data using map pushdown to optimize queries.

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Christian Perone

This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.

Replicate from Oracle to data warehouses and analytics

Continuent

Analyzing transactional data residing in Oracle databases is becoming increasingly common, especially as the data sizes and complexity increase and transactional stores are no longer to keep pace with the ever-increasing storage. Although there are many techniques available for loading Oracle data, getting up-to-date data into your data warehouse store is a more difficult problem. VMware Continuent provides provides data replication from Oracle to data warehouses and analytics engines, to derive insight from big data for better business decisions. Learn practical tips on how to get your data warehouse loading projects off the ground quickly and efficiently when replicating from Oracle into Hadoop, Amazon Redshift, and HP Vertica.

Nike tech talk.2

Jags Ramnarayan

Visual Mapping of Clickstream Data

DataWorks Summit

This document provides an overview of WANdisco's NonStop HBase solution for making HBase continuously available for enterprise deployments. It discusses traditional high availability approaches that rely on backups and describes how these can fail. It then introduces WANdisco's patented active-active replication technology that provides 100% uptime with zero downtime. The document demonstrates how WANdisco implements multiple active HBase masters and region servers using a distributed coordination engine and Paxos consensus protocol. This allows HBase to avoid single points of failure and provides seamless failover for clients. It concludes with a demo of the NonStop HBase solution in action.

May 29, 2014 Toronto Hadoop User Group - Micro ETL

Adam Muise

Replication in real-time from Oracle and MySQL into data warehouses and analy...

Continuent

Analyzing transactional data is becoming increasingly common, especially as the data sizes and complexity increase and transactional stores are no longer to keep pace with the ever-increasing storage. Although there are many techniques available for loading data, getting effective data in real-time into your data warehouse store is a more difficult problem. VMware Continuent provides capabilities for continuous and real-time data warehouse loading. Join us for practical tips and a live demo of how to get your data warehouse loading projects off the ground quickly and efficiently when replicating from MySQL and Oracle into Amazon Redshift, HP Vertica and Hadoop

Replication in real-time from Oracle and MySQL into data warehouses and analy...

Continuent

Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Continuent

Analyzing transactional data is becoming increasingly common, especially as the data sizes and complexity increase and transactional stores are no longer to keep pace with the ever increasing storage. Although there are many techniques available for loading data, getting effective data in real-time into your data warehouse store is a more difficult problem. In this webinar we showcase VMware Continuent's capabilities for continuous and real-time data warehouse loading. We'll share practical tips and a live demo of how to get your data warehouse loading projects off the ground quickly and efficiently.

Splice Machine Overview

Kunal Gupta

Splice Machine is a SQL relational database management system built on Hadoop. It aims to provide the scalability, flexibility and cost-effectiveness of Hadoop with the transactional consistency, SQL support and real-time capabilities of a traditional RDBMS. Key features include ANSI SQL support, horizontal scaling on commodity hardware, distributed transactions using multi-version concurrency control, and massively parallel query processing by pushing computations down to individual HBase regions. It combines Apache Derby for SQL parsing and processing with HBase/HDFS for storage and distribution. This allows it to elastically scale out while supporting rich SQL, transactions, analytics and real-time updates on large datasets.

NoSQL, Hadoop, Cascading June 2010

Christopher Curtin

This document discusses NoSQL, Hadoop, and Cascading. It begins by explaining why NoSQL databases were created, as not all data fits relational schemas and some problems are not relational. It then describes different types of NoSQL databases like key-value stores, document databases, and graph databases. The document outlines how Hadoop uses MapReduce to process large datasets in parallel. It introduces Cascading as a way to define complex multi-step data processing flows in Hadoop more easily than raw MapReduce. The document provides an example of how to use Cascading to analyze user engagement data across millions of records.

SnappyData overview NikeTechTalk 11/19/15

SnappyData

ORC 2015: Faster, Better, Smaller

The Apache Software Foundation

Oracle Database 12c "New features"

Anar Godjaev

Oracle Database 12c includes many new features across SQL, PL/SQL, database management, partitioning, patching, compression, Data Guard, and pluggable databases. Key features include increased datatype size limits, identity columns, implicit result sets in PL/SQL, adaptive plans, row pattern matching, pluggable databases that can be plugged into and unplugged from container databases, and many enhancements to compression, partitioning, Data Guard, and patching functionality.

SQL Server 2014 In-Memory Tables (XTP, Hekaton)

Tony Rogerson

Interactive SQL-on-Hadoop and JethroData

Ofir Manor

SQL on Hadoop

nvvrajesh

The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...

Chicago Hadoop Users Group

John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing. Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update. In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle. HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing. The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions. To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/

Similar to Veracity think bugdata #2 6.7.2015 (20)

From oracle to hadoop with Sqoop and other tools

Hadoop World Vertica

Hw09 Hadoop + Vertica

Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python

Replicate from Oracle to data warehouses and analytics

Nike tech talk.2

Visual Mapping of Clickstream Data

May 29, 2014 Toronto Hadoop User Group - Micro ETL

Replication in real-time from Oracle and MySQL into data warehouses and analy...

Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

Splice Machine Overview

NoSQL, Hadoop, Cascading June 2010

SnappyData overview NikeTechTalk 11/19/15

ORC 2015: Faster, Better, Smaller

Oracle Database 12c "New features"

SQL Server 2014 In-Memory Tables (XTP, Hekaton)

Interactive SQL-on-Hadoop and JethroData

SQL on Hadoop

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...

Veracity think bugdata #2 6.7.2015

1. DWH OVER HADOOPDWH OVER HADOOP

2. THETHE BASICSBASICS

3. COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET) Projection Push Down Predicate Push Down Excellent Compression Ratios Column Indices Max/Avg/Min values Rows must be batched to beneﬁt from these optimizations

4. PARQUETPARQUET Strongly endorsed by Cloudera One of the few formats Impala supports (and the most optimal for it) Also supported by Hive, Spark, Tajo, Drill & Presto. Speaking from my own personal experience a bit more expensive to generate. ORCORC Endorsed by Hortonworks Most optimal for Presto Spark support was recently introduced.

5. QUERYINGQUERYING ENGINESENGINES

6. HIVEHIVE Hive provides a SQL like interface of accessing the data (ﬁles) called HiveQL. The HQL is translated into M/R code and executed immediately. Batch Oriented Fault tolerant and thus reliable Not a DB! Does not support updates & delete and has no transaction (or does it ?)

7. LOWLOW LATENCYLATENCY SQLSQL Map-Reduce can be compared to a Tractor: It's very strong and can plow a ﬁeld better than any other vehicle, but it's also very slow. As prices of memory dropped, a demand emerged to better utilize it for faster response times.

8. CLOUDERA IMPALACLOUDERA IMPALA Writen in C++ Utilizes Hive's metadata Very fast Not fault tolerante Doesn't support custom data formats Doesn't support complex data types (maps/arrays/structs) A bit complicated setup for non CDH distributions

9. FACEBOOK PRESTOFACEBOOK PRESTO Can connect to: Cassandra Hive JMX Sources Postgres & Mysql Allows cross engine joins Used in Facebook to serve online dashboards Easy to setup

10. SPARK SQLSPARK SQL Not afﬁliated with any Hadoop vendor Support all of the optimized ﬁle formats (ORC/Parquet/Avro) Can auto discover schema Aims to provide second/sub- second latnecy Still not very mature

11. THE USUAL DATA FLOWTHE USUAL DATA FLOW Collect -> Store -> Convert -> Select The Data Latency conflict - lots of fragmented small files or big optimized files with big latency Processing efforts involved in the conversion process should be minimized Example..

12. A BETTER DATA FLOWA BETTER DATA FLOW Collec-tor-vert -> Select Convert the data as it is being collected where possible Or convert the data as it is being stored (streaming) but without losing optimizations How can this be achieved?

13. SQOOPSQOOP Import data from RDBMS into Hadoop Create java classes and hive tables on import Export data back to RDBMS Runs a "Map Only" job to perform the task Supports incremental imports Now supports import right away as Parquet

14. HIVE & ACIDHIVE & ACID Recently a conceptual change has been introduced into Hive: CRUD with ACID Transactions. It is not meant to replace your OLTP but rather supply a better data modiﬁcation mechanism to a subset of the data. Explanation on how it works Demo simple insert Still requires M/R :(

15. HIVE & STREAMING INGESTHIVE & STREAMING INGEST With the new ACID capabilities it is now possible to continously insert data into hive Data apperas almost immediately Data is optimized in a columnar format Data is compacted by different triggers Code snippet

16. FLUMEFLUME Distributed Durable Scalable Fault Tolerante Serves for ingestion and basic pre-processing of the data Composed of source -> channel -> Sink (Draw Architecture) Utilized Hive's ACID capabilities to instantly stream data into hive - demo

Veracity think bugdata #2 6.7.2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Veracity think bugdata #2 6.7.2015

Similar to Veracity think bugdata #2 6.7.2015 (20)

Veracity think bugdata #2 6.7.2015