Spark + HBase

•Download as PPTX, PDF•

4 likes•5,333 views

Zhan Zhang presents improvements made to bring HBase data efficiently into Spark with DataFrame support. The improvements include high performance by moving computation to data and reducing network overhead through partition pruning and column pruning. Full DataFrame support is provided, allowing Spark SQL and integrated language queries to run on existing HBase tables with Java primitive type support.

Technology

Spark + HBase
Bringing HBase Data Efficiently into Spark
with DataFrame Support
Zhan Zhang
Software Engineer
04/08/2016

Page2 © Hortonworks Inc. 2014
About Zhan Zhang
 Zhan Zhang (Software Engineer at Hortonworks)
 Currently Focus on Apache Spark and Hadoop, etc
 Contribute to Apache Spark, Yarn, HBase, Ambari, etc
 Experiences on Computer Networks, Distributed System and Machine
Learning Platform

Page3 © Hortonworks Inc. 2014
Why Revamp the Existing HBase Connector?
 Limited Spark Support in HBase Upstream
– Scalability
– RDD level, but Spark is moving to DataFrame/Dataset
– Data Loss and Data Duplication
 Stability
– Correctness
– Stability Impact with Co-processor.
– Serialized RDD Lineage to HBase
– Maintenance Overhead: Internal Hacks

Page4 © Hortonworks Inc. 2014
What Improvement Have We Made?
 Combine Spark and HBase
– Spark Catalyst Engine for Query Plan and Optimization
– HBase for Fast Access KV Store
– Implement Standard External Data Source with Built-in Filter
 High Performance
– Data Locality: Move Computation to Data
– Partition Pruning: Task only Performed in RS Holding Requested Data
– Column Pruning / Predicate Pushdown: Reduce Network Overhead
 Full Fledged DataFrame Support
– Spark-SQL
– Integrated Language Query
 Run on Top of Existing HBase Table
– Native Support Java Primitive Types

Page5 © Hortonworks Inc. 2014
More …
 Composite Key
 Avro Format

Page6 © Hortonworks Inc. 2014
Usage - Define the Catalog
Header (Calibri Bold 28 pt)

Page7 © Hortonworks Inc. 2014
Usage– Write to HBase

Page8 © Hortonworks Inc. 2014
Usage– Construct DataFrame

Page9 © Hortonworks Inc. 2014
Usage - Language Integrate Query

Page10 © Hortonworks Inc. 2014
Usage - Spark SQL

Page11 © Hortonworks Inc. 2014
Usage - With Other Data Sources

Page13 © Hortonworks Inc. 2014
Header (Calibri Bold 28 pt)

Page14 © Hortonworks Inc. 2014
Spark HBase Connector Architecture

Page15 © Hortonworks Inc. 2014
Byte Array Order: SHORT/INT/LONG
0 21 … … MAX -2 -1MIN
… …
WHERE X <= 2
WHERE X >= -2

Page16 © Hortonworks Inc. 2014
Implementation
 Partition Pruning:
– Split into Multiple Range, e.g., WHERE X < 2
 Data Locality:
– Each RDD Partition Has Preferred Location
 Column Pruning:
– Required Column in Scan/BulkGet
 Predicate Pushdown:
– HBase Built-in Filters
 Scan/BulkGets:
– Grouped by Region Server

Page20 © Hortonworks Inc. 2014
Kerberos Cluster
 Kerberos Ticket
 Token Retrieval and Renewal
 Long Running Service

Page21 © Hortonworks Inc. 2014
FLOAT/DOUBLE: IEEE-754
0.0 0.2… … … MAX -2.0… MIN…
WHERE X <= 2.0D
WHERE X >= -2.0D
-0.0

Page22 © Hortonworks Inc. 2014
HBase Meta Table

Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very challenging topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance easy, while achieving a good tradeoff between performance and simplicity. In addition to fully supporting all the Avro schemas natively, SHC has also integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against an Dataframe which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. In this talk, apart from explaining why SHC is of great use, we will also demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multiple secure HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.

HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...

HBaseCon

Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity. Also, SHC has supported Phoenix data as input to HBase in addition to Avro data. Defaulting to a simple native binary encoding seems susceptible to future changes and is a risk for users who write data from SHC into HBase. For example, with SHC going forward, backwards compatibility needs to be properly handled. So the default, SHC needs to support a more standard and well tested format like Phoenix. In this talk, we will demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...

DataWorks Summit/Hadoop Summit

Apache phoenix

University of Moratuwa

This document summarizes a presentation about Apache Phoenix, an open-source project that allows HBase to be queried with SQL. It discusses what Phoenix is, why tracing is important, and the features of a new tracing web app created for Phoenix, including listing traces, visualizing trace distributions and individual trace details. Programming challenges in creating the app and new issues filed are also summarized.

The Heterogeneous Data lake

DataWorks Summit/Hadoop Summit

Dremio is a startup founded in 2015 by experts in big data and open source. It aims to provide a platform for interactive analysis across disparate data sources through a storage-agnostic and client-agnostic approach leveraging Apache Arrow for high performance in-memory columnar execution. Dremio uses Apache Drill as its query engine, allowing users to query data across different systems like HDFS, S3, MongoDB as if it was a single relational database through SQL. It has an extensible architecture that allows new data sources to be easily added via plugins.

Apache Hive 2.0: SQL, Speed, Scale

DataWorks Summit/Hadoop Summit

This document discusses the new features of Apache Hive 2.0, including: 1) The addition of procedural SQL capabilities through HPLSQL to add features like cursors and loops. 2) Performance improvements for interactive queries through LLAP which uses in-memory caching and persistent daemons. 3) Using HBase as the metastore to speed up query planning by reducing metadata access times. 4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized joins. 5) Improvements to the cost-based optimizer including better statistics collection.

Apache Phoenix + Apache HBase

DataWorks Summit/Hadoop Summit

The document summarizes Apache Phoenix and HBase as an enterprise data warehouse solution. It discusses how Phoenix provides OLTP and analytics capabilities over HBase. It then covers various use cases where companies are using Phoenix and HBase, including for web analytics and time series data. Finally, it discusses optimizations that can be made to the schema design, queries, and writes in Phoenix to improve performance.

Polyalgebra

DataWorks Summit/Hadoop Summit

The document discusses polyalgebra, an extended form of relational algebra that can handle complex data types like nested records and streaming data. It allows various data processing engines and SQL query engines to operate over different data sources using a single optimization framework. The document outlines the ecosystem of data stores, engines, and frameworks that can be used with polyalgebra and Calcite's rule-based query planning system. It provides examples of how relational algebra expressions capture the logic of SQL queries and how rules are used to optimize query plans.

Apache Hadoop 3.0 is coming! As the next major release, it attracts everyone's attention as show case several bleeding-edge technologies and significant features across all components of Apache Hadoop, include: Erasure Coding in HDFS, Multiple Standby NameNodes, YARN Timeline Service v2, JNI-based shuffle in MapReduce, Apache Slider integration and Service Support as First Class Citizen, Hadoop library updates and client-side class path isolation, etc. In this talk, we will update the status of Hadoop 3 especially the releasing work in community and then go deep diving on new features included in Hadoop 3.0. As a new major release, Hadoop 3 would also include some incompatible changes - we will go through most of these changes and explore its impact to existing Hadoop users and operators. In the last part of this session, we will continue to discuss ongoing efforts in Hadoop 3 age and show the big picture that how big data landscape could be largely influenced by Hadoop 3.

Curb your insecurity with HDP

DataWorks Summit/Hadoop Summit

This document discusses securing Hadoop and Spark clusters. It begins with an overview of Hadoop security in four steps: authentication, authorization, data protection, and audit. It then discusses specific Hadoop security components like Kerberos, Apache Ranger, HDFS encryption, Knox gateway, and data encryption in motion and at rest. For Spark security, it covers authentication using Kerberos, authorization with Ranger, and encrypting data channels. The document provides demos of HDFS encryption and discusses common gotchas with Spark security.

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Josh Elser

Large-Scale Stream Processing in the Hadoop Ecosystem

DataWorks Summit/Hadoop Summit

The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.

Apache Hive on ACID

DataWorks Summit/Hadoop Summit

- Hive originally only supported updating partitions by overwriting entire files, which caused issues for concurrent readers and limited functionality like row-level updates. - The need for ACID transactions in Hive arose from wanting to support updating data in near real-time as it arrives and making ad hoc data changes without complex workarounds. - Hive's ACID implementation stores changes as delta files, uses the metastore to manage transactions and locks, and runs compactions to merge deltas into base files. - There were initial issues around correctness, performance, usability and resilience, but many have been addressed with ongoing work focused on further improvements and new features like multi-statement transactions and better integration with LLAP.

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

DataWorks Summit/Hadoop Summit

The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.

Empower Data-Driven Organizations

DataWorks Summit/Hadoop Summit

HPE provides optimized server architectures for Hadoop including the Apollo 4200 server which offers high storage density. HPE also offers a reference architecture for Hadoop that separates compute and storage resources for better performance, using optimized servers like Moonshot for processing and Apollo for storage. Additionally, HPE contributes to Apache Spark through HP Labs to improve efficiency and scale of memory and performance.

Meet HBase 2.0 and Phoenix 5.0

DataWorks Summit

This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.

Apache phoenix: Past, Present and Future of SQL over HBAse

enissoz

HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.

The state of SQL-on-Hadoop in the Cloud

DataWorks Summit/Hadoop Summit

The document summarizes the results of a study that evaluated the performance of different Platform-as-a-Service offerings for running SQL on Hadoop workloads. The study tested Amazon EMR, Google Cloud DataProc, Microsoft Azure HDInsight, and Rackspace Cloud Big Data using the TPC-H benchmark at various data sizes up to 1 terabyte. It found that at 1TB, lower-end systems had poorer performance. In general, HDInsight running on D4 instances and Rackspace Cloud Big Data on dedicated hardware had the best scalability and execution times. The study provides insights into the performance, scalability, and price-performance of running SQL on Hadoop in the cloud.

Securing Spark Applications

DataWorks Summit/Hadoop Summit

This document discusses securing Spark applications. It covers encryption to protect data in transit and at rest, authentication using Kerberos to identify users, and authorization for access control through tools like Sentry and a proposed RecordService. While Spark can be secured today by leveraging Hadoop security, continued work is needed for easier encryption, improved Kerberos support for long-running jobs, and row/column-level authorization beyond file permissions.

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

DataWorks Summit/Hadoop Summit

This document summarizes a presentation about Apache Phoenix and HBase. It discusses the past, present, and future of SQL on HBase. In the past section, it describes Phoenix's architecture and key features like secondary indexes, joins, and aggregation. The present section highlights recent Phoenix releases including row timestamps, transactions using Tephra, and the new Phoenix Query Server. The future section mentions upcoming integrations with Calcite and Hive.

Schema Registry - Set Your Data Free

DataWorks Summit

Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats. SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc. In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.

HBaseConEast2016: HBase and Spark, State of the Art

Michael Stack

Major advancements in Apache Hive towards full support of SQL compliance

DataWorks Summit/Hadoop Summit

Major advancements in Apache Hive towards full support of SQL compliance include: 1) Adding support for SQL2011 keywords and reserved keywords to reduce parser ambiguity issues. 2) Adding support for primary keys and foreign keys to improve query optimization, specifically cardinality estimation for joins. 3) Implementing set operations like INTERSECT and EXCEPT by rewriting them using techniques like grouping, aggregation, and user-defined table functions.

Multitenancy At Bloomberg - HBase and Oozie

DataWorks Summit

HBase provides many features for multi-tenancy and isolation. However, the operation of these features require integration into the broader operations of a cluster. This talk will cover some methods we use at Bloomberg for multi-tenancy and discuss some HBase-Oozie integration. Particularly of interest is our work on an Oozie action for secure snapshot export -- this extends the HBase security model via Oozie allowing self-service (non-hbase user) snapshot export on secure clusters. Key topics: * Bloomberg's Oozie HBase export snapshot action * Oozie coordinated time based major compactions * How we use LDAP with HBase (and why to take care with HADOOP-12291) * Some of our multi-tenancy setups around monitoring for SLAs * Suggesting HBase stays the course of being "just" a datastore -- and all projects following the Unix philosophy (this has made things like our Oozie integration much easier!)

Pnuts Review

Ruchika Mehresh

PNUTS is Yahoo!'s scalable, highly available distributed database system for hosting web applications. It provides record-level operations and asynchronous consistency across geographically distributed data centers. The system architecture uses a distributed hash table for data storage and retrieval. Consistency is achieved through a per-record timeline model and a message broker for replication. PNUTS supports flexible schemas, queries, and bulk loading while providing high performance and availability.

Apache Hive 2.0: SQL, Speed, Scale

DataWorks Summit/Hadoop Summit

This document discusses new features in Apache Hive 2.0, including: 1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures. 2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries. 3) Speeding up query planning by using HBase as the metastore instead of a relational database. 4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations. 5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.

Hortonworks Technical Workshop: HBase and Apache Phoenix

Hortonworks

This document provides an overview of Apache HBase and Apache Phoenix. It discusses how HBase is a scalable, non-relational database that can store large volumes of data across commodity servers. Phoenix provides a SQL interface for HBase, allowing users to interact with HBase data using familiar SQL queries and functions. The document outlines new features in Phoenix for HDP 2.2, including improved support for secondary indexes and basic window functions.

Apache Ratis - In Search of a Usable Raft Library

Tsz-Wo (Nicholas) Sze

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Athiq Ahamed

This document provides a summary of a presentation that benchmarked the performance of three popular NoSQL databases: Apache Cassandra, Apache HBase, and MongoDB. It describes the architectures and data models of each database. Benchmark tests were run using the Yahoo Cloud Serving Benchmark and found that Apache Cassandra consistently outperformed the other databases across different workloads in terms of load time, read and write performance, and latency. The presentation emphasizes the importance of benchmarks for evaluating NoSQL database performance and choosing the right database based on application requirements.

Apache HBase 入門 (第２回)

tatsuya6502

This document appears to be test results from running the Yahoo! Cloud Serving Benchmark on a system. It includes performance metrics like request latency distributions and throughput for different request sizes and concurrency levels. Various graphs and tables are presented showing results from multiple benchmark runs. The benchmark was run to test the performance of the system for serving requests in a cloud computing environment.

What's hot

Hadoop 3 in a Nutshell

DataWorks Summit/Hadoop Summit

Curb your insecurity with HDP

DataWorks Summit/Hadoop Summit

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Josh Elser

Large-Scale Stream Processing in the Hadoop Ecosystem

DataWorks Summit/Hadoop Summit

Apache Hive on ACID

DataWorks Summit/Hadoop Summit

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

DataWorks Summit/Hadoop Summit

Empower Data-Driven Organizations

DataWorks Summit/Hadoop Summit

Meet HBase 2.0 and Phoenix 5.0

DataWorks Summit

Apache phoenix: Past, Present and Future of SQL over HBAse

enissoz

The state of SQL-on-Hadoop in the Cloud

DataWorks Summit/Hadoop Summit

Securing Spark Applications

DataWorks Summit/Hadoop Summit

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

DataWorks Summit/Hadoop Summit

Schema Registry - Set Your Data Free

DataWorks Summit

HBaseConEast2016: HBase and Spark, State of the Art

Michael Stack

Major advancements in Apache Hive towards full support of SQL compliance

DataWorks Summit/Hadoop Summit

Multitenancy At Bloomberg - HBase and Oozie

DataWorks Summit

Pnuts Review

Ruchika Mehresh

Apache Hive 2.0: SQL, Speed, Scale

DataWorks Summit/Hadoop Summit

Hortonworks Technical Workshop: HBase and Apache Phoenix

Hortonworks

Apache Ratis - In Search of a Usable Raft Library

Tsz-Wo (Nicholas) Sze

What's hot (20)

Hadoop 3 in a Nutshell

Curb your insecurity with HDP

Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse

Large-Scale Stream Processing in the Hadoop Ecosystem

Apache Hive on ACID

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

Empower Data-Driven Organizations

Meet HBase 2.0 and Phoenix 5.0

Apache phoenix: Past, Present and Future of SQL over HBAse

The state of SQL-on-Hadoop in the Cloud

Securing Spark Applications

Apache Phoenix and HBase: Past, Present and Future of SQL over HBase

Schema Registry - Set Your Data Free

HBaseConEast2016: HBase and Spark, State of the Art

Major advancements in Apache Hive towards full support of SQL compliance

Multitenancy At Bloomberg - HBase and Oozie

Pnuts Review

Apache Hive 2.0: SQL, Speed, Scale

Hortonworks Technical Workshop: HBase and Apache Phoenix

Apache Ratis - In Search of a Usable Raft Library

Viewers also liked

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Athiq Ahamed

Apache HBase 入門 (第２回)

tatsuya6502

Apache HBase Internals you hoped you Never Needed to Understand

Josh Elser

Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.

Apache Spark streaming and HBase

Carol McDonald

HBaseとSparkでセンサーデータを有効活用 #hbasejp

FwardNetwork

Free Code Friday - Spark Streaming with HBase

MapR Technologies

Apache HBase 入門 (第１回)

tatsuya6502

Viewers also liked (7)

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Apache HBase 入門 (第２回)

Apache HBase Internals you hoped you Never Needed to Understand

Apache Spark streaming and HBase

HBaseとSparkでセンサーデータを有効活用 #hbasejp

Free Code Friday - Spark Streaming with HBase

Apache HBase 入門 (第１回)

Similar to Spark + HBase

Apache Spark Workshop at Hadoop Summit

Saptak Sen

This document provides an overview of installing and programming with Apache Spark on the Hortonworks Data Platform (HDP). It discusses how Spark fits within HDP and can be used for batch processing, streaming, SQL queries and machine learning. The document outlines how to install Spark on HDP using Ambari and describes Spark programming with Resilient Distributed Datasets (RDDs), transformations, actions and caching/persistence. It provides examples of Spark APIs and programming patterns.

Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...

Databricks

Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very hard topic. Spark HBase Connector (SHC) provides feature-rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries, and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance very easy, while achieving a good tradeoff between performance and simplicity. Also, SHC has integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against a DataFrame which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. This session will demonstrate how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multi-HBase clusters and how Spark reads/writes data from/into Phoenix tables with SHC, etc. It will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.

Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...

Databricks

Apache Tez - A unifying Framework for Hadoop Data Processing

DataWorks Summit

This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.

Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...

Data Con LA

Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.

Hbase mhug 2015

Joseph Niemiec

HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.

Hortonworks tech workshop in-memory processing with spark

Hortonworks

Apache Spark offers unique in-memory capabilities and is well suited to a wide variety of data processing workloads including machine learning and micro-batch processing. With HDP 2.2, Apache Spark is a fully supported component of the Hortonworks Data Platform. In this session we will cover the key fundamentals of Apache Spark and operational best practices for executing Spark jobs along with the rest of Big Data workloads. We will also provide a working example to showcase micro-batch and machine learning processing using Apache Spark.

Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS

Hortonworks

This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Alex Zeltov

Introduction to Big Data Analytics using Apache Spark on HDInsights on Azure (SaaS) and/or HDP on Azure(PaaS) This workshop will provide an introduction to Big Data Analytics using Apache Spark using the HDInsights on Azure (SaaS) and/or HDP deployment on Azure(PaaS) . There will be a short lecture that includes an introduction to Spark, the Spark components. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes. The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.

Introduction to the Hortonworks YARN Ready Program

Hortonworks

Discover.hdp2.2.h base.final[2]

Hortonworks

Spark crash course workshop at Hadoop Summit

DataWorks Summit

This document provides an overview of installing and programming with Apache Spark on Hortonworks Data Platform (HDP). It introduces Spark and its components, benefits over other frameworks, and Hortonworks' commitment to Spark. The document outlines an example Spark programming workflow using Resilient Distributed Datasets (RDDs) in Scala, and covers common RDD transformations, actions, and persistence methods. It also discusses Spark deployment modes like standalone and on YARN, and reference HDP architectures using Spark.

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

Seetharam Venkatesh

Apache Falcon is a data management platform that allows users to centrally manage data lifecycles across Hadoop clusters. It defines data entities like clusters, feeds, and processes to represent data pipelines. Falcon then automatically generates workflows to orchestrate the movement of data according to defined policies for replication, retention, and late data handling. It also provides data governance features like lineage tracing, auditing, and tagging. The latest version of Falcon includes new capabilities for disaster recovery mirroring and replication to cloud storage services.

Discover.hdp2.2.ambari.final[1]

Hortonworks

Apache Ambari is a single framework for IT administrators to provision, manage and monitor a Hadoop cluster. Apache Ambari 1.7.0 is included with Hortonworks Data Platform 2.2. In this 30-minute webinar, Hortonworks Product Manager Jeff Sposetti and Apache Ambari committer Mahadev Konar discussed new capabilities including: Improvements to Ambari core - such as support for ResourceManager HA Extensions to Ambari platform - introducing Ambari Administration and Ambari Views Enhancements to Ambari Stacks - dynamic configuration recommendations and validations via a "Stack Advisor"

Dancing elephants - efficiently working with object stores from Apache Spark ...

DataWorks Summit

As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different, What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them. We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you. This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

DataWorks Summit

Integration of Hive and HBase

Hortonworks

Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.

Integration of HIve and HBase

Hortonworks

This document discusses integrating Apache Hive and HBase. It provides an overview of Hive and HBase, describes use cases for querying HBase data using Hive SQL, and outlines features and improvements for Hive and HBase integration. Key points include mapping Hive schemas and data types to HBase tables and columns, pushing filters and other operations down to HBase, and using a storage handler to interface between Hive and HBase. The integration allows analysts to query both structured Hive and unstructured HBase data using a single SQL interface.

SoCal BigData Day

John Park

Discover HDP 2.1: Apache Solr for Hadoop Search

Hortonworks

This document appears to be a presentation about Apache Solr for Hadoop search using the Hortonworks Data Platform (HDP). The agenda includes an overview of Apache Solr and Hadoop search, a demo of Hadoop search, and a question and answer section. The presentation discusses how Solr provides scalable indexing of data stored in HDFS and powerful search capabilities. It also includes a reference architecture showing how Solr integrates with Hadoop for search and indexing.

Similar to Spark + HBase (20)

Apache Spark Workshop at Hadoop Summit

Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...

Apache Tez - A unifying Framework for Hadoop Data Processing

Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...

Hbase mhug 2015

Hortonworks tech workshop in-memory processing with spark

Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS

Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...

Introduction to the Hortonworks YARN Ready Program

Discover.hdp2.2.h base.final[2]

Spark crash course workshop at Hadoop Summit

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

Discover.hdp2.2.ambari.final[1]

Dancing elephants - efficiently working with object stores from Apache Spark ...

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon

Integration of Hive and HBase

Integration of HIve and HBase

SoCal BigData Day

Discover HDP 2.1: Apache Solr for Hadoop Search

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production

DataWorks Summit/Hadoop Summit

This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.

State of Security: Apache Spark & Apache Zeppelin

DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger

DataWorks Summit/Hadoop Summit

The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include: - The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies. - Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer. - Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance. - An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared

Enabling Digital Diagnostics with a Data Science Platform

DataWorks Summit/Hadoop Summit

This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.

Revolutionize Text Mining with Spark and Zeppelin

DataWorks Summit/Hadoop Summit

This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.

Double Your Hadoop Performance with Hortonworks SmartSense

DataWorks Summit/Hadoop Summit

This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.

Hadoop Crash Course

DataWorks Summit/Hadoop Summit

The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.

Data Science Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.

Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.

Dataflow with Apache NiFi

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.

Schema Registry - Set you Data Free

DataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

DataWorks Summit/Hadoop Summit

There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time. The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

DataWorks Summit/Hadoop Summit

DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.

Mool - Automated Log Analysis using Data Science and ML

DataWorks Summit/Hadoop Summit

QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful. At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.

How Hadoop Makes the Natixis Pack More Efficient

DataWorks Summit/Hadoop Summit

Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together. This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear: • How and why the business and IT requirements originated • How we leverage the platform to fulfill security and production requirements • How we organize a community to: o Guard all the players, no one gets left on the ground! o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead) • What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match! DETAILS This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.

HBase in Practice

DataWorks Summit/Hadoop Summit

HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.

The Challenge of Driving Business Value from the Analytics of Things (AOT)

DataWorks Summit/Hadoop Summit

There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases. In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

DataWorks Summit/Hadoop Summit

In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

DataWorks Summit/Hadoop Summit

In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs. Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.

Backup and Disaster Recovery in Hadoop

DataWorks Summit/Hadoop Summit

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Recently uploaded

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/ DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen! Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell. Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten. Diese Themen werden behandelt - Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten - Wie funktionieren CCB- und CCX-Lizenzen wirklich? - Verstehen des DLAU-Tools und wie man es am besten nutzt - Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw. - Praxisbeispiele und Best Practices zum sofortigen Umsetzen

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

June Patch Tuesday

Ivanti

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

Choosing The Best AWS Service For Your Website + API.pptx

Brandon Minnick, MBA

Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API? Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose? Which one is cheapest? Which one is fastest? Which one will scale to meet our needs? Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!

Project Management Semester Long Project - Acuity

jpupo2018

Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.

Recommendation System using RAG Architecture

fredae14

Mariano G Tinti - Decoding SpaceX

Mariano Tinti

Webinar: Designing a schema for a Data Warehouse

Federico Razzoli

Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you. A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services. But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts. We will discuss these topics: - How to gather information about a business; - Understanding dictionaries and how to identify business entities; - Dimensions and facts; - Setting a table granularity; - Types of facts; - Types of dimensions; - Snowflakes and how to avoid them; - Expanding existing dimensions and facts.

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

WeTestAthens: Postman's AI & Automation Techniques

Postman

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

saastr

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

Chart Kalyan

OpenID AuthZEN Interop Read Out - Authorization

David Brossard

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

Recently uploaded (20)

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

Artificial Intelligence for XMLDevelopment

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

June Patch Tuesday

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Choosing The Best AWS Service For Your Website + API.pptx

Project Management Semester Long Project - Acuity

Recommendation System using RAG Architecture

Mariano G Tinti - Decoding SpaceX

Webinar: Designing a schema for a Data Warehouse

HCL Notes and Domino License Cost Reduction in the World of DLAU

WeTestAthens: Postman's AI & Automation Techniques

Introduction of Cybersecurity with OSS at Code Europe 2024

Programming Foundation Models with DSPy - Meetup Slides

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...

How to Get CNIC Information System with Paksim Ga.pptx

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

OpenID AuthZEN Interop Read Out - Authorization

Columbus Data & Analytics Wednesdays - June 2024

Spark + HBase

1. Spark + HBase Bringing HBase Data Efficiently into Spark with DataFrame Support Zhan Zhang Software Engineer 04/08/2016

2. Page2 © Hortonworks Inc. 2014 About Zhan Zhang  Zhan Zhang (Software Engineer at Hortonworks)  Currently Focus on Apache Spark and Hadoop, etc  Contribute to Apache Spark, Yarn, HBase, Ambari, etc  Experiences on Computer Networks, Distributed System and Machine Learning Platform

3. Page3 © Hortonworks Inc. 2014 Why Revamp the Existing HBase Connector?  Limited Spark Support in HBase Upstream – Scalability – RDD level, but Spark is moving to DataFrame/Dataset – Data Loss and Data Duplication  Stability – Correctness – Stability Impact with Co-processor. – Serialized RDD Lineage to HBase – Maintenance Overhead: Internal Hacks

4. Page4 © Hortonworks Inc. 2014 What Improvement Have We Made?  Combine Spark and HBase – Spark Catalyst Engine for Query Plan and Optimization – HBase for Fast Access KV Store – Implement Standard External Data Source with Built-in Filter  High Performance – Data Locality: Move Computation to Data – Partition Pruning: Task only Performed in RS Holding Requested Data – Column Pruning / Predicate Pushdown: Reduce Network Overhead  Full Fledged DataFrame Support – Spark-SQL – Integrated Language Query  Run on Top of Existing HBase Table – Native Support Java Primitive Types

16. Page16 © Hortonworks Inc. 2014 Implementation  Partition Pruning: – Split into Multiple Range, e.g., WHERE X < 2  Data Locality: – Each RDD Partition Has Preferred Location  Column Pruning: – Required Column in Scan/BulkGet  Predicate Pushdown: – HBase Built-in Filters  Scan/BulkGets: – Grouped by Region Server

20. Page20 © Hortonworks Inc. 2014 Kerberos Cluster  Kerberos Ticket  Token Retrieval and Renewal  Long Running Service

21. Page21 © Hortonworks Inc. 2014 FLOAT/DOUBLE: IEEE-754 0.0 0.2… … … MAX -2.0… MIN… WHERE X <= 2.0D WHERE X >= -2.0D -0.0

Spark + HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Spark + HBase

Similar to Spark + HBase (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Spark + HBase