Keynote slides from Big Data Spain Nov 2016. Has some thoughts on how Hadoop ecosystem is growing and changing to support the enterprise, including Hive, Spark, NiFi, security and governance, streaming, and the cloud.
This document discusses new features in Apache Hive 2.0, including:
- The addition of procedural SQL (HPLSQL) to add capabilities like loops and branches.
- A new execution engine called LLAP that uses persistent daemons to enable sub-second queries by caching data in memory.
- The option to use HBase as the metastore to speed up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark, the cost-based optimizer, and many bug fixes and performance enhancements.
The document discusses new features in Hive 2.0 including Hive LLAP (Live Long And Process) and Hive on ACID (Atomic, Consistent, Isolated, Durable). Hive LLAP introduces an in-memory caching mechanism that provides sub-second query performance for Hive. Hive on ACID allows for transactions on Hive tables including updates, deletes, and streaming ingestion while maintaining consistency and concurrency. The document provides overviews of how both features work and improvements they provide for analytics workloads on Hive.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
- Hive originally only supported updating partitions by overwriting entire files, which caused issues for concurrent readers and limited functionality like row-level updates.
- The need for ACID transactions in Hive arose from wanting to support updating data in near real-time as it arrives and making ad hoc data changes without complex workarounds.
- Hive's ACID implementation stores changes as delta files, uses the metastore to manage transactions and locks, and runs compactions to merge deltas into base files.
- There were initial issues around correctness, performance, usability and resilience, but many have been addressed with ongoing work focused on further improvements and new features like multi-statement transactions and better integration with LLAP.
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
This document discusses new features in Apache Hive 2.0, including:
- The addition of procedural SQL (HPLSQL) to add capabilities like loops and branches.
- A new execution engine called LLAP that uses persistent daemons to enable sub-second queries by caching data in memory.
- The option to use HBase as the metastore to speed up query planning times for queries involving thousands of partitions.
- Improvements to Hive on Spark, the cost-based optimizer, and many bug fixes and performance enhancements.
The document discusses new features in Hive 2.0 including Hive LLAP (Live Long And Process) and Hive on ACID (Atomic, Consistent, Isolated, Durable). Hive LLAP introduces an in-memory caching mechanism that provides sub-second query performance for Hive. Hive on ACID allows for transactions on Hive tables including updates, deletes, and streaming ingestion while maintaining consistency and concurrency. The document provides overviews of how both features work and improvements they provide for analytics workloads on Hive.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
- Hive originally only supported updating partitions by overwriting entire files, which caused issues for concurrent readers and limited functionality like row-level updates.
- The need for ACID transactions in Hive arose from wanting to support updating data in near real-time as it arrives and making ad hoc data changes without complex workarounds.
- Hive's ACID implementation stores changes as delta files, uses the metastore to manage transactions and locks, and runs compactions to merge deltas into base files.
- There were initial issues around correctness, performance, usability and resilience, but many have been addressed with ongoing work focused on further improvements and new features like multi-statement transactions and better integration with LLAP.
Apache Hive is an Enterprise Data Warehouse build on top of Hadoop. Hive supports Insert/Update/Delete SQL statements with transactional semantics and read operations that run at Snapshot Isolation. This talk will describe the intended use cases, architecture of the implementation, new features such as SQL Merge statement and recent improvements. The talk will also cover Streaming Ingest API, which allows writing batches of events into a Hive table without using SQL. This API is used by Apache NiFi, Storm and Flume to stream data directly into Hive tables and make it visible to readers in near real time.
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
Apache Zeppelin and Spark are turning out to be useful tools in the toolkit of the modern data scientist when working on large scale datasets for machine learning. Zeppelin makes Big Data accessible with minimal effort using web browser based notebooks to interact with data in Hadoop. It enables data scientists to interactively explore and visualize their data and collaborate with others to develop models. Zeppelin has great integration with Apache Spark that delivers many machine learning algorithms out of the box to Zeppelin users as well as providing a fast engine to run custom machine learning on Big Data. The talk will describe the latest in Zeppelin and focus on how it has been made ready for the enterprise. With support for secure Hadoop clusters, LDAP/AD integration, user impersonation and session separation, Zeppelin can now be confidently used in secure and multi-tenant enterprise domains.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive SQL (Hive, Tez), real-time processing (Storm), existing services and a wide variety of custom applications. These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy.
YARN is now adding support for services in a first class manner. This talk will first cover the challenges of running services on YARN, and then move on to the changes that were made to the ResourceManager to support scheduling services on YARN(such as affinity and anti-affinity). The talk will then move on to cover the changes made in the NodeManager and features such as container restart and container upgrades. The talk will also cover new additions to YARN like the new application manager (that will allow users to bring services workloads onto YARN by providing features such as container orchestration and management) and the DNS server that uses the YARN registry to enable service discovery.
Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through how the latest from Cloudbreak enables enterprises to easily and securely run big data workloads. This includes deep-dive discussion on autoscaling, Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As a last topic we will discuss how we deployed and operate Cloudbreak as a Service internally which enables rapid cluster deployment for prototyping and testing purposes.
Speakers
Peter Darvasi, Cloudbreak Partner Engineer, Hortonworks
Richard Doktorics, Staff Engineer, Hortonworks
This document discusses running Oracle E-Business Suite on Oracle Cloud. It provides an overview of Oracle Cloud offerings including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). It outlines reasons for moving E-Business Suite to Oracle Cloud like enabling business agility, lowering costs and risks, and supporting growth. The document also covers solution details such as deployment choices, roadmap for automation, and use cases for transitioning to Oracle Cloud.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
Apache Falcon is a data management platform on Hadoop that provides a holistic way to declaratively define and manage data pipelines and workflows. It allows users to specify feeds, processes, and clusters to orchestrate the flow of data across Hadoop clusters. Falcon handles scheduling, dependency management, replication, and data governance. The architecture uses Oozie to schedule workflows and notifications are sent through JMS. Case studies demonstrate how Falcon can be used for multi-cluster failover and distributed processing across data centers.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
Apache Ambari is used by thousands of Hadoop Operators to manage the deployment, lifecycle, and automation of DevOps for Hadoop ecosystem projects. The Ambari engineering team will talk about improvements being made to the automation, metrics, logging, upgrade, and other core frameworks within Ambari as the project is being re-imagined.
Starting out, Apache Ambari installed a handful of Apache Hadoop ecosystem projects, on a few operating systems, and helped with the most basic Hadoop operational tasks. Today, the product manages over 20 different services, runs on multiple major operating systems and versions, and automates many of the most challenging Hadoop operational tasks in the most secure customer environments.
As part of this talk, the engineering team will walk you through what we've learned, the challenges we've overcome, and how the Apache Ambari community has changed the product to handle them. The future is fast approaching, and with it comes new on-premise and cloud deployment architectures. See how Apache Ambari is being re-imagined to handle these new challenges.
Speaker
Paul Codding, Product Management Director, Hortonworks
Oliver Szabo, Senior Software Engineer, Hortonworks
The document discusses recent releases and major new features of HBase 2.0 and Phoenix 5.0. HBase 2.0 focuses on off-heap memory usage to improve performance, as well as new features like async client, region assignment improvements, and backup/restore capabilities. Phoenix 5.0 includes API cleanup, improved join processing using cost-based optimizations, enhanced index handling including failure recovery, and integration with Apache Kafka.
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
Building Data Pipelines for Solr with Apache NiFiBryan Bende
This document provides an overview of using Apache NiFi to build data pipelines that index data into Apache Solr. It introduces NiFi and its capabilities for data routing, transformation and monitoring. It describes how Solr accepts data through different update handlers like XML, JSON and CSV. It demonstrates how NiFi processors can be used to stream data to Solr via these update handlers. Example use cases are presented for indexing tweets, commands, logs and databases into Solr collections. Future enhancements are discussed like parsing documents and distributing commands across a Solr cluster.
Aioug ha day oct2015 goldengate- High Availability Day 2015aioughydchapter
This document provides an overview of Oracle GoldenGate and discusses its key components and topologies. It begins with background information about the presenter and then covers topics such as Oracle GoldenGate's supported platforms, common topologies used with Oracle GoldenGate including unidirectional data integration and high availability, and the benefits it provides such as zero downtime upgrades and live reporting. It also discusses Oracle GoldenGate's components including the extract, replicat, trail files, and pump. Finally, it touches on performance tuning techniques for Oracle GoldenGate including adjusting TCP buffer sizes and using checkpoints.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Apache Phoenix’s relational database view over Apache HBase delivers a powerful tool which enables users and developers to quickly and efficiently access their data using SQL. However, Phoenix only provides a Java client, in the form of a JDBC driver, which limits Phoenix access to JVM-based applications. The Phoenix QueryServer is a standalone service which provides the building blocks to use Phoenix from any language, not just those running in a JVM. This talk will serve as a general purpose introduction to the Phoenix QueryServer and how it complements existing Apache Phoenix applications. Topics covered will range from design and architecture of the technology to deployment strategies of the QueryServer in production environments. We will also include explorations of the new use cases enabled by this technology like integrations with non-JVM based languages (Ruby, Python or .NET) and the high-level abstractions made possible by these basic language integrations.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
The document discusses new features in Apache Ambari 2.2, including Express Upgrade which allows automated upgrades of Hadoop clusters with downtime but potentially faster completion compared to Rolling Upgrade. Other new features include exporting metric graph data, user-selectable timezones and time ranges for dashboards, automatic logout of inactive Ambari web users, and saving Kerberos admin credentials for easier cluster changes.
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
This document discusses new features in Oracle RAC and ASM in Oracle Database 12c. It introduces Flex Clusters, which use a hub-and-spoke topology to improve scalability over traditional RAC clusters. Leaf nodes run application workloads and connect to hub nodes, which run databases and ASM. Server pools can now manage both hub and leaf nodes to isolate workloads. Other new features include shared Grid Naming Service (GNS) configurations, policy-based cluster administration using server categorization and policies, and Multitenant databases with RAC.
Every development shop is unique, and sometimes that uniqueness can hinder using tools. SQL Developer and Data Modeler have multiple mechanisms that allow for customizations. These customizations can range from simple to complex and can help tailor the tooling to any environment. Some are as simple as colored warning to remind the user what is production vs. development. Some could auto-generate code by walking over a data model. The most complex can change anything at all in the tool. Ever think of a command that should be in SQL Plus scripting? Want to auto-generate table APIs?
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
Apache Zeppelin and Spark are turning out to be useful tools in the toolkit of the modern data scientist when working on large scale datasets for machine learning. Zeppelin makes Big Data accessible with minimal effort using web browser based notebooks to interact with data in Hadoop. It enables data scientists to interactively explore and visualize their data and collaborate with others to develop models. Zeppelin has great integration with Apache Spark that delivers many machine learning algorithms out of the box to Zeppelin users as well as providing a fast engine to run custom machine learning on Big Data. The talk will describe the latest in Zeppelin and focus on how it has been made ready for the enterprise. With support for secure Hadoop clusters, LDAP/AD integration, user impersonation and session separation, Zeppelin can now be confidently used in secure and multi-tenant enterprise domains.
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive SQL (Hive, Tez), real-time processing (Storm), existing services and a wide variety of custom applications. These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy.
YARN is now adding support for services in a first class manner. This talk will first cover the challenges of running services on YARN, and then move on to the changes that were made to the ResourceManager to support scheduling services on YARN(such as affinity and anti-affinity). The talk will then move on to cover the changes made in the NodeManager and features such as container restart and container upgrades. The talk will also cover new additions to YARN like the new application manager (that will allow users to bring services workloads onto YARN by providing features such as container orchestration and management) and the DNS server that uses the YARN registry to enable service discovery.
Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through how the latest from Cloudbreak enables enterprises to easily and securely run big data workloads. This includes deep-dive discussion on autoscaling, Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
As a last topic we will discuss how we deployed and operate Cloudbreak as a Service internally which enables rapid cluster deployment for prototyping and testing purposes.
Speakers
Peter Darvasi, Cloudbreak Partner Engineer, Hortonworks
Richard Doktorics, Staff Engineer, Hortonworks
This document discusses running Oracle E-Business Suite on Oracle Cloud. It provides an overview of Oracle Cloud offerings including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). It outlines reasons for moving E-Business Suite to Oracle Cloud like enabling business agility, lowering costs and risks, and supporting growth. The document also covers solution details such as deployment choices, roadmap for automation, and use cases for transitioning to Oracle Cloud.
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
Apache Falcon is a data management platform on Hadoop that provides a holistic way to declaratively define and manage data pipelines and workflows. It allows users to specify feeds, processes, and clusters to orchestrate the flow of data across Hadoop clusters. Falcon handles scheduling, dependency management, replication, and data governance. The architecture uses Oozie to schedule workflows and notifications are sent through JMS. Case studies demonstrate how Falcon can be used for multi-cluster failover and distributed processing across data centers.
Apache Phoenix Query Server PhoenixCon2016Josh Elser
This document discusses Apache Phoenix Query Server, which provides a client-server abstraction for Apache Phoenix using Apache Calcite's Avatica sub-project. It allows Phoenix to have thin clients by offloading computational resources to query servers running on Hadoop clusters. This enables non-Java clients through a standardized HTTP API. The query server implementation uses HTTP, Protocol Buffers for serialization, and common libraries like Jetty and Dropwizard Metrics. It aims to simplify Phoenix client development and improve performance and scalability.
Apache Ambari is used by thousands of Hadoop Operators to manage the deployment, lifecycle, and automation of DevOps for Hadoop ecosystem projects. The Ambari engineering team will talk about improvements being made to the automation, metrics, logging, upgrade, and other core frameworks within Ambari as the project is being re-imagined.
Starting out, Apache Ambari installed a handful of Apache Hadoop ecosystem projects, on a few operating systems, and helped with the most basic Hadoop operational tasks. Today, the product manages over 20 different services, runs on multiple major operating systems and versions, and automates many of the most challenging Hadoop operational tasks in the most secure customer environments.
As part of this talk, the engineering team will walk you through what we've learned, the challenges we've overcome, and how the Apache Ambari community has changed the product to handle them. The future is fast approaching, and with it comes new on-premise and cloud deployment architectures. See how Apache Ambari is being re-imagined to handle these new challenges.
Speaker
Paul Codding, Product Management Director, Hortonworks
Oliver Szabo, Senior Software Engineer, Hortonworks
The document discusses recent releases and major new features of HBase 2.0 and Phoenix 5.0. HBase 2.0 focuses on off-heap memory usage to improve performance, as well as new features like async client, region assignment improvements, and backup/restore capabilities. Phoenix 5.0 includes API cleanup, improved join processing using cost-based optimizations, enhanced index handling including failure recovery, and integration with Apache Kafka.
With its large install base in production, the Storm 1.x line has proven itself as a stable and reliable workhorse that scales well horizontally. Much has been learnt from evolving the 1.x line that we can now leverage to build the next generation execution engine. Under the STORM-2284 umbrella, we are working hard to bring you this new engine which is being redesigned at a fundamental level for Storm 2.0. The goal is to dramatically improve performance and enhance Storm's abilities without breaking compatibility.
This improved vertical scaling will help meet the needs of the growing user base by delivering more performance with less hardware.
In this talk, we will take an in-depth look at the existing and proposed designs for Storm's threading model and the messaging subsystem. We will also do a quick run-down of the major proposed improvements and share some early results from the work in progress.
Speaker
Roshan Naik, Senior MTS, Hortonworks
Building Data Pipelines for Solr with Apache NiFiBryan Bende
This document provides an overview of using Apache NiFi to build data pipelines that index data into Apache Solr. It introduces NiFi and its capabilities for data routing, transformation and monitoring. It describes how Solr accepts data through different update handlers like XML, JSON and CSV. It demonstrates how NiFi processors can be used to stream data to Solr via these update handlers. Example use cases are presented for indexing tweets, commands, logs and databases into Solr collections. Future enhancements are discussed like parsing documents and distributing commands across a Solr cluster.
Aioug ha day oct2015 goldengate- High Availability Day 2015aioughydchapter
This document provides an overview of Oracle GoldenGate and discusses its key components and topologies. It begins with background information about the presenter and then covers topics such as Oracle GoldenGate's supported platforms, common topologies used with Oracle GoldenGate including unidirectional data integration and high availability, and the benefits it provides such as zero downtime upgrades and live reporting. It also discusses Oracle GoldenGate's components including the extract, replicat, trail files, and pump. Finally, it touches on performance tuning techniques for Oracle GoldenGate including adjusting TCP buffer sizes and using checkpoints.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Apache Phoenix’s relational database view over Apache HBase delivers a powerful tool which enables users and developers to quickly and efficiently access their data using SQL. However, Phoenix only provides a Java client, in the form of a JDBC driver, which limits Phoenix access to JVM-based applications. The Phoenix QueryServer is a standalone service which provides the building blocks to use Phoenix from any language, not just those running in a JVM. This talk will serve as a general purpose introduction to the Phoenix QueryServer and how it complements existing Apache Phoenix applications. Topics covered will range from design and architecture of the technology to deployment strategies of the QueryServer in production environments. We will also include explorations of the new use cases enabled by this technology like integrations with non-JVM based languages (Ruby, Python or .NET) and the high-level abstractions made possible by these basic language integrations.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
The document discusses new features in Apache Ambari 2.2, including Express Upgrade which allows automated upgrades of Hadoop clusters with downtime but potentially faster completion compared to Rolling Upgrade. Other new features include exporting metric graph data, user-selectable timezones and time ranges for dashboards, automatic logout of inactive Ambari web users, and saving Kerberos admin credentials for easier cluster changes.
Aman sharma hyd_12crac High Availability Day 2015aioughydchapter
This document discusses new features in Oracle RAC and ASM in Oracle Database 12c. It introduces Flex Clusters, which use a hub-and-spoke topology to improve scalability over traditional RAC clusters. Leaf nodes run application workloads and connect to hub nodes, which run databases and ASM. Server pools can now manage both hub and leaf nodes to isolate workloads. Other new features include shared Grid Naming Service (GNS) configurations, policy-based cluster administration using server categorization and policies, and Multitenant databases with RAC.
Every development shop is unique, and sometimes that uniqueness can hinder using tools. SQL Developer and Data Modeler have multiple mechanisms that allow for customizations. These customizations can range from simple to complex and can help tailor the tooling to any environment. Some are as simple as colored warning to remind the user what is production vs. development. Some could auto-generate code by walking over a data model. The most complex can change anything at all in the tool. Ever think of a command that should be in SQL Plus scripting? Want to auto-generate table APIs?
This document discusses new features in Apache Hive 2.0, including:
1) Adding procedural SQL capabilities through HPLSQL for writing stored procedures.
2) Improving query performance through LLAP which uses persistent daemons and in-memory caching to enable sub-second queries.
3) Speeding up query planning by using HBase as the metastore instead of a relational database.
4) Enhancements to Hive on Spark such as dynamic partition pruning and vectorized operations.
5) Default use of the cost-based optimizer and continued improvements to statistics collection and estimation.
This document discusses adding ACID transaction support to Hive to allow for updates, deletes and inserts of rows. It describes how transactions will be implemented using delta files stored in HDFS and a transaction manager using the metastore database. The new features will initially support auto-commit transactions with snapshot isolation in Hive 0.13 and add explicit transaction commands like BEGIN, COMMIT, ROLLBACK in a later release. Streaming ingest of data is also supported using a new interface for small batch writes and commits. Limitations include it initially only supporting bucketed tables without sorting.
This document provides an introduction to Hive, including:
- What Hive is and why it is used to run SQL queries on Hadoop data as MapReduce jobs.
- Hive's logical table/physical location/data format architecture.
- An overview of Hive's architecture and metastore configuration.
- A comparison of Hive's schema-on-read approach versus traditional databases' schema-on-write.
- Descriptions of Hive's data types and table types, including managed and external tables.
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015alanfgates
The document discusses using Hive, HBase, Phoenix, and Calcite to build a single data store for both analytics and transaction processing. It describes some recent improvements to Hive like LLAP (Live Long and Process) that aim to achieve sub-second query response times, as well as using HBase as the Hive metastore to improve performance.
Keynote from Apache Big Data EU. This introduces training that we are doing at Hortonworks to help our employees work understand and work well as part of the Apache Software Foundation
Apache Spark Usage in the Open Source EcosystemDatabricks
Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Hive analytic workloads hadoop summit san jose 2014alanfgates
- Hive has undergone significant development over the past few years focused on improving performance, scale, and SQL support. Major releases include 0.11, 0.12, and 0.13.
- The 0.13 release focuses on performance improvements like Hive on Tez and vectorized processing to improve query performance by 100x, as well as security features like SQL standard authorization.
- Ongoing work is focused on further SQL support, ACID compliance, and optimizations to the optimizer.
Hive 0.14 adds ACID transactional support which allows for inserting, updating, and deleting rows in Hive tables. It uses a new transaction manager and lock manager to provide snapshot isolation across DML statements. Data is stored in HDFS in a layout of base files and transactional delta files which are compacted periodically. This allows Hive to support use cases beyond batch loads such as streaming data ingest and updating dimension tables.
The document provides an overview of machine learning concepts and techniques using Apache Spark. It discusses supervised and unsupervised learning methods like classification, regression, clustering and collaborative filtering. Specific algorithms like k-means clustering, decision trees and random forests are explained. It also introduces Apache Spark MLlib and how to build machine learning pipelines and models with Spark ML APIs.
This document discusses best practices for using PySpark. It covers:
- Core concepts of PySpark including RDDs and the execution model. Functions are serialized and sent to worker nodes using pickle.
- Recommended project structure with modules for data I/O, feature engineering, and modeling.
- Writing testable, serializable code with static methods and avoiding non-serializable objects like database connections.
- Tips for testing like unit testing functions and integration testing the full workflow.
- Best practices for running jobs like configuring the Python environment, managing dependencies, and logging to debug issues.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
This document provides an introduction to Spark and PySpark for processing big data. It discusses what Spark is, how it differs from MapReduce by using in-memory caching for iterative queries. Spark operations on Resilient Distributed Datasets (RDDs) include transformations like map, filter, and actions that trigger computation. Spark can be used for streaming, machine learning using MLlib, and processing large datasets faster than MapReduce. The document provides examples of using PySpark on network logs and detecting good vs bad tweets in real-time.
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
Strata Hadoop World 2017 San Jose
Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
RISELab: Enabling Intelligent Real-Time Decisions keynote by Ion StoicaSpark Summit
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...Spark Summit
The document provides an overview of using PySpark for time series analysis. It discusses that time series data can come from sources like IOT feeds, sensor data, and economic indicators. Time series analysis in PySpark allows for windowed aggregations and temporal joins on massive time series datasets that can be both wide and narrow. While basic analytics are possible in PySpark, libraries like Flint provide additional functions specialized for time series analysis on large datasets in a distributed environment. The document encourages attendees to speak with the author after the talk to see a time series analysis library in PySpark demonstrated.
This document summarizes Hortonworks' Data Cloud, which allows users to launch and manage Hadoop clusters on cloud platforms like AWS for different workloads. It discusses the architecture, which uses services like Cloudbreak to deploy HDP clusters and stores data in scalable storage like S3 and metadata in databases. It also covers improving enterprise capabilities around storage, governance, reliability, and fault tolerance when running Hadoop on cloud infrastructure.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
SpringOne Platform 2016
Speaker: Ian Fyfe; Director, Product Marketing, Hortonworks
Apache Hadoop is the most powerful and popular platform for ingesting, storing and processing enormous amounts of “big data”. However, due to its original roots as a batch processing system, doing interactive business analytics with Hadoop has historically suffered from slow response times, or forced business analysts to extract data summaries out of Hadoop into separate data marts. This talk will discuss the different options for implementing speed-of-thought business analytics and machine learning tools directly on top of Hadoop including Apache Hive on Tez, Apache Hive on LLAP, Apache HAWQ and Apache MADlib.
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
This document provides an overview and agenda for an HDF 3.0 Deep Dive presentation. It discusses new features in HDF 3.0 like record-based processing using a record reader/writer and QueryRecord processor. It also covers the latest efforts in the Apache NiFi community like component versioning and introducing a registry to enable capabilities like CI/CD, flow migration, and auditing of flows. The presentation demonstrates record processing in NiFi and concludes by discussing the evolution of Apache NiFi and its ecosystem.
This document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows Hadoop applications to work with both HDFS and cloud storage transparently. Recent enhancements to the S3A file system connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries with S3A compared to earlier versions. Upcoming work on output committers, object store abstraction, and consistency are outlined.
The document discusses Hadoop integration with cloud storage. It describes the Hadoop-compatible file system architecture, which allows applications to work with different storage systems transparently. Recent enhancements to the S3A connector for Amazon S3 are discussed, including performance improvements and support for encryption. Benchmark results show significant performance gains for Hive queries running on S3A compared to earlier versions. Upcoming work on consistency, output committers, and abstraction layers is outlined to further improve object store integration.
The document discusses key considerations for running Hadoop in the cloud. It notes that running Hadoop in the cloud provides unlimited elastic scale, ephemeral and long-running workloads, no upfront hardware costs, and IT and business agility. It outlines some of the major cloud Hadoop solutions according to a Forrester Wave report and discusses architectural considerations like shared data and storage, on-demand ephemeral workloads, elastic resource management, and shared metadata, security, and governance.
Hortonworks and Platfora in Financial Services - WebinarHortonworks
Big Data Analytics is transforming how banks and financial institutions unlock insights, make more meaningful decisions, and manage risk. Join this webinar to see how you can gain a clear understanding of the customer journey by leveraging Platfora to interactively analyze the mass of raw data that is stored in your Hortonworks Data Platform. Our experts will highlight use cases, including customer analytics and security analytics.
Speakers: Mark Lochbihler, Partner Solutions Engineer at Hortonworks, and Bob Welshmer, Technical Director at Platfora
Hortonworks Data Platform 2.2 includes Apache HBase for fast NoSQL data access. In this 30-minute webinar, we discussed HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block ache compression; and wire-level encryption.
SQL on Hadoop Batch, Interactive and Beyond.
Public Presentation showing history and where Hortonworks is looking to go with 100% Open Source Technology.
Apache Hive, Apache SparkSQL, Apache Pheonix, and Apache Druid
This document provides an overview of the past, present, and future of Apache Hadoop YARN. It discusses how YARN has evolved from Apache Hadoop 2.6/2.7 to now support 2.8 with features like dynamic resource configuration, container resizing, and Docker support. Upcoming work includes support for arbitrary resource types, federation of multiple YARN clusters, and a new ResourceManager UI. The future of YARN scheduling may include distributed scheduling, intra-queue preemption, and scheduling based on actual resource usage.
Hadoop Present - Open Enterprise HadoopYifeng Jiang
The document is a presentation on enterprise Hadoop given by Yifeng Jiang, a Solutions Engineer at Hortonworks. The presentation covers updates to Hadoop Core including HDFS and YARN, data access technologies like Hive, Spark and stream processing, security features in Hadoop, and Hadoop management with Apache Ambari.
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
This document discusses challenges and solutions for using object storage with Apache Spark and Hive. It covers:
- Eventual consistency issues in object storage and lack of atomic operations
- Improving performance of object storage connectors through caching, optimized metadata operations, and consistency guarantees
- Techniques like S3Guard and committers that address consistency and correctness problems with output commits in object storage
This document summarizes Hortonworks' Hadoop distribution called Hortonworks Data Platform (HDP). It discusses how HDP provides a comprehensive data management platform built around Apache Hadoop and YARN. HDP includes tools for storage, processing, security, operations and accessing data through batch, interactive and real-time methods. The document also outlines new capabilities in HDP 2.2 like improved engines for SQL, Spark and streaming and expanded deployment options.
This document provides an overview of Hadoop and its ecosystem. It discusses the evolution of Hadoop from version 1 which focused on batch processing using MapReduce, to version 2 which introduced YARN for distributed resource management and supported additional data processing engines beyond MapReduce. It also describes key Hadoop services like HDFS for distributed storage and the benefits of a Hadoop data platform for unlocking the value of large datasets.
The document discusses how Hadoop can be used for interactive and real-time data analysis. It notes that the amount of digital data is growing exponentially and will reach 40 zettabytes by 2020. Traditional data systems are struggling to manage this new data. Hadoop provides a solution by tying together inexpensive servers to act as one large computer for processing big data using various Apache projects for data access, governance, security and operations. Examples show how Hadoop can be used to analyze real-time streaming data from sensors on trucks to monitor routes, vehicles and drivers.
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
Apache Falcon is a data management platform that allows users to centrally manage data lifecycles across Hadoop clusters. It defines data entities like clusters, feeds, and processes to represent data pipelines. Falcon then automatically generates workflows to orchestrate the movement of data according to defined policies for replication, retention, and late data handling. It also provides data governance features like lineage tracing, auditing, and tagging. The latest version of Falcon includes new capabilities for disaster recovery mirroring and replication to cloud storage services.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Odoo ERP software
Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth.
The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently.
This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
Speaker: Alan Gates, Hortonworks Co-Founder
Title: 10 Years of Apache Hadoop and Beyond
Duration: 40 minutes
Abstract:
In 2006, Apache Hadoop had its first line of code committed to what has become a breakthrough technology. A decade later, we are witness to open source innovation that has literally changed the face of business. Hadoop and related technologies have become the enterprise data platform, fueled by a rich ecosystem capable of supporting any application, any data, anywhere. Join Hortonworks Co-Founder Alan Gates as he as he drills down into the current and future state of Hadoop and reviews community initiatives aimed at enabling the next wave of modern data applications that are well governed and easy to deploy on-premises and in the cloud.
Our Hadoop journey began in 2006 focused on executing batch MapReduce jobs on petabytes of data.
Yahoo’s decision to contribute Hadoop to the Apache Software Foundation was critical because a vibrant set of related technologies began to appear around Hadoop.
[NEXT]
Fast forward to 2011 and the concept of YARN began to emerge.
Its goal? Enable Hadoop to move from its batch-only roots and become a data platform capable of running batch, interactive, and real-time applications.
The emergence of YARN further accelerated the innovation around Hadoop with the emergence of Spark, Kafka, Storm, and many other projects that started life as Apache Incubator proposals.
[NEXT]
I want to focus for minute in one area of how Hadoop has developed. Apache Hive has participated in that move from batch to interactive, from ETL only to enterprise ready EDW
swiss army knife of big data, can do streaming, SQL, ETL, ML
available from multiple languages (python, java, scala)
So the enterprise has invested in integrating Hadoop into its data lake architecture.
Landing petabyte of data from streams, pipelines, data feeds into HDFS files, Hive and HBase tables, etc.
The question arises of how we can setup policies for these data sets that enable us to secure and govern access to it.
[NEXT ]
The community has been hard at work on integrating Apache Atlas as a metadata catalog and Apache Ranger as the centralized security system to address this need.
The result is tag-based authorization model driven by the metadata catalog (i.e. Atlas) with access and audit policies applied to those tags (via Ranger).
This enables a more flexible way to govern access to data and data sets than traditional role/group based access policies.
Ex. as data pipelines land data, they can tag that data as the data lands and the access policies setup for those tags immediately apply.
Moreover, Ranger has added the notions of time-based and location-based access policies, so users can do things like limit access to data that’s older than 90 days (for example) or limit access to data from certain geographies.
This provides important enterprise-focused capabilities that will help businesses deploy more modern data applications in a way where they have the confidence their data is secure and well-governed.
[NEXT]
TALK TRACK
People are no longer willing to wait until data is in the store before processing it
Hortonworks DataFlow is a platform for data in motion.
It is powered by Apache NiFI, Kafka, and Storm for dataflow management and stream processing.
MiNiFi/NiFi : creates dynamic, configurable data pipelines
Kafka support adaptation to differing rates of data creation and delivery
Storm for real-time streaming processing to create immediate insights at a massive scale.
There are scenarios where NiFI will provide all that you you need – especially in situations that only require dataflow management, but you will notice the orange and blue horizontal triangles provide a continuum of capability from edge to core, that indicates varying degrees of need for the different products.
So after 10 years, the Hadoop ecosystem is available everywhere.
In the Data Center, within appliances, across public and private clouds.
This maximizes choice for people interested in getting started with Hadoop and deploying it at scale for transformational use cases.
[NEXT]
13
While there are a range of great choices in the market today, there’s more that we, as a community, can and should do to make Hadoop in the cloud better and first class.
I’ll spend the remainder of this talk on the key architectural considerations
Shared Data & Storage – the shared-data-lake is on cloud storage, it is not HDFS.
Also memory and local storage play a different role – that of caching
An import distinction in the cloud is On-Demand Ephemeral Workloads – this changes a number of things in fundamental ways.
Shared Metadata, Security, and Governance remains important but need to be adjusted in the face of ephemeral clusters.
And finally, I’ll touch on Elastic Resource Management
We need to shift our thinking away from cluster resource management and more towards SLA-driven workloads
[NEXT]
In the cloud, the shared DataLake is on cloud storage. It is not HDFS of a specific Hadoop cluster.
Note this is very different from a traditional on-premise cluster where each cluster has an internal shared store representing its internal DataLake.
Moreover, it’s desirable to have this shared data be accessible by all apps, not just Hadoop apps – Cloud Native and 3rd party
Good news: goe-distribution is built-into the cloud storage and DR becomes simpler.
Cloud storage has two limitation:
Eventual consistency and its API does not match the filesystem API expected by Hadoop and normal apps.
Addressing these two issues is a key area of ongoing investment. I encourage you to attend today’s breakout session by my fellow Hortonworkers that’s focused on this topic.
Cloud Storage is designed for low cost & scale – unfortunately performance is not its strong point due to segregation from compute.
Memory and Local storage play a different role in the cloud – cache to enhance the performance.
[NEXT]
Wrt to caching we need to consider both tabular data and non-tabular data.
For tabular data, LLAP comes to rescue – it provide a tabular cache that’s shared across jobs, apps, and engines such as Hive and Spark.
LLAP only caches the needed columns, so it’s very efficient in its use of memory.
Further, data is stored in an internal serialized form to optimize compute
The design center is anti-caching – put it all only memory and spill to disk/SSD when memory is full .
LLAP currently provides read caching, but is being extended to support a write-through cache.
And LLAP addresses a key security Gap for the Hadoop eco system, it provides a convenient place to address column-level and row-level access control
that works across all kinds of Engines: Hive, Spark, Flink, or even old fashion MapReduce. Note this was not previously possible ….
From a non-tabular data perspective, HDFS can be used to cache cloud data – both a read cache and a write-through cache.
This essentially evolves HDFS to play a different role, A place to store intermediate data
and also to be a finely-tuned caching layer between the applications and the cloud storage.
[NEXT]
Always-on multitenant clusters are important for a range of mission critical use cases.
However, bringing forth an ephemeral cluster to support a specific workload is game changing.
The agile nature of the cloud allows us to create prescription workload environments.
For someone interested in modeling and analyzing data sets,
- they simply want to interact with a PRE-TUNED environment optimized for the application.
- The complexities of configuring Spark, Hive and Hadoop need to be hidden under the hood.
Whether it’s data science, data warehouse, ETL, or other common workload types,
- provide pre-configured and pre-tuned compute environments
- Further we need be able manage them in, ephemeral fashion.
The NET: deliver user experiences that are focused on business agility,
- rather than infinite configurability and cluster management.
[NEXT]
So far: I shared data and storage and how to optimize performance by caching.
Shared data fundamentally requires a shared approach to metadata, security and governance.
The Metadata is not just the classic Hive metadata that describes the tabular data,
-about storing and tracking the lineage and provenance of data,
- about details related to data pipeline processing and job management.
Tabular data needs to be available to all applications so that SQL is an option regardless of where your data is
Also, as data is ingested and processed, metadata needs to be created and adjusted
Governance and securing the data remain critical and its matadata needs to be managed across all workloads.
- The work done by projects such as Ranger and Atlas need to be evolved to fit the cloud environment.
If we don’t do this then the cloud will not be adapted aggressively for enterprise use.
Getting back to the Shared metadata – each ephemeral cluster cannot have it private copy of the metadata..
In the cloud world, metadata must be centrally stored so it is used across all ephemeral clusters.
[NEXT]
Final area: resource management.
We up-level resource manage
- So far, Yarn has focused on optimizing resources in the context cluster.
- The cloud is not about the cluster it is about the workloads, And further resources are elastic.
The scheduler needs to change its focus to managing resources in the context of a workload and meeting the workload’s SLA
- It may need get extra resources from the cloud - get the right resource to match the needs of the workload.
Sometimes adding compute power is not sufficient to meet an SLA, because latency/bandwidth to data may be the bottleneck
–e.g. spin up LLAPs memory in order to improve caching and hence meet SLA
Cloud offers another dimension – that of cost and budgets.
There are different costs tied to CPU, memory and data access bandwidth, so elasticity and Spot pricing tradeoffs should be factored in
Resource management in this new dimension is important if you want reap the benefit low cost cloud computing.
To Summerize: the better one understands the nature of a workload, the more we are able to take advantage of elasticity and spot pricing.
CONCLUDE: While one could lift and shift Hadoop on the cloud, I hope I have convinced you that we really need to evolve Hadoop to run first class in the cloud
and also to take advantage of the unique cloud features such as elasticity.
We at Hortonworks have been working on this over the few months and Ram will show you a quick demo of the tech-preview we are releasing this week.
[NEXT]
Today we have talked about evolving Hadoop to run well in the cloud.
At Hortonworks, we are focused on enabling a connected data architecture that seamlessly spans the cloud and data center.
This is illustrated on the screen.
It stress two important points – the connectedness of the cloud and the on-premise infrastructure & data..
Also it illustrates the the connected ness of data at motion and data at rest.
The Era of the Internet-of-Things demands that we manage the entire lifecycle of all data
- (data in motion and data at rest)
It’s about being able to collect and curate data across traditional silos so the various groups and lines of business can have a place where they can assemble a single view of data in order to drive deep historical insights.
It’s also about proactively managing data from its point of inception and securely acquiring and delivering it. Moreover, it’s not just about point-to-point delivery, but it’s also about enabling bi-directional data flows that can leverage both real-time and historical insights to help shape and prioritize the flow of data.
So in this diagram, for example, the upper-left edge could represent the connected car, whereas the lower-left edge can represent data from the manufacturing line. Having a connected data architecture that enables you to deal with all of this data unlocks the ability to figure out what manufacturing line issues may be causing operational issues in cars in the field, for example.
In this world of next generation applications, I am existed about evolving the Hadoop eco-system to enable these types of use cases and usage models.
[NEXT SLIDE]