HAWQ is an enterprise platform that provides the fewest barriers, lowest risk, and fastest way to perform big data analytics on Hadoop. It combines SQL with Hadoop by providing ANSI SQL capabilities on Hadoop for high performance analytics. HAWQ stores all data directly on HDFS and runs on various Hadoop distributions like Pivotal HD, HDP and IBM BigInsights.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
HBase provides many features for multi-tenancy and isolation. However, the operation of these features require integration into the broader operations of a cluster. This talk will cover some methods we use at Bloomberg for multi-tenancy and discuss some HBase-Oozie integration. Particularly of interest is our work on an Oozie action for secure snapshot export -- this extends the HBase security model via Oozie allowing self-service (non-hbase user) snapshot export on secure clusters.
Key topics:
* Bloomberg's Oozie HBase export snapshot action
* Oozie coordinated time based major compactions
* How we use LDAP with HBase (and why to take care with HADOOP-12291)
* Some of our multi-tenancy setups around monitoring for SLAs
* Suggesting HBase stays the course of being "just" a datastore -- and all projects following the Unix philosophy (this has made things like our Oozie integration much easier!)
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
HBase provides many features for multi-tenancy and isolation. However, the operation of these features require integration into the broader operations of a cluster. This talk will cover some methods we use at Bloomberg for multi-tenancy and discuss some HBase-Oozie integration. Particularly of interest is our work on an Oozie action for secure snapshot export -- this extends the HBase security model via Oozie allowing self-service (non-hbase user) snapshot export on secure clusters.
Key topics:
* Bloomberg's Oozie HBase export snapshot action
* Oozie coordinated time based major compactions
* How we use LDAP with HBase (and why to take care with HADOOP-12291)
* Some of our multi-tenancy setups around monitoring for SLAs
* Suggesting HBase stays the course of being "just" a datastore -- and all projects following the Unix philosophy (this has made things like our Oozie integration much easier!)
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Speakers:
John Hol, Regional Director, Attunity
Mike Hollobon, Director Business Development, IBT
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?
In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.
I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.
Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)
Hadoop users leverage tools such as MapReduce, Hive, HBase etc. for various data processing requirements. These tools do not share a common notion of storage formats, schemas, data models and data types. Apache HAWQ(Incubating) along with its extension framework (PXF) provides a high-performance massively-parallel SQL processing framework on unmanaged data stores/formats in the hadoop ecosystem. HCatalog provides a glue for the entire Hadoop ecosystem by providing a relational abstraction for HDFS data. This talk introduces the integration of Hcatalog metadata into HAWQ's in memory catalog, which provides a simple and seamless access paradigm to data managed by Hive.
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
HDFS is well designed to operate efficiently at scale for normal hardware failures within a datacenter, but it is not designed to handle significant negative events, such as datacenter failures. To overcome this defect, a common practice of HDFS disaster recovery (DR) is replicating data from one location to another through DistCp, which provides a robust and reliable backup capability for HDFS data through batch operations. However, DistCp also has several drawbacks: (1) Taking HDFS Snapshots is time and space consuming on large HDFS cluster. (2) Applying file changes though MapReduce may introduce additional execution overhead and potential issues. (3) DistCp requires administrator intervene to trigger, perform, and verify DistCp jobs, which is not user-friendly in practice.
In this presentation, we will share our experience in HDFS DR and introduce our light-weighted HDFS disaster recovery system that addresses afore-mentioned problems. Different from DistCp, our light-weighted DR system is designed based on HDFS logs (e.g. edit log and Inotify), light-weighted producer/consumer framework, and FileSystem API. During synchronization, it fetches limited subsets of namespace and incremental file changes from NameNode, then our executors apply these changes incrementally to remote clusters through FileSystem API. Furthermore, it also provides a powerful user interface with trigger conditions, path filters and jobs scheduler, etc. Compared to DistCp, it is more straightforward, light-weighted, reliable, efficient, and user-friendly.
Speaker
Qiyuan Gong, Big Data Software Engineer, Intel
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
Speaker: Alan Gates, Co-Founder, Hortonworks
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현K data
빅데이터 환경에서 기업의 의사결정에 필요한 DW 시스템은 더욱 중요해졌고, 대용량 데이터 분석은 필수가 되었다. 전통적인 DBMS의 확장성, 성능 한계를 해결하기 위해서 소프트웨어 뿐만 아니라 최신의 하드웨어 디바이스와 결합하여 어플라이언스 형태의 DW 구축이 대세가 되고 있는 환경에서, 국산 DBMS의 선두주자 티맥스소프트는 외산 DB 어플라이언스와 경쟁할 수 있는 데이터베이스 어플라이언스를 출시하였다. 최근 HP 하드웨어와 어플라이언스 협력 모델을 내놓았으며, 기존의 DBMS가 해결하지 못한 초대용량과 고성능, 그리고 데이터의 확장성이 특징이다. ZetaData는 고성능 데이터베이스 서버와 지능형 스토리지 서버, 초고속 네트워크를 통해 대용량 데이터의 빠른 처리와 시스템 안정성을 제공하는 통합(Consolidated) 데이터 솔루션이다.
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
The world’s largest enterprises run their infrastructure on Oracle, DB2 and SQL and their critical business operations on SAP applications. Organisations need this data to be available in real-time to conduct necessary analytics. However, delivering this heterogeneous data at the speed it’s required can be a huge challenge because of the complex underlying data models and structures and legacy manual processes which are prone to errors and delays.
Unlock these silos of data and enable the new advanced analytics platforms by attending this session.
Find out how to:
• To overcome common challenges faced by enterprises trying to access their SAP data
• You can integrate SAP data in real-time with change data capture (CDC) technology
• Organisations are using Attunity Replicate for SAP to stream SAP data in to Kafka
Speakers:
John Hol, Regional Director, Attunity
Mike Hollobon, Director Business Development, IBT
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
Once the staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, one can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting 10s GB/s bandwidths, less than 100 microseconds of latencies, with millions of IOPS. How does one leverage this phenomenal performance for popular data processing frameworks such as Apache Spark, Flink, Hadoop that we all know and love?
In this talk, I will introduce the Apache Crail (Incubating), which is a fast, distributed data store that is designed specifically for high-performance network and storage devices. The goal of the project is to deliver the true hardware performance to Apache data processing frameworks in the most accessible way. With its modular design, Crail supports multiple storage back ends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Crail provides multiple flexible APIs (file system, KV, HDFS, streaming) for a better integration with the high-level data access operations in Apache compute frameworks. As a result, on a 100 Gbps network infrastructure, Crail delivers all-to-all shuffle operations at 80+ Gbps speed, broadcast operations at less than 10 usec latencies, and more than 8M lookups/namenode, etc. Moreover, Crail is a generic solution that integrates well with the Apache ecosystem including frameworks like Spark, Hadoop, Hive, etc.
I will present the case for Crail, its current status, and future plans. As Crail is a young Apache project, we are seeking to build a community and expand its application to other interesting domains.
Speaker
Animesh Trivedi, IBM Research, Research Staff Member (RSM)
Hadoop users leverage tools such as MapReduce, Hive, HBase etc. for various data processing requirements. These tools do not share a common notion of storage formats, schemas, data models and data types. Apache HAWQ(Incubating) along with its extension framework (PXF) provides a high-performance massively-parallel SQL processing framework on unmanaged data stores/formats in the hadoop ecosystem. HCatalog provides a glue for the entire Hadoop ecosystem by providing a relational abstraction for HDFS data. This talk introduces the integration of Hcatalog metadata into HAWQ's in memory catalog, which provides a simple and seamless access paradigm to data managed by Hive.
Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit
Two popular open source technologies, Druid and Apache Hive, are often mentioned as viable solutions for large-scale analytics. Hive works well for storing large volumes of data, although not optimized for ingesting streaming data and making it available for queries in realtime. On the other hand, Druid excels at low-latency, interactive queries over streaming data and making data available in realtime for queries. Although the high level messaging presented by both projects may lead you to believe they are competing for same use case, the technologies are in fact extremely complementary solutions.
By combining the rich query capabilities of Hive with the powerful realtime streaming and indexing capabilities of Druid, we can build more powerful, flexible, and extremely low latency realtime streaming analytics solutions. In this talk we will discuss the motivation to combine Hive and Druid together alongwith the benefits, use cases, best practices and benchmark numbers.
The Agenda of the talk will be -
1. Motivation behind integrating Druid with Hive
2. Druid and Hive together - benefits
3. Use Cases with Demos and architecture discussion
4. Best Practices - Do's and Don'ts
5. Performance vs Cost Tradeoffs
6. SSB Benchmark Numbers
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
HDFS is well designed to operate efficiently at scale for normal hardware failures within a datacenter, but it is not designed to handle significant negative events, such as datacenter failures. To overcome this defect, a common practice of HDFS disaster recovery (DR) is replicating data from one location to another through DistCp, which provides a robust and reliable backup capability for HDFS data through batch operations. However, DistCp also has several drawbacks: (1) Taking HDFS Snapshots is time and space consuming on large HDFS cluster. (2) Applying file changes though MapReduce may introduce additional execution overhead and potential issues. (3) DistCp requires administrator intervene to trigger, perform, and verify DistCp jobs, which is not user-friendly in practice.
In this presentation, we will share our experience in HDFS DR and introduce our light-weighted HDFS disaster recovery system that addresses afore-mentioned problems. Different from DistCp, our light-weighted DR system is designed based on HDFS logs (e.g. edit log and Inotify), light-weighted producer/consumer framework, and FileSystem API. During synchronization, it fetches limited subsets of namespace and incremental file changes from NameNode, then our executors apply these changes incrementally to remote clusters through FileSystem API. Furthermore, it also provides a powerful user interface with trigger conditions, path filters and jobs scheduler, etc. Compared to DistCp, it is more straightforward, light-weighted, reliable, efficient, and user-friendly.
Speaker
Qiyuan Gong, Big Data Software Engineer, Intel
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
Pivotal workshop slide deck for Structure Data 2016 held in San Francisco.
Abstract:
Learn how data scientists at Pivotal build machine learning models at massive scale on open source MPP databases like Greenplum and HAWQ (under Apache incubation) using in-database machine learning libraries like MADlib (under Apache incubation) and procedural languages like PL/Python and PL/R to take full advantage of the rich set of libraries in the open source community. This workshop will walk you through use cases in text analytics and image processing on MPP.
How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB.
LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud.
The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.
Speaker: Alan Gates, Co-Founder, Hortonworks
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
[2016 데이터 그랜드 컨퍼런스] 2 3(빅데이터). 엑셈 빅데이터 적용 사례 및 플랫폼 구현K data
빅데이터 환경에서 기업의 의사결정에 필요한 DW 시스템은 더욱 중요해졌고, 대용량 데이터 분석은 필수가 되었다. 전통적인 DBMS의 확장성, 성능 한계를 해결하기 위해서 소프트웨어 뿐만 아니라 최신의 하드웨어 디바이스와 결합하여 어플라이언스 형태의 DW 구축이 대세가 되고 있는 환경에서, 국산 DBMS의 선두주자 티맥스소프트는 외산 DB 어플라이언스와 경쟁할 수 있는 데이터베이스 어플라이언스를 출시하였다. 최근 HP 하드웨어와 어플라이언스 협력 모델을 내놓았으며, 기존의 DBMS가 해결하지 못한 초대용량과 고성능, 그리고 데이터의 확장성이 특징이다. ZetaData는 고성능 데이터베이스 서버와 지능형 스토리지 서버, 초고속 네트워크를 통해 대용량 데이터의 빠른 처리와 시스템 안정성을 제공하는 통합(Consolidated) 데이터 솔루션이다.
Zeppelin Interpreters
PSQL (to became JDBC in 0.6.x)
Geode
SpringXD
Apache Ambari
Zeppelin Service
Geode, HAWQ and Spring XD services
Webpage Embedder View
Apache HAWQ (Incubating) along with its extension framework (PXF) provides a high-performance massively-parallel SQL processing framework on unmanaged data stores/formats in the hadoop ecosystem. HCatalog provides a glue for the entire Hadoop ecosystem by providing a relational abstraction for HDFS data. This presentation introduces the integration of Hcatalog metadata into HAWQ's in memory catalog, which provides a simple and seamless access paradigm to data managed by Hive.
Data Engineers Lab's (DLAB) company and service information including Varies Big Data Case Studies in both vertical and horizontal business perspectives.
데이터엔지니어스랩 (디랩)의 회사 및 서비스 소개서입니다. 각 산업별 및 업무 분야별 빅데이터 사례와 활용도에 대한 커멘트를 수록한 최신 버전입니다.
Pivotal is a trusted partner for IT innovation and transformation. From the technology, to the people, to the way people interact with technology, Pivotal is transforming how the world builds software.
At Strata NYC 2015, Pivotal, announced it will Supercharge the Hadoop Ecosystem by contributing the HAWQ advanced SQL on Hadoop analytics and MADlib machine learning technologies to The Apache Software Foundation.
Pivotal HAWQ is a high performance SQL engine on top of Hadoop. It support SQL 92, multi-way joins and has one of the best query processing engines on top of Hadoop. This presentation explains some of the design principles behind HAWQ HA and offers insight into how it works with Hadoop HA.
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
VMworld 2013
Abhishek Kashyap, Pivotal
Kevin Leong, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.
Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
Our experts will walk you through some key design considerations when deploying a Hadoop cluster in production. We'll also share practical best practices around HP and Hortonworks Data Platform to get you started on building your modern data architecture.
Learn how to:
- Leverage best practices for deployment
- Choose a deployment model
- Design your Hadoop cluster
- Build a Modern Data Architecture and vision for the Data Lake
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
This is the presentation from the "Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS" webinar on May 28, 2014. Rohit Bahkshi, a senior product manager at Hortonworks, and Vinod Vavilapalli, PMC for Apache Hadoop, discuss an overview of YARN in HDFS and new features in HDP 2.1. Those new features include: HDFS extended ACLs, HTTPs wire encryption, HDFS DataNode caching, resource manager high availability, application timeline server, and capacity scheduler pre-emption.
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
Trafodion, open sourced by HP, reflects 20+ years of investment in a full-fledged RDBMS built on Tandem's OLTP heritage and geared towards a wide set of mixed query workloads. In this talk, we will discuss how HP integrated Trafodion with HBase to take full advantage of the Trafodion database engine and the HBase storage engine, covering 3-tier architecture, storage, salting/partitioning, data movement, and more.
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters
There are many frameworks that can offer real time on top of Hadoop. This talk will show you the usage of Pivotal HAWQ and how it is easy to use SQL for querying your Hadoop data. Come and see the power and easy of use that can help you on using the Hadoop ecosystem.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.