Cloudy with a Chance of Hadoop - Real World Considerations

•Download as PPTX, PDF•

1 like•362 views

Over the last eighteen months, we have seen significant adoption of Hadoop eco-system centric big data processing in Microsoft Azure and Amazon AWS. In this talk we present some of the lessons learned and architectural considerations for cloud-based deployments including security, fault tolerance and auto-scaling. We look at how Hortonworks Data Cloud and Cloudbreak can automate that scaling of Hadoop clusters, showing how it can react dynamically to workloads, and what that can deliver in cost-effective Hadoop-in-cloud deployments.

Cloudbreak
Janos Matyas & Krisztian Horvath

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Presenters
Krisztian Horvath
Senior Member of technical staff, Cloudbreak
Co-Founder at SequenceIQ
Janos Matyas
Senior Director of Engineering, Cloudbreak
Co-Founder and CTO at SequenceIQ

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals and Motivations – What We Wanted to Do…
 Declarative/full Hadoop stack provisioning in all major cloud providers
 Automate and unify the process
 Zero-configuration approach
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 Provide tooling - UI, REST API and CLI/shell
 Secure and multi-tenant
 SLA policy based autoscaling
 Advanced and custom monitoring of clusters
 Auto recovery, fault tolerance

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals and Motivations – What We Wanted to Do…
 All cloud providers are fundamentally different…
 Compute, network, security, performance
 We want to share what we found, and how we made it work!

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Technology Stack
 Apache Ambari
 Cloud provider API
 Salt
 Docker
 Packer

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cloudbreak – Components overview
 Cloudbreak Deployer (CBD)
– Tool to deploy the Cloudbreak application
– Microservice architecture (using Docker)
– DevOps friendly
 Cloudbreak Application
– Extensible, available through UI, CLI, REST API
– SLA auto-scaling policy management
 Cluster deployed with Cloudbreak

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned
 Not all cloud providers are the same
– Difference in performance, storage and functionality
 (Capacity) planning
– Based on workload type (batch / interactive and ad-hoc / long running)
– Use heterogeneous clusters
– Trial and error – mistakes are cheap, iterate until you find your best fit
– Leverage the cloud - scale your cluster on demand
– Infinite capacity myth - your cluster is just not big enough
 Number one consideration – storage
– Multiple choices (ephemeral, block storage and BLOB store)
– Bring compute to storage – might not work (everywhere) – in cloud everything is as a service
– Independently scale storage from compute, partition your data
 Security
– Consider using strict security rules (private subnets, access, etc) and use edge nodes

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned - AWS
 Compute
– Find your instance types for the workload, use heterogeneous clusters
– Different instance types for transient (e.g. C4, M4) and long running (e.g. H2, D2) clusters
– Dedicated instances (to avoid noise, regulations e.g. HIPPA)
 Storage
– Use latest version of Hadoop (Hortonworks contributed cloud specific optimizations)
– Note that S3 gives you only eventual consistency
– Different driver implementation: S3n (native, jets3t based), S3a (successor of n) , S3 (block based)
 Network
– Use enhanced networking (Amazon Linux by default, RHEL based – apply patch)
– Placement groups, cross AZ deployments
– Not all instance types can use the 10Gbit network (e.g. use 8x)
 Security
– Use instance roles to access S3, deploy in a private subnet/VPC

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned - AWS
* D28xlarge used as instance type

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned - AWS
* D28xlarge used as instance type

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned - Azure
 Compute
– Find your instance types for the workload, use heterogeneous clusters
– Different instance types for transient (e.g. A and D family) and long running (e.g. Dv2) clusters
– Use ARM instead of old API
 Storage
– Use latest version of Hadoop (Hortonworks contributed cloud specific optimizations)
– Storage account scaling limitations
– Use WASB, WASB with DASH or ADL
– Ephemeral disk is faster than root disk – does not survive auto-updates
 Network
– No PTR record/reverse lookup support
 Security
– Integrate/sync with your corporate AD

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lessons Learned - Azure

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
Autoscaling and fault tolerance

As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different, What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them. We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you. This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution. Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks

An Overview on Optimization in Apache Hive: Past, Present Future

DataWorks Summit/Hadoop Summit

Apache Hive has been continuously evolving to support a broad range of use cases, bringing it beyond its batch processing roots to its current support for interactive queries with sub-second response times using LLAP. However, the development of its execution internals is not sufficient to guarantee efficient performance, since poorly optimized queries can create a bottleneck in the system. Hence, each release of Hive has included new features for its optimizer aimed to generate better plans and deliver improvements to query execution. In this talk, we present the development of the optimizer since its initial release. We describe its current state and how Hive leverages the latest Apache Calcite features to generate the most efficient execution plans. We show numbers demonstrating the improvements brought to Hive performance, and we discuss future directions for the next-generation Hive optimizer, which include an enhanced cost model, materialized views support, and complex query decorrelation.

The state of SQL-on-Hadoop in the Cloud

DataWorks Summit/Hadoop Summit

Streamline Hadoop DevOps with Apache Ambari

DataWorks Summit/Hadoop Summit

Moving towards enterprise ready Hadoop clusters on the cloud

DataWorks Summit/Hadoop Summit

Schema Registry - Set Your Data Free

DataWorks Summit

Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats. SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc. In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache Nifi, Apache Kafka, Apache Storm.

A New "Sparkitecture" for modernizing your data warehouse

DataWorks Summit/Hadoop Summit

Apache Hadoop 3.0 Community Update

DataWorks Summit

Apache Hadoop 3 is coming! As the next major milestone for hadoop and big data, it attracts everyone's attention as showcase several bleeding-edge technologies and significant features across all components of Apache Hadoop: Erasure Coding in HDFS, Docker container support, Apache Slider integration and Native service support, Application Timeline Service version 2, Hadoop library updates and client-side class path isolation, etc. In this talk, first we will update the status of Hadoop 3.0 releasing work in apache community and the feasible path through alpha, beta towards GA. Then we will go deep diving on each new feature, include: development progress and maturity status in Hadoop 3. Last but not the least, as a new major release, Hadoop 3.0 will contain some incompatible API or CLI changes which could be challengeable for downstream projects and existing Hadoop users for upgrade - we will go through these major changes and explore its impact to other projects and users. Speaker: Sanjay Radia, Founder and Chief Architect, Hortonworks

Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture. In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

DataWorks Summit

As Apache Hadoop clusters become central to an organization’s operations, they have clusters in more than one data center. Historically, this has been largely driven by requirements of business continuity planning or geo localization. It has also recently been gaining a lot of interest from a hybrid cloud perspective, i.e. wherein people are trying to augment their traditional on-prem setup with cloud-based additions as well. A robust replication solution is a fundamental requirement in such cases. The Apache Hive community has been working on new capabilities for efficient and fault tolerant replication of data in the Hive warehouse. In this talk, we will discuss these new capabilities, how it works, what replication at Hive-scale looks like, what challenges it poses, what we have done to solve those issues. We will also focus on what we need to be aware of in our use case that might make replication optimal. Speaker Sankar Hariappan, Senior Software Engineer, Hortonworks

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

DataWorks Summit/Hadoop Summit

Apache Ambari is now the preferred way of provisioning, managing and monitoring Hadoop Clusters. Ambari helps users to manage Hadoop clusters simplifying actions such as upgrades, configuration management, service management, etc. From release 2.0, Ambari started supporting automated Rolling Upgrades. This was further enhanced with release 2.2.0.0 to include support for Express Upgrades, which allows users to upgrade large scale clusters faster but requiring cluster downtime. This talk will cover planning and execution of Hadoop cluster upgrades from an operational perspective. The talk will also cover the internals of the upgrade process including the various stages such as pre-upgrade, backup, service checks, configuration upgrades, and finalization. Finally, the talk will cover troubleshooting upgrade failures, monitoring services during upgrades and post upgrade actions. The presentation will conclude with a case study that will cover how the upgrade process works on a large cluster (including aspects such as planning the upgrade, the amount of time required for the various stages, and troubleshooting)

Apache Hadoop YARN: Present and Future

DataWorks Summit

Apache Hadoop YARN is the modern Distributed Operating System. It enables the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Multiple organizations are able to leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues etc. In this talk, we’ll first hit the ground with the current status of Apache Hadoop YARN – how it is faring today in deployments large and small. We will cover different types of YARN deployments, in different environments and scale. We'll then move on to the exciting present & future of YARN – features that are further strengthening YARN as the first-class resource-management platform for datacenters running enterprise Hadoop. We’ll discuss the current status as well as the future promise of features and initiatives like – 10x scheduler throughput improvements, docker containers support on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, fine-grained isolation for multi-tenancy using CGroups on disk & network resources, powerful scheduling features like application priorities, intra-queue preemption across applications and operational enhancements including insights through Timeline Service V2, a new web UI and better queue management.

Running Services on YARN

DataWorks Summit/Hadoop Summit

Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive SQL (Hive, Tez), real-time processing (Storm), existing services and a wide variety of custom applications. These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy. YARN is now adding support for services in a first class manner. This talk will first cover the challenges of running services on YARN, and then move on to the changes that were made to the ResourceManager to support scheduling services on YARN(such as affinity and anti-affinity). The talk will then move on to cover the changes made in the NodeManager and features such as container restart and container upgrades. The talk will also cover new additions to YARN like the new application manager (that will allow users to bring services workloads onto YARN by providing features such as container orchestration and management) and the DNS server that uses the YARN registry to enable service discovery.

Big Data in the Cloud - The What, Why and How from the Experts

DataWorks Summit/Hadoop Summit

Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.

Row/Column- Level Security in SQL for Apache Spark

DataWorks Summit/Hadoop Summit

Security is one of fundamental features for enterprise adoption. Specifically, for SQL users, row/column-level access control is important. However, when a cluster is used as a data warehouse accessed by various user groups via different ways, it is difficult to guarantee data governance in a consistent way. In this talk, we focus on SQL users and talk about how to provide row/column-level access controls with common access control rules throughout the whole cluster with various SQL engines, e.g., Apache Spark 2.1, Apache Spark 1.6 and Apache Hive 2.1. If some of rules are changed, all engines are controlled consistently in near real-time. Technically, we enables Spark Thrift Server to work with an identify given by JDBC connection and take advantage of Hive LLAP daemon as a shared and secured processing engine. We demonstrate row-level filtering, column-level filtering and various column maskings in Apache Spark with Apache Ranger. We use Apache Ranger as a single point of security control center.

Apache Hadoop 3.0 What's new in YARN and MapReduce

DataWorks Summit/Hadoop Summit

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

DataWorks Summit

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments. Speaker Irfan Elahi, Consultant, Deloitte

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

DataWorks Summit/Hadoop Summit

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. It is simple and ubiquitous. And a large number of readily available packages make it very powerful for statistical computing. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform that leverages clusters of computers and is able to process data at a scale that has not been feasible before. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR. Suggested Topics: • Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R. • Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas. • Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods. • Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics. • Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future. • Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency. Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Accelerating Big Data Insights

DataWorks Summit

Hadoop’s capabilities offer untapped potential for business insights but companies often get weighed down with DIY platforms and fail to keep up with the requirements. Join this Dell EMC session which will address this challenge with ready bundles to quickly deliver solutions for ETL offload, Single View, & IoT. Get more value from your big data: • Deploy big data applications faster • Increase business agility • Confidently deliver high performance and endless scale • Improve IT operational efficiency Speaker Shawn Smith, Big Data Specialist, Dell EMC

Apache Hive 2.0: SQL, Speed, Scale

DataWorks Summit/Hadoop Summit

Efficient Data Formats for Analytics with Parquet and Arrow

DataWorks Summit/Hadoop Summit

Hadoop makes it relatively easy to store petabytes of data. However, storing data is not enough; columnar layouts for storage and in-memory execution allow the analysis of large amounts of data very quickly and efficiently. It provides the ability for multiple applications to share a common data representation and perform operations at full CPU throughput using SIMD and Vectorization. For interoperability, row based encodings (CSV, Thrift, Avro) combined with general purpose compression algorithms (GZip, LZO, Snappy) are common but inefficient. As discussed extensively in the database literature, a columnar layout with statistics and sorting provides vertical and horizontal partitioning, thus keeping IO to a minimum. Additionally a number of key big data technologies have or will soon have in-memory columnar capabilities. This includes Kudu, Ibis and Drill. Sharing a common in-memory columnar representation allows interoperability without the usual cost of serialization. Understanding modern CPU architecture is critical to maximizing processing throughput. We’ll discuss the advantages of columnar layouts in Parquet and Arrow for in-memory processing and data encodings used for storage (dictionary, bit-packing, prefix coding). We’ll dissect and explain the design choices that enable us to achieve all three goals of interoperability, space and query efficiency. In addition, we’ll provide an overview of what’s coming in Parquet and Arrow in the next year.

The Future of Apache Ambari

DataWorks Summit

Apache Ambari is an extensible framework that simplifies provisioning, managing and monitoring Hadoop clusters. Apache Ambari was built on a standardized stack-based operations model. Stacks wrap services of all shapes and sizes with a consistent definition and lifecycle-control layer; thereby providing a consistent approach for managing and monitoring the services. This also provided a natural extension point for operators and the community to bring in their own add-on services and “plug-in” the new services into the stack. However, one of the fundamental limitations of the current Apache Ambari architecture has been that there is a strong one-on-one coupling between entities. For instance, a cluster is tied to a single stack and a Hadoop operator can only deploy services defined in that stack, a cluster can have only a single instance of a service and a host can have only a single instance of a component. Taking into consideration various use case scenarios that cannot be enabled due to these limitations there is a growing need to revamp the Ambari architecture. In this talk, we propose a revamped Apache Ambari architecture that will open up the floodgates for a wide range of scenarios that wouldn’t have been possible thus far. We will focus the discussion on a new mpack-based operations model that will replace the stack-based operations model. A management package is a self-contained deployment artifact that includes all the details for deploying, managing and upgrading a set of services bundled in the package. A third-party provider can also build their own management package containing their custom services. This eliminates the need to plug-in their services into a stack and also can define their own upgrade story for these custom services. A Hadoop operator will be able to deploy a Hadoop cluster with a mix of services across multiple packages instead of being limited to a single stack. For example, it would be possible to deploy a cluster with HDFS from HDP and NIFI from HDF. Further, we will also discuss about the architectural changes needed to enable a multi instance architecture in future Ambari releases to support deploying multiple instances of a service in a cluster, deploying multiple instances of a component on a host as well as future proofing the Ambari architecture to leverage some of the advancements happening in the Hadoop community like YARN services (YARN-4692). We will wrap up the conversation with a brief overview of other improvements planned for future releases of Ambari.

Accelerate Your Big Data Analytics Efforts with SAS and Hadoop

DataWorks Summit

Analytics and machine learning continue to be the top use cases for deploying big data platforms such as Hadoop. SAS recognised the potential and power of Hadoop platform early on and has been integrating analytical solutions with Hadoop to leverage the power and flexibility of Hadoop for analytical workloads. The combination of SAS and Hadoop offers developers and organisations an approach that can accelerate the development and deployment of big data analytics applications that are mature, proven and scalable. Furthermore, by giving developers and analysts analytical applications that are rich, proven and collaborative, SAS allows more users across different skill levels to unleash the value of data stored in big data platform more easily and quickly. In this session, we will cover common big data analytics use cases, the depth and breadth of SAS analytical capabilities on Hadoop, and how SAS solutions are integrated into the Hadoop ecosystem via technologies such as Hive, YARN and Spark. Speaker Felix Liao, SAS Institute Australia & New Zealand

Dynamic DDL: Adding structure to streaming IoT data on the fly

DataWorks Summit

At the end of day the only thing that data scientists want is one thing. They want tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh... and there are a bunch more data sources that you need to ingest and the current providers of data are changing their structure. At GoPro, we have massive amounts of heterogeneous data being streamed at us from our consumer devices and applications, and we have developed a concept of "dynamic DDL" to structure our streamed data on the fly using Spark Streaming, Kafka, HBase, Hive, and S3. The idea is simple. Add structure (schema) to the data as soon as possible. Allow the providers of the data to dictate the structure. And automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Evolving HDFS to a Generalized Storage Subsystem

DataWorks Summit/Hadoop Summit

Cloudbreak - Technical Deep Dive

DataWorks Summit/Hadoop Summit

Cloudy with a chance of Hadoop - real world considerations

DataWorks Summit

What's hot

IoT:what about data storage?

DataWorks Summit/Hadoop Summit

Hadoop in the Cloud - The what, why and how from the experts

DataWorks Summit/Hadoop Summit

Dancing elephants - efficiently working with object stores from Apache Spark ...

DataWorks Summit

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

DataWorks Summit

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

DataWorks Summit

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

DataWorks Summit/Hadoop Summit

Apache Hadoop YARN: Present and Future

DataWorks Summit

Running Services on YARN

DataWorks Summit/Hadoop Summit

Big Data in the Cloud - The What, Why and How from the Experts

DataWorks Summit/Hadoop Summit

Row/Column- Level Security in SQL for Apache Spark

DataWorks Summit/Hadoop Summit

Apache Hadoop 3.0 What's new in YARN and MapReduce

DataWorks Summit/Hadoop Summit

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

DataWorks Summit

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

DataWorks Summit/Hadoop Summit

Accelerating Big Data Insights

DataWorks Summit

Apache Hive 2.0: SQL, Speed, Scale

DataWorks Summit/Hadoop Summit

Efficient Data Formats for Analytics with Parquet and Arrow

DataWorks Summit/Hadoop Summit

The Future of Apache Ambari

DataWorks Summit

Accelerate Your Big Data Analytics Efforts with SAS and Hadoop

DataWorks Summit

Dynamic DDL: Adding structure to streaming IoT data on the fly

DataWorks Summit

Evolving HDFS to a Generalized Storage Subsystem

DataWorks Summit/Hadoop Summit

What's hot (20)

IoT:what about data storage?

Hadoop in the Cloud - The what, why and how from the experts

Dancing elephants - efficiently working with object stores from Apache Spark ...

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting

Apache Hadoop YARN: Present and Future

Running Services on YARN

Big Data in the Cloud - The What, Why and How from the Experts

Row/Column- Level Security in SQL for Apache Spark

Apache Hadoop 3.0 What's new in YARN and MapReduce

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Accelerating Big Data Insights

Apache Hive 2.0: SQL, Speed, Scale

Efficient Data Formats for Analytics with Parquet and Arrow

The Future of Apache Ambari

Accelerate Your Big Data Analytics Efforts with SAS and Hadoop

Dynamic DDL: Adding structure to streaming IoT data on the fly

Evolving HDFS to a Generalized Storage Subsystem

Similar to Cloudy with a Chance of Hadoop - Real World Considerations

Cloudbreak - Technical Deep Dive

DataWorks Summit/Hadoop Summit

Cloudy with a chance of Hadoop - real world considerations

DataWorks Summit

Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose

Mingliang Liu

This talks about use cases and scenarios of running Hadoop applications in the cloud. It covers the problems encountered and lessons learned at Hortonworks. In the talk, we will see a couple deep dives, including Hadoop cluster/service auto-scaling, fault tolerance, and object storage consistency problems. This appears at DataWorks Summit 2017 San Jose, a co-joint talk by Ram Venkatesh and Mingliang Liu.

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...

DataWorks Summit

Today enterprises desire to move more and more of their data lakes to the cloud to help them execute faster, increase productivity, drive innovation while leveraging the scale and flexibility of the cloud. However, such gains come with risks and challenges in the areas of data security, privacy, and governance. In this talk we cover how enterprises can overcome governance and security obstacles to leverage these new advances that the cloud can provide to ease the management of their data lakes in the cloud. We will also show how the enterprise can have consistent governance and security controls in the cloud for their ephemeral analytic workloads in a multi-cluster cloud environment without sacrificing any of the data security and privacy/compliance needs that their business context demands. Additionally, we will outline some use cases and patterns as well as best practices to rationally manage such a multi-cluster data lake infrastructure in the cloud.

Running Cloudbreak on Kubernetes

Future of Data Meetup

Cloudbreak (https://hortonworks.com/open-source/cloudbreak/), as part of the Hortonworks Data Platform (https://hortonworks.com/products/data-center/hdp/) (HDP), makes it easy to provision, configure and elastically grow clusters across cloud infrastructure providers including Amazon Web Services, Microsoft Azure, Google Cloud Platform and OpenStack. It has been designed and developed in a micro-service architecture since the beginning and shipped in Docker. This talk is about the challenges of how we scaled the application using Kubernetes on ACS (https://azure.microsoft.com/en-us/services/container-service/) (Azure Container Services) and how handle the hosted service challenges like: - Exposing metrics regarding the application - Collecting application, k8s, and system logs - Handling scenarios when one of the application node dies - Preparing for high workloads Learn more about the meetup, here: https://www.meetup.com/futureofdata-budapest/events/244277121/

Running Cloudbreak on Kubernetes

Krisztián Horváth

The Unbearable Lightness of Ephemeral Processing

DataWorks Summit

Ephemeral clusters can be launched quickly (minutes), are pre-configured for a specific processing purpose, and can be brought down quickly as soon as their usefulness has expired. The ability to launch Ephemeral clusters for on-demand processing, quickly and efficiently, is transforming how organizations design, deploy and Manage applications. The velocity and elasticity of fast cluster deployment enables seamless peak-demand provisioning, enables cost optimization by leveraging significantly lower cloud spot pricing, and maximizes utilization of existing compute capacity. Additionally, being able to launch bespoke clusters for specific compute needs in a repeatable fashion and within a shared infrastructure provides flexibility for special purpose processing needs. Organizations can leverage Ephemeral Clusters for parallel compute intensive applications which require short bursts of power but are short lived. In this session we will explore how to design Ephemeral clusters, how to launch, modify and bring them down, as well as application design considerations to maximize Ephemeral clusters usability.

Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit

Built-In Security for the Cloud

DataWorks Summit

Micro services vs hadoop

Gergely Devenyi

Hadoop & cloud storage object store integration in production (final)

Chris Nauroth

Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.

Hadoop & Cloud Storage: Object Store Integration in Production

DataWorks Summit/Hadoop Summit

Hadoop & Cloud Storage: Object Store Integration in Production

DataWorks Summit/Hadoop Summit

DCOS Presentation

Jan Repnak

Running Enterprise Workloads in the Cloud

DataWorks Summit

Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease. In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through how the latest from Cloudbreak enables enterprises to easily and securely run big data workloads. This includes deep-dive discussion on autoscaling, Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments. As a last topic we will discuss how we deployed and operate Cloudbreak as a Service internally which enables rapid cluster deployment for prototyping and testing purposes. Speakers Peter Darvasi, Cloudbreak Partner Engineer, Hortonworks Richard Doktorics, Staff Engineer, Hortonworks

Understanding Platform as a Service

Paul Fremantle

Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...

Hortonworks

Hadoop Everywhere & Cloudbreak

Sean Roberts

Druid deep dive

Kashif Khan

Txlf2012Joe Brockmeier

Similar to Cloudy with a Chance of Hadoop - Real World Considerations (20)

Cloudbreak - Technical Deep Dive

Cloudy with a chance of Hadoop - real world considerations

Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...

Running Cloudbreak on Kubernetes

The Unbearable Lightness of Ephemeral Processing

Hadoop in the Clouds, Virtualization and Virtual Machines

Built-In Security for the Cloud

Micro services vs hadoop

Hadoop & cloud storage object store integration in production (final)

Hadoop & Cloud Storage: Object Store Integration in Production

DCOS Presentation

Running Enterprise Workloads in the Cloud

Understanding Platform as a Service

Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...

Hadoop Everywhere & Cloudbreak

Druid deep dive

Txlf2012

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production

DataWorks Summit/Hadoop Summit

State of Security: Apache Spark & Apache Zeppelin

DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger

DataWorks Summit/Hadoop Summit

Enabling Digital Diagnostics with a Data Science Platform

DataWorks Summit/Hadoop Summit

Revolutionize Text Mining with Spark and Zeppelin

DataWorks Summit/Hadoop Summit

Double Your Hadoop Performance with Hortonworks SmartSense

DataWorks Summit/Hadoop Summit

Hadoop Crash Course

DataWorks Summit/Hadoop Summit

Data Science Crash Course

DataWorks Summit/Hadoop Summit

Apache Spark Crash Course

DataWorks Summit/Hadoop Summit

Dataflow with Apache NiFi

DataWorks Summit/Hadoop Summit

Schema Registry - Set you Data Free

DataWorks Summit/Hadoop Summit

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

DataWorks Summit/Hadoop Summit

There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time. The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

DataWorks Summit/Hadoop Summit

DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.

Mool - Automated Log Analysis using Data Science and ML

DataWorks Summit/Hadoop Summit

QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful. At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.

How Hadoop Makes the Natixis Pack More Efficient

DataWorks Summit/Hadoop Summit

Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together. This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear: • How and why the business and IT requirements originated • How we leverage the platform to fulfill security and production requirements • How we organize a community to: o Guard all the players, no one gets left on the ground! o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead) • What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match! DETAILS This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.

HBase in Practice

DataWorks Summit/Hadoop Summit

HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.

The Challenge of Driving Business Value from the Analytics of Things (AOT)

DataWorks Summit/Hadoop Summit

There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases. In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

DataWorks Summit/Hadoop Summit

In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

DataWorks Summit/Hadoop Summit

In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs. Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.

Backup and Disaster Recovery in Hadoop

DataWorks Summit/Hadoop Summit

While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production

State of Security: Apache Spark & Apache Zeppelin

Unleashing the Power of Apache Atlas with Apache Ranger

Enabling Digital Diagnostics with a Data Science Platform

Revolutionize Text Mining with Spark and Zeppelin

Double Your Hadoop Performance with Hortonworks SmartSense

Hadoop Crash Course

Data Science Crash Course

Apache Spark Crash Course

Dataflow with Apache NiFi

Schema Registry - Set you Data Free

Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...

Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...

Mool - Automated Log Analysis using Data Science and ML

How Hadoop Makes the Natixis Pack More Efficient

HBase in Practice

The Challenge of Driving Business Value from the Analytics of Things (AOT)

Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop

From Regulatory Process Verification to Predictive Maintenance and Beyond wit...

Backup and Disaster Recovery in Hadoop

Recently uploaded

Assure Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

RESUME BUILDER APPLICATION Project for students

KAMESHS29

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Alex Pruden

This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second). Paper: https://eprint.iacr.org/2023/1886

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Peter Spielvogel

Building better applications for business users with SAP Fiori. • What is SAP Fiori and why it matters to you • How a better user experience drives measurable business benefits • How to get started with SAP Fiori today • How SAP Fiori elements accelerates application development • How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities • How SAP Fiori paves the way for using AI in SAP apps

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

UiPathCommunity

In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni. 📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath: Autopilot per Studio Web Autopilot per Studio Autopilot per Apps Clipboard AI GenAI applicata alla Document Understanding 👨‍🏫👨‍💻 Speakers: Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath Andrei Tasca, RPA Solutions Team Lead @NTT Data

Elevating Tactical DDD Patterns Through Object Calisthenics

Dorra BARTAGUIZ

After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!

Recently uploaded (20)

Assure Contact Center Experiences for Your Customers With ThousandEyes

FIDO Alliance Osaka Seminar: Overview.pdf

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Essentials of Automations: The Art of Triggers and Actions in FME

A tale of scale & speed: How the US Navy is enabling software delivery from l...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

The Art of the Pitch: WordPress Relationships and Sales

RESUME BUILDER APPLICATION Project for students

GraphRAG is All You need? LLM & Knowledge Graph

zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs

Monitoring Java Application Security with JDK Tools and JFR Events

UiPath Test Automation using UiPath Test Suite series, part 4

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Introduction to CHERI technology - Cybersecurity

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

PCI PIN Basics Webinar from the Controlcase Team

Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™

Elevating Tactical DDD Patterns Through Object Calisthenics

Cloudy with a Chance of Hadoop - Real World Considerations

1. Cloudbreak Janos Matyas & Krisztian Horvath

2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Presenters Krisztian Horvath Senior Member of technical staff, Cloudbreak Co-Founder at SequenceIQ Janos Matyas Senior Director of Engineering, Cloudbreak Co-Founder and CTO at SequenceIQ

3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals and Motivations – What We Wanted to Do…  Declarative/full Hadoop stack provisioning in all major cloud providers  Automate and unify the process  Zero-configuration approach  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  Provide tooling - UI, REST API and CLI/shell  Secure and multi-tenant  SLA policy based autoscaling  Advanced and custom monitoring of clusters  Auto recovery, fault tolerance

4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals and Motivations – What We Wanted to Do…  All cloud providers are fundamentally different…  Compute, network, security, performance  We want to share what we found, and how we made it work!

6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloudbreak – Components overview  Cloudbreak Deployer (CBD) – Tool to deploy the Cloudbreak application – Microservice architecture (using Docker) – DevOps friendly  Cloudbreak Application – Extensible, available through UI, CLI, REST API – SLA auto-scaling policy management  Cluster deployed with Cloudbreak

7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lessons Learned  Not all cloud providers are the same – Difference in performance, storage and functionality  (Capacity) planning – Based on workload type (batch / interactive and ad-hoc / long running) – Use heterogeneous clusters – Trial and error – mistakes are cheap, iterate until you find your best fit – Leverage the cloud - scale your cluster on demand – Infinite capacity myth - your cluster is just not big enough  Number one consideration – storage – Multiple choices (ephemeral, block storage and BLOB store) – Bring compute to storage – might not work (everywhere) – in cloud everything is as a service – Independently scale storage from compute, partition your data  Security – Consider using strict security rules (private subnets, access, etc) and use edge nodes

8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lessons Learned - AWS  Compute – Find your instance types for the workload, use heterogeneous clusters – Different instance types for transient (e.g. C4, M4) and long running (e.g. H2, D2) clusters – Dedicated instances (to avoid noise, regulations e.g. HIPPA)  Storage – Use latest version of Hadoop (Hortonworks contributed cloud specific optimizations) – Note that S3 gives you only eventual consistency – Different driver implementation: S3n (native, jets3t based), S3a (successor of n) , S3 (block based)  Network – Use enhanced networking (Amazon Linux by default, RHEL based – apply patch) – Placement groups, cross AZ deployments – Not all instance types can use the 10Gbit network (e.g. use 8x)  Security – Use instance roles to access S3, deploy in a private subnet/VPC

11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lessons Learned - Azure  Compute – Find your instance types for the workload, use heterogeneous clusters – Different instance types for transient (e.g. A and D family) and long running (e.g. Dv2) clusters – Use ARM instead of old API  Storage – Use latest version of Hadoop (Hortonworks contributed cloud specific optimizations) – Storage account scaling limitations – Use WASB, WASB with DASH or ADL – Ephemeral disk is faster than root disk – does not survive auto-updates  Network – No PTR record/reverse lookup support  Security – Integrate/sync with your corporate AD

Cloudy with a Chance of Hadoop - Real World Considerations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloudy with a Chance of Hadoop - Real World Considerations

Similar to Cloudy with a Chance of Hadoop - Real World Considerations (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Cloudy with a Chance of Hadoop - Real World Considerations