Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Justin Sheppard discussed Transforming Data Architecture Complexity at Sears. High ETL complexity and costs, data latency and redundancy, and batch window limits are just some of the IT challenges caused by traditional data warehouses. Gain an understanding of big data tools through the use cases and technology that enables Sears to solve the problems of the traditional enterprise data warehouse approach. Learn how Sears uses Hadoop as a data hub to minimize data architecture complexity – resulting in a reduction of time to insight by 30-70% – and discover “quick wins” such as mainframe MIPS reduction.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Transforming Data Architecture Complexity at Sears - StampedeCon 2013StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Justin Sheppard discussed Transforming Data Architecture Complexity at Sears. High ETL complexity and costs, data latency and redundancy, and batch window limits are just some of the IT challenges caused by traditional data warehouses. Gain an understanding of big data tools through the use cases and technology that enables Sears to solve the problems of the traditional enterprise data warehouse approach. Learn how Sears uses Hadoop as a data hub to minimize data architecture complexity – resulting in a reduction of time to insight by 30-70% – and discover “quick wins” such as mainframe MIPS reduction.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Microsoft has embraced OSS by placing a big bet on Apache YARN to govern the resources of our computing clusters, and we did so by working with the community and adding many new capabilities in YARN. We now look to undertake a similar journey and build the next generation of our job execution engine on top of Apache Tez. We will be building a common platform for executing batch, interactive, ML, and streaming queries at exabyte scale for Microsoft's BigData system called Cosmos. This requires us to push the limits of Tez API to support new graph models, change the executing DAG by dynamically adding new vertices, scheduling for interactive and streaming workloads, squeeze out all the computing power in the cluster by integrating Tez with opportunistic containers in YARN, and scaling a DAG across tens of thousands of machines. We have started out on this journey and want to share our progress, lessons learned, seek help from the community to add these new capabilities, and push Apache Tez to new levels.
SPEAKERS
Hitesh Sharma, Principal Software Engineering Manager, Microsoft Engineering manager in the Big Data team at Microsoft.
Anupam, Senior Software Engineer, Microsoft
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
Today, most any application can be “Dockerized.” However, there are special challenges when deploying a distributed application such as Spark on containers. This session will describe how to overcome these challenges in deploying Spark on Docker containers, with many practical tips and techniques for running Spark in a container environment.
Containers are typically used to run stateless applications on a single host. There are significant real-world enterprise requirements that need to be addressed when running a stateful, distributed application in a secure multi-host container environment.
There are decisions that need to be made concerning which tools and infrastructure to use. There are many choices with respect to container managers, orchestration frameworks, and resource schedulers that are readily available today and some that may be available tomorrow including:]
• Mesos
• Kubernetes
• Docker Swarm
Each has its own strengths and weaknesses; each has unique characteristics that may make it suitable, or unsuitable, for Spark. Understanding these differences is critical to the successful deployment of Spark on Docker containers.
This session will describe the work done by the BlueData engineering team to run Spark inside containers, on a distributed platform, including the evaluation of various orchestration frameworks and lessons learned. You will learn how to apply practical networking and storage techniques to achieve high performance and agility in a distributed, container environment.
Speaker
Thomas Phelan, Chief Architect, Blue Data, Inc
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
SpringOne Platform 2016
Speaker: Lawrence Spracklen; Vice President of Engineering, Alpine Data Labs.
Data science is undoubtedly becoming a key component of every company’s core strategy for growth and increased revenue potential. To meet this market demand, the big data industry has exploded with a variety of tools to address various pieces of the data science value chain, from model scoring, to notebook interfaces, to niche algorithmic techniques. However, despite the increase in innovation in this area, many insights generated by data science teams end up “dying on the vine”. There has to be a better way of deploying operational models to end users through intuitive interfaces that they can use everyday.
In this session, we will demo how the joint solution between Alpine’s Chorus Platform and Cloud Foundry addresses this problem and closes the gap between data science insights and business value. We will demo an example of creating a machine learning model leveraging data within MPP databases such as Apache HAWQ or Greenplum Database integrated with the Chorus Platform and then deploying this as a micro service within Cloud Foundry as a scoring engine. This turn-key solution will show attendees how easy it is to plug in analytic insights into end user applications that scale, without going through lengthy development cycles.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...ArabNet ME
A new foundation for the Modern Information Architecture.
Speaker: Amr Awadallah, CTO & Cofounder, Cloudera
Our legacy information architecture is not able to cope with the realities of today's business. This is because it is not able to scale to meet our SLAs due to separation of storage and compute, economically store the volumes and types of data we currently confront, provide the agility necessary for innovation, and most importantly, provide a full 360 degree view of our customers, products, and business. In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera, Inc.
Chief Technologist, Office of the CTO at Cloudera Eli Collins, shares the story of the enterprise data hub and how it relates to the enterprise data warehouse.
Rethink Analytics with an Enterprise Data HubCloudera, Inc.
Have you run into one or more of the following barriers or limitations with your existing data warehousing architecture:
> Increasingly high data storage and/or processing costs?
> Silos of data sources?
> Complexity of management and security?
> Lack of analytics agility?
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
The Cisco Open SDN Controller is a commercial distribution of OpenDaylight that delivers business agility through automation of standards-based network infrastructure.
Built as a highly scalable software-defined networking (SDN) platform, the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
The controller exposes REST APIs to allow other applications to take advantage capabilities of the controller and unlock the power of the underlying network infrastructure, and JAVA APIs to allow for the creation of new network services.
This session will present the basic constructs of the controller and the capabilities of the REST and JAVA APIs to demonstrate how the Open SDN Controller abstracts away the complexity of managing heterogeneous networks to improve service delivery and reduce operating costs.
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
SpringOne Platform 2016
Speaker: Lawrence Spracklen; Vice President of Engineering, Alpine Data Labs.
Data science is undoubtedly becoming a key component of every company’s core strategy for growth and increased revenue potential. To meet this market demand, the big data industry has exploded with a variety of tools to address various pieces of the data science value chain, from model scoring, to notebook interfaces, to niche algorithmic techniques. However, despite the increase in innovation in this area, many insights generated by data science teams end up “dying on the vine”. There has to be a better way of deploying operational models to end users through intuitive interfaces that they can use everyday.
In this session, we will demo how the joint solution between Alpine’s Chorus Platform and Cloud Foundry addresses this problem and closes the gap between data science insights and business value. We will demo an example of creating a machine learning model leveraging data within MPP databases such as Apache HAWQ or Greenplum Database integrated with the Chorus Platform and then deploying this as a micro service within Cloud Foundry as a scoring engine. This turn-key solution will show attendees how easy it is to plug in analytic insights into end user applications that scale, without going through lengthy development cycles.
Practice of large Hadoop cluster in China MobileDataWorks Summit
China Mobile Limited is the leading telecommunications services provider in China, with more than 800 million active users. In China Mobile, distributed big data clusters are built by branch companies in each province for their unique requirements. Meanwhile, we have built a centralized Hadoop cluster with scale more than 1600 nodes, on which we collect data from dozens of distributed clusters and make analysis for our business.
In this session, we will introduce the architecture of the centralized Hadoop cluster and experience of constructing and tuning this large scale Hadoop cluster. Key points are as follows:
1. About Ambari: we improve Ambari with features like supporting HDFS Federation and Ambari HA , improving its performance and enabling it to support up to 1600 nodes.
2. About HDFS: we build a large HDFS cluster with data up to 60PB, using federation, ViewFS, FairCallQueue. Our best practice of cluster operation and management will also be included.
3. About Flume: We use the reformed Flume to collect data as much as 200TB per day.
Speakers
Yuxuan Pan, Software Engineer, China Mobile Software Technology
Duan Yunfeng, Chief Designer of China Mobile's big data system, China Mobile Communications Corporation
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...DataWorks Summit
TMW Systems, A TRIMBLE Company, is the industry-leading transportation management software. 3PLs, brokers, distribution and supply operations, dedicated and private fleets, commercial carriers, and energy service providers rely on our transportation management systems, our fleet maintenance management software, or our routing and scheduling software to make them more efficient and profitable. Billions of data points exist in the trucking industry, and we at TMW Systems are pioneers of tracking millions of trucks, freights, and assets.
The architecture team at TMW leverages Nifi and SAM to deliver the immense volume of data in real-time. In this session, you will get a thorough understanding of all the streaming components. We have utilized Apache Kafka, Apache Nifi, and Streaming Analytics Manager to build our real-time data pipeline. We will also discuss the real-time event processing using SAM and Schema Registry. Lastly, we will show custom processors in Nifi and SAM that helped us with complex event processing.
Speaker
Krishna Potluri, TMW Systems, A Trimble Company, Big Data Architect
Donnie Wheat, Trimble, Senior Big Data Architect
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...ArabNet ME
A new foundation for the Modern Information Architecture.
Speaker: Amr Awadallah, CTO & Cofounder, Cloudera
Our legacy information architecture is not able to cope with the realities of today's business. This is because it is not able to scale to meet our SLAs due to separation of storage and compute, economically store the volumes and types of data we currently confront, provide the agility necessary for innovation, and most importantly, provide a full 360 degree view of our customers, products, and business. In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera, Inc.
Chief Technologist, Office of the CTO at Cloudera Eli Collins, shares the story of the enterprise data hub and how it relates to the enterprise data warehouse.
Rethink Analytics with an Enterprise Data HubCloudera, Inc.
Have you run into one or more of the following barriers or limitations with your existing data warehousing architecture:
> Increasingly high data storage and/or processing costs?
> Silos of data sources?
> Complexity of management and security?
> Lack of analytics agility?
The 3 T's - Using Hadoop to modernize with faster access to data and valueDataWorks Summit
Near real-time, big data analytics is a reality via a new data pattern that avoids the latency and overhead of legacy ETL--the 3 T's of Hadoop: Transfer, Transform, and Translate. Transfer: Once a Hadoop infrastructure is in place, a mandate is needed to immediately and continuously transfer all enterprise data, from external and internal sources and through different existing systems, into Hadoop. Previously, enterprise data was isolated, disconnected and monolithically segmented. Through this T, various source data are consolidated and centralized in Hadoop almost as they are generated in near real-time. Transform: Most of the enterprise data, when flowing into Hadoop, is transactional in nature. Analytics requires data be transformed from record-based OLTP form to column-based OLAP. This T is not the same T in ETL as we need to retain the granularity in the data feeds. The key is to transform in-place within Hadoop, without further data movement from Hadoop to other legacy systems. Translate: We pre-compute or provide on-the-fly views of analytical data, exposed for consumption. We facilitate analysis and reporting, for both scheduled and ad hoc needs, to be interactive with the data for analysts and end users, integrated in and on top of Hadoop.
Data volumes have experienced explosive growth in recent years, and that data is being generated from sources that are increasingly complex and varied. Harnessing and refining value from this data requires a new approach as data extraction, transformation, and loading (ETL) becoming increasingly more costly and difficult to scale.
Organizations are looking to leverage Hadoop as an enterprise data hub—also called a “data lake” or “data reservoir”—as a key component of their data architecture to augment their data warehouse, ETL and analytical systems in order to maximize their existing investments, reduce costs, and unlock new business value from their data.
In this webinar, you will learn:
Real-world examples that illustrate why Hadoop is the best low-cost data hub, data lake, or data landing zone (staging area) option for ETL processing
Proof points that demonstrate advantages of Hadoop and its ability to scale to manage increasing data volumes and support exploratory big data analytics
Proven best practices for a cost-effective, reliable way to implement a data management platform for your entire big data analytical ecosystem
Hidden issues to be aware of in deploying your data hub/data lake
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discuss how to eliminate the challenges to Big Data management inside Hadoop.
Big Data 2.0: Hadoop as part of a Near-Real-Time Integrated Data EraDataWorks Summit
A new era of big data is coming, an era we would call ?Big Data 2.0,? with characteristics including: 1. The lines between data and metadata, storage and processing logic become further blurred 2. Data integration pattern is shifting from ETL (extract, transform and load) to the 3 T?s in Hadoop (transfer, transform and translate) 3. Batch-oriented data pipeline is challenged, even surpassed by stream-based data flow 4. In-memory big data processing emerges as a new promising trend 5. Latency from raw data to business intelligence is dramatically shortened toward real-time or near real-time 6. Hadoop and other No-SQL solutions are further integrated into the same environment 7. Mapping and conversion between relational/row-based and column-based data becomes end-user friendly 8. More ad hoc, interactive, query-based analytics outgrow pure MapReduce 9. Hadoop evolves from data server-centric to client rich 10. Hadoop becomes the centerpiece of enterprise data systems, with roles of database, data warehouse, and data center storage, all in one, as integrated platform and solutions This vision of Big Data 2.0 is based on Sears? research, development and production experience, and best practice in enterprise data solutions, which indicate that Hadoop is ready for its prime time in this new era.
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...MapR Technologies
Atzmon Hen-Tov & Lior Schachter, Pontis
Businesses everywhere are increasingly challenged by their dependencies on legacy platforms. The dramatic increase in data volume, speed, and types of data is quickly outstripping the capabilities of these legacy systems. By transitioning from a legacy RDBMS to a Hadoop-based platform, Pontis was able to process and analyze billions of mobile subscriber events every day. In this talk, we’ll provide a quick overview of our legacy system, as well as our process for migrating to our target architecture. We’ll continue with a review our Hadoop platform selection process, which involved a thorough RFP and a detailed analysis of the top Hadoop platform vendors. This session will focus on how we gradually transitioned to our big data platform over the course of several product versions, resulting in higher scalability and a lower TCO in each version. We’ll outline the benefits of the target architecture, and detail how we successfully integrated Hadoop into our organization. Our session will conclude with a look at technical solutions for dealing with big data deficiencies.
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
Federal organizations increasingly are focused on creating environments that enable more data-driven decisions. Yet ensuring that all data is considered and is current, complete, and accurate is a tall order for most. To make data analytics meaningful to support real-world transformation, agency staff need business tools that provide user-friendly dashboards, on-demand reporting, and methods to manage efficiently the rise of voluminous and varied data sets and types commonly associated with big data. In most cases, existing systems are insufficient to support these requirements. Enter the enterprise data hub (EDH), a software architecture specifically designed to be a unified platform that can economically store unlimited data and enable diverse access to it at scale. Plan to attend this discussion to understand the key considerations to making an EDH the architectural center of your agency’s modern data strategy.
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
If you missed Strata + Hadoop World, you missed quite a bit. This year's event was packed with Big Data practitioners across industries who shared their experiences and how they are driving new innovations like never before. Just because you weren't there, doesn't mean you missed out.
In this session, we'll touch on a few of the key highlights from the show, including:
Key trends in Big Data adoption
The enterprise data hub
How the enterprise data hub is used in practice
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
Swarnim Kulkarni (Cerner)
Cerner has been an active consumer of HBase for a very long time, storing petabytes of healthcare data in its multiple isolated HBase clusters. This talk will walk through the design of Cerner's enterprise data hub with a focus on the multi-tenant HBase as a service offering within the hub.
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Companies that want to turn excellent customer experience into growth need to master Customer Journeys. Customer Journeys (the set of interactions a customer has with a brand to complete a task) and less moments of truth are what matter for a customer. Companies that master not only see an improvement in customer experience, loyalty, and operational productivity; they also see above-market growth.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.
Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.
Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.
View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
This session will provide an executive overview of the Apache Hadoop ecosystem, its basic concepts, and its real-world applications. Attendees will learn how organizations worldwide are using the latest tools and strategies to harness their enterprise information to solve business problems and the types of data analysis commonly powered by Hadoop. Learn how various projects make up the Apache Hadoop ecosystem and the role each plays to improve data storage, management, interaction, and analysis. This is a valuable opportunity to gain insights into Hadoop functionality and how it can be applied to address compelling business challenges in your agency.
Gluent Extending Enterprise Applications with Hadoopgluent.
This presentation shows how to transparently extend enterprise applications with the power of modern data platforms such as Hadoop. Application re-writing is not needed and there is no downtime when virtualizing data with Gluent.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The next generation user experience should move to customer engagement zones along their preferred channels with desired action to outcome approaches. With scores of information ranging from inventory to inquiry, weather to warehouse alerts, product to promotion info at disposal, enterprise digitization can create value at every customer touch point. Attendees witnessed the manifestation of TCS’ Thought Leadership in the Game of Retail.
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Similar to Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub (20)
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Free Complete Python - A step towards Data Science
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point of truth: The reality of the enterprise data hub
1. 1
Single Point of Truth
The Reality of the Enterprise Data Hub
Justin Sheppard
Ankur Gupta
Sears Holdings Corporation
2. 2
• Not meeting production schedules
• Multiple copies of data, no single point of truth
• ETL complexity, cost of software and cost to manage
• Time to setup ETL data sources for projects
• Latency in data (up to weeks in some cases)
• Enterprise Data Warehouses unable to handle load
• Mainframe workload over consuming capacity
• IT Budgets not growing – BUT data volumes escalating
Where Did We Start?
3. What Is Hadoop?
3
Hadoop Distributed
File System (HDFS)
File Sharing & Data
Protection Across
Physical Servers
MapReduce
Fault Tolerant
Distributed
Computing Across
Physical Servers
Flexibility
o A single repository for
storing processing &
analyzing any type of data
(structured and complex)
o Not bound by a single
schema
Scalability
o Scale-out architecture divides
workloads across multiple
nodes
o Flexible file system eliminates
ETL bottlenecks
Low Cost
o Can be deployed on
commodity hardware
o Open source platform guards
against vendor lock
Hadoop is a platform for data storage
and processing that is…
o Scalable
o Fault tolerant
o Open source
4. 4
Hadoop
IS
• Store vast amounts of data
• Run queries on huge data
sets
• Ask questions previously
impossible
• Archive data but still
analyze it
• Capture data streams at
incredible speeds
• Massively reduce data
latency
• Transform your thinking
about ETL
Is Not
• High-speed SQL database
• Simple
• Easily connected to legacy
systems
• A replacement for your
current data warehouse
• Going to be built or
operated by your DBA's
• Going to make any sense
to your data architects
• Going to be possible if do
not have Linux skills
5. 5
Use The Right Tool For The Right Job
Databases: Hadoop:
When to use?
• Affordable Storage/Compute
• High-performance queries on large data
• Complex data
• Resilient Auto Scalability
When to use?
• Transactional, High Speed Analytics
• Interactive Reporting (<1sec)
• Multi-step Transactions
• Numerous Inserts/Updates/Deletes
Can be combined
6. Use The Right Tool For The Right Job
6
Hadoop
Database
7. Data Hub
7
• Underlying premise as Hadoop adoption continues – source data once, use many.
• Over time, as more and more data is sourced, development times will reduce since data
sourcing is significantly less than typical.
9. The First Usage in Production
Use Case
• Interactive presentation layer was required to present item/price/sales data in a highly flexible user
interface with rapid response time
• Needed to deliver solution within a very short period of time.
• Legacy architecture would have required a MicroStrategy solution utilizing 1,000’s of cubes on
many expensive servers
Approach
• Rapid development project initiated to present item/price/sales data in a highly flexible user
interface with rapid response time
• Built system from the ground up
• Migrated all required data to centralized HDFS repository from legacy databases
• Developed MapReduce code to process daily data files into 4 primary data tables
• Tables extracted to service layer (MySQL/Infobrite) for presentation through the Pricing Portal
Results
• File preparation completes in minutes each day and ensures portal data is ready very soon after
daily sales processing completes (100K records daily)
• This was the first production usage of MapReduce and associated technologies – the project
initiated in March and was live on May 9 (<10 weeks concept to realization)
Technologies Used
• Hadoop, Hive, MapReduce, MySql, Infobright, Linux, REST Web Service, Dotnetnuke
9
Learning experience for all parties, successfully demonstrated platform abilities in
production environment – but we would NOT do it this way again…
10. Mainframe Migration
10
Step 1
Source 1 Source 2
Step 2 Step 3 Step 4 Step 5
Source 3 Source 4
Output
As our experience with Hadoop increased, hypothesis were formed that the
technology could aid with SHC’s mainframe migration initiative.
Example above represents a simply mainframe process
Step 1
Source 1 Source 2
Step 2 Step 3 Step 4 Step 5
Source 3 Source 4
Output
Step 4 Step 5
X X
Migrated sections of mainframe processing, including
data transfer to Hadoop and back, eliminating MIPS
and IMPROVING overall cycle time
11. ETL Replacement
• A major ongoing system effort in our Marketing department
was heavily reliant on DataStage processing for ETL
– In the early stages of deployment the ETL platform performed within
acceptable limits
– As volume increased the system began to have performance issues as
the ETL platform degraded
– With full rollout imminent, the options were to heavily invest in
additional hardware – or – re-work CPU-intensive portions in Hadoop
11
• Experience with mainframe migration evolved to ETL replacement .
• SHC successfully demonstrated reducing load on costly ETL software with PiG
scripts (and data movement from / to ETL platform as an intermediate step).
• AND with improved processing time…
12. ETL Replacement
12
• Section in RED – processing duration had far exceeded SLA.
• Using similar approach to mainframe migration, components of the process were migrated to Hadoop (in
PiG).
• Data movement plus processing on Hadoop was more predictable and efficient, regardless of volume, than
prior environment – with no additional investment.
13. The Journey
• From Legacy (> 1000 lines) to Ruby / MapReduce (400 lines)
– Cryptic code, difficult to support, difficult to train
• We tried HIVE (~400 lines - Sql-like abstraction)
– Easy to use, easy to experiment and test with
– Poor performance, difficult to implement business logic
• We evolved to PiG with Java UDF extensions
– Compressed, very efficient, easy to code / read (~200 lines)
– Demonstrated success in transforming mainframe developers to PiG developers in under 2 weeks
• As we progressed, our business partners requested more and more data from the cluster –
which required developer time
– We are now using Datameer as a business-user reporting and query front-end to the cluster
– Developer for Hadoop, runs efficiently, flexible spreadsheet interface with dashboards
13
We are in a much different place now than when we started our Hadoop journey.
14. 14
The LearningHADOOP
We can dramatically reduce batch processing times for mainframe and EDW
We can retain and analyze data at a much more granular level, with longer history
Hadoop must be part of an overall solution and eco-system
IMPLEMENTATION
We can reliably meet our production deliverable time-windows by using Hadoop
We can largely eliminate the use of traditional ETL tools
New Tools allow improved user experience on very large data sets
We developed tools and skills – The learning curve is not to be underestimated
We developed experience in moving workload from expensive, proprietary
mainframe and EDW platforms to Hadoop with spectacular results
UNIQUEVALUE
Over three years of experience using Hadoop for enterprise
legacy workload.
15. Thank You!
For further information
email:
visit:
contact@metascale.com
www.metascale.com