High-level Hadoop analysis requires custom solutions to deliver the data that you need and the amount of time that even senior engineers require to create ETL jobs in a DIY hardware and software situation, can be substantial.
We found that the Dell | Cloudera | Syncsort solution was so easy to use that an entry-level employee could use it to create optimized ETL jobs after only a few days of training. And he could do it quickly—our technician, who had no previous experience using Hadoop, developed three optimized ETL jobs in 31 hours. That is less than half of the 68 hours our expert with years of Hadoop experience needed to create the same jobs using open source tools.
Using the Dell | Cloudera | Syncsort solution means that your organization can implement a Hadoop solution using employees already on your staff rather than trying to recruit expensive, difficult-to-find specialists. Not only that, but the projects can be completed in a fraction of the time. This makes the Dell | Cloudera | Syncsort solution a winning business proposition.
Accelerating big data with ioMemory and Cisco UCS and NOSQLSumeet Bansal
When great companies work together, an even greater outcome is possible. I am presenting this at the Oracle Open World 2012 at the Cisco theatre. Could one possibly support a twitter-like workload with just one server and few iodrives? Its all here.
Innovation Without Compromise: The Challenges of Securing Big DataCloudera, Inc.
Hadoop is a powerful tool for today’s enterprise – providing unified storage of all data and metadata, regardless of format or source, and multiple frameworks for robust processing and analytics. However, this flexibility and scale also presents challenges for securing and governing this data.
Join IDC analysts, Carl Olofson and Mike Versace, as they discuss the changing world of big data security with Eddie Garcia, Information Security Architect at Cloudera, and Anil Earla, Chief Data and Analytics Officer of IS at Visa.
During this live roundtable discussion, you will:
Gain an understanding of how securing big data differs from traditional enterprise security
Learn about the latest tools and initiatives around Hadoop platform security
Hear how one of the largest payment processors approaches big data security and regulatory concerns
Accelerating big data with ioMemory and Cisco UCS and NOSQLSumeet Bansal
When great companies work together, an even greater outcome is possible. I am presenting this at the Oracle Open World 2012 at the Cisco theatre. Could one possibly support a twitter-like workload with just one server and few iodrives? Its all here.
Innovation Without Compromise: The Challenges of Securing Big DataCloudera, Inc.
Hadoop is a powerful tool for today’s enterprise – providing unified storage of all data and metadata, regardless of format or source, and multiple frameworks for robust processing and analytics. However, this flexibility and scale also presents challenges for securing and governing this data.
Join IDC analysts, Carl Olofson and Mike Versace, as they discuss the changing world of big data security with Eddie Garcia, Information Security Architect at Cloudera, and Anil Earla, Chief Data and Analytics Officer of IS at Visa.
During this live roundtable discussion, you will:
Gain an understanding of how securing big data differs from traditional enterprise security
Learn about the latest tools and initiatives around Hadoop platform security
Hear how one of the largest payment processors approaches big data security and regulatory concerns
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Precisely
Effective AI and ML projects require a perfect blend of scalable, clean data funneled from a variety of sources across the business. The only problem? Uncleaned data often lives in hard-to-access legacy systems, and it costs time and money to build the right foundation to deliver that data to answer ever-changing questions from business users. Together, Cloudera and Syncsort enable you to build a scalable foundation of data connections to reinvent the data lifecycle of all your projects in the most efficient way possible.
View this webinar on-demand to learn how innovative solutions from Cloudera and Syncsort enable AI and ML success. You will learn:
• Best practices for transforming complex data into clear, actionable insights for AI and ML projects
• How to visually assess the quality of the sources in your data lake and their completeness, consistency, and accuracy
• The value of an Enterprise Data Cloud and the newly unveiled Cloudera Data Platform
• How Syncsort Connect integrates natively with the Cloudera Data Platform
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Strategies for Enterprise Grade Azure-based AnalyticsCloudera, Inc.
Over the past decade, big data implementations have been more sophisticated in particular for organizations operationalizing machine learning, analytics and data engineering. The pressures of data-driven cultures, multiple workload applications such as customer care, fraud management and cross-platform marketing are changing the game. Mixing machine learning with business processes and operationalizing analytics with data engineering practices places burdens on IT teams. Making these advanced data environments all work together is an ongoing challenge.
While you can still “swipe and go” to implement data management environments in the cloud for an easy solution, the easy path is often littered with additional costs, higher overhead in terms of maintenance and synchronization. Data savvy organizations are taking a more measured and coordinated approach to their machine learning, analytics and data engineering infrastructures. These proactive approaches speed adoption among business stakeholders and lower administration and governance issues for technologists.
Join John L Myers, managing research director at leading IT analyst firm Enterprise Management Associates (EMA), and Nik Rouda, director of product marketing at Cloudera, to discover how the world of cloud implementations have changed for the better and the future of an enterprise grade cloud environment for your organization using the right resources.
Attend this webinar to learn about:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
How proactive implementations improve the workloads of technologists
Examples of real world customer implementations
3 things to learn:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
3 Things to Learn About:
*The IoT ecosystem and data management considerations for IoT
*Top IoT use cases and data architecture strategies for managing the sheer volume and variety of IoT data
*Real-life case studies on how our customers are using Cloudera Enterprise to drive insights and analytics from all of their IoT data
The Data Center Is The Heartbeat of Today's IT Transformation (ENT215) | AWS ...Amazon Web Services
(Presented by CoreSite) The cloud-enabled data center sits at the center of IT transformation. It facilitates the interconnection and communities that come together, propelling growth for both buyers and sellers. Learn how CoreSite is bringing together best-of-breed partners through the Open Cloud Exchange, resulting in public, private, and hybrid IT interconnection and management as well as integration of AWS Direct Connect.
The CSC Big Data Analytics Insights service enables clients who do not have an analytics capability to implement the business, data and technology changes to gain business benefit from an initial set of analytics based on a roadmap of changes created by CSC or provided from a compatible set of inputs.
CSC Analytic Insights Implementation has four phases:
Stage 1: Analytic Engagement
Stage 2: Analytic Discovery
Stage 3: Implementation Planning
Stage 4: Embedding Analysis .
Migrating to the Cloud – Is Application Performance Monitoring still required?eG Innovations
As more businesses adopt cloud technologies for their various benefits it must be noted that not all cloud offerings are the same and provide different services or infra SLA. Do you know that not all SLA from cloud providers mean your application will be ensured similar availability?
Depending on whether you are leveraging Saas, Paas, Iaas, etc to deliver your applications, you will have a different level of visibility and control of how you manage performance and deliver the user experience expected by your users.
Join the session and find out what remains within your responsibility and how you can monitor the various cloud infrastructure/services to give yourself the needed visibility to deliver the expected user experience without over-provisioning to ensure better performance.
IoT is reshaping the manufacturing and industrial processes, effectively changing the paradigm from one of repair and replace to more of predict and prevent. Using data streaming from connected equipment and machinery, organizations can now monitor the health of their assets and effectively predict when and how an asset might fail. However, without the right data management strategy and tools, investments in IoT can yield limited results. Join Cloudera and Tata Consultancy Services (TCS) for a joint webinar to learn more about how organizations are using advanced analytics and machine learning to drive IoT enabled predictive maintenance.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/07/the-data-driven-engineering-revolution-a-presentation-from-edge-impulse/
Zach Shelby, Co-founder and CEO of Edge Impulse, presents the “Data-Driven Engineering Revolution” tutorial at the May 2021 Embedded Vision Summit.
In this talk, IoT industry pioneer and Edge Impulse co-founder Zach Shelby shares insights about how machine learning is revolutionizing embedded engineering. Advances in silicon and deep learning are enabling embedded machine learning (TinyML) to be deployed where data is born, from industrial sensor data to audio and video.
Shelby explains the new paradigm of data-driven engineering with ML, showing how developers are using data instead of code to drive algorithm innovation. To support widespread deployment, ML workloads need to run on embedded computing targets from MCUs to GPUs, with MLOps processes to support efficient development and deployment. Industrial, logistics and health markets are particularly ripe to deploy this data-driven approach, and Shelby highlights several exciting case studies.
Delivering improved patient outcomes through advanced analytics 6.26.18Cloudera, Inc.
Rush University Medical Center, along with Cloudera and MetiStream, talk about adopting a comprehensive and interactive analytic platform for improved patient outcomes and better genomic analysis, highlighting examples in both genomics and clinical notes. John Spooner of 451 Research provides context to the discussion and shares market insights that complement the customer stories.
Data Centers in the age of the Industrial InternetGE_India
The convergence of machine and intelligent data is known as the Industrial Internet, and it's changing the way we work by improving efficiency and operations.
In the age of the Industrial Internet, the data center and its key components are evolving.
To learn more, click here: http://invent.ge/1c8vfvO
To know more about GE in India log on to: http://www.ge.com/in/
Connect with GE India online:
https://www.facebook.com/GEIndia
https://www.twitter.com/GEIndia
https://www.youtube.com/GEIndia
Cloud Infrastructure-as-a-Service (IaaS) model is gaining traction among businesses.The advent of more powerful, streamlined and cost-effective cloud-enabled services is fueling this migration process by giving organizations another infrastructure management alternative.
Unified Analytics in GE’s Predix for the IIoT: Tying Operational Technology t...Altoros
Learn how to achieve holistic operational visibility into IIoT business environments by correlating the data from Operational Technology and IT, and organizing it as a single pane of glass in accordance with business processes.
Performance advantages of Hadoop ETL offload with the Intel processor-powered...Principled Technologies
High-level Hadoop analysis requires custom solutions to deliver the data that you need, and the faster these jobs run the better. What if ETL jobs created by an entry-level employee after only a few days of training could run even faster than the same jobs created by a Hadoop expert with 18 years of database experience?
This is exactly what we found in our testing with the Dell | Cloudera | Syncsort solution. Not only was this solution was faster, easier, and less expensive to implement, but the ETL use cases our beginner created with this solution ran up to 60.3 percent more quickly than those our expert created with open-source tools.
Using the Dell | Cloudera | Syncsort solution means that your organization can compensate a lower-level employee for half as much time as a senior engineer doing less-optimized work. That is a clear path to savings.
Cloudera + Syncsort: Fuel Business Insights, Analytics, and Next Generation T...Precisely
Effective AI and ML projects require a perfect blend of scalable, clean data funneled from a variety of sources across the business. The only problem? Uncleaned data often lives in hard-to-access legacy systems, and it costs time and money to build the right foundation to deliver that data to answer ever-changing questions from business users. Together, Cloudera and Syncsort enable you to build a scalable foundation of data connections to reinvent the data lifecycle of all your projects in the most efficient way possible.
View this webinar on-demand to learn how innovative solutions from Cloudera and Syncsort enable AI and ML success. You will learn:
• Best practices for transforming complex data into clear, actionable insights for AI and ML projects
• How to visually assess the quality of the sources in your data lake and their completeness, consistency, and accuracy
• The value of an Enterprise Data Cloud and the newly unveiled Cloudera Data Platform
• How Syncsort Connect integrates natively with the Cloudera Data Platform
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Strategies for Enterprise Grade Azure-based AnalyticsCloudera, Inc.
Over the past decade, big data implementations have been more sophisticated in particular for organizations operationalizing machine learning, analytics and data engineering. The pressures of data-driven cultures, multiple workload applications such as customer care, fraud management and cross-platform marketing are changing the game. Mixing machine learning with business processes and operationalizing analytics with data engineering practices places burdens on IT teams. Making these advanced data environments all work together is an ongoing challenge.
While you can still “swipe and go” to implement data management environments in the cloud for an easy solution, the easy path is often littered with additional costs, higher overhead in terms of maintenance and synchronization. Data savvy organizations are taking a more measured and coordinated approach to their machine learning, analytics and data engineering infrastructures. These proactive approaches speed adoption among business stakeholders and lower administration and governance issues for technologists.
Join John L Myers, managing research director at leading IT analyst firm Enterprise Management Associates (EMA), and Nik Rouda, director of product marketing at Cloudera, to discover how the world of cloud implementations have changed for the better and the future of an enterprise grade cloud environment for your organization using the right resources.
Attend this webinar to learn about:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
How proactive implementations improve the workloads of technologists
Examples of real world customer implementations
3 things to learn:
Drivers for implementing machine learning, analytics and data engineering with a proactive approach
Pitfalls associated with “immediate gratification” implementations
How business stakeholders benefit from proactive approaches
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
3 Things to Learn About:
*The IoT ecosystem and data management considerations for IoT
*Top IoT use cases and data architecture strategies for managing the sheer volume and variety of IoT data
*Real-life case studies on how our customers are using Cloudera Enterprise to drive insights and analytics from all of their IoT data
The Data Center Is The Heartbeat of Today's IT Transformation (ENT215) | AWS ...Amazon Web Services
(Presented by CoreSite) The cloud-enabled data center sits at the center of IT transformation. It facilitates the interconnection and communities that come together, propelling growth for both buyers and sellers. Learn how CoreSite is bringing together best-of-breed partners through the Open Cloud Exchange, resulting in public, private, and hybrid IT interconnection and management as well as integration of AWS Direct Connect.
The CSC Big Data Analytics Insights service enables clients who do not have an analytics capability to implement the business, data and technology changes to gain business benefit from an initial set of analytics based on a roadmap of changes created by CSC or provided from a compatible set of inputs.
CSC Analytic Insights Implementation has four phases:
Stage 1: Analytic Engagement
Stage 2: Analytic Discovery
Stage 3: Implementation Planning
Stage 4: Embedding Analysis .
Migrating to the Cloud – Is Application Performance Monitoring still required?eG Innovations
As more businesses adopt cloud technologies for their various benefits it must be noted that not all cloud offerings are the same and provide different services or infra SLA. Do you know that not all SLA from cloud providers mean your application will be ensured similar availability?
Depending on whether you are leveraging Saas, Paas, Iaas, etc to deliver your applications, you will have a different level of visibility and control of how you manage performance and deliver the user experience expected by your users.
Join the session and find out what remains within your responsibility and how you can monitor the various cloud infrastructure/services to give yourself the needed visibility to deliver the expected user experience without over-provisioning to ensure better performance.
IoT is reshaping the manufacturing and industrial processes, effectively changing the paradigm from one of repair and replace to more of predict and prevent. Using data streaming from connected equipment and machinery, organizations can now monitor the health of their assets and effectively predict when and how an asset might fail. However, without the right data management strategy and tools, investments in IoT can yield limited results. Join Cloudera and Tata Consultancy Services (TCS) for a joint webinar to learn more about how organizations are using advanced analytics and machine learning to drive IoT enabled predictive maintenance.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/07/the-data-driven-engineering-revolution-a-presentation-from-edge-impulse/
Zach Shelby, Co-founder and CEO of Edge Impulse, presents the “Data-Driven Engineering Revolution” tutorial at the May 2021 Embedded Vision Summit.
In this talk, IoT industry pioneer and Edge Impulse co-founder Zach Shelby shares insights about how machine learning is revolutionizing embedded engineering. Advances in silicon and deep learning are enabling embedded machine learning (TinyML) to be deployed where data is born, from industrial sensor data to audio and video.
Shelby explains the new paradigm of data-driven engineering with ML, showing how developers are using data instead of code to drive algorithm innovation. To support widespread deployment, ML workloads need to run on embedded computing targets from MCUs to GPUs, with MLOps processes to support efficient development and deployment. Industrial, logistics and health markets are particularly ripe to deploy this data-driven approach, and Shelby highlights several exciting case studies.
Delivering improved patient outcomes through advanced analytics 6.26.18Cloudera, Inc.
Rush University Medical Center, along with Cloudera and MetiStream, talk about adopting a comprehensive and interactive analytic platform for improved patient outcomes and better genomic analysis, highlighting examples in both genomics and clinical notes. John Spooner of 451 Research provides context to the discussion and shares market insights that complement the customer stories.
Data Centers in the age of the Industrial InternetGE_India
The convergence of machine and intelligent data is known as the Industrial Internet, and it's changing the way we work by improving efficiency and operations.
In the age of the Industrial Internet, the data center and its key components are evolving.
To learn more, click here: http://invent.ge/1c8vfvO
To know more about GE in India log on to: http://www.ge.com/in/
Connect with GE India online:
https://www.facebook.com/GEIndia
https://www.twitter.com/GEIndia
https://www.youtube.com/GEIndia
Cloud Infrastructure-as-a-Service (IaaS) model is gaining traction among businesses.The advent of more powerful, streamlined and cost-effective cloud-enabled services is fueling this migration process by giving organizations another infrastructure management alternative.
Unified Analytics in GE’s Predix for the IIoT: Tying Operational Technology t...Altoros
Learn how to achieve holistic operational visibility into IIoT business environments by correlating the data from Operational Technology and IT, and organizing it as a single pane of glass in accordance with business processes.
Performance advantages of Hadoop ETL offload with the Intel processor-powered...Principled Technologies
High-level Hadoop analysis requires custom solutions to deliver the data that you need, and the faster these jobs run the better. What if ETL jobs created by an entry-level employee after only a few days of training could run even faster than the same jobs created by a Hadoop expert with 18 years of database experience?
This is exactly what we found in our testing with the Dell | Cloudera | Syncsort solution. Not only was this solution was faster, easier, and less expensive to implement, but the ETL use cases our beginner created with this solution ran up to 60.3 percent more quickly than those our expert created with open-source tools.
Using the Dell | Cloudera | Syncsort solution means that your organization can compensate a lower-level employee for half as much time as a senior engineer doing less-optimized work. That is a clear path to savings.
Docker's value for Development Teams in a DevOps ProcessLaurent Goujon
This flash study exposes how Docker technology can be considered as a lever for digital transformations and DevOps deployments, and used as an industrial tool to transform the way we manufacture and run applications and software.
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with HadoopPrecisely
With so many new, evolving frameworks, tools, and languages, a new big data project can lead to confusion and unwarranted risk.
Many organizations have found Data Warehouse Optimization with Hadoop to be a good starting point on their Big Data journey. Offloading ETL workloads from the enterprise data warehouse (EDW) into Hadoop is a well-defined use case that produces tangible results for driving more insights while lowering costs. You gain significant business agility, avoid costly EDW upgrades, and free up EDW capacity for faster queries. This quick win builds credibility and generates savings to reinvest in more Big Data projects.
A proven reference architecture that includes everything you need in a turnkey solution – the Hadoop distribution, data integration software, servers, networking and services – makes it even easier to get started.
Continuing to run a legacy infrastructure may be possible, but it isn’t optimal—not when new technologies like the Dell and Nutanix solution are available. By upgrading to this new hyperconverged infrastructure, you could do eight times the work of a legacy solution in just 6U and scale for more work by simply adding another node. What’s more, eliminating the need for centralized SAN storage means more space to grow, less hardware to manage, and the potential for lower power and cooling bills.
Take the first step on the path to an upgraded environment. Run DPACK in your own datacenter, and discover your performance requirements and potential bottlenecks. Then consider how the increased mixed workload performance from the hyperconverged, Intel processor-powered Dell and Nutanix solution could help your business thrive.
Using Dell ProDeploy Plus for Infrastructure can improve deployment times for...Principled Technologies
Save valuable in-house admin time by using a Dell Technologies-certified engineer for installation and configuration of a Dell solution
New data center resources can be great for your critical workloads and applications, but deploying those resources can burden IT staff. Organizations adding Dell Technologies solutions to their data center can use the deployment service Dell ProDeploy Plus for Infrastructure to simplify deployment and save time.
We found that Dell ProDeploy Plus:
• Planned the deployment with minimal input from our in-house staff, which freed them up to focus on
organizational demands and other strategic initiatives during deployment
• Reduced deployment planning by 67 percent—down to just three hours and 34 minutes—saving significant pre-deployment time for our admin as well as in totality
• Deployed our Dell Technologies solution in just one business day, needing just four
hours and 17 minutes
• Reduced installation time by nearly 16 hours, or two days, compared to our admin—or in other words, deployed the solution three times as fast
With ProDeploy Plus for Infrastructure, your organization can realize a faster, more efficient deployment of Dell technology with minimal impact on your in-house IT staff.
Unifying the management of a data center’s software and hardware components can help organizations deliver the technology infrastructures necessary to capitalize on the promises of cloud computing, big data, and the Internet of Things (IoT).
How pig and hadoop fit in data processing architectureKovid Academy
Pig, developed by Yahoo research in 2006, enables programmers to write data transformation programs for Hadoop quickly and easily without the cost and complexity of map-reduce programs.
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
Data management may be the hardest part of making the transition to the cloud, but enterprises including Intuit and Macy’s have figured out how to do it right. So what do they know that you might not? Join Robin Schumacher, Chief Product Officer at DataStax as he explores best practices for defining and implementing data management strategies for the cloud. He outlines a four-step journey that will take you from your first deployment in the cloud through to a true intercloud implementation and walk through a real-world use case where a major retailer has evolved through the four phases over a period of four years and is now benefiting from a highly resilient multi-cloud deployment.
View webinar: https://youtu.be/RrTxQ2BAxjg
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....Principled Technologies
Conclusion
Diving into the world of GenAI has the potential to yield a great many benefits for your organization, but it first requires consideration for how best to implement those GenAI workloads. Whether your AI goals are to create a chatbot for online visitors, generate marketing materials, aid troubleshooting, or something else, implementing an AI solution requires careful planning and decision-making. A major decision is whether to host GenAI in the cloud or keep your data on premises. Traditional on-premises solutions can provide superior security and control, a substantial concern when dealing with large amounts of potentially sensitive data. But will supporting a GenAI solution on site be a drain on an organization’s IT budget?
In our research, we found that the value proposition is just the opposite: Hosting GenAI workloads on premises, either in a traditional Dell solution or using a managed Dell APEX pay-per-use solution, could significantly lower your GenAI costs over 3 years compared to hosting these workloads in the cloud. In fact, we found that a comparable AWS SageMaker solution would cost up to 3.8 times as much and an Azure ML solution would cost up to 3.6 times as much as GenAI on a Dell APEX pay-per-use solution. These results show that organizations looking to implement GenAI and reap the business benefits to come can find many advantages in an on-premises Dell solution, whether they opt to purchase and manage it themselves or choose a subscription-based Dell APEX pay-per-use solution. Choosing an on-premises Dell solution could save your organization significantly over hosting GenAI in the cloud, while giving you control over the security and privacy of your data as well as any updates and changes to the environment, and while ensuring your environment is managed consistently.
Workstations powered by Intel can play a vital role in CPU-intensive AI devel...Principled Technologies
In three AI development workflows, Intel processor-powered workstations delivered strong performance, without using their GPUs, making them a good choice for this part of the AI process
Conclusion
We executed three AI development workflows on tower workstations and mobile workstations from three vendors, with each workflow utilizing only the Intel CPU cores, and found that these platforms were suitable for carrying out various AI tasks. For two of the workflows, we learned that completing the tasks on the tower workstations took roughly half as much time as on the mobile workstations. This supports the idea that the tower workstations would be appropriate for a development environment for more complex models with a greater volume of data and that the mobile workstations would be well-suited for data scientists fine-tuning simpler models. In the third workflow, we explored tower workstation performance with different precision levels and learned that using 16-bit floating point precision allowed the workstations to execute the workflow in less time and also reduced memory usage dramatically. For all three AI workflows we executed, we consider the time the workstations needed to complete the tasks to be acceptable, and believe that these workstations can be appropriate, cost-effective choices for these kinds of activities.
Enable security features with no impact to OLTP performance with Dell PowerEd...Principled Technologies
Get comparable online transaction processing (OLTP) performance with or without enabling AMD Secure Memory Encryption and AMD Secure Encrypted Virtualization - Encrypted State
Conclusion
You’ve likely already implemented many security measures for your servers, which may include physical security for the data center, hardware-level security, and software-level security. With the cost of data breaches high and still growing, however, wise IT teams will consider what additional security measures they may be able to implement.
AMD SME and SEV-ES are technologies that are already available within your AMD processor-powered 16th Generation Dell PowerEdge servers—and in our testing, we saw that they can offer extra layers of security without affecting performance. We compared the online transaction processing performance of a Dell PowerEdge R7625 server, powered by AMD EPYC 9274F processors, with and without these two security features enabled. We found that enabling AMD Secure Memory Encryption and Secure Encrypted Virtualization-Encrypted State did not impact performance at all.
If your team is assessing areas where you might be able to enhance security—without paying a large performance cost—consider enabling AME SME and AMD SEV-ES in your Dell PowerEdge servers.
Improving energy efficiency in the data center: Endure higher temperatures wi...Principled Technologies
In high-temperature test scenarios, a Dell PowerEdge HS5620 server continued running an intensive workload without component warnings or failures, while a Supermicro SYS‑621C-TN12R server failed
Conclusion: Remain resilient in high temperatures with the Dell PowerEdge HS5620 to help increase efficiency
Increasing your data center’s temperature can help your organization make strides in energy efficiency and cooling cost savings. With servers that can hold up to these higher everyday temperatures—as well as high temperatures due to unforeseen circumstances—your business can continue to deliver the performance your apps and clients require.
When we ran an intensive floating-point workload on a Dell PowerEdge HS5620 and a Supermicro SYS-621CTN12R in three scenario types simulating typical operations at 25°C, a fan failure, and an HVAC malfunction, the Dell server experienced no component warnings or failures. In contrast, the Supermicro server experienced warnings in all three scenario types and experienced component failures in the latter two tests, rendering the system unusable. When we inspected and analyzed each system, we found that the Dell PowerEdge HS5620 server’s motherboard layout, fans, and chassis offered cooling design advantages.
For businesses aiming to meet sustainability goals by running hotter data centers, as well as those concerned with server cooling design, the Dell PowerEdge HS5620 is a strong contender to take on higher temperatures during day-to-day operations and unexpected malfunctions.
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...Principled Technologies
The 4th Generation Intel Xeon Scalable processor‑powered solution deployed in less than two hours and ran a Kubernetes container-based generative AI workload effectively
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...Principled Technologies
The 4th Generation Intel Xeon Scalable processor‑powered solution deployed in less than two hours and ran a generative AI workload effectively
Conclusion
The appeal of incorporating GenAI into your organization’s operations is likely great. Getting started with an efficient solution for your next LLM workload or application can seem daunting because of the changing hardware and software landscape, but Dell APEX Cloud Platform for Red Hat OpenShift powered by 4th Gen Intel Xeon Scalable processors could provide the solution you need. We started with a Dell Validated Design as a reference, and then went on to modify the deployment as necessary for our Llama 2 workload. The Dell APEX Cloud Platform for Red Hat OpenShift solution worked well for our LLM, and by using this deployment guide in conjunction with numerous Dell documents and some flexibility, you could be well on your way to innovating your next GenAI breakthrough.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation (VCF)
For organizations running clusters of moderately configured, older Dell PowerEdge servers with a previous version of VCF, upgrading to better-configured modern servers can provide a significant performance boost and more.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation 4.5
If your company is struggling with underperforming infrastructure, upgrading to 16th Generation Dell PowerEdge servers running VCF 5.1 could be just what you need to handle more database throughput and reduce vSAN latencies. As an additional benefit to IT admins, we also found that the embedded VMware Aria Operation adapter provided useful infrastructure insights.
Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...Principled Technologies
Three AMD EPYC processor-based two-processor solutions outshined comparable Intel Xeon Scalable processor-based solutions by handling more Redis workload transactions and requests while consuming less power
Conclusion
Performance and energy efficiency are significant factors in processor selection for servers running data-intensive workloads, such as Redis. We compared the Redis performance and energy consumption of a server cluster in three AMD EPYC two-processor configurations against that of a server cluster in two Intel Xeon Scalable two-processor configurations. In each of our three test scenarios, the server cluster backed by AMD EPYC processors outperformed the server cluster backed by Intel Xeon Scalable processors. In addition, one of the AMD EPYC processor-based clusters consumed 20 percent less power than its Intel Xeon Scalable processor-based counterpart. Combining these measurements gave us power efficiency metrics that demonstrate how valuable AMD EPYC processor-based servers could be—you could see better performance per watt with these AMD EPYC processor-based server clusters and potentially get more from your Redis or other data intensive applications and workloads while reducing data center power costs.
Improve performance and gain room to grow by easily migrating to a modern Ope...Principled Technologies
We deployed this modern environment, then migrated database VMs from legacy servers and saw performance improvements that support consolidation
Conclusion
If your organization’s transactional databases are running on gear that is several years old, you have much to gain by upgrading to modern servers with new processors and networking components and an OpenShift environment. In our testing, a modern OpenShift environment with a cluster of three Dell PowerEdge R7615 servers with 4th Generation AMD EPYC processors and high-speed 100Gb Broadcom NICs outperformed a legacy environment with MySQL VMs running on a cluster of three Dell PowerEdge R7515 servers with 3rd Generation AMD EPYC processors and 25Gb Broadcom NICs. We also easily migrated a VM from the legacy environment to the modern environment, with only a few steps required to set up and less than ten minutes of hands-on time. The performance advantage of the modern servers would allow a company to reduce the number of servers necessary to perform a given amount of database work, thus lowering operational expenditures such as power and cooling and IT staff time for maintenance. The high-speed 100Gb Broadcom NICs in this solution also give companies better network performance and networking capacity to grow as they embrace emerging technologies such as AI that put great demands on networks.
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
With more memory available, system performance of three Dell devices increased, which can translate to a better user experience
Conclusion
When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
A Principled Technologies deployment guide
Conclusion
Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Principled Technologies
Compared to a cluster of PowerEdge R750 servers running VMware Cloud Foundation 4.5
Conclusion
If your company is struggling with underperforming infrastructure, upgrading to 16th Generation Dell PowerEdge servers running VCF 5.1 could be just what you need to handle more database throughput and reduce vSAN latencies. We found that a Dell PowerEdge R760 server cluster running VCF 5.1 processed over 78 percent more TPM and 79 percent more NOPM than a Dell PowerEdge R750 server cluster running VCF 4.5. It’s also worth noting that the PowerEdge R750 cluster bottlenecked on vSAN storage, with max write latency at 8.9ms. For reference, the PowerEdge R760 cluster clocked in at 3.8ms max write latency. This higher latency is due in part to the single disk group per host on the moderately configured PowerEdge R750 cluster, while the better-configured PowerEdge R760 cluster supported four disk groups per host. As an additional benefit to IT admins, we also found that the embedded VMware Aria Operation adapter provided useful infrastructure insights.
Based on our research using publicly available materials, it appears that Dell supports nine of the ten PC security features we investigated, HP supports six of them, and Lenovo supports three features.
Increase security, sustainability, and efficiency with robust Dell server man...Principled Technologies
Compared to the Supermicro management portfolio
Conclusion
Choosing a vendor for server purchases is about more than just the hardware platform. Decision-makers must also consider more long-term concerns, including system/data security, energy efficiency, and ease of management. These concerns make the systems management tools a vendor offers as important as the hardware.
We investigated the features and capabilities of server management tools from Dell and Supermicro, comparing Dell iDRAC9 against Supermicro IPMI for embedded server management and Dell OpenManage Enterprise and CloudIQ against Supermicro Server Manager for one-to-many device and console management and monitoring. We found that the Dell management tools provided more comprehensive security, sustainability, and management/monitoring features and capabilities than Supermicro servers did. In addition, Dell tools automated more tasks to ease server management, resulting in significant time savings for administrators versus having to do the same tasks manually with Supermicro tools.
When making a server purchase, a vendor’s associated management products are critical to protect data, support a more sustainable environment, and to ease the maintenance of systems. Our tests and research showed that the Dell management portfolio for PowerEdge servers offered more features to help organizations meet these goals than the comparable Supermicro management products.
Increase security, sustainability, and efficiency with robust Dell server man...Principled Technologies
Compared to the Supermicro management portfolio
Conclusion
Choosing a vendor for server purchases is about more than just the hardware platform. Decision-makers must also consider more long-term concerns, including system/data security, energy efficiency, and ease of management. These concerns make the systems management tools a vendor offers as important as the hardware.
We investigated the features and capabilities of server management tools from Dell and Supermicro, comparing Dell iDRAC9 against Supermicro IPMI for embedded server management and Dell OpenManage Enterprise and CloudIQ against Supermicro Server Manager for one-to-many device and console management and monitoring. We found that the Dell management tools provided more comprehensive security, sustainability, and management/monitoring features and capabilities than Supermicro servers did. In addition, Dell tools automated more tasks to ease server management, resulting in significant time savings for administrators versus having to do the same tasks manually with Supermicro tools.
When making a server purchase, a vendor’s associated management products are critical to protect data, support a more sustainable environment, and to ease the maintenance of systems. Our tests and research showed that the Dell management portfolio for PowerEdge servers offered more features to help organizations meet these goals than the comparable Supermicro management products.
Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...Principled Technologies
In our tests, Dell APEX Block Storage for AWS outperformed similarly configured solutions from Vendor A, achieving more IOPS, better throughput, and more consistent performance on both NVMe-supported configurations and configurations backed by Elastic Block Store (EBS) alone.
Dell APEX Block Storage for AWS supports a full NVMe backed configuration, but Vendor A doesn’t—its solution uses EBS for storage capacity and NVMe as an extended read cache—which means APEX Block Storage for AWS can deliver faster storage performance.
Scale up your storage with higher-performing Dell APEX Block Storage for AWSPrincipled Technologies
Dell APEX Block Storage for AWS offered stronger and more consistent storage performance for better business agility than a Vendor A solution
Conclusion
Enterprises desiring the flexibility and convenience of the cloud for their block storage workloads can find fast-performing solutions with the enterprise storage features they’re used to in on-premises infrastructure by selecting Dell APEX Block Storage for AWS.
Our hands-on tests showed that compared to the Vendor A solution, Dell APEX Block Storage for AWS offered stronger, more consistent storage performance in both NVMe-supported and EBS-backed configurations. Using NVMe-supported configurations, Dell APEX Block Storage for AWS achieved 4.7x the random read IOPS and 5.1x the throughput on sequential read operations per node vs. Vendor A. In our EBS-backed comparison, Dell APEX Block Storage for AWS offered 2.2x the throughput per node on sequential read operations vs. Vendor A.
Plus, the ability to scale beyond three nodes—up to 512 storage nodes with capacity of up to 8 PBs—enables Dell APEX Block Storage for AWS to help ensure performance and capacity as your team plans for the future.
Get in and stay in the productivity zone with the HP Z2 G9 Tower WorkstationPrincipled Technologies
We compared CPU performance and noise output of an HP Z2 G9 Tower Workstation in High Performance Mode to Dell Precision 3660 and 5860 tower workstations in optimized performance modes
Conclusion
HP Z2 G9 Tower Workstation users can change the BIOS settings to dial in the performance mode that best suits their needs: High Performance Mode, Performance Mode, or Quiet Mode. In good
news for both creative and technical professionals, we found that an Intel Core i9-13900 processor-powered HP Z2 G9 Tower Workstation set to High Performance mode received higher CPU-based benchmark scores than both a similarly configured Dell Precision 3660 and a Dell Precision 5860 equipped with an Intel Xeon w5-2455x processor. Plus, the HP Z2 G9 Tower Workstation was quieter while running CPU-intensive Cinebench 2024 and SPECapc for Solidworks 2022 workloads than both Dell Precision tower workstations. This means HP Z2 G9 Tower Workstation users who prize performance over everything else can do so without sacrificing a quiet workspace.
Open up new possibilities with higher transactional database performance from...Principled Technologies
In our PostgreSQL tests, R7i instances boosted performance over R6i instances with previous-gen processors
If you use the open-source PostgreSQL database to run your critical business operations, you have many cloud options from which to choose. While many of these instances can do the job, some can deliver stronger performance, which can mean getting a greater return on your cloud investment.
We conducted hands-on testing with the HammerDB TPROC-C benchmark to see how the PostgreSQL performance of Amazon EC2 R7i instances, enabled by 4th Gen Intel Xeon Scalable processors, stacked up to that of R6i instances with previous-generation processors. We learned that small, medium-sized, and large R7i instances with the newer processors delivered better OLTP performance, with improvements as high as 13.8 percent. By choosing the R7i instances, your organization has the potential to support more users, deliver a better experience to those users, and even lower your cloud operating expenditures by requiring fewer instances to get the job done.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Design advantages of Hadoop ETL offload with the Intel processor-powered Dell | Cloudera | Syncsort solution
1. JULY 2015
A PRINCIPLED TECHNOLOGIES TEST REPORT
Commissioned by Dell
DESIGN ADVANTAGES OF HADOOP ETL OFFLOAD WITH THE INTEL
PROCESSOR-POWERED DELL | CLOUDERA | SYNCSORT
Many companies are adopting Hadoop solutions to handle large amounts of
data stored across clusters of servers. Hadoop is a distributed, scalable approach to
managing Big Data that is very powerful and can bring great value to organizations.
Companies use extract, transform, and load (ETL) jobs to bring together data from many
different applications or systems on different hardware in order to modify or adjust the
data in some way, and then put it into a new format that they can mine for useful
information.
Using traditional ETL can require highly experienced, expensive, and hard-to-
find programmers to create jobs in order to extract data. Dell, Cloudera, and Syncsort
offer an integrated Hadoop ETL solution that allows entry-level technicians—after only a
few days of training—to perform the same tasks that these Hadoop specialists perform,
often even more quickly. In our tests, we found that with the unique design of the Dell |
Cloudera | Syncsort solution can allow an end user with little experience using Hadoop
to develop and deploy optimized ETL jobs up to 58.8 percent faster than an expert-
driven do-it-yourself (DIY) solution deployed using open-source tools.
2. A Principled Technologies test report 2Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
SAVE TIME CREATING ETL JOBS WITH THE DELL | CLOUDERA | SYNCSORT
SOLUTION
Hadoop implementations suffer from several barriers to effectiveness. They
often have primitive integration with infrastructure, and today we find there is currently
a lack of available talent to run Hadoop clusters and perform data ingest and processing
tasks using the cluster.1
The Dell | Syncsort solution offers help with both of these
problems.
The Dell | Cloudera | Syncsort solution is a reference architecture that offers a
reliable, tested configuration that incorporates Dell hardware on the Cloudera Hadoop
platform, with Syncsort’s DMX-h ETL software. The Dell | Cloudera | Syncsort reference
architecture includes four Dell PowerEdge R730xd servers and two Dell PowerEdge R730
servers, powered by the Intel Xeon processor E5-2600 v3 product family.
For organizations that want to optimize their data warehouse environments, the
Dell | Cloudera | Syncsort reference architecture can greatly reduce the time needed to
deploy Hadoop when using the included setup and configuration documentation as well
as the validated best practices. Leveraging the Syncsort DMX-h software means Hadoop
ETL jobs can be developed using a graphical interface in a matter of hours, with minor
amounts of training, and with no need to spend days developing code. The Dell |
Cloudera | Syncsort solution also offers professional services with Hadoop and ETL
experts to help fast track your project to successful completion.2
To understand how fast and easy designing ETL jobs with the Dell | Cloudera |
Syncsort solution can be, we had an entry-level technician and a highly experienced
Hadoop expert work to create three Hadoop ETL jobs using different approaches to
meet the goals of several use cases.
The entry-level worker, who had no familiarity with Hadoop and less than one
year of general server experience, used Syncsort DMX-h to carry out these tasks. Our
expert had 18 years of experience designing, deploying, administering, and
benchmarking enterprise-level relational database management systems (RDBMS). He
has deployed, managed, and benchmarked Hadoop clusters, covering several Hadoop
distributions and several Big Data strategies. He designed and created the use cases
using only free open-source DIY tools.
1
Source: survey of attendees for the 2014 Gartner webinar Hadoop 2.0 Signals Time for Serious Big Data Consideration.
www.informationweek.com/big-data/software-platforms/cloudera-trash-talks-with-enterprise-data-hub-release/d/d-id/1113677
2 Learn more at en.community.dell.com/dell-blogs/dell4enterprise/b/dell4enterprise/archive/2015/06/09/fast-track-data-
strategies-etl-offload-hadoop-reference-architecture
Extract, Transform, and
Load
ETL refers to the
following process in
database usage and data
warehousing:
• Extract the data from
multiple sources
• Transform the data so
it can be stored properly
for querying and analysis
• Load the data into the
final database,
operational data store,
data mart, or data
warehouse
3. A Principled Technologies test report 3Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
The implementation experience of these two workers showed us that using the
Dell | Cloudera | Syncsort solution was faster, easier, and—because a less experienced
employee could use it to create ETL jobs—far less expensive to implement. In this cost
analysis, we apply those findings to a hypothetical large enterprise.
Figure 1: Our entry-level
employee completed the three
ETL design jobs using the Dell |
Cloudera | Syncsort solution in
less than half the time the much
more experienced engineer
required using other tools.
(Lower numbers are better.)
As Figure 1 shows, the beginner using the Dell | Cloudera | Syncsort solution
was able to complete the three ETL design jobs in a total of 31 hours, whereas the
senior engineer using open-source tools needed 67 hours—more than twice as long.
Note that in addition to actually coding/designing the use cases, both of the workers
spent time familiarizing themselves with the requirements, revising to correct output
issues, and validating results.
Along with the savings3
that come from having a less highly compensated
employee perform the design work more quickly, the Dell | Cloudera | Syncsort solution
offers another avenue to cost-effectiveness: performance. Additional testing in the PT
labs4
revealed that due to their extreme efficiency, the ETL jobs our entry-level worker
created using Syncsort DMX-h ran more quickly than those our highly compensated
expert created. This can lead to savings in server utilization.
In this paper, we provide a quick overview of the Dell | Cloudera | Syncsort
solution and then discuss the experience the entry-level technician had while receiving
training and using the Dell | Cloudera | Syncsort solution.
3
Cost advantages of Hadoop ETL offload with the Intel processor-powered Dell | Cloudera | Syncsort solution
http://www.principledtechnologies.com/Dell/Dell_Cloudera_Syncsort_cost_0715.pdf
4
Performance advantages of Hadoop ETL offload with the Intel processor-powered Dell | Cloudera | Syncsort solution
www.principledtechnologies.com/Dell/Dell_Cloudera_Syncsort_performance_0715.pdf
4. A Principled Technologies test report 4Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
ABOUT SYNCSORT DMX-h
Syncsort DMX-h is a high-performance data integration software that runs
natively in Hadoop, providing everything needed to collect, prepare, blend, transform,
and distribute data. DMX-h, with its Intelligent Execution, allows users to graphically
design sophisticated data flows once and deploy on any compute framework (Apache
MapReduce, Spark, etc. on premise or in the cloud), future-proofing the applications
while eliminating the need for coding.
Using an architecture that runs ETL processing natively in Hadoop, without code
generation, Syncsort DMX-h lets users maximize performance without compromising on
the capabilities and typical use cases of conventional ETL tools. In addition, the software
packages’ industrial-grade capabilities to deploy, manage, monitor, and secure your
Hadoop environment.
Figure 2: How the Dell | Cloudera | Syncsort solution for Hadoop works.
Syncsort SILQ®
Syncsort SILQ is a technology that pairs well with DMX-h. SILQ is a SQL offload
utility designed to help users visualize and move their expensive data warehouse
(SQL) data integration workloads into Hadoop. SILQ supports a wide range of SQL
flavors and can parse thousands of lines of SQL code in seconds, outputting logic
flowcharts, job analysis, and DMX-h jobs. SILQ has the potential to take an
overwhelming SQL workload migration process and make it simple and efficient.
5. A Principled Technologies test report 5Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
About Dell | Cloudera | Syncsort training and professional services
An integral part of the Dell | Cloudera | Syncsort solution is the on-site training
that the company makes available for a fee. This training typically lasts four days.
However, because of our tight time frame, our technician worked with the trainer for
two and a half intensive days. The points below are based on the reflections of the
technician:
The training sessions were a mix of lecture, lessons, and question and
answer.
The technician logged into Syncsort’s training labs where VMs were set up
with Hadoop clusters ready for practicing on.
The technician was able to ask the instructor anything, which helped him
get a sense of best practices and how other people are currently using the
product.
In addition to learning how to use DMX-h, the technician also learned how
to use Syncsort’s documentation effectively, which made building jobs much
easier.
By the end of the training, the technician felt he had been well exposed to
every area of Syncsort DMX-h.
Another important component of the Dell | Cloudera | Syncsort solution is
ongoing enablement assistance that the company makes available for a fee. (Technical
support to answer questions about the product is included with the purchase price.) The
points below are based on the reflections of the technician:
After setting up his first job in a matter of hours, our technician sought
approval from Syncsort. Their enablement assistance team helped him
finalize and tweak his job for performance.
When our technician had a question about his output being formatted
correctly, the Syncsort team looked at his job definitions and provided
feedback.
6. A Principled Technologies test report 6Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
THE ETL JOBS OUR WORKERS DESIGNED
In our tests, both the entry-level technician and the senior engineer used the
same source dataset on the HDFS cluster, generated from the TPC-H dataset at scale
factor 200. The entry-level technician utilized Syncsort DMX-h to develop the ETL job for
each use case, while the senior engineer had the flexibility to utilize the publically
available open-source tools of their choice. The senior engineer chose to utilize Apache®
Pig because of its procedural nature rather than other commonly used tools, such as
Apache Hive.
For each of the three use cases, we compared the resulting output of the two
ETL jobs to ensure that they achieved the same business goals. See Appendix A for more
details on our ETL jobs.
Use case 1: Fact dimension load with Type 2 Slowly Changing Dimensions (SCD)
Businesses often feed their Business Intelligence decisions using the
Dimensional Fact Model, which uses a set of dimensional tables (tables with active and
historical data about a specific category, such as Geography or Time) to feed a fact table
(a table that joins the active records of all the dimensional tables on a common field)
summarizing current business activity. While some dimension tables never change,
some often experience updates. The dimension tables must be up-to-date to feed the
fact table correctly, and many businesses would like to keep historical data as well for
archival purposes.
A common method to identify changes in data is to compare historical and
current data using Changed Data Capture (CDC). A common method for updating
records while retaining the outdated record is Type 2 Slowly Changing Dimensions
(SCD). We used outer joins and conditional formatting to implement CDC and identify
changed records. We then implemented Type 2 SCD to mark outdated records with an
end-date and insert the updated data as a new, current record (in our case we used a
perpetual end-date of 12/31/9999 to specify a current record). We then fed the
information from the dimensional tables downstream to a fact table that could be
queried for Business Intelligence.
Figure 3 shows the Syncsort DMX-h job editor with the completed use case 1
design. The individual boxes represent tasks, while the arrows between the tasks
represent data flow.
7. A Principled Technologies test report 7Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Figure 3: Syncsort DMX-h
job editor showing the
completed use case 1
design.
Both the entry-level technician using the Dell | Cloudera | Syncsort solution and
the senior engineer using open-source DIY tools were able to design effective jobs to
perform Type 2 Slowly Changing Dimensions to feed the Dimensional Fact Model. The
beginner was able to design the job 47.8 percent faster than the senior engineer (see
Figure 4).
Figure 4: Time to design the
Fact dimension load with Type
2 SCD use case. (Lower
numbers are better.)
8. A Principled Technologies test report 8Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Use case 2: Data validation and pre-processing
Data validation is an important part of any ETL process. It is important for a
business to ensure that only usable data makes it into their data warehouse. And, that
when data is unusable, it’s important to mark that unusable data with an error message
explaining why it’s invalid. Data validation can be performed with conditions and filters
to check for and discard data that is badly formatted or nonsensical.
Figure 5 shows the Syncsort DMX-h job editor with the completed use case 2
design, which pictures the metadata panel showing a sample of the conditions used to
reformat and validate the data.
Figure 5: Syncsort DMX-h job
editor showing the completed
use case 2 design.
Both the entry-level technician using the Dell | Cloudera | Syncsort solution and
the senior engineer using open-source DIY tools were able to design effective jobs to
perform accurate data validation, but the beginner was able to design the job 55.6
percent faster than the senior engineer using Apache Pig (see Figure 6). The Syncsort
DMX-h graphical user interface made it easy to quickly try new filters and sample the
outputs. Being able to rapidly prototype the data validation job meant less time spent
revising.
9. A Principled Technologies test report 9Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Figure 6: Time to create the
Data validation and pre-
processing use case. (Lower
numbers are better.)
Use case 3: Vendor mainframe file integration
The need to reformat heterogeneous data sources into a more useable format is
a common requirement of businesses with many varied data sources. Often companies
have high-value data locked away in the mainframe and want to assimilate that data
into the cluster. Some mainframe data types use a COBOL copybook to specify field
formats. Syncsort DMX-h makes the process of reformatting mainframe data sources
easy by allowing the user to link a COBOL copybook into metadata and automatically
interpret the flat file’s record layout—one of many data integration features Syncsort
DMX-h offers. Syncsort DMX-h enabled an inexperienced IT person with no prior
exposure to mainframe-formatted files to interpret these legacy data types, all without
disturbing the integrity of the original data.
Figure 7 shows the Syncsort DMX-h job editor with the completed use case 3
design. The task shows a single mainframe-formatted data source being split into two
ASCII-formatted outputs.
10. A Principled Technologies test report 10Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Figure 7: Syncsort DMX-h
job editor showing the
completed use case 3
design.
Both the entry-level technician using the Dell | Cloudera | Syncsort solution and
the senior engineer using open source DIY tools were able to design effective jobs to
reformat the mainframe data successfully, but the beginner was able to design the job
58.8 percent faster (see Figure 8).
Figure 8: Time to create the
Vendor mainframe file
integration use case. (Lower
numbers are better.)
11. A Principled Technologies test report 11Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
CONCLUSION
High-level Hadoop analysis requires custom solutions to deliver the data that
you need and the amount of time that even senior engineers require to create ETL jobs
in a DIY hardware and software situation, can be substantial.
We found that the Dell | Cloudera | Syncsort solution was so easy to use that an
entry-level employee could use it to create optimized ETL jobs after only a few days of
training. And he could do it quickly—our technician, who had no previous experience
using Hadoop, developed three optimized ETL jobs in 31 hours. That is less than half of
the 68 hours our expert with years of Hadoop experience needed to create the same
jobs using open source tools.
Using the Dell | Cloudera | Syncsort solution means that your organization can
implement a Hadoop solution using employees already on your staff rather than trying
to recruit expensive, difficult-to-find specialists. Not only that, but the projects can be
completed in a fraction of the time. This makes the Dell | Cloudera | Syncsort solution a
winning business proposition.
12. A Principled Technologies test report 12Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
APPENDIX A – HOW WE TESTED
Installing the Dell | Cloudera Apache Hadoop Solution
We installed Cloudera Hadoop (CDH) version 5.4 onto our cluster by following the “Dell | Cloudera Apache
Hadoop Solution Deployment Guide – Version 5.4” with some modifications. The following is a high-level summary of
this process.
Configuring the Dell Force10 S55 and Dell PowerConnect S4810 switches
We used the Dell Force10 S55 switch for 1GbE external management access from our lab to the Edge Node. We
configured two Dell PowerConnect S4810 switches for redundant 10GbE cluster traffic.
Configuring the BIOS, firmware, and RAID settings on the hosts
We used the Dell Deployment Tool Kit to configure our hosts before OS installation. We performed these steps
on each host.
1. Boot into the Dell DTK USB drive using BIOS boot mode.
2. Once the CentOS environment loads, choose the node type (infrastructure or storage), and enter the iDRAC
connection details.
3. Allow the system to boot into Lifecycle Controller and apply the changes. Once this is complete, the system will
automatically reboot once more.
Installing the OS on the hosts
We installed CentOS 6.5 using a kickstart file with the settings recommended by the Deployment Guide. We
performed these steps on each node.
1. Boot into a minimal CentOS ISO and press Tab at the splash screen to enter boot options.
2. Enter the kickstart string and required options, and press Enter to install the OS.
3. When the OS is installed, run yum updates on each node, and reboot to fully update the OS.
Installing Cloudera Manager and distributing CDH to all nodes
We used Installation Path A in the Cloudera support documentation to guide our Hadoop installation. We chose
to place Cloudera Manager on the Edge Node so that we could easily access it from our lab network.
1. On the Edge Node, use wget to download the latest cloudera-manager-installer.bin, located on
archive.cloudera.com.
4. Run the installer and select all defaults.
5. Navigate to Cloudera Manager by pointing a web browser to
http://<Edge_Node_IP_address>:7180.
6. Log into Cloudera Manager using the default credentials admin/admin.
7. Install the Cloudera Enterprise Data Hub Edition Trial with the following options:
a. Enter each host’s IP address.
b. Leave the default repository options.
c. Install the Oracle Java SE Development Kit (JDK).
d. Do not check the single user mode checkbox.
e. Enter the root password for host connectivity.
8. After the Host Inspector checks the cluster for correctness, choose the following Custom Services:
a. HDFS
b. Hive
13. A Principled Technologies test report 13Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
c. Hue
d. YARN (MR2 Included)
9. Assign roles to the hosts using Figure 9.
Service Role Node(s)
HBase
Master nn01
HBase REST Server nn01
HBase Thrift Server nn01
Region Server nn01
HDFS
NameNode nn01
Secondary NameNode en01
Balancer en01
HttpFS nn01
NFS Gateway nn01
DataNode dn[01-04]
Hive
Gateway [all nodes]
Hive Metastore Server en01
WebHCat Server en01
HiveServer2 en01
Hue
Hue Server en01
Impala
Catalog Server nn01
Impala StateStore nn01
Impala Daemon nn01
Key-value Store Indexer
Lily Hbase Indexer nn01
Cloudera Management Service
Service Monitor en01
Activity Monitor en01
Host Monitor en01
Reports Manager en01
Event Server en01
Alert Publisher en01
Oozie
Oozie Server nn01
Solr
Solr Server nn01
Spark
History Server nn01
Gateway nn01
14. A Principled Technologies test report 14Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Service Role Node(s)
Sqoop 2
Sqoop 2 Server nn01
YARN (MR2 Included)
ResourceManager nn01
JobHistory Server nn01
NodeManager dn[01-04]
Zookeeper
Server nn01, en01, en01
Figure 9: Role assignments.
10. At the Database Setup screen, copy down the embedded database credentials and test the connection. If the
connections are successful, proceed through the wizard to complete the Cloudera installation.
Installing the Syncsort DMX-h environment
Installation of the Syncsort DMX-h environment involves installing the Job Editor onto a Windows server,
distributing the DMX-h parcel to all Hadoop nodes, and installing the dmxd service onto the NameNode. We used the
30-day trial license in our setup.
Installing Syncsort DMX-h onto Windows
We used a Windows VM with access to the NameNode to run the Syncsort DMX-h job editor.
1. Run dmexpress_8-1-0_windows_x86.exe on the Windows VM and follow the wizard steps to install the
job editor.
Distributing the DMX-h parcel via Cloudera Manager
We downloaded the DMX-h parcel to the Cloudera parcel repository and used Cloudera Manager to pick it up
and send it to every node.
1. Copy dmexpress-8.1.7-el6.parcel_en.bin to the EdgeNode and set execute permissions for the
root user.
2. Run dmexpress-8.1.7-el6.parcel_en.bin and set the extraction directory to /opt/cloudera/parcel-repo.
3. In Cloudera Manager, navigate to the Parcels section and distribute the DMExpress parcel to all nodes.
Installing the dmxd daemon on the NameNode
We placed the dmxd daemon on the NameNode in order to have it in the same location as the YARN
ResourceManager.
1. Copy dmexpress-8.1.7-1.x86_64_en.bin to the NameNode and set execute permissions for the root
user.
2. Run dmexpress-8.1.7-el6.parcel_en.bin to install the dmxd daemon.
Post-install configuration
We made a number of changes to the cluster in order to suit our environment and increase performance.
Relaxing HDFS permissions
We allowed the root user to read and write to HDFS, in order to simplify the process of performance testing.
1. In Cloudera Manager, search for “Check HDFS Permissions” and uncheck the HDFS (Service-Wide) checkbox.
15. A Principled Technologies test report 15Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Setting YARN parameters
We made a number of parameter adjustments to increase resource limits for our map-reduce jobs. These
parameters can be found using the Cloudera Manager search bar. Figure 10 shows the parameters we changed.
Parameter New value
yarn.nodemanager.resource.memory-mb 80 GiB
yarn.nodemanager.resource.cpu-vcores 35
yarn.scheduler.maimum-allocation-mb 16 GiB
Figure 10: YARN resource parameter adjustments.
Custom XML file for DMX-h jobs
We created an XML file to set cluster parameters for each job. In the Job Editor, set the environment variable
DMX_HADOOP_CONF_FILE to the XML file path. The contents of the XML file are below.
<?xml version="1.0"?>
<configuration>
<!-- Specify map vcores resources -->
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>2</value>
</property>
<!-- Specify reduce vcores resources -->
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>4</value>
</property>
<!-- Specify map JVM Memory resources -->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx2048m</value>
</property>
<!-- Specify reduce JVM Memory resources -->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx7168m</value>
</property>
<!-- Specify map Container Memory resources -->
<property>
16. A Principled Technologies test report 16Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
<name>mapreduce.map.memory.mb</name>
<value>2560</value>
</property>
<!-- Specify reduce Container Memory resources -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8704</value>
</property>
<!-- Specify reducers to be used -->
<property>
<name>mapreduce.job.reduces</name>
<value>32</value>
</property>
</configuration>
Creating the Syncsort DMX-h use cases
In our testing, we measured the time required to design and create DMX-h jobs for three use cases. Screenshots
of the DMX-h jobs for each use case appear below.
Use case 1: Fact dimension load with Type 2 Slowly Changing Dimensions (SCD)
We used outer joins and conditional reformatting to implement Type 2 SCD for Use Case 1. Figure 11 shows the
UC1 job layout.
Figure 11: Use Case 1 job
layout.
17. A Principled Technologies test report 17Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Use case 2: Data validation and pre-processing
We used a copy task with conditional filters to implement Data Validation for Use Case 2. Figure 12 shows the
UC2 job layout.
Figure 12: Use Case 2 job
layout.
Use case 3: Vendor mainframe file integration
We used a copy task with imported metadata to implement Vendor Mainframe File Integration for use case 3.
Figure 13 shows the UC3 job layout.
Figure 13: Use Case 3 job
layout.
18. A Principled Technologies test report 18Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Creating the DIY use cases
We used Pig, Python and Java to implement the DIY approaches to each use case. The DIY code for each use case
appears below.
Use case 1: Fact dimension load with Type 2 Slowly Changing Dimensions (SCD)
set pig.maxCombinedSplitSize 2147483648
set pig.exec.mapPartAgg true
-- uc1
-----------------
-----------------
-- parameters and constants
-- input and output files
%DECLARE IN_O_UPDATES '/UC1/200/Source/orders.tbl'
%DECLARE IN_P_UPDATES '/UC1/200/Source/part.tbl'
%DECLARE IN_S_UPDATES '/UC1/200/Source/supplier.tbl'
%DECLARE IN_O_HISTORY '/UC1/200/Source/ORDERS_DIM_HISTORY'
%DECLARE IN_P_HISTORY '/UC1/200/Source/PART_DIM_HISTORY'
%DECLARE IN_S_HISTORY '/UC1/200/Source/SUPPLIER_DIM_HISTORY'
%DECLARE IN_LINEITEM '/UC1/200/Source/lineitem.tbl'
%DECLARE OUT_O_HISTORY '/DIY/200/UC1/ORDERS_DIM_HISTORY'
%DECLARE OUT_P_HISTORY '/DIY/200/UC1/PART_DIM_HISTORY'
%DECLARE OUT_S_HISTORY '/DIY/200/UC1/SUPPLIER_DIM_HISTORY'
%DECLARE OUT_FACT '/DIY/200/UC1/SOLD_PARTS_DIM_FACT'
-- "interior" fields for the i/o tables. needed to assist with column projections
%DECLARE HI_O_FIELDS
'ho_custkey,ho_orderstatus,ho_totalprice,ho_orderdate,ho_orderpriority,ho_clerk,ho_shippr
iority'
%DECLARE HI_P_FIELDS
'hp_name,hp_mfgr,hp_brand,hp_type,hp_size,hp_container,hp_retailprice'
%DECLARE HI_S_FIELDS 'hs_name,hs_address,hs_nationkey,hs_phone,hs_acctbal'
%DECLARE UP_O_FIELDS
'uo_custkey,uo_orderstatus,uo_totalprice,uo_orderdate,uo_orderpriority,uo_clerk,uo_shippr
iority'
%DECLARE UP_P_FIELDS
'up_name,up_mfgr,up_brand,up_type,up_size,up_container,up_retailprice'
%DECLARE UP_S_FIELDS 'us_name,us_address,us_nationkey,us_phone,us_acctbal'
-- Option to use replicated JOIN for the supplier name lookup.
-- use it
%DECLARE USE_REP_JOIN 'USING 'REPLICATED'' ;
-- don’t use it
-- %DECLARE USE_REP_JOIN ' ' ;
-- tag in end-date fieldstop signify an active records
%DECLARE OPEN_REC '12/31/9999'
-- tags for start/end date fields
%DECLARE TODAY_REC `date +%m/%d/%Y`
%DECLARE YESTE_REC `date --date="1 days ago" +%m/%d/%Y`
IMPORT 'uc1_macros.pig';
19. A Principled Technologies test report 19Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
-----------------
-----------------
-- supplier, new history
supplier = update_history('$IN_S_UPDATES', '$IN_S_HISTORY', '$UP_S_FIELDS',
'$HI_S_FIELDS');
STORE supplier INTO '$OUT_S_HISTORY' USING PigStorage('|');
-- part, new history
part = update_history('$IN_P_UPDATES', '$IN_P_HISTORY', '$UP_P_FIELDS', '$HI_P_FIELDS');
STORE part INTO '$OUT_P_HISTORY' USING PigStorage('|');
-- orders, new history
orders = update_history('$IN_O_UPDATES', '$IN_O_HISTORY', '$UP_O_FIELDS',
'$HI_O_FIELDS');
STORE orders INTO '$OUT_O_HISTORY' USING PigStorage('|');
----------------------------------------------
----------------------------------------------
-- drop expired records
supplier = FILTER supplier BY h_enddate == '$OPEN_REC';
part = FILTER part BY h_enddate == '$OPEN_REC';
orders = FILTER orders BY h_enddate == '$OPEN_REC';
-- data for fact table
lineitem = lOAD '$IN_LINEITEM' USING PigStorage('|') AS (
l_orderkey,l_partkey,l_suppkey,
l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,
l_linestatus,l_shipdate,l_commitdate,
l_receiptdate,l_shipinstruct,l_shipmode,l_comment
);
-- dereference supplier and save required keys (drop l_suppkey)
lineitem = JOIN lineitem by l_suppkey LEFT OUTER, supplier by h_key;
lineitem = FOREACH lineitem GENERATE
l_orderkey,l_partkey,
hs_name,hs_nationkey,hs_acctbal,supplier::h_startdate,l_linenumber .. l_comment;
-- dereference part and save required keys (drop l_partkey)
lineitem = JOIN lineitem by l_partkey LEFT OUTER, part by h_key;
lineitem = FOREACH lineitem GENERATE
l_orderkey,
hp_name,hp_mfgr,hp_brand,hp_type,hp_retailprice,part::h_startdate,hs_name ..
l_comment;
-- dereference orders and save required keys (drop l_orderkey)
20. A Principled Technologies test report 20Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
lineitem = JOIN lineitem by l_orderkey LEFT OUTER, orders by h_key;
lineitem = FOREACH lineitem GENERATE
ho_custkey,ho_orderstatus,ho_totalprice,ho_orderdate,orders::h_startdate,hp_name ..
l_comment;
----------------------------------------------
----------------------------------------------
STORE lineitem INTO '$OUT_FACT' USING PigStorage('|');
uc1_macros.pig
DEFINE update_history(in_updates, in_history, update_fields, history_fields) RETURNS
history {
-- update tables
updates = LOAD '$in_updates' USING PigStorage('|') AS (
u_key, $update_fields, u_comment
);
-- historical tables
historical = LOAD '$in_history' USING PigStorage('|') AS (
h_key, $history_fields, h_comment, h_startdate:chararray,h_enddate:chararray
);
-- remove expired records from the historical data and save for final table
SPLIT historical INTO
in_unexpired IF (h_enddate matches '^$OPEN_REC$'),
in_expired OTHERWISE;
-- full join by primary key to determine matches and left/right uniques
joined = JOIN
updates by u_key FULL OUTER,
in_unexpired by h_key;
SPLIT joined INTO
new IF h_key IS NULL, -- totally new entry from updates
old IF u_key IS NULL, -- unmatched historical entry
matched OTHERWISE; -- match primary key: either redundant or updated entry
-- format new and old entries for output
new = FOREACH new GENERATE u_key .. u_comment, '$TODAY_REC', '$OPEN_REC';
old = FOREACH old GENERATE h_key .. h_enddate;
-- find updated entries
SPLIT matched INTO
updates IF ( TOTUPLE($update_fields, u_comment) != TOTUPLE($history_fields,
h_comment) ),
redundant OTHERWISE;
-- format redundant entries for output
redundant = FOREACH redundant GENERATE h_key .. h_enddate;
-- updated entry: expire historical; tag update
modified = FOREACH updates GENERATE u_key .. u_comment, '$TODAY_REC', '$OPEN_REC';
expired = FOREACH updates GENERATE h_key .. h_startdate, '$YESTE_REC';
-- combine the record classes
$history = UNION in_expired, new, old, redundant, modified, expired;
};
21. A Principled Technologies test report 21Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Use case 2: Data validation and pre-processing
set pig.maxCombinedSplitSize 1610612736
set pig.exec.mapPartAgg true
-----------------------------
-----------------------------
-- parameter and constants definitions
-- input and output files
%DECLARE IN_LINEITEM_VENDOR '/UC2/200/Source/lineitem_vendorx.tbl'
%DECLARE IN_SUPPLIER '/UC2/200/Source/supplier.tbl'
%DECLARE OUT_GOOD '/DIY/200/UC2/lineitem_validated'
%DECLARE OUT_BAD '/DIY/200/UC2/lineitem_errors'
-- Option to use replicated JOIN for the supplier name lookup.
-- use it
%DECLARE USE_REP_JOIN 'USING 'REPLICATED'' ;
-- don’t use it
-- %DECLARE USE_REP_JOIN ' ' ;
-- REGEX to match valid Gregorian dates in the form yyyy-mm-dd for
-- years 0000 to 9999 (as a mathematical concept, not political).
%DECLARE VALID_DATE '^(?:d{4}-(?:(?:0[1-9]|1[0-2])-(?:0[1-9]|1d|2[0-8])|(?:0[13-
9]|1[0-2])-(?:29|30)|(?:0[13578]|1[02])-
31)|(?:d{2}(?:[02468][48]|[13579][26]|[2468]0)|(?:[02468][048]|[13579][26])00)-02-
29)$'
-- header for invalid data error message field and other nerror messages
%DECLARE ERR_MSG_HEAD 'ERROR(S) in LINEITEM source record:'
%DECLARE ORD_ERR ' invalid L_ORDERKEY value;'
%DECLARE PAR_ERR ' invalid L_PARTKEY value;'
%DECLARE SUP_ERR ' invalid L_SUPPKEY value;'
%DECLARE LIN_ERR ' invalid L_LINENUMBER value;'
%DECLARE QUN_ERR ' invalid L_QUANTITY value;'
%DECLARE DIS_ERR ' invalid L_DISCOUNT value;'
%DECLARE TAX_ERR ' invalid L_TAX value;'
%DECLARE RET_ERR ' L_RETURNFLAG must be one of A,N,R;'
%DECLARE LIS_ERR ' L_LINESTATUS must be one of O,F;'
%DECLARE SHI_ERR ' L_SHIPMODE must be one of AIR,FOB,MAIL,RAIL,REG AIR,SHIP,TRUCK;'
%DECLARE NAM_ERR ' supplier lookup failed;'
%DECLARE COD_ERR ' invalid L_COMMITDATE;'
%DECLARE SHD_ERR ' invalid L_SHIPDATE;'
%DECLARE RED_ERR ' invalid L_RECEIPTDATE;'
%DECLARE SRD_ERR ' L_SHIPDATE after L_RECEIPTDATE;'
-----------------------------
-----------------------------
-- start of the main program
-- data to be validated
lineitem = LOAD '$IN_LINEITEM_VENDOR' USING PigStorage('|') AS (
l_orderkey:int, l_partkey:int, l_suppkey:int, l_linenumber:int, l_quantity:int,
l_extendedprice:float, l_discount:float, l_tax:float,
l_returnflag:chararray, l_linestatus:chararray,
22. A Principled Technologies test report 22Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
l_shipdate:chararray, l_commitdate:chararray, l_receiptdate:chararray,
l_shipinstruct:chararray, l_shipmode:chararray,
l_comment);
-- lookup table for supplier name and validation
-- only first two columns from supplier
suppnames = LOAD '$IN_SUPPLIER' USING PigStorage('|') AS (
s_suppkey:int, s_name:chararray);
-- anti join idiom (completed below) for the supplier lookup (i.e., null is if unmatched)
-- We use known cardinality of TPC-H tables (supplier is small; lineitem is
-- huge) to choose JOIN type. For example, when the TPC-H SF is 1000, the entire
-- supplier table is about 1.4GB and 10,000,000 rows. So, we know the supplier table
-- with two fields will fit (partitioned or not) in RAM -- so replicated JOIN
good1 = JOIN lineitem BY l_suppkey LEFT OUTER, suppnames BY s_suppkey $USE_REP_JOIN ;
good2 = FOREACH good1 GENERATE
--
-- start of error message
CONCAT('$ERR_MSG_HEAD', CONCAT(
(l_orderkey > 0 ? '' : '$ORD_ERR'), CONCAT(
(l_partkey > 0 ? '' : '$PAR_ERR'), CONCAT(
(l_suppkey > 0 ? '' : '$SUP_ERR'), CONCAT(
(l_linenumber > 0 ? '' : '$LIN_ERR'), CONCAT(
(l_quantity > 0 ? '' : '$QUN_ERR'), CONCAT(
((l_discount >= 0.0F AND l_discount <= 1.0F)
? '' : '$DIS_ERR'), CONCAT(
(l_tax >= 0.0F ? '' : '$TAX_ERR'), CONCAT(
(l_returnflag IN ('A','N','R')
? '' : '$RET_ERR'), CONCAT(
(l_linestatus IN ('F','O')
? '' : '$LIS_ERR'), CONCAT(
(l_shipmode IN ('AIR','FOB','MAIL','RAIL','REG AIR','SHIP','TRUCK')
? '' : '$SHI_ERR'), CONCAT(
(s_name is NOT NULL ? '' : '$NAM_ERR'), CONCAT(
(l_commitdate matches '$VALID_DATE'
? '' : '$COD_ERR'), CONCAT(
(l_shipdate matches '$VALID_DATE'
? '' : '$SHD_ERR'), CONCAT(
(l_receiptdate matches '$VALID_DATE'
? '' : '$RED_ERR'), -- last CONCAT
((l_shipdate matches '$VALID_DATE' AND l_receiptdate matches '$VALID_DATE')
? ((DaysBetween(ToDate(l_receiptdate,'yyyy-MM-
dd'),ToDate(l_shipdate, 'yyyy-MM-dd')) >= 0)
? '' : '$SRD_ERR' )
: '')
))))))))))))))) AS err_message,
-- end of error message
--
l_orderkey .. l_suppkey, s_name, l_linenumber .. l_shipmode,
--
-- start of shipping time and rating as a TUPLE
((l_shipdate matches '$VALID_DATE' AND l_receiptdate matches '$VALID_DATE')
? TOTUPLE((int)DaysBetween(ToDate(l_receiptdate,'yyyy-MM-
dd'),ToDate(l_shipdate, 'yyyy-MM-dd')),
23. A Principled Technologies test report 23Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
(CASE (int)DaysBetween(ToDate(l_receiptdate,'yyyy-MM-
dd'),ToDate(l_shipdate, 'yyyy-MM-dd'))
WHEN 0 THEN 'A' WHEN 1 THEN 'A' WHEN 2 THEN 'A'
WHEN 3 THEN 'A'
WHEN 4 THEN 'B' WHEN 5 THEN 'B' WHEN 6 THEN 'B'
WHEN 7 THEN 'B'
WHEN 8 THEN 'C' WHEN 9 THEN 'C' WHEN 10 THEN 'C'
WHEN 11 THEN 'C' WHEN 12 THEN 'C' WHEN 13 THEN 'C'
ELSE 'D' END ) )
: TOTUPLE( 999, 'Z')) AS t1:tuple(shippingtime_days:int,
shipping_ratingi:chararray),
-- end of shipping time and rating as a TUPLE
--
l_comment;
SPLIT good2 INTO
good IF ENDSWITH(err_message, ':'),
bad OTHERWISE;
good = FOREACH good GENERATE l_orderkey .. l_shipmode, FLATTEN(t1), l_comment;
bad = FOREACH bad GENERATE err_message .. l_suppkey, l_linenumber .. l_shipmode,
l_comment;
STORE good into '$OUT_GOOD' USING PigStorage('|');
STORE bad into '$OUT_BAD' USING PigStorage('|');
24. A Principled Technologies test report 24Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
Use case 3: Vendor mainframe file integration
set pig.maxCombinedSplitSize 2147483648
set pig.exec.mapPartAgg true
-- uc3
-- djs, 13 july 2015
-----------------
-----------------
-- parameters and constants
-- input and output files
%DECLARE IN_ORDERITEM '/UC3/200/Source/ORDRITEM'
%DECLARE OUT_ORDERS '/DIY/200/UC3/orders'
%DECLARE OUT_LINEITEM '/DIY/200/UC3/lineitem'
-----------------
-----------------
-- job parameterrs
-- set the property for fixed-lengthed records; unsure which it is
set mapreduce.input.fixedlengthinputformat.record.length 178;
set mapreduce.lib.input.fixedlengthinputformat.record.length 178;
set fixedlengthinputformat.record.length 178;
-- UDF loader: fixed-length reads and ebscdic decoing
register myudfs.jar;
-----------------
-----------------
-- this UDF loader reads fixed-length records (178 bytes), and performs
-- EBCDIC conversion on the string and formats the fields, using the fixed
-- copybook format for this table. A loader that works for
-- general-purpose copybooks is out of scope, but could be done with the
-- Jrecord classes.
cobol = LOAD '$IN_ORDERITEM' USING myudfs.SimpleFixedLengthLoader;
SPLIT cobol INTO
orders if (chararray)$0 == 'O',
lineitem if (chararray)$0 == 'L';
-- remove record-type field
orders = FOREACH orders GENERATE $1 .. ;
lineitem = FOREACH lineitem GENERATE $1 .. ;
store orders into '$OUT_ORDERS' USING PigStorage('|');
store lineitem into '$OUT_LINEITEM' USING PigStorage('|');
SimpleFixedLengthLoader.java
package myudfs;
/* compile with, for example,
CP=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/pig/pig-0.12.0-cdh5.4.3.jar
export PATH=$PATH:/usr/java/jdk1.7.0_67-cloudera/bin
27. A Principled Technologies test report 27Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
// add sign
if ( low == 0x0d ) sb.insert(0, '-');
return (sb.toString()).replaceFirst("^(-?)0+(?!.)","$1");
}
private String doSingle(String str, int start) {
return str.substring(start, start+1);
}
private String cleanR(String str, int start, int end) {
return str.substring(start, end).replaceFirst("s+$", "");
}
private String cleanL(String str, int start, int end) {
return str.substring(start, end).replaceFirst("^0+(?!$)", "");
}
private String doDate(String str, int start) {
return str.substring(start,start+4) + '-' + str.substring(start+4,start+6) +
'-' + str.substring(start+6,start+8);
}
private void addTupleValue(ArrayList<Object> tuple, String buf) {
tuple.add(new DataByteArray(buf));
}
@Override
public InputFormat getInputFormat() {
return new FixedLengthInputFormat();
}
@Override
public void prepareToRead(RecordReader reader, PigSplit split) {
in = reader;
}
@Override
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
28. A Principled Technologies test report 28Design advantages of Hadoop ETL offload with the Intel processor-
powered Dell | Cloudera | Syncsort solution
ABOUT PRINCIPLED TECHNOLOGIES
Principled Technologies, Inc.
1007 Slater Road, Suite 300
Durham, NC, 27703
www.principledtechnologies.com
We provide industry-leading technology assessment and fact-based
marketing services. We bring to every assignment extensive experience
with and expertise in all aspects of technology testing and analysis, from
researching new technologies, to developing new methodologies, to
testing with existing and new tools.
When the assessment is complete, we know how to present the results to
a broad range of target audiences. We provide our clients with the
materials they need, from market-focused data to use in their own
collateral to custom sales aids, such as test reports, performance
assessments, and white papers. Every document reflects the results of
our trusted independent analysis.
We provide customized services that focus on our clients’ individual
requirements. Whether the technology involves hardware, software, Web
sites, or services, we offer the experience, expertise, and tools to help our
clients assess how it will fare against its competition, its performance, its
market readiness, and its quality and reliability.
Our founders, Mark L. Van Name and Bill Catchings, have worked
together in technology assessment for over 20 years. As journalists, they
published over a thousand articles on a wide array of technology subjects.
They created and led the Ziff-Davis Benchmark Operation, which
developed such industry-standard benchmarks as Ziff Davis Media’s
Winstone and WebBench. They founded and led eTesting Labs, and after
the acquisition of that company by Lionbridge Technologies were the
head and CTO of VeriTest.
Principled Technologies is a registered trademark of Principled Technologies, Inc.
All other product names are the trademarks of their respective owners.
Disclaimer of Warranties; Limitation of Liability:
PRINCIPLED TECHNOLOGIES, INC. HAS MADE REASONABLE EFFORTS TO ENSURE THE ACCURACY AND VALIDITY OF ITS TESTING, HOWEVER,
PRINCIPLED TECHNOLOGIES, INC. SPECIFICALLY DISCLAIMS ANY WARRANTY, EXPRESSED OR IMPLIED, RELATING TO THE TEST RESULTS AND
ANALYSIS, THEIR ACCURACY, COMPLETENESS OR QUALITY, INCLUDING ANY IMPLIED WARRANTY OF FITNESS FOR ANY PARTICULAR PURPOSE.
ALL PERSONS OR ENTITIES RELYING ON THE RESULTS OF ANY TESTING DO SO AT THEIR OWN RISK, AND AGREE THAT PRINCIPLED
TECHNOLOGIES, INC., ITS EMPLOYEES AND ITS SUBCONTRACTORS SHALL HAVE NO LIABILITY WHATSOEVER FROM ANY CLAIM OF LOSS OR
DAMAGE ON ACCOUNT OF ANY ALLEGED ERROR OR DEFECT IN ANY TESTING PROCEDURE OR RESULT.
IN NO EVENT SHALL PRINCIPLED TECHNOLOGIES, INC. BE LIABLE FOR INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES IN
CONNECTION WITH ITS TESTING, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. IN NO EVENT SHALL PRINCIPLED TECHNOLOGIES,
INC.’S LIABILITY, INCLUDING FOR DIRECT DAMAGES, EXCEED THE AMOUNTS PAID IN CONNECTION WITH PRINCIPLED TECHNOLOGIES, INC.’S
TESTING. CUSTOMER’S SOLE AND EXCLUSIVE REMEDIES ARE AS SET FORTH HEREIN.