Detailed report of IBM's 30TB Hadoop-DS report showing that IBM InfoSphere BigInsights (SQL-on-Hadoop) is able to execute all 99 TPC-DS queries at scale over native Hadoop data formats. Written by Simon Harris, Abhayan Sundararajan, John Poelman and Matthew Emmerton.
Blistering fast access to Hadoop with SQLSimon Harris
Big SQL, Impala, and Hive were benchmarked on their ability to execute 99 queries from the TPC-DS benchmark at various scale factors. Big SQL was able to express all queries without rewriting, complete the full workload at 10TB and 30TB, and achieved the highest throughput. Impala and Hive required rewriting some queries and could only complete 70-73% of the workload at 10TB. The results indicate that query support, scale, and throughput are important factors to consider for SQL-on-Hadoop implementations.
This document contains a resume for Arti Patel summarizing her work experience and qualifications. She has over 6 years of experience working with SAP MM and SAP PI, including implementing subsidiary rollouts for various companies. Her experience includes requirements gathering, process mapping, configuration, interface development, and support. She holds a Bachelor's degree in Information Technology.
SQL Server 2016 introduces new features for business intelligence and reporting. PolyBase allows querying data across SQL Server and Hadoop using T-SQL. Integration Services has improved support for AlwaysOn availability groups and incremental package deployment. Reporting Services adds HTML5 rendering, PowerPoint export, and the ability to pin report items to Power BI dashboards. Mobile Report Publisher enables developing and publishing mobile reports.
Oracle Exalytics - Tips and Experiences from the Field (Enkitec E4 Conference...Mark Rittman
Presentation by Rittman Mead's Mark Rittman and Stewart Bryson on our experiences 1-year on with Exalytics. Includes sections on aggregate caching and datamart loading into TT, use of Essbase as a TT alternative, and deployment patterns we see on client sites.
The document discusses IBM InfoSphere DataStage and the IBM Information Server architecture. It describes the key components of IBM InfoSphere including DataStage, QualityStage and Information Services Director. It outlines the client-server architecture with client tier, server tiers, repository, engine and working areas. It also discusses data transformation stages, job execution parallelism techniques like pipeline and partition parallelism, and hash partitioning.
Cast Iron Cloud Integration Best PracticesSarath Ambadas
This document provides best practices for developing and managing WebSphere Cast Iron integrations. It discusses naming conventions, error handling, orchestration development, appliance configuration, performance tuning, and upgrade processes. Development best practices include splitting large orchestrations, using configuration properties, and testing before deploying. Appliance best practices involve monitoring resources and purging logs. Performance can be improved by configuring connection pooling, batch processing, and tuning job concurrency. Upgrades involve backing up repositories and deploying existing projects to new versions.
The document is a resume for Ragavendiran Kulothungan Shakila, an SAP consultant with over 3 years of experience in SAP implementations including S/4HANA conversions, SAP HANA migrations, upgrades, and ongoing system support. Key skills include SAP S/4HANA, SAP HANA, ABAP, and experience leading projects such as S/4HANA conversions, HANA migrations, and upgrades. Education includes a Bachelor's degree in Computer Science and Engineering.
Update datacenter technology to consolidate and save on virtualizationPrincipled Technologies
Extending a server’s life cycle sounds financially conservative, but it overlooks the benefits of transitioning to a new composable infrastructure platform. A datacenter running legacy HPE blade servers could do more work after transitioning, which can decrease licensing and support OpEx.
A company replacing many HPE ProLiant BL460c G6 servers with new HPE Synergy 480 Gen10 Compute Modules could see a consolidation ratio of five to one. This would let them transfer one-fifth of the VMware vSphere licenses they’ve already purchased to the new servers while keeping the remaining licenses for future growth. They could quintuple their workload before having to license additional servers at $8,738, which includes one year of support.
In addition, yearly support costs for vSphere would shrink to one-fifth of what they have been paying. For every five HPE BL460c G6 server blades the company consolidates onto one HPE Synergy Gen10, the reduction in annual support costs would equal $6,992. This is on top of potential savings on reduced power usage and IT management. A company replacing HPE ProLiant BL460c G7 or Gen8 servers would enjoy the same savings to a lesser degree. Although this study focused on a virtualized database environment and hypervisor licensing, bare-metal applications that use per-socket or per-CPU licensing schemes could potentially benefit from similar consolidation approaches as well.
Purchasing new solutions may seem challenging, but spending less each year on software support, in addition to other OpEx from server consolidation, can greatly offset your investment. Take a look at how much your organization spends on virtualization software support for your older servers—you might realize that you can’t afford not to upgrade.
Blistering fast access to Hadoop with SQLSimon Harris
Big SQL, Impala, and Hive were benchmarked on their ability to execute 99 queries from the TPC-DS benchmark at various scale factors. Big SQL was able to express all queries without rewriting, complete the full workload at 10TB and 30TB, and achieved the highest throughput. Impala and Hive required rewriting some queries and could only complete 70-73% of the workload at 10TB. The results indicate that query support, scale, and throughput are important factors to consider for SQL-on-Hadoop implementations.
This document contains a resume for Arti Patel summarizing her work experience and qualifications. She has over 6 years of experience working with SAP MM and SAP PI, including implementing subsidiary rollouts for various companies. Her experience includes requirements gathering, process mapping, configuration, interface development, and support. She holds a Bachelor's degree in Information Technology.
SQL Server 2016 introduces new features for business intelligence and reporting. PolyBase allows querying data across SQL Server and Hadoop using T-SQL. Integration Services has improved support for AlwaysOn availability groups and incremental package deployment. Reporting Services adds HTML5 rendering, PowerPoint export, and the ability to pin report items to Power BI dashboards. Mobile Report Publisher enables developing and publishing mobile reports.
Oracle Exalytics - Tips and Experiences from the Field (Enkitec E4 Conference...Mark Rittman
Presentation by Rittman Mead's Mark Rittman and Stewart Bryson on our experiences 1-year on with Exalytics. Includes sections on aggregate caching and datamart loading into TT, use of Essbase as a TT alternative, and deployment patterns we see on client sites.
The document discusses IBM InfoSphere DataStage and the IBM Information Server architecture. It describes the key components of IBM InfoSphere including DataStage, QualityStage and Information Services Director. It outlines the client-server architecture with client tier, server tiers, repository, engine and working areas. It also discusses data transformation stages, job execution parallelism techniques like pipeline and partition parallelism, and hash partitioning.
Cast Iron Cloud Integration Best PracticesSarath Ambadas
This document provides best practices for developing and managing WebSphere Cast Iron integrations. It discusses naming conventions, error handling, orchestration development, appliance configuration, performance tuning, and upgrade processes. Development best practices include splitting large orchestrations, using configuration properties, and testing before deploying. Appliance best practices involve monitoring resources and purging logs. Performance can be improved by configuring connection pooling, batch processing, and tuning job concurrency. Upgrades involve backing up repositories and deploying existing projects to new versions.
The document is a resume for Ragavendiran Kulothungan Shakila, an SAP consultant with over 3 years of experience in SAP implementations including S/4HANA conversions, SAP HANA migrations, upgrades, and ongoing system support. Key skills include SAP S/4HANA, SAP HANA, ABAP, and experience leading projects such as S/4HANA conversions, HANA migrations, and upgrades. Education includes a Bachelor's degree in Computer Science and Engineering.
Update datacenter technology to consolidate and save on virtualizationPrincipled Technologies
Extending a server’s life cycle sounds financially conservative, but it overlooks the benefits of transitioning to a new composable infrastructure platform. A datacenter running legacy HPE blade servers could do more work after transitioning, which can decrease licensing and support OpEx.
A company replacing many HPE ProLiant BL460c G6 servers with new HPE Synergy 480 Gen10 Compute Modules could see a consolidation ratio of five to one. This would let them transfer one-fifth of the VMware vSphere licenses they’ve already purchased to the new servers while keeping the remaining licenses for future growth. They could quintuple their workload before having to license additional servers at $8,738, which includes one year of support.
In addition, yearly support costs for vSphere would shrink to one-fifth of what they have been paying. For every five HPE BL460c G6 server blades the company consolidates onto one HPE Synergy Gen10, the reduction in annual support costs would equal $6,992. This is on top of potential savings on reduced power usage and IT management. A company replacing HPE ProLiant BL460c G7 or Gen8 servers would enjoy the same savings to a lesser degree. Although this study focused on a virtualized database environment and hypervisor licensing, bare-metal applications that use per-socket or per-CPU licensing schemes could potentially benefit from similar consolidation approaches as well.
Purchasing new solutions may seem challenging, but spending less each year on software support, in addition to other OpEx from server consolidation, can greatly offset your investment. Take a look at how much your organization spends on virtualization software support for your older servers—you might realize that you can’t afford not to upgrade.
This document provides an overview of SAP HANA, including its architecture, deployment options, hardware sizing, and tools for monitoring and administration. It defines SAP HANA as an in-memory computing platform that combines database, application processing, and integration services. The document outlines SAP HANA's key features and discusses its partnerships before reviewing deployment models and certified hardware sizes. It also introduces several tools for administering, monitoring, and troubleshooting SAP HANA systems.
The Quick Sizer tool translates business requirements like number of users or processes into technical hardware requirements to ensure an optimal system configuration is purchased. It uses proven sizing methodologies developed with SAP technology partners. Quick Sizer provides simplified greenfield sizing to select the necessary processor performance, memory, storage, and other hardware components for a balanced system infrastructure that meets business needs cost-effectively.
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...Jaleel Ahmed Gulammohiddin
This document provides guidance on installing and configuring SUSE Linux Enterprise Server for SAP Applications (SLES for SAP Applications) to be used with SAP HANA. It outlines hardware, software, and other prerequisites. It then provides a sample installation process for SLES 12 on x86_64 and Power architectures, including partitioning disks, selecting software packages, configuring system settings, and installing SAP HANA. Additional sections cover installing high availability, backup, and other optional software.
Tips for Installing Cognos Analytics: Configuring and Installing the ServerSenturus
Learn the following about Cognos Analytics: install options, gateway and IIS setup, database drivers, release upgrade strategy and schedule migration tips. Download this deck and view the video recording at: http://www.senturus.com/resources/how-to-install-ibm-cognos-analytics/.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Ramesh Gurajana is an Oracle DBA with over 3.8 years of experience working with Oracle databases from 10g to 12c. He currently works as an Oracle DBA for SBIICM in Hyderabad, India. Some of his key skills and responsibilities include Oracle database installation, configuration, backup/recovery, performance tuning, query optimization, and high availability solutions like Oracle RAC and Data Guard. He has worked on several projects for banks and telecom companies involving large Oracle databases and e-learning portals.
Effective Integration of SAP MDM & BODSNavneetGiria
The document discusses the effective integration of SAP Master Data Management (MDM) and SAP Business Objects Data Services (BODS). It provides examples of how BODS can be integrated with MDM for ETL/data integration and data quality processes. The integration enables capabilities like initial data loads, incremental updates, and central master data maintenance. BODS tools help with tasks like data profiling, impact analysis, and transformation. Together, MDM and BODS provide combined data governance, consolidation, and maintenance capabilities.
The material was created around 2010-end, but published February'2011.
The main purposes of creating this document were as follows:
- very very few people working on SAP HANA at that time
- information regarding SAP HANA not really available and if available, it was scattered !!
- always pulled for presentations and technical demos which generally hampered my own hands-on work :). And I was alone at San Jose whereas Ulrich used to busy in Walldorf coordinating with SAP.
BTW, it was my full-fledged first SAP HANA presentation in end of 2010, although published in 2011. The document is quite old now but most of the part still holds good as of today.
Himanshu Jain is a database administrator with over 12 years of experience working with Oracle databases. He has extensive experience working with clients in Germany, the UK, France and Sri Lanka. He is currently employed by TCS in Dusseldorf, Germany and holds a Master's degree in Computer Application from India.
SharePoint and Large Scale SQL Deployments - NZSPCguest7c2e070
This document discusses considerations for large-scale SharePoint deployments on SQL Server. It provides examples of real-world deployments handling over 10TB of content. It discusses database types, performance issues like indexing and backups, and architectural design best practices like separating databases onto unique volumes. It also provides statistics on deployments handling over 70 million documents and 40TB of content across multiple farms and databases.
Informatica Power Center - Workflow ManagerZaranTech LLC
The document discusses various workflow tasks in Informatica PowerCenter including sessions, commands, email, decision, assignment, timer, control, event raise, and event wait tasks. It provides examples of how to use these tasks to control workflow execution based on conditions, variables, events, timing requirements. Specifically, it presents a business case where sessions need to wait for indicator files but only within a specific time window each day, using assignment, file wait, timer, and command tasks along with link logic.
The document introduces workload management features in SAP HANA SPS 10. It describes different workload types like OLTP, OLAP, and mixed workloads. It then discusses how system resources like CPU, memory, and priority were previously controlled using configuration parameters. The main new feature introduced is workload classes, which allow administrators to define resource policies for different workloads and applications. Workload mappings ensure the correct workload class is applied based on connection properties. Administrative tools and monitoring views are provided for workload classes.
Prepare for your interview with these top 20 SAP HANA interview questions. For more IT Profiles, Sample Resumes, Practice exams, Interview Questions, Live Training and more…visit ITLearnMore – Most Trusted Website for all Learning Needs by Students, Graduates and Working Professionals.
Looking to add weight to your resume? Check out for ITLearnmore for varied online IT courses at affordable prices intended for career boost. There is so much in store for both fresh graduates and professionals here. Hurry up..! Get updated with the current IT job market requirements and related courses.For more information visit http://www.ITLearnMore.com.
See tips to improve your Cognos (v10 and v11) environment. Topics include the new Interactive Performance Assistant (v11), hardware and server specifics, failover and high availability, high and low affinity requests, overview of services, Java heap settings, IIS configurations and non-Cognos related tuning. View the video recording and download this deck at: http://www.senturus.com/resources/cognos-analytics-performance-tuning/
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Thejokumar Muduru is seeking a position as a Database Consultant or Senior Technical Lead with over 10 years of experience administering Oracle databases. He has extensive experience installing, configuring, patching, tuning, backing up and recovering Oracle databases from versions 8i through 11g. He is proficient in performance monitoring, high availability solutions, disaster recovery and database upgrades. His technical skills include Oracle, Linux, SQL, PL/SQL and shell scripting.
This presentation to give a brief overview of various technology features that constitue SAP In-memory Database HANA. More details about how SAP HANA achieves compression, insert only on delta, columniar/row stores, Data provisioning tools in to HANA appliance. this is more suited for technology teams who will manage SAP HANA appliance.
This document contains a professional summary and details for Satheesh Talluri. It outlines over 8 years of experience as an Oracle Database Administrator with skills in Oracle 11g, 10g, 9i, and other technologies. Recent roles include working as an Oracle DBA for AT&T in Dallas, Texas and managing critical online applications. Previous experience includes work as an Oracle DBA for IEEE.org and Mylan.
Data-driven analytics is making a measurable impact on businesses performance, helping companies pinpoint new sources of revenue and streamline operations. But traditional computing systems are challenged to keep up with a rapidly evolving data management landscape.
How do you foster superior efficiency, flexibility, and economy while meeting diverse and pressing analytics needs?
SAP® Sybase IQ and Dobler Consulting can help:
Traditional database systems were meant for processing transactions, but SAP® Sybase® IQ server is a highly efficient RDBMS optimized for extreme-scale EDWs and Big Data analytics – offering you faster data loading and query performance while slashing maintenance, hardware, and storage costs. Realize exponential improvement, even as thousands of employees and massive amounts of data (structured and unstructured) enter your ecosystem.
With SAP Sybase IQ 16 you can:
• Exploit the value of Big Data and incorporate into everyday business decision-making
• Transform your business through deeper insight by enabling analytics on real-time information
• Extend the power of analytics across your enterprise with speed, availability and security.
Please join us to learn the value offered by SAP Sybase IQ 16. And, see how by tying together your organization’s data assets – from operational data to external feeds and Big Data – SAP dramatically simplifies data management landscapes for both current and next-generation business applications, delivering information at unprecedented speeds and empowering a Big Data-enabled Enterprise Data Warehouse.
This document discusses common production failures encountered with SAP BW, including transactional RFC errors due to non-updated IDocs, time stamp errors, short dumps, job cancellations in the R/3 source system, incorrect data in the PSA, ODS activation failures, missing caller profiles, failed attribute change runs, failed R/3 extraction jobs, missing files, table space issues, and restarting failed process chains. It provides explanations of each error, the impacts, and recommended actions to resolve the issues. The document serves as a quick reference guide for BW consultants to help solve complex production problems.
La motivación laboral se refiere a la capacidad de las empresas para mantener el estímulo positivo de sus empleados y su desempeño en el trabajo. Existen cuatro tipos de motivación: extrínseca, intrínseca, transitiva y trascendente. La motivación es importante para las empresas porque mejora la productividad individual y grupal de los empleados. Algunos factores que motivan incluyen tener responsabilidades, autonomía y objetivos claros, mientras que problemas interpersonales, falta de confianza y exceso de control desmotivan.
This document provides an overview of SAP HANA, including its architecture, deployment options, hardware sizing, and tools for monitoring and administration. It defines SAP HANA as an in-memory computing platform that combines database, application processing, and integration services. The document outlines SAP HANA's key features and discusses its partnerships before reviewing deployment models and certified hardware sizes. It also introduces several tools for administering, monitoring, and troubleshooting SAP HANA systems.
The Quick Sizer tool translates business requirements like number of users or processes into technical hardware requirements to ensure an optimal system configuration is purchased. It uses proven sizing methodologies developed with SAP technology partners. Quick Sizer provides simplified greenfield sizing to select the necessary processor performance, memory, storage, and other hardware components for a balanced system infrastructure that meets business needs cost-effectively.
Suse linux enterprise_server_12_x_for_sap_applications_configuration_guide_fo...Jaleel Ahmed Gulammohiddin
This document provides guidance on installing and configuring SUSE Linux Enterprise Server for SAP Applications (SLES for SAP Applications) to be used with SAP HANA. It outlines hardware, software, and other prerequisites. It then provides a sample installation process for SLES 12 on x86_64 and Power architectures, including partitioning disks, selecting software packages, configuring system settings, and installing SAP HANA. Additional sections cover installing high availability, backup, and other optional software.
Tips for Installing Cognos Analytics: Configuring and Installing the ServerSenturus
Learn the following about Cognos Analytics: install options, gateway and IIS setup, database drivers, release upgrade strategy and schedule migration tips. Download this deck and view the video recording at: http://www.senturus.com/resources/how-to-install-ibm-cognos-analytics/.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Ramesh Gurajana is an Oracle DBA with over 3.8 years of experience working with Oracle databases from 10g to 12c. He currently works as an Oracle DBA for SBIICM in Hyderabad, India. Some of his key skills and responsibilities include Oracle database installation, configuration, backup/recovery, performance tuning, query optimization, and high availability solutions like Oracle RAC and Data Guard. He has worked on several projects for banks and telecom companies involving large Oracle databases and e-learning portals.
Effective Integration of SAP MDM & BODSNavneetGiria
The document discusses the effective integration of SAP Master Data Management (MDM) and SAP Business Objects Data Services (BODS). It provides examples of how BODS can be integrated with MDM for ETL/data integration and data quality processes. The integration enables capabilities like initial data loads, incremental updates, and central master data maintenance. BODS tools help with tasks like data profiling, impact analysis, and transformation. Together, MDM and BODS provide combined data governance, consolidation, and maintenance capabilities.
The material was created around 2010-end, but published February'2011.
The main purposes of creating this document were as follows:
- very very few people working on SAP HANA at that time
- information regarding SAP HANA not really available and if available, it was scattered !!
- always pulled for presentations and technical demos which generally hampered my own hands-on work :). And I was alone at San Jose whereas Ulrich used to busy in Walldorf coordinating with SAP.
BTW, it was my full-fledged first SAP HANA presentation in end of 2010, although published in 2011. The document is quite old now but most of the part still holds good as of today.
Himanshu Jain is a database administrator with over 12 years of experience working with Oracle databases. He has extensive experience working with clients in Germany, the UK, France and Sri Lanka. He is currently employed by TCS in Dusseldorf, Germany and holds a Master's degree in Computer Application from India.
SharePoint and Large Scale SQL Deployments - NZSPCguest7c2e070
This document discusses considerations for large-scale SharePoint deployments on SQL Server. It provides examples of real-world deployments handling over 10TB of content. It discusses database types, performance issues like indexing and backups, and architectural design best practices like separating databases onto unique volumes. It also provides statistics on deployments handling over 70 million documents and 40TB of content across multiple farms and databases.
Informatica Power Center - Workflow ManagerZaranTech LLC
The document discusses various workflow tasks in Informatica PowerCenter including sessions, commands, email, decision, assignment, timer, control, event raise, and event wait tasks. It provides examples of how to use these tasks to control workflow execution based on conditions, variables, events, timing requirements. Specifically, it presents a business case where sessions need to wait for indicator files but only within a specific time window each day, using assignment, file wait, timer, and command tasks along with link logic.
The document introduces workload management features in SAP HANA SPS 10. It describes different workload types like OLTP, OLAP, and mixed workloads. It then discusses how system resources like CPU, memory, and priority were previously controlled using configuration parameters. The main new feature introduced is workload classes, which allow administrators to define resource policies for different workloads and applications. Workload mappings ensure the correct workload class is applied based on connection properties. Administrative tools and monitoring views are provided for workload classes.
Prepare for your interview with these top 20 SAP HANA interview questions. For more IT Profiles, Sample Resumes, Practice exams, Interview Questions, Live Training and more…visit ITLearnMore – Most Trusted Website for all Learning Needs by Students, Graduates and Working Professionals.
Looking to add weight to your resume? Check out for ITLearnmore for varied online IT courses at affordable prices intended for career boost. There is so much in store for both fresh graduates and professionals here. Hurry up..! Get updated with the current IT job market requirements and related courses.For more information visit http://www.ITLearnMore.com.
See tips to improve your Cognos (v10 and v11) environment. Topics include the new Interactive Performance Assistant (v11), hardware and server specifics, failover and high availability, high and low affinity requests, overview of services, Java heap settings, IIS configurations and non-Cognos related tuning. View the video recording and download this deck at: http://www.senturus.com/resources/cognos-analytics-performance-tuning/
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
Thejokumar Muduru is seeking a position as a Database Consultant or Senior Technical Lead with over 10 years of experience administering Oracle databases. He has extensive experience installing, configuring, patching, tuning, backing up and recovering Oracle databases from versions 8i through 11g. He is proficient in performance monitoring, high availability solutions, disaster recovery and database upgrades. His technical skills include Oracle, Linux, SQL, PL/SQL and shell scripting.
This presentation to give a brief overview of various technology features that constitue SAP In-memory Database HANA. More details about how SAP HANA achieves compression, insert only on delta, columniar/row stores, Data provisioning tools in to HANA appliance. this is more suited for technology teams who will manage SAP HANA appliance.
This document contains a professional summary and details for Satheesh Talluri. It outlines over 8 years of experience as an Oracle Database Administrator with skills in Oracle 11g, 10g, 9i, and other technologies. Recent roles include working as an Oracle DBA for AT&T in Dallas, Texas and managing critical online applications. Previous experience includes work as an Oracle DBA for IEEE.org and Mylan.
Data-driven analytics is making a measurable impact on businesses performance, helping companies pinpoint new sources of revenue and streamline operations. But traditional computing systems are challenged to keep up with a rapidly evolving data management landscape.
How do you foster superior efficiency, flexibility, and economy while meeting diverse and pressing analytics needs?
SAP® Sybase IQ and Dobler Consulting can help:
Traditional database systems were meant for processing transactions, but SAP® Sybase® IQ server is a highly efficient RDBMS optimized for extreme-scale EDWs and Big Data analytics – offering you faster data loading and query performance while slashing maintenance, hardware, and storage costs. Realize exponential improvement, even as thousands of employees and massive amounts of data (structured and unstructured) enter your ecosystem.
With SAP Sybase IQ 16 you can:
• Exploit the value of Big Data and incorporate into everyday business decision-making
• Transform your business through deeper insight by enabling analytics on real-time information
• Extend the power of analytics across your enterprise with speed, availability and security.
Please join us to learn the value offered by SAP Sybase IQ 16. And, see how by tying together your organization’s data assets – from operational data to external feeds and Big Data – SAP dramatically simplifies data management landscapes for both current and next-generation business applications, delivering information at unprecedented speeds and empowering a Big Data-enabled Enterprise Data Warehouse.
This document discusses common production failures encountered with SAP BW, including transactional RFC errors due to non-updated IDocs, time stamp errors, short dumps, job cancellations in the R/3 source system, incorrect data in the PSA, ODS activation failures, missing caller profiles, failed attribute change runs, failed R/3 extraction jobs, missing files, table space issues, and restarting failed process chains. It provides explanations of each error, the impacts, and recommended actions to resolve the issues. The document serves as a quick reference guide for BW consultants to help solve complex production problems.
La motivación laboral se refiere a la capacidad de las empresas para mantener el estímulo positivo de sus empleados y su desempeño en el trabajo. Existen cuatro tipos de motivación: extrínseca, intrínseca, transitiva y trascendente. La motivación es importante para las empresas porque mejora la productividad individual y grupal de los empleados. Algunos factores que motivan incluyen tener responsabilidades, autonomía y objetivos claros, mientras que problemas interpersonales, falta de confianza y exceso de control desmotivan.
- Total delivered strong 2016 results in a challenging environment, with adjusted net income of $8.3 billion and production growth of 4.5%.
- Safety remains a core value, with the Total Recordable Injury Rate improving to 0.9 per million man-hours worked.
- Total is focused on reducing costs, with upstream operating costs targeted to reach $5.5/boe in 2017 and $5/boe by 2018.
- Production is expected to continue growing in 2017 with ramp-ups of new projects and start-ups.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
El documento describe los componentes básicos de los generadores eólicos. Explica que la energía eólica proviene de la energía solar y el calentamiento diferencial del aire por el sol. También menciona que existen diferentes tipos de aerogeneradores según su potencia y número de palas. Luego enumera los principales componentes como el rotor, las palas, el eje de baja velocidad, la caja multiplicadora, el sistema de orientación y el sistema de soporte.
It's all about introduction to a blog which speaks about Destinations, Arts, Culture, People, Cuisines...Everything you would want to know about Kerala
Discover Life. Feel Divinity. Find Yourself...........Experience God's Own Country
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
1) JSON-LD has seen widespread adoption with over 2 million HTML pages including it and it being a required format for Linked Data platforms.
2) A primary goal of JSON-LD was to allow JSON developers to use it similarly to JSON while also providing mechanisms to reshape JSON documents into a deterministic structure for processing.
3) JSON-LD 1.1 includes additional features like using objects to index into collections, scoped contexts, and framing capabilities.
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
Enabling Multimodel Graphs with Apache TinkerPopJason Plurad
Graphs are everywhere, but in a modern data stack, they are not the only tool in the toolbox. With Apache TinkerPop, adding graph capability on top of your existing data platform is not as daunting as it sounds. We will do a deep dive on writing Traversal Strategies to optimize performance of the underlying graph database. We will investigate how various TinkerPop systems offer unique possibilities in a multimodel approach to graph processing. We will discuss how using Gremlin frees you from vendor lock-in and enables you to swap out your graph database as your requirements evolve. Presented at Graph Day Texas, January 14, 2017. http://graphday.com/graph-day-at-data-day-texas/#plurad
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
Consumers are interested in autonomous cars but still fear letting go of the wheel completely. While traffic is a major issue for city satisfaction, autonomous vehicles may help by freeing up drivers and improving the commute experience. Those most interested in autonomous cars tend to be professionals with children who already use cars to commute. Allowing cars to be shared more easily through technologies like digital keys could change whether people own cars or use them as a service. A variety of companies from traditional automakers to technology firms and public transport providers are seen as potential future providers of autonomous mobility options.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Zipcar is a car sharing service that allows users to rent vehicles by the hour or day. Members pay an annual fee of $70 plus hourly rates of $8.50 per hour or daily rates of $59. Zipcar has over 1 million members across 500 cities in 9 countries, with a fleet of 10,000 vehicles. The document outlines Zipcar's approach, history, competitors, and future outlook which includes increasing their fleet size and adding more hybrid and electric vehicles.
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
The document discusses data manipulation in R. It notes that R has some quirks with naming conventions and variable types but is well-suited for exploratory data analysis, generating visualizations, and statistical modeling. The tidyverse collection of R packages, including dplyr and ggplot2, helps make data analysis easier by providing tools for reshaping data into a tidy format with one variable per column and observation per row. Dplyr's verbs like filter, arrange, select, mutate and summarize allow for splitting, applying transformations, and combining data in a functional programming style.
Numbers that Actually Matter. Finding Your North Star Mamoon Hamid
This presentation uncovers a common misconception in fast growing SaaS businesses. Revenue reigns over everything. We talk about what really matters in building a sustainable SaaS company.
Sizing SAP on x86 IBM PureFlex with Reference ArchitectureDoddi Priyambodo
This document provides an overview of the IBM PureSystems platform and its suitability for running SAP workloads. It describes the Intel-based compute and storage nodes that comprise PureSystems, as well as connectivity and reliability features. Typical SAP landscape topologies that can be implemented on PureSystems are explained, including core, high availability, heterogeneous, and HANA-based landscapes. The document also discusses supported virtualization technologies, SAP software configurations, and IBM software integrations for PureSystems.
The document discusses IBM DB2 Analytics Accelerator and its support for Netezza In-Database Analytics (INZA) functions and Accelerator Only Tables (AoT) in QMF 11.2. It describes how INZA functions enable acceleration of predictive analytics applications by allowing SPSS/Netezza Analytics data mining and modeling to be processed within the Accelerator. It also outlines the installation and configuration needed to utilize these new capabilities.
Smarter Documentation: Shedding Light on the Black Box of Reporting DataKelly Raposo
Developing reports to make sense of project data can be a difficult task. IBM’s reporting tools enable users to report on the data from Rational Team Concert, Rational Quality Manager, and Rational Requirements Composer, but our clients often have trouble determining how to get the right data into the right reports. Through a collaborative effort between our clients and several Rational teams (incl. Support, Development, User Experience and Documentation), we explored the challenges and developed a plan to get all the necessary information into our users’ hands. Using tools to automate documentation of the data models, collect and expose SME knowledge about the product REST APIs, and filter the information based on goals, the team delivered a full set of guidance and reference material in the Information Centres that sheds some light on the black box of data. Ongoing efforts will connect the pieces using linked data, allowing fast and easy exploration of the data relationships.
Benchmarking Hadoop - Which hadoop sql engine leads the herdGord Sissons
Stewart Tate (tates@us.ibm.com), a key architect behind the industry's first ever Hadoop-DS benchmark at 30TB scale, describes the benchmark and comparative testing between IBM, Cloudera Impala and Hortonworks Hive
Improve Predictability & Efficiency with Kanban Metrics using IBM Rational In...Marc Nehme
This presentation discusses IBM Rational Insight and how it was leveraged to provide reports with metrics supporting the adoption of the Kanban Method, by teams using IBM Rational Team Concert.
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Paulo Lacerda
This document discusses using Kanban metrics in Rational Insight to improve predictability and efficiency. It describes challenges with enterprise reporting like disparate data sources and outlines how Rational Insight provides an automated solution. It then explains how Kanban concepts like work-in-progress limits and cumulative flow diagrams can provide metrics for monitoring flow and identifying bottlenecks. The demonstration shows analyzing metrics in Rational Insight to find inefficiencies in a software development process and improve predictability.
This document discusses scientific research using Database-as-a-Service (DBaaS) on IBM PureApplication System and PureData System for Transactions. It notes that scientific data has different characteristics than business data and often requires high reliability and availability. DBaaS can help with scientific research by providing database services in a cloud environment and allowing computations to be performed close to the data. Examples of large scientific databases discussed include the Sloan Digital Sky Survey and 1000 Genomes Project.
Connect2014 Spot101: Cloud Readiness 101: Analyzing and Visualizing Your IT I...panagenda
This document summarizes a presentation about analyzing and visualizing an IT infrastructure in preparation for a move to the cloud. It discusses key factors to consider like user activity analysis and application design analysis. User activity analysis can identify how applications are used and who the key users are. Application design analysis evaluates dependencies and constraints. These analyses combined with organizational considerations can indicate the best cloud strategy and destinations like web or mobile applications. The presentation demonstrates these analyses with customer examples and emphasizes understanding the full infrastructure picture is key to a successful cloud transformation.
Highly successful performance tuning of an informix databaseIBM_Info_Management
This document contains several notices and disclaimers related to IBM products, services, and information. It states that IBM owns the copyright to the document and its contents. It also notes that performance results may vary depending on the environment. The document is provided without warranty and IBM is not liable for damages from its use. Statements regarding IBM's future plans are subject to change.
IBM Connect 2014 - AD206 - Build Apps Rapidly by Leveraging Services from IBM...Niklas Heidloff
IBM Connect 2014
AD206 : Build Apps Rapidly by Leveraging Services from IBM Collaboration Solutions
Niklas Heidloff, IBM
Henning Schmidt, hedersoft GmbH
Demo: http://www.youtube.com/watch?v=Wl5hasivtPQ
Don’t reinvent the wheel when building your own apps. Instead use the services provided by IBM Collaboration Solutions and focus on your specific business requirements. IBM Collaboration Solutions provide an unique set of social and collaborative services like profiles, file sharing, community discussions and much more. Come to this session to see different types of apps, e.g. XPages apps, that have been developed rapidly by leveraging these services from IBM Connections–on premises or in the cloud. Technically the services can be easily accessed from apps via the IBM Social Business Toolkit SDK. In this session you’ll learn how the SDK simplifies calling the back-end services via APIs and how reusable user interface controls can leveraged.
Wed, 29/Jan 05:30 PM – 06:30 PM
IBM Connect 2014 - AD206: Build Apps Rapidly by Leveraging Services from IBM ...IBM Connections Developers
IBM Connect 2014
AD206 : Build Apps Rapidly by Leveraging Services from IBM Collaboration Solutions
Niklas Heidloff, IBM
Henning Schmidt, hedersoft GmbH
Demo: http://www.youtube.com/watch?v=Wl5hasivtPQ
Don’t reinvent the wheel when building your own apps. Instead use the services provided by IBM Collaboration Solutions and focus on your specific business requirements. IBM Collaboration Solutions provide an unique set of social and collaborative services like profiles, file sharing, community discussions and much more. Come to this session to see different types of apps, e.g. XPages apps, that have been developed rapidly by leveraging these services from IBM Connections–on premises or in the cloud. Technically the services can be easily accessed from apps via the IBM Social Business Toolkit SDK. In this session you’ll learn how the SDK simplifies calling the back-end services via APIs and how reusable user interface controls can leveraged.
Wed, 29/Jan 05:30 PM – 06:30 PM
Ims13 ims tools ims v13 migration workshop - IMS UG May 2014 Sydney & Melbo...Robert Hain
Together, the IBM IMS Tools Solution Packs and IMS 13 deliver simplification, automation and intelligence, with all the tools needed to support IMS databases now in one package. It doesn’t make sense to run reorganization utilities if your databases do not need to be reorganized. Now you can quickly and easily improve IMS application performance, IMS resource utilization and deliver higher system availability with the end-to-end analysis of IMS transactions. Comprehensive performance reporting and easier interactive analysis determine what happened, what needs fixing and how to fix it – all part of the intelligence and automation of the IMS Tools Performance Solution Pack.
Spark working with a Cloud IDE: Notebook/Shiny AppsData Con LA
Abstract:-
The Problem: Energy inefficiency within public/private buildings in the City of New York.
The Goal: Take meter(Sensor) data, solve the inefficiencies through better insights.
The Solution: Visualization and Reporting through the Shiny App to gain knowledge in past, and present usage patterns. In addition to those patterns, compare and gain insights/predictions on energy usage.
Spark's Dataframes and RDD's will be used in concert with panda (library) to clean and model/prepare data for the R Shiny App. The message to convey in this meetup discussion is to show the capabilities of Spark while using DSX and RStudio/Shiny App to create visualization/reporting that will be able to give insights to the end user.
There are a few techniques that we will present in this notebook with both modeling and ML: Linear Regression, K-Means clustering for identifying inefficient buildings, (Statistical) Classification Modeling, followed by a confusion matrix (error matrices).
Bio:-
Thomas Liakos has been an Open Source Systems Engineer for 11 years and he has 8 years of experience in Cloud and hybrid environments. Prior to IBM Thomas was at Gem.co: Sr. Systems Architect. and CrowdStrike: DevOps / Systems Engineer - Cloud Operations. Thomas has expertise in Spark, Python, Systems and Configuration Management, Architecture, Data Warehousing, and Data Engineering.
[IBM Pulse 2014] #1579 DevOps Technical Strategy and RoadmapDaniel Berg
Hey everyone. Here is the presentation that I had the pleasure of presenting the following deck with Maciej Zawadzki and Ruth Willenborg describing IBM's technical strategy and roadmap.
Enjoy!!!
A description of what REST is and is not useful for followed by a walkthrough of how to use REST API's to access Informix databases. Includes new features released for Informix 12.10xC7
IBM Connect 2014 - AD206: Build Apps Rapidly by Leveraging Services from IBM ...IBM Connections Developers
IBM Connect 2014
AD206 : Build Apps Rapidly by Leveraging Services from IBM Collaboration Solutions
Niklas Heidloff, IBM
Henning Schmidt, hedersoft GmbH
Demo: http://www.youtube.com/watch?v=Wl5hasivtPQ
Don’t reinvent the wheel when building your own apps. Instead use the services provided by IBM Collaboration Solutions and focus on your specific business requirements. IBM Collaboration Solutions provide an unique set of social and collaborative services like profiles, file sharing, community discussions and much more. Come to this session to see different types of apps, e.g. XPages apps, that have been developed rapidly by leveraging these services from IBM Connections–on premises or in the cloud. Technically the services can be easily accessed from apps via the IBM Social Business Toolkit SDK. In this session you’ll learn how the SDK simplifies calling the back-end services via APIs and how reusable user interface controls can leveraged.
Wed, 29/Jan 05:30 PM – 06:30 PM
ICS usergroup dev day2014_application development für die ibm smartcloud for ...ICS User Group
The document discusses how to build apps rapidly by leveraging services from IBM Collaboration Solutions. It introduces Niklas Heidloff and Henning Schmidt and provides two sample scenarios: a Partner Community app built using IBM SmartCloud for Social Business and a Loan Manager app built using IBM Connections. The document demonstrates how these apps were implemented using the IBM Social Business Toolkit SDK and XPages for rapid development. It summarizes the key services accessed from the SDK, such as profiles, files, tasks and forums. The document encourages developers to reuse existing collaboration services rather than reinventing functionality.
Integrating BigInsights and Puredata system for analytics with query federati...Seeling Cheung
This document summarizes a presentation given by David Darden and Don Smith of Big Fish Games about their efforts to integrate the BigInsights and PureData System for Analytics platforms. They discussed augmenting their data warehouse by using these platforms for landing zones, exploration of "awkward" datasets, and offloading some processing. They demonstrated several options for moving data between the platforms using tools like Sqoop, Fluid Query, and Big SQL. They identified documentation, performance, and usability as ongoing challenges and next steps to improve their users' experience with the systems.
Similar to IBM Hadoop-DS Benchmark Report - 30TB (20)
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Nordic Marketo Engage User Group_June 13_ 2024.pptx
IBM Hadoop-DS Benchmark Report - 30TB
1. Hadoop-DS Benchmark Report
for
IBM System x3650 M4
using
IBM BigInsights Big SQL 3.0
and
Red Hat Enterprise Linux Server Release 6.4
FIRST EDITION
Published on
Oct 24, 2014
3. Page | 3
Authors
Simon Harris: Simon is the Big SQL performance lead working in the IBM BigInsights development team. He has 20 years of experience working in information management including MPP RDBMS, federated database technology, tooling and big data. Simon now specializes in SQL over Hadoop technologies.
Abhayan Sundararajan: Abhayan is a Performance Analyst on IBM BigInsights with a focus on Big SQL. He has held a variety of roles within the IBM DB2 team, including functional verification test and a brief foray into development before joining the performance team to work on DB2 BLU.
John Poelman: John joined the BigInsights performance team in 2011. While at IBM John has worked as a developer or a performance engineer on a variety of Database, Business Intelligence, and now Big Data software products.
Matthew Emmerton: Matt Emmerton is a Senior Software Developer in Information Management at the IBM Toronto Software Lab. He has over 10 years of expertise in database performance analysis and scalability testing. He has participated in many successful world-record TPC and SAP benchmarks. His interests include exploring and exploiting key hardware and operating system technologies in DB2, and developing extensible test suites for standard benchmark workloads.
Special thanks
Special thanks to the following people for their contribution to the benchmark and content:
Berni Schiefer – Distinguish Engineer, Information Management Performance and Benchmarks, DB2 LUW, Big Data, MDM, Optim Data Studio Performance Tools Adriana Zubiri – Program Director, Big Data Development
Mike Ahern – BigInsights Performance
Mi Shum – Senior Performance Manager, Big Data
Cindy Saracco - Solution Architect, IM technologies - Big Data
Avrilia Floratou – IBM Research
Fatma Özcan – IBM Research
Glen Sheffield – Big Data Competitive Analyst
Gord Sissons – BigInsights Product Marketing Manager
Stewart Tate – Senior Technical Staff Member, Information Management Performance Benchmarks and Solutions
Jo A Ramos - Executive Solutions Architect - Big Data and Analytics
5. Page | 5
Introduction
Performance benchmarks are an integral part of software and systems development as they can evaluate systems performance in an objective way. They have also become highly visible components of the exciting world of marketing SQL over Hadoop solutions.
IBM has constructed and used the Hadoop Decision Support (Hadoop-DS) benchmark, which was modeled on the industry standard Transaction Processing Performance Council Benchmark DS (TPC-DS)1 and validated by a TPC certified auditor. While adapting the workload for the nature of a Hadoop system IBM worked to ensure the essential attributes of both typical customer requirements and the benchmark were preserved. TPC-DS was released in January 2012 and most recently revised (Revision 1.2.0) in September 20142.
The Hadoop-DS Benchmark is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries. The queries and the data populating the database have been chosen to have broad industry wide relevance while maintaining a sufficient degree of ease of implementation. This benchmark illustrates decision support systems that:
Examine large volumes of data;
Execute queries with a high degree of complexity;
Give answers to critical business questions.
Benchmarks results are highly dependent upon workload, specific application requirements, and systems design and implementation. Relative system performance will vary as a result of these and other factors. Therefore, Hadoop-DS should not be used as a substitute for specific customer application benchmarking when critical capacity planning and/or product evaluation decisions are contemplated.
1 TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
2 The latest revision of the TPC-DS specification can be found at http://www.tpc.org/tpcds/default.asp
6. Page | 6
Motivation
Good benchmarks reflect, in a practical way, an abstraction of the essential elements of real customer workloads. Consequently, the aim of this project was to create a benchmark for SQL over Hadoop products which reflect a scenario common to many organizations adopting the technology today. The most common scenario we see involves moving subsets of workloads from the traditional relational data warehouse to SQL over Hadoop solutions (a process commonly referred to as Warehouse Augmentation). For this reason our Hadoop-DS workload was modeled on the existing relational TPC-DS benchmark.
The TPC-DS benchmark uses relational database management systems (RDBMSs) to model a decision support system that examines large volumes of data and gives answers to real-world business questions by executing queries of various complexity (such as ad-hoc, reporting, OLAP and data mining type queries). It is therefore an ideal fit to mimic the experience of an organization porting parts of their workload from a traditional warehouse housed on an RDBMS to a SQL over Hadoop technology. As highlighted in IBM‟s “Benchmarking SQL over Hadoop Systems: TPC or not TPC?”3 Research paper, SQL over Hadoop solutions are in the “wild west” of benchmarking. Many vendors use the data generators and queries of existing TPC benchmarks, but cherry pick the parts of the benchmark most likely to highlight their own strengths – thus making comparison between results impossible and meaningless.
To reflect real-world situations, Hadoop-DS does not cherry pick the parts of the TPC-DS workload that highlight Big SQL‟s strengths. Instead, we included all parts of the TPC-DS workload that are appropriate for SQL over Hadoop solutions; those being data loading, single user performance and multi-user performance. Since TPC-DS is a benchmark designed for relational database engines, some aspects of the benchmark are not applicable to SQL over Hadoop solutions. Broadly speaking, those are the “Data Maintenance” and “Data Persistence” sections of the benchmark. Consequently these sections were omitted from our Hadoop-DS workload.
The TPC-DS benchmark also defines restrictions related to real-life situations – such as preventing the vendor from changing the queries to include additional predicates based on a customized partitioning schema, employing query specific tuning mechanisms (such as optimizer hints), making configuration changes between the single and multi-user tests etc... We endeavored to stay within the bounds of these restrictions for the Hadoop-DS workload and conducted the comparison with candor and due diligence. To validate our candor, we retained the services of Infosizing4, an established and respected benchmark auditing firm with multiple TPC certified auditors, including one with TPC-DS certification, to review and audit the benchmarking results.
It is important to note that this is not an official TPC-DS benchmark result since aspects of the standard benchmark that do not apply to SQL over Hadoop solutions were not implemented. However, the independent review of the environment and results by an official auditor shows IBM commitment to openness and fair play in this arena. All deviations from the TPC-DS standard benchmark are noted in the attached auditor‟s attestation letter in Appendix F. In addition, all the information required to reproduce the environment and the Hadoop-DS workload is published in the various Appendices of this document – thus allowing any vendor or third party the ability to independently execute the benchmark and verify the results.
3 “Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?” http://researcher.ibm.com/researcher/files/us- aflorat/BenchmarkingSQL-on-Hadoop.pdf
4 Infosizing: http://www.infosizing.com
7. Page | 7
Benchmark Methodology
In this section we provide a high level overview of the Hadoop- DS benchmark process.
There are three key stages in the Hadoop-DS benchmark (which reflect similar stages in the TPC-DS benchmark). They are:
1. Data load
2. Query generation and validation
3. Performance test
The flow diagram in Figure 1 outlines these three key stages when conducting the Hadoop-DS benchmark:
These stages are outlined below. For a detailed description of each phase refer to the “Design and Implementation” section of this document.
1. Data Load
The load phase of the benchmark includes all operations required to bring the System Under Test (SUT) to a state where the Performance Test phase can begin. This includes all hardware provisioning and configuration, storage setup, software installation (inc. the OS), verifying the cluster operation, generating the raw data and copying it to HDFS, cluster tuning and all steps required to create and populate the database in order to bring the system into a state ready to accept queries (the Performance Test phase). All desired tuning of the SUT must be completed before the end of the LOAD phase.
Once the tables are created and loaded with the raw data, relationships between tables such as primary-foreign key relationships and corresponding referential integrity constraints can be defined. Finally, statistics are collected. These statistics help the Big SQL query optimizer generate efficient access plans during the performance test. The SUT is now ready for the Performance Test phase.
2. Query Generation and Validation
Before the Performance Test phase can be begin, the queries must be generated and validated by executing each query against a qualification database and comparing the result with a pre-defined answer set.
There are 99 queries in the Hadoop-DS benchmark. Queries are automatically generated from query templates which contain substitution parameters. Specific parameter values depend on both the context the query is run (scale factor, single or multi- stream), and the seed for the random number generator. The seed used was the end time of the timed LOAD phase of the benchmark.
Figure 1: High Level Procedure
Figure 2: Load Phase
8. Page | 8
3. Performance Test
This is the timed phase of the Hadoop-DS benchmark. This phase consists of a single-stream performance test followed immediately by a multi-stream performance test.
In the single-stream performance test, a single query stream is executed against the database, and the total elapsed time for all 99 queries is measured.
In the multi-stream performance test, multiple query streams are executed against the database, and the total elapsed time from the start of the first query to the completion of the last query is measured.
The multi-stream performance test is started immediately following the completion of the single-stream test. There can be no modifications to the system under test, or components restarted between these performance tests.
The following steps are used to implement the performance test:
Single-Stream Test
1. Stream 0 Execution
Multi-Stream (all steps conducted in parallel)
1. Stream 1 Execution
2. Stream 2 Execution
3. Stream 3 Execution
4. Stream 4 Execution
Hadoop-DS uses the “Hadoop-DS Qph” metric to report query performance. The Hadoop-DS Qph metric is the effective query throughput, measured as the number of queries executed over a period of time. A primary factor in the Hadoop-DS metric is the scale factor (SF) -- size of data set -- which is used to scale the actual performance numbers. This means that results have a metric scaled to the database size which helps guard against the fact that cluster hardware doesn't always scale linearly and helps differentiate large clusters from small clusters (since performance is typically a factor of cluster size).
A Hadoop-DS Qph metric is calculated for each of the single and multi-user runs using the following formula:
Hadoop-DS Qph @ SF = ( (SF/100) * Q * S ) / T
Where:
• SF is the scale factor used in GB (30,000 in our benchmark). SF is divided by 100 in order to normalize the results using 100GB as the baseline.
• Q is the total number of queries successfully executed
• S is the number of streams (1 for the single user run)
• T is the duration of the run measured in hours (with a resolution up to one second)
Hadoop-DS Qph metrics are reported at a specific scale factor. For example „Hadoop-DS Qph@30TB‟ represents the effective throughput of the SQL over Hadoop solution against a 30TB database.
9. Page | 9
Design and Implementation
This section provides a more detailed description of the configuration used in this implementation of the benchmark.
Hardware
The benchmark was conducted on a 17 node cluster with each node being an IBM x3650BD server. A complete specification of the hardware used can be found in Appendix A: Cluster Topology and Hardware Configuration.
Physical Database Design
In-line with Big SQL best practices, a single ext4 filesystems was created on each disk used to store data on all nodes in the cluster (including the master). Once mounted, 3 directories were created on each filesystem for the HDFS data directory, the Map-Red cache and the Big SQL Data directory. This configuration simplifies disk layout and evenly spreads the io for all components across all available disks – and consequently provides good performance. For detailed information, refer to the “Installation options”and “OS storage” sections of Appendix C.
The default HDFS replication factor of 3 was used to replicate HDFS blocks between nodes in the cluster. No other replication (at the filesystem or database level) was used.
Parquet was chosen as the storage format for the Big SQL tables. Parquet is the optimal format for Big SQL 3.0 – both in terms of performance and disk space consumption. The Paquet storage format has Snappy compression enabled by default in Big SQL.
Logical Database Design
Big SQL‟s Parquet storage format does not support DATE or TIME data types, so VARCHAR(10) and VARCHAR(16) were used respectively. As a result, a small number of queries required these columns to be CAST to the appropriate DATE or TIME types in order for date arithmetic to be performed upon them.
Other than the natural scatter partitioning providing by HDFS, no other explicit horizontal or vertical partitioning was implemented.
For detailed information on the DDL used to create the tables, see Appendix B.
Data Generation
Although data generation is not a timed operation, generation and copy of the raw data to HDFS was parallelized across the data nodes in the cluster to improve efficiency. After generation, a directory exists on HDFS for each table in the schema. This directory is used as the source for the LOAD command.
Database Population
Tables were populated using the data generated and stored on HDFS during the data generation phase. The tables were loaded sequentially, one after the other. Tables were populated using the Big SQL LOAD command which uses Map-Reduce jobs to read the source data and populate the target file using the specified storage format (parquet in our case). The number of Map-Reduce tasks used to load each of the large fact tables where individually tuned via the num.map.tasks property in order to improve load performance.
Details of the LOAD command used can be found in Appendix B.
Referential Integrity / Informational Constraints
In a traditional data warehouse, referential integrity (“RI”) constraints are often employed to ensure that relationship between tables are maintained over time, as additional data is loaded and/or existing data is refreshed.
10. Page | 10
While today‟s “big data” products support RI constraints, they do not have the maturity required to enforce these constraints, as ingest and lookup capabilities are optimized for large parallel scans, not singleton lookups. As a result, care should be taken to enforce RI at the source, or during the ETL process. A good example is when moving data from an existing data warehouse to a big data platform – if the RI constraints existed in the source RDBMs then it is safe to create equivalent informational constraints on the big data platform as the constraints were already enforced by the RDBMs.
The presence of these constraints in the schema provides valuable information to the query compiler / optimizer, and thus these RI constraints are still created, but are unenforced. Unenforced RI constraints are often called informational constraints (“IC”).
Big SQL supports the use of Informational Constraints, and ICs were created for every PK and FK relationship in the TPC-DS schema. Appendix B provides full details on all informational constraints created.
Statistics
As Big SQL uses a cost-based query optimizer, the presence of accurate statistics is essential. Statistics can be gathered in various forms, ranging from simple cardinality statistics on tables, to distribution statistics for single columns and groups of columns within a table, to “statistical views” that provide cardinality and distribution statistics across join products.
For this test, a combination of all of these methods was employed.
a. cardinality statistics collected on all tables
b. distribution statistics collected on all columns
c. group distribution statistics collected for composite PKs in all 7 fact tables
d. statistical views created on a combination of join predicates
See Appendix B for full details of the statistics collected for the Hadoop-DS benchmark.
Query Generation and Validation
Since there are many variations of SQL dialects, the specification allows the sponsor to make pre-defined minor modifications to the queries so they can be successfully compiled and executed. In Big SQL, 87 of the 99 queries worked directly from generated query templates. The other 12 queries required only simple minor query modifications (mainly type casts) and took less than one hour to complete. Chart 1 shows the query breakdown.
Full query text for all 99 queries used during the single-stream run can be found in Appendix E.
Performance Test
Once the data is loaded, statistics gathered and queries generated the performance phase of the benchmark can commence. During the performance phase, a single stream run is executed, followed immediately by a multi-stream run. For the multi-stream run, four query streams were used.
Chart 1: Big SQL 3.0 query breakdown at 30TB
11. Page | 11
Results
Figures 3 and 4 summarize the results for executing the Hadoop-DS benchmark against Big SQL using a scale factor of 30TB.
Figure 3: Big SQL Results for Hadoop-DS @ 30TB
IBM System x3650BD with
IBM BigInsights Big SQL v3.0
Hadoop-DS (*)
October 24, 2014
Single-Stream Performance
Multi-Stream Performance
1,023 Hadoop-DS Qph @ 30TB
2,274 Hadoop-DS Qph @ 30TB
Database Size Query Engine Operating System
30 TB IBM BigInsights Big SQL v3.0
Red Hat Enterprise Linux Server
Release 6.4
System Components (per cluster node)
Processors/Cores/Threads
Memory
Disk Controllers
Disk Drives
Network
Quantity
2/20/40
8
1
10
1
1
Description
Intel Xeon E5-2680 v2, 2.80GHz, 25MB L3 Cache
16GB ECC DDR3 1866MHz LRDIMM
IBM ServeRAID-M5210 SAS/SATA Controller
2TB SATA 3.5” HDD (HDFS)
128GB SATA 2.5” SSD (Swap)
Onboard dual-port GigE Adapter
This implementation of the Hadoop-DS benchmark audited by Francois Raab of Infosizing (www.sizing.com)
(*) The Hadoop-DS benchmark is derived from TPC Benchmark DS (TPC-DS) and is not comparable to published TPC-DS results.
TPC Benchmark is a trademark of the Transaction Processing Performance Council.
Master Node
16 Data Nodes
Network
(10GigE)
12. Page | 12
IBM System x3650BD with
IBM BigInsights Big SQL v3.0
Hadoop-DS (*)
November 14, 2014
Start and End Times
Test
Start Date
Start Time
End Date
End Time
Elapsed Time
Single Stream
10/15/14
16:01:45
10/16/14
21:02:30
29:00:45
Multi-Stream
10/16/14
21:41:50
10/19/14
01:55:03
52:13:13
Number of Query Streams for Multi-Stream Test
4
Figure 4: Big SQL Elapsed Times for Hadoop-DS@30TB
Of particular note is the fact that 4 concurrent query streams (and therefore 4 times as many queries) take only 1.8x longer than a single query stream. Chart 2 highlights Big SQL‟s impressive multi-user scalability.
Chart 2: Big SQL Multi-user Scalability using 4 Query Streams @30TB
13. Page | 13
Benchmark Audit
Auditor
This implementation of the IBM Hadoop-DS Benchmark was audited by Francois Raab of Infosizing.
Further information regarding the audit process may be obtained from:
InfoSizing
531 Crystal Hills Boulevard
Crystal Springs, CO 80829
Telephone: (719) 473-7555
Attestation Letter
The auditor‟s attestation letter can be found in Appendix F.
14. Page | 14
Summary
The Hadoop-DS benchmark takes a Data Warehousing workload from an RDBMs and ports it to IBM‟s SQL over Hadoop solution – namely Big SQL 3.0. Porting workloads from existing warehouses to SQL over Hadoop is a common scenario for organizations seeking to reduce the cost of their existing data warehousing platforms. For this reason the Hadoop-DS workload was modeled on the Transaction Processing Performance Council Benchmark DS (TPC-DS). The services of a TPC approved auditor were secured to review the benchmark process and results.
The results of this benchmark demonstrate the ease with which existing data warehouses can be augmented with IBM Big SQL 3.0. It highlights how Big SQL is able to implement rich SQL with outstanding performance, on a large data set with multiple concurrent users.
These findings will be compelling to organizations augmenting data warehouse environments with Hadoop-based technologies. Strict SQL compliance can translate into significant cost savings by allowing customers to leverage existing investments in databases, applications and skills and take advantage of SQL over Hadoop with minimal disruption to existing environments. Enterprise customers cannot afford to have different dialects of SQL across different data management platforms. This benchmark shows that IBM‟s Big SQL demonstrates a high degree of SQL language compatibility with existing RDBMs workloads.
Not only is IBM Big SQL compatible with existing RDBMs, it also demonstrates very good performance and scalability for a SQL over Hadoop solution. This means that customers can realize business results faster, ask more complex questions, and realize great efficiencies per unit investment in infrastructure. All of these factors help provide a competitive advantage.
The performance and SQL language richness demonstrated through-out this paper demonstrates that IBM Big SQL is the industry leading SQL over Hadoop solution available today.
15. Page | 15
Appendix A: Cluster Topology and Hardware Configuration
Figure 5: IBM System x3650BD M4
The measured configuration was a cluster of 17 identical IBM XSeries x3650BD M4 servers with 1 master node and 16 data nodes. Each contained:
CPU: E5-2680@2.8GHz v2 2 sockets, 10 cores each, hyper threading enabled. Total of 40 Logical CPUs
Memory: 128 GB RAM at 1866 MHz
Storage: 10 x 2TB 3.5” Serial SATA, 7200RPM. One disk for OS, 9 for data
Storage:4 x 128GB SSD. A single SSD was used for OS Swap. Other SSDs were not used.
Network: Dual port 10 Gb Ethernet
OS: Red Hat Enterprise Linux Server release 6.4 (Santiago)
16. Page | 16
Appendix B: Create and Load Tables
Create Flat files
Scripts:
001.gen-data-v3-tpcds.sh
002.gen-data-v3-tpcds-forParquet.sh
These scripts are essentially wrappers for dsdgen to generate the TPC-DS flat files. They provide support for parallel data generation, as well as the generation of the data directly in HDFS (through the use of named pipes) rather than first staging the flat files on a local disk.
Create and Load Tables
040.create-tables-parquet.jsq:
The following DDL was used to create the tables in Big SQL:
set schema $schema;
create hadoop table call_center
(
cc_call_center_sk bigint not null,
cc_call_center_id varchar(16) not null,
cc_rec_start_date varchar(10) ,
cc_rec_end_date varchar(10) ,
cc_closed_date_sk bigint ,
cc_open_date_sk bigint ,
cc_name varchar(50) ,
cc_class varchar(50) ,
cc_employees bigint ,
cc_sq_ft bigint ,
cc_hours varchar(20) ,
cc_manager varchar(40) ,
cc_mkt_id bigint ,
cc_mkt_class varchar(50) ,
cc_mkt_desc varchar(100) ,
cc_market_manager varchar(40) ,
cc_division bigint ,
cc_division_name varchar(50) ,
cc_company bigint ,
cc_company_name varchar(50) ,
cc_street_number varchar(10) ,
cc_street_name varchar(60) ,
cc_street_type varchar(15) ,
cc_suite_number varchar(10) ,
cc_city varchar(60) ,
cc_county varchar(30) ,
cc_state varchar(2) ,
cc_zip varchar(10) ,
cc_country varchar(20) ,
cc_gmt_offset double ,
cc_tax_percentage double
)
STORED AS PARQUETFILE;
create hadoop table catalog_page
(
cp_catalog_page_sk bigint not null,
cp_catalog_page_id varchar(16) not null,
cp_start_date_sk bigint ,
cp_end_date_sk bigint ,
cp_department varchar(50) ,
cp_catalog_number bigint ,
cp_catalog_page_number bigint ,
cp_description varchar(100) ,
cp_type varchar(100)
)
23. Page | 23
ws_coupon_amt double ,
ws_ext_ship_cost double ,
ws_net_paid double ,
ws_net_paid_inc_tax double ,
ws_net_paid_inc_ship double ,
ws_net_paid_inc_ship_tax double ,
ws_net_profit double
)
STORED AS PARQUETFILE;
create hadoop table web_site
(
web_site_sk bigint not null,
web_site_id varchar(16) not null,
web_rec_start_date varchar(10) ,
web_rec_end_date varchar(10) ,
web_name varchar(50) ,
web_open_date_sk bigint ,
web_close_date_sk bigint ,
web_class varchar(50) ,
web_manager varchar(40) ,
web_mkt_id bigint ,
web_mkt_class varchar(50) ,
web_mkt_desc varchar(100) ,
web_market_manager varchar(40) ,
web_company_id bigint ,
web_company_name varchar(50) ,
web_street_number varchar(10) ,
web_street_name varchar(60) ,
web_street_type varchar(15) ,
web_suite_number varchar(10) ,
web_city varchar(60) ,
web_county varchar(30) ,
web_state varchar(2) ,
web_zip varchar(10) ,
web_country varchar(20) ,
web_gmt_offset double ,
web_tax_percentage double
)
STORED AS PARQUETFILE;
commit;
045.load-tables.jsq:
The following script was used to load the flatfiles into Big SQL in Parquet format:
set schema $schema;
load hadoop using file url '/HADOOPDS30000G_PARQ/call_center' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table call_center overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/catalog_page' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table catalog_page overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/catalog_returns' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table catalog_returns overwrite WITH LOAD PROPERTIES ('num.map.tasks'='425');
load hadoop using file url '/HADOOPDS30000G_PARQ/catalog_sales' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table catalog_sales overwrite WITH LOAD PROPERTIES ('num.map.tasks'='4250');
load hadoop using file url '/HADOOPDS30000G_PARQ/customer' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table customer overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/customer_address' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table customer_address overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
24. Page | 24
load hadoop using file url '/HADOOPDS30000G_PARQ/customer_demographics' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table customer_demographics overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/date_dim' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table date_dim overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/household_demographics' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table household_demographics overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/income_band' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table income_band overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/inventory' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table inventory overwrite WITH LOAD PROPERTIES ('num.map.tasks'='160');
load hadoop using file url '/HADOOPDS30000G_PARQ/item' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table item overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/promotion' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table promotion overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/reason' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table reason overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/ship_mode' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table ship_mode overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/store' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table store overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/store_returns' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table store_returns overwrite WITH LOAD PROPERTIES ('num.map.tasks'='700');
load hadoop using file url '/HADOOPDS30000G_PARQ/store_sales' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table store_sales overwrite WITH LOAD PROPERTIES ('num.map.tasks'='5500');
load hadoop using file url '/HADOOPDS30000G_PARQ/time_dim' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table time_dim overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/warehouse/' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table warehouse overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/web_page' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table web_page overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
load hadoop using file url '/HADOOPDS30000G_PARQ/web_returns' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table web_returns overwrite WITH LOAD PROPERTIES ('num.map.tasks'='200');
load hadoop using file url '/HADOOPDS30000G_PARQ/web_sales' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table web_sales overwrite WITH LOAD PROPERTIES ('num.map.tasks'='2000');
load hadoop using file url '/HADOOPDS30000G_PARQ/web_site' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table web_site overwrite WITH LOAD PROPERTIES ('num.map.tasks'='1');
046.load-files-individually.sh:
Since the benchmark uses 1GB block sizes for the Parquet files, any table which is smaller than 16 GB but still significant in size (at least 1 GB) will not be spread across all data nodes in the cluster when using the load script above. For this reason, the flat files for customer, customer_address and inventory were generated in several pieces (1 file per data node). Each file was then loaded individually in order to spread the files and blocks across as many of the data nodes as possible. This allows Big SQL to fully parallelize a table scan against these tables across all data nodes .
The following script was used to achieve this distribution for the 3 tables mentioned:
FLATDIR=$1
TABLE=$2
25. Page | 25
FILE="046.load-files-individually-${TABLE}.jsq"
i=0
schema=HADOOPDS30000G_PARQ
rm -rf ${FILE}
echo "set schema $schema;" >> ${FILE}
echo >> ${FILE}
hadoop fs -ls ${FLATDIR} | grep -v Found | awk '{print $8}' |
while read f
do
if [[ $i == 0 ]] ; then
echo "load hadoop using file url '$f' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table ${TABLE} overwrite ;" >> ${FILE}
i=1
else
echo "load hadoop using file url '$f' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table ${TABLE} append ;" >> ${FILE}
fi
done
055.ri.jsq:
Primary Key (PK) and Foreign Key (FK) constraints cannot be enforced by BigSQL but “not enforced” constraints can be used to give the optimizer some added information when it is considering access plans.
The following informational constraints were used in the environment (all PK + FK relationships outlined in the TPC-DS specification):
set schema $schema;
------------------------------------------------------------
-- primary key definitions
------------------------------------------------------------
alter table call_center
add primary key (cc_call_center_sk)
not enforced enable query optimization;
commit work;
alter table catalog_page
add primary key (cp_catalog_page_sk)
not enforced enable query optimization;
commit work;
alter table catalog_returns
add primary key (cr_item_sk, cr_order_number)
not enforced enable query optimization;
commit work;
alter table catalog_sales
add primary key (cs_item_sk, cs_order_number)
not enforced enable query optimization;
commit work;
alter table customer
add primary key (c_customer_sk)
not enforced enable query optimization;
commit work;
alter table customer_address
add primary key (ca_address_sk)
not enforced enable query optimization;
commit work;
alter table customer_demographics
add primary key (cd_demo_sk)
not enforced enable query optimization;
commit work;
26. Page | 26
alter table date_dim
add primary key (d_date_sk)
not enforced enable query optimization;
commit work;
alter table household_demographics
add primary key (hd_demo_sk)
not enforced enable query optimization;
commit work;
alter table income_band
add primary key (ib_income_band_sk)
not enforced enable query optimization;
commit work;
alter table inventory
add primary key (inv_date_sk, inv_item_sk, inv_warehouse_sk)
not enforced enable query optimization;
commit work;
alter table item
add primary key (i_item_sk)
not enforced enable query optimization;
commit work;
alter table promotion
add primary key (p_promo_sk)
not enforced enable query optimization;
commit work;
alter table reason
add primary key (r_reason_sk)
not enforced enable query optimization;
commit work;
alter table ship_mode
add primary key (sm_ship_mode_sk)
not enforced enable query optimization;
commit work;
alter table store
add primary key (s_store_sk)
not enforced enable query optimization;
commit work;
alter table store_returns
add primary key (sr_item_sk, sr_ticket_number)
not enforced enable query optimization;
commit work;
alter table store_sales
add primary key (ss_item_sk, ss_ticket_number)
not enforced enable query optimization;
commit work;
alter table time_dim
add primary key (t_time_sk)
not enforced enable query optimization;
commit work;
alter table warehouse
add primary key (w_warehouse_sk)
not enforced enable query optimization;
commit work;
alter table web_page
add primary key (wp_web_page_sk)
not enforced enable query optimization;
commit work;
alter table web_returns
add primary key (wr_item_sk, wr_order_number)
not enforced enable query optimization;
commit work;
alter table web_sales
27. Page | 27
add primary key (ws_item_sk, ws_order_number)
not enforced enable query optimization;
commit work;
alter table web_site
add primary key (web_site_sk)
not enforced enable query optimization;
commit work;
------------------------------------------------------------
-- foreign key definitions
------------------------------------------------------------
-- tables with no FKs
-- customer_address
-- customer_demographics
-- item
-- date_dim
-- warehouse
-- ship_mode
-- time_dim
-- reason
-- income_band
alter table promotion
add constraint fk1 foreign key (p_start_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table promotion
add constraint fk2 foreign key (p_end_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table promotion
add constraint fk3 foreign key (p_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store
add constraint fk foreign key (s_closed_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table call_center
add constraint fk1 foreign key (cc_closed_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table call_center
add constraint fk2 foreign key (cc_open_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table customer
add constraint fk1 foreign key (c_current_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table customer
add constraint fk2 foreign key (c_current_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table customer
add constraint fk3 foreign key (c_current_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table customer
add constraint fk4 foreign key (c_first_shipto_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table customer
28. Page | 28
add constraint fk5 foreign key (c_first_sales_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_site
add constraint fk1 foreign key (web_open_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_site
add constraint fk2 foreign key (web_close_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_page
add constraint fk1 foreign key (cp_start_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_page
add constraint fk2 foreign key (cp_end_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table household_demographics
add constraint fk foreign key (hd_income_band_sk)
references income_band (ib_income_band_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_page
add constraint fk1 foreign key (wp_creation_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_page
add constraint fk2 foreign key (wp_access_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_page
add constraint fk3 foreign key (wp_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk1 foreign key (ss_sold_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk2 foreign key (ss_sold_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk3a foreign key (ss_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk4 foreign key (ss_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk5 foreign key (ss_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk6 foreign key (ss_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk7 foreign key (ss_addr_sk)
29. Page | 29
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk8 foreign key (ss_store_sk)
references store (s_store_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_sales
add constraint fk9 foreign key (ss_promo_sk)
references promotion (p_promo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk1 foreign key (sr_returned_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk2 foreign key (sr_return_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk3a foreign key (sr_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk3b foreign key (sr_item_sk, sr_ticket_number)
references store_sales (ss_item_sk, ss_ticket_number) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk4 foreign key (sr_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk5 foreign key (sr_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk6 foreign key (sr_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk7 foreign key (sr_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk8 foreign key (sr_store_sk)
references store (s_store_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table store_returns
add constraint fk9 foreign key (sr_reason_sk)
references reason (r_reason_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk1 foreign key (cs_sold_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk2 foreign key (cs_sold_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk3 foreign key (cs_ship_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
30. Page | 30
commit work;
alter table catalog_sales
add constraint fk4 foreign key (cs_bill_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk5 foreign key (cs_bill_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk6 foreign key (cs_bill_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk7 foreign key (cs_bill_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk8 foreign key (cs_ship_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk9 foreign key (cs_ship_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk10 foreign key (cs_ship_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk11 foreign key (cs_ship_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk12 foreign key (cs_call_center_sk)
references call_center (cc_call_center_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk13 foreign key (cs_catalog_page_sk)
references catalog_page (cp_catalog_page_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk14 foreign key (cs_ship_mode_sk)
references ship_mode (sm_ship_mode_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk15 foreign key (cs_warehouse_sk)
references warehouse (w_warehouse_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk16a foreign key (cs_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_sales
add constraint fk17 foreign key (cs_promo_sk)
references promotion (p_promo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk1 foreign key (cr_returned_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
31. Page | 31
alter table catalog_returns
add constraint fk2 foreign key (cr_returned_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk3 foreign key (cr_item_sk, cr_order_number)
references catalog_sales (cs_item_sk, cs_order_number) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk4 foreign key (cr_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk5 foreign key (cr_refunded_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk6 foreign key (cr_refunded_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk7 foreign key (cr_refunded_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk8 foreign key (cr_refunded_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk9 foreign key (cr_returning_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk10 foreign key (cr_returning_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk11 foreign key (cr_returning_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk12 foreign key (cr_returning_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk13 foreign key (cr_call_center_sk)
references call_center (cc_call_center_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk14 foreign key (cr_catalog_page_sk)
references catalog_page (cp_catalog_page_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk15 foreign key (cr_ship_mode_sk)
references ship_mode (sm_ship_mode_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table catalog_returns
add constraint fk16 foreign key (cr_warehouse_sk)
references warehouse (w_warehouse_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
32. Page | 32
alter table catalog_returns
add constraint fk17 foreign key (cr_reason_sk)
references reason (r_reason_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk1 foreign key (ws_sold_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk2 foreign key (ws_sold_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk3 foreign key (ws_ship_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk4a foreign key (ws_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk5 foreign key (ws_bill_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk6 foreign key (ws_bill_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk7 foreign key (ws_bill_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk8 foreign key (ws_bill_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk9 foreign key (ws_ship_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk10 foreign key (ws_ship_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk11 foreign key (ws_ship_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk12 foreign key (ws_ship_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk13 foreign key (ws_web_page_sk)
references web_page (wp_web_page_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk14 foreign key (ws_web_site_sk)
references web_site (web_site_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
33. Page | 33
add constraint fk15 foreign key (ws_ship_mode_sk)
references ship_mode (sm_ship_mode_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk16 foreign key (ws_warehouse_sk)
references warehouse (w_warehouse_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_sales
add constraint fk17 foreign key (ws_promo_sk)
references promotion (p_promo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk1 foreign key (wr_returned_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk2 foreign key (wr_returned_time_sk)
references time_dim (t_time_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk3a foreign key (wr_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk3b foreign key (wr_item_sk, wr_order_number)
references web_sales (ws_item_sk, ws_order_number) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk4 foreign key (wr_refunded_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk5 foreign key (wr_refunded_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk6 foreign key (wr_refunded_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk7 foreign key (wr_refunded_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk8 foreign key (wr_returning_customer_sk)
references customer (c_customer_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk9 foreign key (wr_returning_cdemo_sk)
references customer_demographics (cd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk10 foreign key (wr_returning_hdemo_sk)
references household_demographics (hd_demo_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk11 foreign key (wr_returning_addr_sk)
references customer_address (ca_address_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk12 foreign key (wr_web_page_sk)
34. Page | 34
references web_page (wp_web_page_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table web_returns
add constraint fk13 foreign key (wr_reason_sk)
references reason (r_reason_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table inventory
add constraint fk1 foreign key (inv_date_sk)
references date_dim (d_date_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table inventory
add constraint fk2 foreign key (inv_item_sk)
references item (i_item_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
alter table inventory
add constraint fk3 foreign key (inv_warehouse_sk)
references warehouse (w_warehouse_sk) NOT ENFORCED ENABLE QUERY OPTIMIZATION;
commit work;
Collect Statistics
060.analyze-withCGS.jsq:
The following script was used to collect statistics for the database.
Distribution stats were collected for every column in the database and group distribution stats were collected for the composite primary keys in the 7 fact tables.
set schema $schema;
ANALYZE TABLE call_center COMPUTE STATISTICS FOR COLUMNS cc_call_center_sk, cc_call_center_id, cc_rec_start_date, cc_rec_end_date, cc_closed_date_sk, cc_open_date_sk, cc_name, cc_class, cc_employees, cc_sq_ft, cc_hours, cc_manager, cc_mkt_id, cc_mkt_class, cc_mkt_desc, cc_market_manager, cc_division, cc_division_name, cc_company, cc_company_name, cc_street_number, cc_street_name, cc_street_type, cc_suite_number, cc_city, cc_county, cc_state, cc_zip, cc_country, cc_gmt_offset, cc_tax_percentage;
ANALYZE TABLE catalog_page COMPUTE STATISTICS FOR COLUMNS cp_catalog_page_sk, cp_catalog_page_id, cp_start_date_sk, cp_end_date_sk, cp_department, cp_catalog_number, cp_catalog_page_number, cp_description, cp_type;
ANALYZE TABLE catalog_returns COMPUTE STATISTICS FOR COLUMNS cr_returned_date_sk, cr_returned_time_sk, cr_item_sk, cr_refunded_customer_sk, cr_refunded_cdemo_sk, cr_refunded_hdemo_sk, cr_refunded_addr_sk, cr_returning_customer_sk, cr_returning_cdemo_sk, cr_returning_hdemo_sk, cr_returning_addr_sk, cr_call_center_sk, cr_catalog_page_sk, cr_ship_mode_sk, cr_warehouse_sk, cr_reason_sk, cr_order_number, cr_return_quantity, cr_return_amount, cr_return_tax, cr_return_amt_inc_tax, cr_fee, cr_return_ship_cost, cr_refunded_cash, cr_reversed_charge, cr_store_credit, cr_net_loss, (cr_item_sk, cr_order_number);
ANALYZE TABLE catalog_sales COMPUTE STATISTICS FOR COLUMNS cs_sold_date_sk, cs_sold_time_sk, cs_ship_date_sk, cs_bill_customer_sk, cs_bill_cdemo_sk, cs_bill_hdemo_sk, cs_bill_addr_sk, cs_ship_customer_sk, cs_ship_cdemo_sk, cs_ship_hdemo_sk, cs_ship_addr_sk, cs_call_center_sk, cs_catalog_page_sk, cs_ship_mode_sk, cs_warehouse_sk, cs_item_sk, cs_promo_sk, cs_order_number, cs_quantity, cs_wholesale_cost, cs_list_price, cs_sales_price, cs_ext_discount_amt, cs_ext_sales_price, cs_ext_wholesale_cost, cs_ext_list_price, cs_ext_tax, cs_coupon_amt, cs_ext_ship_cost, cs_net_paid, cs_net_paid_inc_tax, cs_net_paid_inc_ship, cs_net_paid_inc_ship_tax, cs_net_profit, (cs_item_sk, cs_order_number);
ANALYZE TABLE customer COMPUTE STATISTICS FOR COLUMNS c_customer_sk, c_customer_id, c_current_cdemo_sk, c_current_hdemo_sk, c_current_addr_sk, c_first_shipto_date_sk, c_first_sales_date_sk, c_salutation, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_day, c_birth_month, c_birth_year, c_birth_country, c_login, c_email_address, c_last_review_date;
ANALYZE TABLE customer_address COMPUTE STATISTICS FOR COLUMNS ca_address_sk, ca_address_id, ca_street_number, ca_street_name, ca_street_type, ca_suite_number, ca_city, ca_county, ca_state, ca_zip, ca_country, ca_gmt_offset, ca_location_type;
ANALYZE TABLE customer_demographics COMPUTE STATISTICS FOR COLUMNS cd_demo_sk, cd_gender, cd_marital_status, cd_education_status, cd_purchase_estimate, cd_credit_rating, cd_dep_count, cd_dep_employed_count, cd_dep_college_count;
36. Page | 36
web_street_name, web_street_type, web_suite_number, web_city, web_county, web_state, web_zip, web_country, web_gmt_offset, web_tax_percentage;
064.statviews.sh:
This script was used to create “statviews” and collect statistics about them.
The statviews give Big SQL‟s optimizer more information about joins on PK-FK columns. Only a subset of joins are modeled.
DBNAME=$1
schema=$2
db2 connect to ${DBNAME}
db2 -v set schema ${schema}
# workaround for bug with statviews
# need to select from any random table at the begninning of the connection
# or we'll get a -901 during runstats on CS_GVIEW or CR_GVIEW
db2 -v "select count(*) from date_dim"
db2 -v "drop view cr_gview"
db2 -v "drop view cs_gview"
db2 -v "drop view sr_gview"
db2 -v "drop view ss_gview"
db2 -v "drop view wr_gview"
db2 -v "drop view ws_gview"
db2 -v "drop view c_gview"
db2 -v "drop view inv_gview"
db2 -v "drop view sv_date_dim"
db2 -v "create view CR_GVIEW (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, c79, c80, c81, c82, c83,c84, c85, c86, c87, c88, c89, c90, c91, c92, c93, c94, c95, c96, c97, c98, c99, d_d_date) as
(
select T2.*, T3.*, T4.*, T5.*, T6.*, T7.*, DATE(T5.D_DATE) as D_D_DATE
from CATALOG_RETURNS as T1,
CATALOG_PAGE as T2, CUSTOMER_ADDRESS as T3, CUSTOMER as T4,
DATE_DIM as T5, CUSTOMER_ADDRESS as T6, CUSTOMER as T7
where T1.CR_CATALOG_PAGE_SK = T2.CP_CATALOG_PAGE_SK and
T1.CR_REFUNDED_ADDR_SK = T3.CA_ADDRESS_SK and
T1.CR_REFUNDED_CUSTOMER_SK = T4.C_CUSTOMER_SK and
T1.CR_RETURNED_DATE_SK = T5.D_DATE_SK and
T1.CR_RETURNING_ADDR_SK = T6.CA_ADDRESS_SK and
T1.CR_RETURNING_CUSTOMER_SK = T7.C_CUSTOMER_SK
)"
db2 -v "create view CS_GVIEW (c1, c2, c3, c4, c5,c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, c79, c80, c81, c82, c83,c84, c85, c86, c87, c88, c89, c90, c91, c92, c93, c94, c95, c96, c97, c98, c99, c100, c101, d_d_date1, d_d_date2) as
(
select T2.*, T3.*, T4.*, T5.*, T6.*, DATE(T4.D_DATE) as D_D_DATE1, DATE(T6.D_DATE) as D_D_DATE2
from CATALOG_SALES as T1,
CUSTOMER as T2, CATALOG_PAGE as T3, DATE_DIM as T4,
CUSTOMER as T5, DATE_DIM as T6
where T1.CS_BILL_CUSTOMER_SK = T2.C_CUSTOMER_SK and
T1.CS_CATALOG_PAGE_SK = T3.CP_CATALOG_PAGE_SK and
T1.CS_SHIP_DATE_SK = T4.D_DATE_SK and
T1.CS_SHIP_CUSTOMER_SK = T5.C_CUSTOMER_SK and
T1.CS_SOLD_DATE_SK = T6.D_DATE_SK
)"
db2 -v "create view SR_GVIEW as
(
select T2.*, T3.*, T4.*, T5.*, DATE(T3.D_DATE) as D_D_DATE
from STORE_RETURNS as T1,
CUSTOMER as T2, DATE_DIM as T3, TIME_DIM as T4, STORE as T5
37. Page | 37
where T1.SR_CUSTOMER_SK = T2.C_CUSTOMER_SK and
T1.SR_RETURNED_DATE_SK = T3.D_DATE_SK and
T1.SR_RETURN_TIME_SK = T4.T_TIME_SK and
T1.SR_STORE_SK = T5.S_STORE_SK
)"
db2 -v "create view SS_GVIEW as
(
select T2.*, T3.*, T4.*, DATE(T2.D_DATE) as D_D_DATE
from STORE_SALES as T1,
DATE_DIM as T2, TIME_DIM as T3, STORE as T4
where T1.SS_SOLD_DATE_SK = T2.D_DATE_SK and
T1.SS_SOLD_TIME_SK = T3.T_TIME_SK and
T1.SS_STORE_SK = T4.S_STORE_SK
)"
db2 -v "create view WR_GVIEW (c1, c2, c3, c4, c5,c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, c79, c80, c81, c82, c83,c84, c85, c86, c87, c88, c89, c90, c91, c92, c93, c94, c95, c96, c97, c98, c99, c100, c101, c102, c103, c104, c105, c106, c107, c108, D_D_DATE) as
(
select T2.*, T3.*, T4.*, T5.*, T6.*, T7.*, T8.*, DATE(T5.D_DATE) as D_D_DATE
from WEB_RETURNS as T1,
CUSTOMER_ADDRESS as T2, CUSTOMER_DEMOGRAPHICS as T3, CUSTOMER as T4, DATE_DIM as T5,
CUSTOMER_ADDRESS as T6, CUSTOMER_DEMOGRAPHICS as T7, CUSTOMER as T8
where T1.WR_REFUNDED_ADDR_SK = T2.CA_ADDRESS_SK and
T1.WR_REFUNDED_CDEMO_SK = T3.CD_DEMO_SK and
T1.WR_REFUNDED_CUSTOMER_SK = T4.C_CUSTOMER_SK and
T1.WR_RETURNED_DATE_SK = T5.D_DATE_SK and
T1.WR_RETURNING_ADDR_SK = T6.CA_ADDRESS_SK and
T1.WR_RETURNING_CDEMO_SK = T7.CD_DEMO_SK and
T1.WR_RETURNING_CUSTOMER_SK = T8.C_CUSTOMER_SK
)"
db2 -v "create view WS_GVIEW (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, c79, c80, c81, c82, c83,c84, c85, c86, c87, c88, c89, c90, c91, c92, D_D_DATE, E_D_DATE) as
(
select T2.*, T3.*, T4.*, T5.*, DATE(T3.D_DATE) as D_D_DATE, DATE(T5.D_DATE) as E_D_DATE
from WEB_SALES as T1,
CUSTOMER as T2, DATE_DIM as T3, CUSTOMER as T4, DATE_DIM as T5
where T1.WS_BILL_CUSTOMER_SK = T2.C_CUSTOMER_SK and
T1.WS_SHIP_CUSTOMER_SK = T4.C_CUSTOMER_SK and
T1.WS_SHIP_DATE_SK = T3.D_DATE_SK and
T1.WS_SOLD_DATE_SK = T5.D_DATE_SK
)"
db2 -v "create view C_GVIEW (c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63, c64, c65, c66, c67, c68, c69, c70, c71, c72, c73, c74, c75, c76, c77, c78, D_D_DATE, E_D_DATE) as
(
select T2.*, T3.*, T4.*, T5.*, DATE(T4.D_DATE) as D_D_DATE, DATE(T5.D_DATE) as E_D_DATE
from CUSTOMER as T1,
CUSTOMER_ADDRESS as T2, CUSTOMER_DEMOGRAPHICS as T3, DATE_DIM as T4, DATE_DIM as T5
where T1.C_CURRENT_ADDR_SK = T2.CA_ADDRESS_SK and
T1.C_CURRENT_CDEMO_SK = T3.CD_DEMO_SK and
T1.C_FIRST_SALES_DATE_SK = T4.D_DATE_SK and
T1.C_FIRST_SHIPTO_DATE_SK = T5.D_DATE_SK
)"
db2 -v "create view INV_GVIEW as (select T2.*, DATE(T2.D_DATE) as D_D_DATE from INVENTORY as T1, DATE_DIM as T2 where T1.INV_DATE_SK=T2.D_DATE_SK)"
db2 -v "create view SV_DATE_DIM as (select date(d_date) as d_d_date from DATE_DIM)"
db2 -v "alter view CR_GVIEW enable query optimization"
38. Page | 38
db2 -v "alter view CS_GVIEW enable query optimization"
db2 -v "alter view SR_GVIEW enable query optimization"
db2 -v "alter view SS_GVIEW enable query optimization"
db2 -v "alter view WR_GVIEW enable query optimization"
db2 -v "alter view WS_GVIEW enable query optimization"
db2 -v "alter view C_GVIEW enable query optimization"
db2 -v "alter view INV_GVIEW enable query optimization"
db2 -v "alter view SV_DATE_DIM enable query optimization"
# workaround for bug with statviews
# need to run first runstats twice or we don't actually get any stats
time db2 -v "runstats on table SV_DATE_DIM with distribution"
time db2 -v "runstats on table SV_DATE_DIM with distribution"
time db2 -v "runstats on table CR_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table CS_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table SR_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table SS_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table WR_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table WS_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table C_GVIEW with distribution tablesample BERNOULLI(1)"
time db2 -v "runstats on table INV_GVIEW with distribution tablesample BERNOULLI(1)"
db2 commit
db2 terminate
39. Page | 39
Appendix C: Tuning
Installation options:
During install, the following Big SQL properties were set. Node resource percentage was set to 90% in order to provide as much of the cluster resources as possible to Big SQL:
Big SQL administrator user: bigsql
Big SQL FCM start port: 62000
Big SQL 1 server port: 7052
Scheduler service port: 7053
Scheduler administration port: 7054
Big SQL server port: 51000
Node resources percentage: 90%
The following disk layout is in accordance with current BigInsights and Big SQL 3.0 best practices which recommend distributing all I/O for the Hadoop cluster across all disks:
BigSQL2 data directory: /data1/db2/bigsql,/data2/db2/bigsql,/data3/db2/bigsql,/data4/db2/bigsql,/data5/db2/bigsql,/data6/db2/bigsql, /data7/db2/bigsql,/data8/db2/bigsql,/data9/db2/bigsql
Cache directory: /data1/hadoop/mapred/local,/data2/hadoop/mapred/local,/data3/hadoop/mapred/local,/data4/hadoop/mapred/local, /data5/hadoop/mapred/local,/data6/hadoop/mapred/local,/data7/hadoop/mapred/local,/data8/hadoop/mapred/local, /data9/hadoop/mapred/local
DataNode data directory: /data1/hadoop/hdfs/data,/data2/hadoop/hdfs/data,/data3/hadoop/hdfs/data,/data4/hadoop/hdfs/data,/data5/hadoop/hdfs/data,/data6/hadoop/hdfs/data,/data7/hadoop/hdfs/data,/data8/hadoop/hdfs/data,/data9/hadoop/hdfs/data
Big SQL tuning options:
## Configured for 128 GB of memory per node
## 30 GB bufferpool
## 3.125 GB sortheap / 50 GB sheapthres_shr
## reader memory: 20% of total memory by default (user can raise it to 30%)
##
## other useful conf changes:
## mapred-site.xml
## mapred.tasktracker.map.tasks.maximum=20
## mapred.tasktracker.reduce.tasks.maximum=6
## mapreduce.map.java.opts="-Xmx3000m ..."
## mapreduce.reduce.java.opts="-Xmx3000m ..."
##
## bigsql-conf.xml
## dfsio.num_scanner_threads=12
## dfsio.read_size=4194304
## dfsio.num_threads_per_disk=2
## scheduler.client.request.timeout=600000
DBNAME=$1
40. Page | 40
db2 connect to ${DBNAME}
db2 -v "call syshadoop.big_sql_service_mode('on')"
db2 -v "alter bufferpool IBMDEFAULTBP size 891520 "
db2 -v "alter tablespace TEMPSPACE1 no file system caching"
db2 -v "update db cfg for ${DBNAME} using sortheap 819200 sheapthres_shr 13107200"
db2 -v "update db cfg for ${DBNAME} using dft_degree 8"
db2 -v "update dbm cfg using max_querydegree ANY"
db2 -v "update dbm cfg using aslheapsz 15"
db2 -v "update dbm cfg using cpuspeed 1.377671e-07"
db2 -v "update dbm cfg using INSTANCE_MEMORY 85"
db2 -v "update dbm cfg using CONN_ELAPSE 18"
## Disable auto maintenance
db2 -v "update db cfg for bigsql using AUTO_MAINT OFF AUTO_TBL_MAINT OFF AUTO_RUNSTATS OFF AUTO_STMT_STATS OFF"
db2 terminate
BigInsights mapred-site.xml tuning:
The following changes (highlighted) were made to the Hadoop mapred-site.xml file to tune the number of map-reduce slots, and the maximum memory allocated to these slots. In Big SQL, Map-Reduce is used for the LOAD and ANALYZE commands only, not query execution. The properties were tuned in order to get the best possible performance from these commands.
<property>
<!-- The maximum number of map tasks that will be run simultaneously by a
task tracker. Default: 2. Recommendations: set relevant to number of
CPUs and amount of memory on each data node. -->
<name>mapred.tasktracker.map.tasks.maximum</name>
<!--value><%= Math.max(2, Math.ceil(0.66 * Math.min(numOfDisks, numOfCores, totalMem/1000) * 1.75) - 2) %></value-->
<value>20</value>
</property>
<property>
<!-- The maximum number of reduce tasks that will be run simultaneously by
a task tracker. Default: 2. Recommendations: set relevant to number of
CPUs and amount of memory on each data node, note that reduces usually
take more memory and do more I/O than maps. -->
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<!--value><%= Math.max(2, Math.ceil(0.33 * Math.min(numOfDisks, numOfCores, totalMem/1000) * 1.75) - 2)%></value-->
<value>6</value>
</property>
<property>
41. Page | 41
<!-- Max heap of child JVM spawned by tasktracker. Ideally as large as the
task machine can afford. The default -Xmx200m is usually too small. -->
<name>mapreduce.map.java.opts</name>
<value>-Xmx3000m -Xms1000m -Xmn100m -Xtune:virtualized - Xshareclasses:name=mrscc_%g,groupAccess,cacheDir=/var/ibm/biginsights/hadoop/tmp,nonFatal -Xscmx20m - Xdump:java:file=/var/ibm/biginsights/hadoop/tmp/javacore.%Y%m%d.%H%M%S.%pid.%seq.txt - Xdump:heap:file=/var/ibm/biginsights/hadoop/tmp/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd</value>
</property>
<property>
<!-- Max heap of child JVM spawned by tasktracker. Ideally as large as the
task machine can afford. The default -Xmx200m is usually too small. -->
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3000m -Xms1000m -Xmn100m -Xtune:virtualized - Xshareclasses:name=mrscc_%g,groupAccess,cacheDir=/var/ibm/biginsights/hadoop/tmp,nonFatal -Xscmx20m - Xdump:java:file=/var/ibm/biginsights/hadoop/tmp/javacore.%Y%m%d.%H%M%S.%pid.%seq.txt - Xdump:heap:file=/var/ibm/biginsights/hadoop/tmp/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd</value>
</property>
Big SQL dfs reader options:
The following properties were changed in the Big SQL bigsql-conf.xml file to tune dfs reader properties:
<property>
<!-- Number of threads reading from each disk.
Set this to 0 to use default values. -->
<name>dfsio.num_threads_per_disk</name>
<value>2</value>
<!--value>0</value-->
</property>
<property>
<!-- Read Size (in bytes) - Size of the reads sent to Hdfs (i.e., also the max I/O read buffer size).
Default is 8*1024*1024 = 8388608 bytes -->
<name>dfsio.read_size</name>
<value>4194304</value>
<!--value>8388608</value-->
</property>
…..
<property>
<!-- (Advanced) Cap on the number of scanner threads that will be created.
If set to 0, the system decides. -->
<name>dfsio.num_scanner_threads</name>
<value>12</value>
</property>
Big SQL dfs logging:
The minLogLevel property was changed in the Big SQL glog-dfsio.properties file to reduce the amount of logging by the dfs readers:
42. Page | 42
glog_enabled=true
log_dir=/var/ibm/biginsights/bigsql/logs
log_filename=bigsql-ndfsio.log
# 0 - INFO
# 1 - WARN
# 2 - ERROR
# 3 - FATAL
minloglevel=3
OS Storage:
The following script was used to create ext4 filesystems on all disks (to be used to store data) on all nodes in the cluster (inc. the master) – in-line with Big SQL best practices.
Note- a single one SSD was used for swap during the test, the rest were unused:
#!/bin/bash
# READ / WRITE Performance tests for EXT4 file systems
# Author - Stewart Tate, tates@us.ibm.com
# Copyright (C) 2013, IBM Corp. All rights reserved.:
#################################################################
# the follow is server unique and MUST be adjusted! #
#################################################################
drives=(b g h i j k l m n)
SSDdrives=(c d e f)
echo "Create EXT4 file systems, version 130213b"
echo " "
pause()
{
sleep 2
}
# make ext4 file systems on HDDs
echo "Create EXT4 file systems on HDDs"
for dev_range in ${drives[@]}
do
echo "y" | mkfs.ext4 -b 4096 -O dir_index,extent /dev/sd$dev_range
done
for dev_range in ${drives[@]}
do
parted /dev/sd$dev_range print
done
pause
# make ext4 file systems on SSDs
echo "Create EXT4 file systems on SSDs"
for dev_range in ${SSDdrives[@]}
do
echo "y" | mkfs.ext4 -b 4096 -O dir_index,extent /dev/sd$dev_range
done
for dev_range in ${SSDdrives[@]}
do
parted /dev/sd$dev_range print
43. Page | 43
echo "Partitions aligned(important for performance) if following returns 0:"
blockdev --getalignoff /dev/sd$dev_range
done
exit
The filesystems are then mounted using the following script:
#!/bin/bash
# READ / WRITE Performance tests for EXT4 file systems
# Author - Stewart Tate, tates@us.ibm.com
# Copyright (C) 2013, IBM Corp. All rights reserved.:
#################################################################
# the follow is server unique and MUST be adjusted! #
#################################################################
drives=(b g h i j k l m n)
SSDdrives=(c d e f)
echo "Mount EXT4 file systems, version 130213b"
echo " "
pause()
{
sleep 2
}
j=0
echo "Create EXT4 mount points for HDDs"
for i in ${drives[@]}
do
let j++
mkdir /data$j
mount -vs -t ext4 -o nobarrier,noatime,nodiratime,nobh,nouser_xattr,data=writeback,commit=100 /dev/sd$i /data$j
done
j=0
echo "Create EXT4 mount points for SSDs"
for i in ${SSDdrives[@]}
do
let j++
mkdir /datassd$j
mount -vs -t ext4 -o nobarrier,noatime,nodiratime,discard,nobh,nouser_xattr,data=writeback,commit=100 /dev/sd$i /datassd$j
done
echo "Done."
exit
OS kernel changes:
echo 0 > /proc/sys/vm/swappiness
echo "net.ipv6.conf.all.disable_ipv6 = 1" >> /etc/sysctl.conf
Active Hadoop components:
In order to release valuable resources on the cluster only the following BigInsights components were started during the single- and multi-stream runs: bigsql, Hadoop, hive, catalog, zookeeper and console.
44. Page | 44
Appendix D: Scaling and Database Population
The following table details the cardinality of each table along with its on disk size stored in the parquet format.
Table
Cardinality
Size on disk (in bytes)
call_center
60
13373
catalog_page
46000
2843783
catalog_returns
4319733388
408957106444
catalog_sales
43198299410
4094116501072
customer
80000000
4512791675
customer_address
40000000
986555270
customer_demographics
1920800
7871742
date_dim
73049
1832116
household_demographics
7200
30851
income_band
20
692
inventory
1627857000
8646662210
item
462000
46060058
promotion
2300
116753
reason
72
1890
ship_mode
20
1497
store
1704
154852
store_returns
8639847757
656324500809
store_sales
86400432613
5623520800788
time_dim
86400
1134623
warehouse
27
3782
web_page
4602
86318
web_returns
2160007345
205828528963
web_sales
21600036511
2017966660709
web_site
84
15977
TOTAL
13020920276247