My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
Learn about the promise of data lakes:
- Store all types of data in its raw format
- Create refined, standardized, trusted datasets for various use cases
- Store data for longer periods of time to enable historical analysis - Query and Access the data using a variety of methods
- Manage streaming and batch data in a converged platform
- Provide shorter time-to-insight with proper data management and governance
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
Geisinger Health System is well known in the healthcare community as a pioneer in data and analytics. We have had an Electronic Health Record (EHR) since 1996, and an Electronic Data Warehouse (EDW) since 2008. Much of daily and weekly operational reporting, as well as an abundance of ad hoc analytics, come from the EDW.
Approximately 18 months ago, the Data Management team implemented Hadoop in the Hortonworks Data Platform (HDP), and successes in implementation and development have proven to the organization that we should abandon the traditional EDW in favor of the Big Data (HDP) platform.
In less than 18 months, we stood up the platform, created a data ingestion pipeline, duplicated all source feeds from the EDW into HDP, and had several analytics developed with HDP and Tableau. Furthermore, we have exploited the new capabilities of the platform, where we use Natural Language Processing (NLP) to interrogate valuable (but previously hidden) clinical notes. The new platform has data that is modeled and governed, setting the stage to push Geisinger Health System from a pioneer to a leader in Big Data and Analytics.
This session will focus on Hortonworks Data Platform, covering data architecture, security, data process flow, and development. It is geared toward Data Architects, Data Scientists, and Operations/I.T. audiences.
Processing transactions is at the core of any bank’s business. Danske Bank’s journey started with recognising the value that could be gleaned from generating insights from the data to improve customer behaviour analytics. Today, the company streams large volumes of transactional data in near-real time onto its Hortonworks data Platform to improve fraud detection and customer marketing. In this session, Nadeem will outline the bank’s vision, how it was socialised across the executive board team and the resulting sponsorship, the technological path, challenges overcome and the results that have not only improved the customer experience but quantifiable metrics fraud and opening new revenue streams. Furthermore, Nadeem will cover future use cases around maintenance and operations.
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
For firms in the financial industry, especially within regulated organizations such as credit card processors and banks, PCI DSS compliance has become a business and operational necessity. Although the blueprint of a PCI-compliant architecture varies from organization to organization, the mixture of modern Hadoop-based data lakes and legacy systems are a common theme.
In this talk, we will discuss recent updates to PCI DSS and how significant portions of PCI DSS compliance controls can be achieved using open source Hadoop security stack and technologies for the Hadoop ecosystem. We will provide a broad overview of implementing key aspects of PCI DSS standards at WorldPay such as encryption management, data protection with anonymization, separation of duties, and deployment considerations regarding securing the Hadoop clusters at the network layer from a practitioner’s perspective. The talk will provide patterns and practices map current Hadoop security capabilities to security controls that a PCI-compliant environment requires.
Speaker
David Walker, Enterprise Data Platform Programme Director, Worldpay
Srikanth Venkat, Senior Director Product Management, Hortonworks
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
This presentation explains in detail what a Data Lake Architecture looks like, how data virtualization fits into the Logical Data Lake, and goes over some performance tips. Also it includes an example demonstrating this model's performance.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/9Jwfu6.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Multi-tenant Hadoop - the challenge of maintaining high SLASDataWorks Summit
In shared configuration, the same Hadoop environment supports many applications. Each has
specific requirements and criticality (SLA). Yet they all rely on an assembly of shared application
bricks.
At the same time, the life cycle of a cluster is not static in time. It evolves horizontally, with the
arrival of new applications, but also vertically, as the applications grow in load or evolve in
terms of functionality.
With this in mind, a multi-tenant production cluster presents several challenges including and
not limited to:
- Maintain a high level of SLA for a set of use cases with heterogeneous needs
- Plan and implement the architecture evolution of a cluster in production to ensure the
maintenance of SLA throughout the integration of new use cases on it
EDF will present how it manages this heterogeneity of SLA inherent of any Big Data cluster. EDF
is focusing on how it is renovating its cluster, its organization, its processes and its approach in
order to deliver a platform with strong SLA throughout its life cycle.
Speaker
Edouard Rousseaux, Tech Lead, EDF
Big data ingest frameworks ship with an array of connectors for common data origins and destinations, such as flat files, S3, HDFS, Kafka etc, but sometimes, you need to send data to, or receive data from a system that's not on the list. StreamSets includes template code for building your own connectors and processors; we'll walk through the process of building a simple destination that sends data to a REST web service, and show how it can be extended to target more sophisticated systems such as Salesforce Wave Analytics.
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
This presentation describes the journey taken by the Hotels.com big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.
The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.
Speakers
Adrian Woodhead, Principal Engineer, Hotels.com
Elliot West, Senior Engineer, Hotels.com
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Processing transactions is at the core of any bank’s business. Danske Bank’s journey started with recognising the value that could be gleaned from generating insights from the data to improve customer behaviour analytics. Today, the company streams large volumes of transactional data in near-real time onto its Hortonworks data Platform to improve fraud detection and customer marketing. In this session, Nadeem will outline the bank’s vision, how it was socialised across the executive board team and the resulting sponsorship, the technological path, challenges overcome and the results that have not only improved the customer experience but quantifiable metrics fraud and opening new revenue streams. Furthermore, Nadeem will cover future use cases around maintenance and operations.
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
For firms in the financial industry, especially within regulated organizations such as credit card processors and banks, PCI DSS compliance has become a business and operational necessity. Although the blueprint of a PCI-compliant architecture varies from organization to organization, the mixture of modern Hadoop-based data lakes and legacy systems are a common theme.
In this talk, we will discuss recent updates to PCI DSS and how significant portions of PCI DSS compliance controls can be achieved using open source Hadoop security stack and technologies for the Hadoop ecosystem. We will provide a broad overview of implementing key aspects of PCI DSS standards at WorldPay such as encryption management, data protection with anonymization, separation of duties, and deployment considerations regarding securing the Hadoop clusters at the network layer from a practitioner’s perspective. The talk will provide patterns and practices map current Hadoop security capabilities to security controls that a PCI-compliant environment requires.
Speaker
David Walker, Enterprise Data Platform Programme Director, Worldpay
Srikanth Venkat, Senior Director Product Management, Hortonworks
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
This presentation explains in detail what a Data Lake Architecture looks like, how data virtualization fits into the Logical Data Lake, and goes over some performance tips. Also it includes an example demonstrating this model's performance.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/9Jwfu6.
This deck cover Microsoft Analytics Platform System (APS) formerly known as Parallel Data Warehouse (PDW). This is based on massively parallel processing technology and can typically reduce your OLAP workloads by 98%.
APS AU3 is a phenomenal technology based on SQL Server 2014 and costs a fraction of a comparable Netezza or Teradata.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
The business and technology teams within a health insurer must align the company’s central data platform with its data strategy. That requires substantial organizational alignment. Hear the firsthand perspective from Health Care Service Corporation (HCSC), the largest customer-owned health insurance company in the United States. The speaker will cover how they integrated membership information, regulatory compliance, and the general ledger, to improve overall healthcare management. At HCSC, the strong alignment between executive leadership, business portfolio direction, architectural strategy, technology delivery, and program management have helped create leading-edge capabilities which help the company respond nimbly to a quickly evolving healthcare industry.
Insights into Real World Data Management ChallengesDataWorks Summit
Data is your most valuable business asset and it's also your biggest challenge. This challenge and opportunity means we continually face significant road blocks toward becoming a data driven organisation. From the management of data, to the bubbling open source frameworks, the limited industry skills to surmounting time and cost pressures, our challenge in data is big.
We all want and need a “fit for purpose” approach to management of data, especially Big Data, and overcoming the ongoing challenges around the ‘3Vs’ means we get to focus on the most important V - ‘Value’.Come along and join the discussion on how Oracle Big Data Cloud provides Value in the management of data and supports your move toward becoming a data driven organisation.
Speaker
Noble Raveendran, Principal Consultant, Oracle
Multi-tenant Hadoop - the challenge of maintaining high SLASDataWorks Summit
In shared configuration, the same Hadoop environment supports many applications. Each has
specific requirements and criticality (SLA). Yet they all rely on an assembly of shared application
bricks.
At the same time, the life cycle of a cluster is not static in time. It evolves horizontally, with the
arrival of new applications, but also vertically, as the applications grow in load or evolve in
terms of functionality.
With this in mind, a multi-tenant production cluster presents several challenges including and
not limited to:
- Maintain a high level of SLA for a set of use cases with heterogeneous needs
- Plan and implement the architecture evolution of a cluster in production to ensure the
maintenance of SLA throughout the integration of new use cases on it
EDF will present how it manages this heterogeneity of SLA inherent of any Big Data cluster. EDF
is focusing on how it is renovating its cluster, its organization, its processes and its approach in
order to deliver a platform with strong SLA throughout its life cycle.
Speaker
Edouard Rousseaux, Tech Lead, EDF
Big data ingest frameworks ship with an array of connectors for common data origins and destinations, such as flat files, S3, HDFS, Kafka etc, but sometimes, you need to send data to, or receive data from a system that's not on the list. StreamSets includes template code for building your own connectors and processors; we'll walk through the process of building a simple destination that sends data to a REST web service, and show how it can be extended to target more sophisticated systems such as Salesforce Wave Analytics.
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
This presentation describes the journey taken by the Hotels.com big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.
The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.
Speakers
Adrian Woodhead, Principal Engineer, Hotels.com
Elliot West, Senior Engineer, Hotels.com
Traditional data storage and analytic tools no longer provide the agility and flexibility required to deliver relevant business insights. That’s why organizations are shifting to a data lake architecture. This approach allows you to store massive amounts of data in a central location so it's readily available to be categorized, processed, analyzed, and consumed by diverse organizational groups. In this session, we’ll assemble a data lake using services such as Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue.
Presentation from Data Science Conference 2.0 held in Belgrade, Serbia. The focus of the talk was to address the challenges of deploying a Data Lake infrastructure within the organization.
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
How and why are companies like Uber, Netflix and Airbnb so successful, what you need to in order to become successful in the same way that they are and how Pivotal can help you with that.
Speaker: Les Klein, EMEA CTO Data, Pivotal
Driving Real Insights Through Data ScienceVMware Tanzu
Major changes in industries have been brought about by the emergence of data-driven discoveries and applications. Many organizations are bringing together their data, and looking to drive change. But the ability to generate new insights in real time from a massive sets of data is still far from commonplace.
At this event, data technology experts and data scientists from Pivotal provided the latest business perspective on how data science and engineering can be used to accelerate the generation of new insights.
For information about upcoming Pivotal events, please visit: http://pivotal.io/news-events/#events
Troubleshooting App Health and Performance with PCF Metrics 1.2VMware Tanzu
Join Allen Duet and Pieter Humphrey from Pivotal, to learn how PCF Metrics enhances the developer experience on Pivotal Cloud Foundry, with a simple and powerful way to troubleshoot app health and performance issues. You will see how, with a single, unified interface for events, logs, and metrics, app devs can easily navigate graphs to identify problems and then view logs for that time slice.
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Mike Rossi
Explosive growth of Smart Meter (SM) deployments has presented key infrastructure challenges across the utility industry. The huge volumes of smart meter data has led the industry to a tipping point which requires investments in modernizing existing data warehouses. Typical modernization efforts lead to huge capital expenditures for DW appliances and storage. Sizing this new infrastructure is tricky and can lead to underutilized or poorly performing hardware.
The Cloud is the catalyst to solving these Big Data challenges.
Utilizing a Cloud architecture delivers huge benefits by:
Maximizing use of existing architecture
Minimizing new CapEx expenditures
Lowering overall storage costs
Enabling scale on demand
Real life use cases from across Europe (Walid Aoudi - Cognizant)
This presentation will present some Cognizant Big Data clients return on experiences on continental Europe and UK. The main focus will be centered on use cases through the presentation of the business drivers behind these projects. Key highlights around the big data architecture and approach solutions will be presented. Finally, the business outcomes in terms of ROI provided by the solutions implementations will be discussed.
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
sap hana|sap hana database| Introduction to sap hanaJames L. Lee
SAP HANA, sap hana implementation scenarios, sap hana deployment scenarios, SAP HANA Implementations, sap hana implementation and modeling, sap hana implementation cost, sap hana implementation partners, Applications based on SAP HANA, SAP HANA Databases.
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Slides: Success Stories for Data-to-CloudDATAVERSITY
Companies are finding accessing data from a variety of sources can be labor-intensive and costly. Oftentimes these companies are looking to cloud solutions, but are then finding the traditional architecture brittle when trying to move data to the cloud, which can drain organizations of time and resources.
Join this webinar to hear several company success stories, the data-to-cloud issues they were encountering, and the steps these companies took to bring their cloud architecture to a successful, real-time analytic solution unlocking massive amounts of fresh enterprise-wide on a continuous basis.
In addition, you will learn how to:
• Modernize the ETL process to one that’s fast, flexible, and scalable
• Supply users with up-to-date, accurate, trusted data
• Increase your time to value with data in the cloud
• Best practices on how to minimize resource overhead
From Data to Services at the Speed of BusinessAli Hodroj
From Data to Services at the Speed of Business: Applying cloud-native paradigm to combine fast data analytics with microservices architecture for hybrid workloads.
Big Data Expo 2015 - Talend Delivering Real TimeBigDataExpo
Pioneers like Mint in the financial sector, Amazon in retail or Netflix in media proved that turning Big Data into actions and insights at the customer touch points delivers measurable outcomes – increased transformation rate, larger share of wallet, better customer acquisition, just in time fraud detection, etc. They showcased that it is possible today to put in place a platform for the management of customer data that is able to integrate and deliver information in real time, regardless of the interaction channel being used… and as a result establish the foundation to disrupt a whole industry with data driven processes.
Now, this Customer Data Platform is reaching the mainstream through affordable technologies such as Hadoop and Spark, if empowered with embedded data and application integration, data governance, master data management, analytics and real time data processing. This platform, sometimes referred as Customer Data Platform (CDP) or as a Data Management Platform (DMP), allows organizations to reconstruct the entire customer journey by centralizing and cross referencing interactional or internal data such as purchase history, preferences, satisfaction, and loyalty with social or external data that can uncover customer intention as well as broader habits and tastes.
In this presentation, attendees will get knowledge of the key components of the platform, how to implement it, and how to run it in the context of the enterprise marketing activities.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA
Syncsort data integration solution and data quality solution on hadoop can help accelerate the process of Populating your Enterprise Data Hub with data from multiple disparate data sources like legacy systems, databases, ERPs ,CRMs ,etc. Standardizing and cleansing the data before it is ingested into the data lake will dramatically increase the analytics value proposition.
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsArcadia Data
Learn how HPE uses visual analytics within a data lake to create an “Industrial Internet of Things” model that solves their data analytics problem at scale.
Digital Business Transformation in the Streaming EraAttunity
Enterprises are rapidly adopting stream computing backbones, in-memory data stores, change data capture, and other low-latency approaches for end-to-end applications. As businesses modernize their data architectures over the next several years, they will begin to evolve toward all-streaming architectures. In this webcast, Wikibon, Attunity, and MemSQL will discuss how enterprise data professionals should migrate their legacy architectures in this direction. They will provide guidance for migrating data lakes, data warehouses, data governance, and transactional databases to support all-streaming architectures for complex cloud and edge applications. They will discuss how this new architecture will drive enterprise strategies for operationalizing artificial intelligence, mobile computing, the Internet of Things, and cloud-native microservices.
Link to the Wikibon report - wikibon.com/wikibons-2018-big-data-analytics-trends-forecast
Link to Attunity Streaming CDC Book Download - http://www.bit.ly/cdcbook
Link to MemSQL's Free Data Pipeline Book - http://go.memsql.com/oreilly-data-pipelines
Top SAP Online training institute in HyderabadAadhyaKrishnan
ERP tech is the one of the top & Best SAP Training institute in Hyderabad. We offers best training completely on all SAP Modules Like BPC Embedded, BPC Classic and HANA with Reasonable prices.
OpenWorld: 4 Real-world Cloud Migration Case StudiesDatavail
In this presentation, get answers to these questions and more by exploring four different successful real-world Oracle EPM Cloud migration and implementation case studies for Oracle Enterprise Planning and Budgeting Cloud Service, Oracle Financial Consolidation and Close Cloud Service, and Oracle Account Reconciliation Cloud Service. Attendees get a birds-eye view into the practicalities of moving to the cloud and making the business case for their own company.
Similar to Solving Performance Problems on Hadoop (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
4. Actian at a Glance
4
10,000+
8 Countries; 7 US Cities
HQ Palo Alto
400+
Employees Customers
3
Businesses
Banking, Insurance
Telecom and Media
Data Management
Data Integration
Big Data Analytics
6. Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture
Data Management
& Integration
Analytics
Query & Analyze
Solutions
Problem
Solved
7. Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture
Data Management
& Integration
Analytics Solutions
??????
8. Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture
Data Management
& Integration
Analytics
???
Solutions
???
9. Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue,
reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively
scalable platform
Decisive reduction in the cost of high performance analytics at scale
Performance that can meet all SLAs
Full leverage of existing SQL skills while deploying a modern analytic
infrastructure
Grow
Revenue
Reduce
Cost
Mitigate
Risk
Create
New
Business
Business Solution Architecture Challenges
10. Wide Ranges of Use Cases
10
Financial Services
Advanced Credit
Risk Analytics
across billions of
data points
Internet Scale
Application
Predictive
Analytics across
hundreds of
millions of
customers
Media
Data Science and
Discovery across
trillions of IoT
events
Dept of Defense
Cyber-Security:
Network
intrusion models
every second
Credit Card
Processing
Fraud
detection
every milli-
second
12. 3 Essential Big Data Concepts
12
0. Take nothing for granted
1. Partitioning vs Data skew
2. Data types matter
3. Maximize memory / minimize bottlenecks
4. Take nothing for granted
17. Customer 360: Understanding Experience, Driving Revenue
17
Telecom Challenge
Vast and growing repository of proprietary click data, customer records, service
call records, smart phone and device data GPS location, webserver, telephone,
network usage.
Queries took minutes or hours, and sometimes never returned at all.
Critical business analysis on a consolidated customer 360 data lake was
grinding to a halt.
The ability to gain deeper market insights, visualization and desired data
management and operational optimization was at risk
18. Customer 360: Initial Architecture
18
Development System
• 300+ node cluster
• HIVE access
• SQL based BI / Data Science
• Pre-processed as performance was unacceptable
• Views taking days to return snapshot views
19. Customer 360: Technical Improvements
19
Production Prototype
• 30 node cluster (10% of Hive)
• Actian Vector on Hadoop solution
• SQL based BI / Data Science
• No materialized view building required
• Join on demand faster than aggregate tables in Hive
• Reduced storage requirements
• 91TB – two years data, 1100 columns when joined
20. Customer 360: Understanding Experience, Driving Revenue
20
Results
Customer 360 across prior data silos
Leveraged for customer retention strategies
Predict and take proactive, tailored
responses
Enables next gen data-driven
troubleshooting, impact analysis and root
cause analysis
• Accelerated operations intelligence
• Improved customer experience
• Reduced customer churn
Impact
21. Financial Risk: Upgrading Legacy to Meet SLA
21
Challenge
Legacy single-purpose risk application took 3 hours to generate end-of-day risk report,
and failed to meet changing SLA’s for reporting risk.
In deciding to replace risk application, bank opted to build a multi-purpose risk
application, addressing multiple business requirements
22. Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System
• Single server architecture, MS SSAS, Oracle - ~30 applications
• Pre-processing of desired measures exploding data volumes
• Cube and Analysis engines being maxed out as they exceed 1.5TB range
• Unable to scale to the desired range of > 200GB/day new data
• Impala attempt failed
• Highly invested in apps built on Analysis service
23. Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities
• Clustered solution – Hadoop 5 and 10 node
• No pre-processing cubes, SSAS partly kept
• Tested solutions 1TB -> 20TB at a time
• Produced interactive queries across large datasets
• Focused query results in 2s or less
• Processing all data in the database 6s – 80s
• 2x nodes ~ 200% speed improvement
24. Financial Risk: Upgrading Legacy to Meet SLA
24
Results
Increased data analyzed by 100X
2–200B rows / 1-20TB
Report run in 28 seconds vs. 3 hours
Use of application for:
• Intra-day reporting (surveillance)
• End of day reporting (compliance)
• Overnight float investment
options
• Annual CCAR Analysis
ActualGoal
29. Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *
VectorH vs other platforms, faster by how much?
Tuned platforms
Identical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM,
24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
30. Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:
• skip blocks via MinMax indexes
• sophisticated query processing
• efficient block format, esp. 64-bit int
31. Summary
Conscientious data handling & next gen engineering takes SQL
in Hadoop to new levels.
All Hadoop users can move from development into production
while delivering compelling business results.
31
32. Delivering the Results With Better Engineering
32
VectorH v5 – Spark integration, external table support, and more
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed.
A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible.
2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache.
3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped.
To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed.
4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution.
We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache.
5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column.
6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time
All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.