The document summarizes a presentation about data vault automation at a Dutch department store chain called de Bijenkorf. It discusses the project objectives of having a single source of reports and integrating with production systems. An architectural overview is provided, including the use of AWS services, a Snowplow event tracker, and Vertica data warehouse. Automation was implemented for loading data from over 250 source tables into the data vault and then into information marts. This reduced ETL development time and improved auditability. The data vault supports customer analysis, personalization, and business intelligence uses at de Bijenkorf. Drivers of the project's success included the AWS infrastructure, automation approach, and Pentaho ETL framework.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
This is a talk by Zach and me on how to analyze your S3 storage access pattern to save storage cost by lifecycle objects at the right time to the right cost tier.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Cloud Storage Spring Cleaning: A Treasure HuntSteven Moy
This is a talk by Zach and me on how to analyze your S3 storage access pattern to save storage cost by lifecycle objects at the right time to the right cost tier.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
Many organizations focus on the licensing cost of Hadoop when considering migrating to a cloud platform. But other costs should be considered, as well as the biggest impact, which is the benefit of having a modern analytics platform that can handle all of your use cases. This session will cover lessons learned in assisting hundreds of companies to migrate from Hadoop to Databricks.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Agile Methods and Data Warehousing (2016 update)Kent Graziano
This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included. Includes more details on using Data Vault as well. (I gave this presentation at OUGF14 in Helsinki, Finland and again in 2016 for TDWI Nashville.)
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
This is a run-through at a 200 level of the Microsoft Azure Big Data Analytics for the Cloud data platform based on the Cortana Intelligence Suite offerings.
Agile Methods and Data Warehousing (2016 update)Kent Graziano
This presentation takes a look at the Agile Manifesto and the 12 Principles of Agile Development and discusses how these apply to Data Warehousing and Business Intelligence projects. Several examples and details from my past experience are included. Includes more details on using Data Vault as well. (I gave this presentation at OUGF14 in Helsinki, Finland and again in 2016 for TDWI Nashville.)
Every business today wants to leverage data to drive strategic initiatives with machine learning, data science and analytics — but runs into challenges from siloed teams, proprietary technologies and unreliable data.
That’s why enterprises are turning to the lakehouse because it offers a single platform to unify all your data, analytics and AI workloads.
Join our How to Build a Lakehouse technical training, where we’ll explore how to use Apache SparkTM, Delta Lake, and other open source technologies to build a better lakehouse. This virtual session will include concepts, architectures and demos.
Here’s what you’ll learn in this 2-hour session:
How Delta Lake combines the best of data warehouses and data lakes for improved data reliability, performance and security
How to use Apache Spark and Delta Lake to perform ETL processing, manage late-arriving data, and repair corrupted data directly on your lakehouse
O'Reilly ebook: Operationalizing the Data LakeVasu S
Best practices for building a cloud data lake operation—from people and tools to processes
https://www.qubole.com/resources/ebooks/ebook-operationalizing-the-data-lake
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
This is from the talk I gave at the 30th Anniversary NoCOUG meeting in San Jose, CA.
We all know that data warehouses and best practices for them are changing dramatically today. As organizations build new data warehouses and modernize established ones, they are turning to Data Warehousing as a Service (DWaaS) in hopes of taking advantage of the performance, concurrency, simplicity, and lower cost of a SaaS solution or simply to reduce their data center footprint (and the maintenance that goes with that).
But what is a DWaaS really? How is it different from traditional on-premises data warehousing?
In this talk I will:
• Demystify DWaaS by defining it and its goals
• Discuss the real-world benefits of DWaaS
• Discuss some of the coolest features in a DWaaS solution as exemplified by the Snowflake Elastic Data Warehouse.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
This presentation is an explanation of the research work done in the topic of 'hadoop integration into data warehouse architectures'. It explains where Hadoop fits into data warehouse architecture. Furthermore, it purposes a BI assessment model to determine the capability of current BI program and how to define roadmap for its maturity.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Architecting for Real-Time Big Data AnalyticsRob Winters
Slides from a talk given on 7 April, 2016 regarding best practices for building a federated data platform to support data science, reporting, and business analytics
A strong relationship with the founder
of Data Vault for over 3 years now.
Supporting your business with 40+
certified consultants.
Incorporated as the preferred
Enterprise Data Warehouse modelling
paradigm in the Logica BI Framework.
Satisfied customers in many countries
and industry sectors
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
IBM Integration Bus
Web sphere Message Broker v9.0
Introduction:
IBM Integration Bus v9, known in previous releases as WebSphere Message Broker, is a lightweight, advanced enterprise service bus (ESB) that provides a broad range of integration capabilities that enable companies to rapidly integrate internal applications and connect to partner applications. Messages from business applications can be transformed, augmented and routed to other business applications. The types and complexity of the integration required will vary by company, application types, and a number of other factors.
The product supports a wide range of protocols: WebSphere® MQ, JMS 1.1, HTTP and HTTPS, Web Services (SOAP and REST), File, Enterprise Information Systems (including SAP and Siebel), and TCP/IP.
It supports a broad range of data formats: binary formats (C and COBOL), XML, and industry standards (including SWIFT, EDI, and HIPAA). You can also define your own data formats.
It supports many operations, including routing, transforming, filtering, enriching, monitoring, distribution, collection, correlation, and detection.
Introduction to EAI,SOA Architecture
Introduction to MQ
Course content:
Day -1
=======
• WMQ Introduction
• Creation of WMQ Objects through GUI and MQSC Commands
• Distribution Queuing through GUI and MQSC Commands
Day -2
======
• Clustering through GUI and through MQSC
Commands
• Over view all the concepts in the WMQ
• Working with RFHUTIL
Day -3
======
o
• Introduction IIB
• Creating Message Flows and Deploying the Message Flow to Integration Node
Day -5
======
• Tree Structure
• Introduction to Message Sets
• Message Transformation scenario to convert CSV to XML using MRM Message set
Day -6
======
• Message Transformation scenario to convert BLOB to XML using MRM Message set (using XSD)
• Message Transformation scenario to convert XML to BLOB using MRM Message set (using Cobol copy book)
Day -6
======
• Routing Scenarios using Flow order, Filter and Route to Label
Day - 7
=======
• Routing Scenarios using Propagation and Aggregation
Day -8
=======
• Working with Data base node
Day-10
======
• Exception Handling
• Logging
Day -11
=======
• Webservices – SOAP Nodes
Day -12
=======
• Webservices – HTTP Nodes
• Java compute Node
Day – 13
========
• Transformation using Mapping and XSLT Nodes
Day – 14
============================
• IIB 9 – Embedded Business Rules
• Workload Management
Day – 15
========
• Web Visualization and Analytics
• Database services and Analysis
• Service Discovery and cataloguing
Project Work
Courses Offerings
• Abinitio
• Android
• AIX Administration
• Business Analyst
• CCNA, CCNP Security, CCIE
• Citrix XenApp
• Cogno
Billions of Rows, Millions of Insights, Right NowRob Winters
Presentation from Tableau Customer Conference 2013 on building a real time reporting/analytics platform. Topics discussed include definitions of big data and real time, technology choices and rationale, use cases for real time big data, architecture, and pitfalls to avoid.
Big Data Expo 2015 - Infotopics Zien, Begrijpen, Doen!BigDataExpo
Het visueel maken van data is de enige manier om Big Data te begrijpen. Infotopics is dé specialist in datavisualisatie en continu op zoek naar de beste instrumenten om middels self service BI inzicht in data te geven. Zie hoe wij dit in de praktijk toepassen met Tableau & Alteryx.
Semantic Technology for the Data Warehousing PractitionerThomas Kelly, PMP
Semantic Technology for the Data Warehousing Practitioner -- Shattering Traditional DW/BI Best Practices to Drive Intelligent Analytics
Current DW/BI best practices optimize technologies that were conceived two to three decades ago. To successfully
leverage semantic technology, DW/BI professionals will change (even reverse) many of these practices.
Many organizations use data warehousing and business intelligence to monitor their operations and guide
tactical and strategic decision making. Data warehouses continue to have challenges in:
- keeping the data organized in sync with the organization's analytics needs,
- delivering data to decision makers in a timely manner,
- managing constantly-evolving data quality requirements,
- integrating new data sets into the data warehouse,
- reusing expert knowledge that is embedded in end-user analytics, and
- organizing internal data assets, data from cloud applications, and data from business partners
into a common access method.
Many of today's DW/BI practices were developed to optimize technologies that were conceived in the 1970's,
80's, and 90's. This presentation examines key features of semantic technology and how DW/BI practices are likely to change to successfully deliver intelligent DW/BI projects.
Rundeck is a robust and reliable application, very easy to use and it is Open Source. It helps to automate operational routines in remote environments, and it brings access facilities to authorized users in node servers
How a Data Mesh is Driving our Platform | Trey Hicks, GlooHostedbyConfluent
At Gloo.us, we face a challenge in providing platform data to heterogeneous applications in a way that eliminates access contention, avoids high latency ETLs, and ensures consistency for many teams. We're solving this problem by adopting Data Mesh principles and leveraging Kafka, Kafka Connect, and Kafka streams to build an event driven architecture to connect applications to the data they need. A domain driven design keeps the boundaries between specialized process domains and singularly focused data domains clear, distinct, and disciplined. Applying the principles of a Data Mesh, process domains assume the responsibility of transforming, enriching, or aggregating data rather than relying on these changes at the source of truth -- the data domains. Architecturally, we've broken centralized big data lakes into smaller data stores that can be consumed into storage managed by process domains.
This session covers how we’re applying Kafka tools to enable our data mesh architecture. This includes how we interpret and apply the data mesh paradigm, the role of Kafka as the backbone for a mesh of connectivity, the role of Kafka Connect to generate and consume data events, and the use of KSQL to perform minor transformations for consumers.
Choosing the Right Business Intelligence Tools for Your Data and Architectura...Victor Holman
Watch video presentation and get a FREE performance management kit at
http://www.lifecycle-performance-pros.com
This presentation takes you through the steps of understanding your business intelligence needs and identifying the right tools for you. We discuss the different types of BI tools. We to discuss the criteria for selecting each type of tools. We to discuss popular Business Intelligence vendors and how to rate them. And we are going to discuss the job functions and responsibilities for a typical BI implementation
DevOps is changing today's software development world by helping us build better software, faster. However most of the knowledge and experience with DevOps is based around application software and ignores the database. We will examine how the concepts and principles of DevOps can be applied to database development by looking at both automated comparison analysis as well as migration script management. Automated building, testing, and deployment of database changes will be shown.
About the Presenter
Steve Jones is a Microsoft SQL Server MVP and has been working with SQL Server since version 4.2 on OS/2. After working as a DBA and developer for a variety of companies, Steve co-founded the community website SQLServerCentral.com in 2001. Since 2004, Steve has been the full-time editor of the site, ensuring it continues to be a great resource for SQL Server professionals. Over the last decade, Steve has written hundreds of articles about SQL Server for SQLServerCentral.com, SQL Server Standard magazine, SQL Server Magazine, and Database Journal.
Delivering Changes for Applications and DatabasesMiguel Alho
Presentation used at PortoData (28/07/2016) about manageing database changes in application projects. Demoed Flyway migrations for script packages, and a strategy for using DBUp and localDb for integration testing at the data layer
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), application programming interfaces (API), clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. Building scalable big data pipelines with automated extract-transform-load (ETL) and machine learning processes can address these limitations. JustGiving is the world’s largest social platform for online giving. In this session, we describe how we created several scalable and loosely coupled event-driven ETL and ML pipelines as part of our in-house data science platform called RAVEN. You learn how to leverage AWS Lambda, Amazon S3, Amazon EMR, Amazon Kinesis, and other services to build serverless, event-driven, data and stream processing pipelines in your organization. We review common design patterns, lessons learned, and best practices, with a focus on serverless big data architectures with AWS Lambda.
Fishbowl Solutions' Administration Suite combines our most effective and popular tools for WebCenter administrators. Learn more about how these tools can automate many daily tasks and simplify processes!
What is Data Warehousing? ,
Who needs Data Warehousing? ,
Why Data Warehouse is required? ,
Types of Systems ,
OLTP
OLAP
Maintenance of Data Warehouse
Data Warehousing Life Cycle
DevOps has been an emerging trend in the software development world for the past several years. While the term is relatively new, it is really a convergence of a number of practices that have been evolving for decades. Unfortunately, database development has been left out of much of this movement, but that's starting to change. As database professionals, we all need to understand what this important change is about, how we fit in, and how to best work database development practices into the established DevOps practices.
One of the cornerstones of the DevOps methodology is source control. When most people think of source control, they picture a tool - either a traditional, centralized system like TFS, or a newer, distributed system like Git. Source control is more than a tool, though; human processes and practices also play a critical role in an effective source control (and DevOps) implementation. In this session, we'll talk in depth about both types of source control systems and how you can effectively use source control for your databases.
AppSphere 15 - Is the database affecting your critical business transactions?AppDynamics
Databases are fundamental to every application, and reading or writing data in a quick and reliable way is critical to ensuring happy users. Your shiny new application may have worked well in the beginning, but with more and more people using the app the response times may take a nose dive. Or, what if you push a great new feature, guaranteed to make customers happy, but then it doesn’t scale or it actually degrades the performance of existing functionality? What do you do? How do you diagnose and resolve these issues in a timely and cost efficient manner?
Whilst many organizations employ a team of database administrators (DBAs) to manage database performance, it’s often a group separated from development and operational support with their own tools, scripts, and procedures. This creates inefficiency in the root cause analysis process. We want to empower customers by including deep-dive database monitoring as part of end-to-end APM, thereby providing immediate visibility of DB metrics to all groups within IT.
This session covers:
-Why it makes sense to include the database as part of API
-What are some real world problems affecting database performance
-What is a good methodology for diagnosing and understanding database performance
This deck was originally presented at AppSphere 2015.
Similar to Data Vault Automation at the Bijenkorf (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Presentation agenda
◦ Project objectives
◦ Architectural overview
◦ The data warehouse data model
◦ Automation in the data warehouse
◦ Successes and failures
◦ Conclusions
3. About the presenters
Rob Winters
Head of Data Technology, the Bijenkorf
Project role:
◦ Project Lead
◦ Systems architect and administrator
◦ Data modeler
◦ Developer (ETL, predictive models, reports)
◦ Stakeholder manager
◦ Joined project September 2014
Andrei Scorus
BI Consultant, Incentro
Project role:
◦ Main ETL Developer
◦ ETL Developer
◦ Modeling support
◦ Source system expert
◦ Joined project November 2014
4. Project objectives
◦ Information requirements
◦ Have one place as the source for all reports
◦ Security and privacy
◦ Information management
◦ Integrate with production
◦ Non-functional requirements
◦ System quality
◦ Extensibility
◦ Scalability
◦ Maintainability
◦ Security
◦ Flexibility
◦ Low Cost
Technical Requirements
• One environment to quickly generate customer insights
• Then feed those insights back to production
• Then measure the impact of those changes in near real time
5. Source system landscape
Source Type Number of Sources Examples Load Frequency Data Structure
Oracle DB 2 Virgo ERP 2x/hour Partial 3NF
MySQL 3 Product DB, Web
Orders, DWH
10x/hour 3NF (Web Orders),
Improperly normalized
Event bus 1 Web/email events 1x/minute Tab delimited with
JSON fields
Webhook 1 Transactional Emails 1x/minute JSON
REST APIs 5+ GA, DotMailer 1x/hour-1x/day JSON
SOAP APIs 5+ AdWords, Pricing 1x/day XML
7. DWH internal architecture
• Traditional three tier DWH
• ODS generated automatically from
staging
• Ops mart reflects data in original
source form
• Helps offload queries from
source systems
• Business marts materialized
exclusively from vault
8. Bijenkorf Data Vault overview
Data volumes
• ~1 TB base volume
• 10-12 GB daily
• ~250 source tables
Aligned to Data Vault 2.0
• Hash keys
• Hashes used for CDC
• Parallel loading
• Maximum utilization of available
resources
• Data unchanged in to the vault
Some statistics
18 hubs
• 34 loading scripts
27 links
• 43 loading scripts
39 satellites
• 43 loading scripts
13 reference tables
• 1 script per table
Model contains
• Sales transactions
• Customer and corporate
locations
• Customers
• Products
• Payment methods
• E-mail
• Phone
• Product grouping
• Campaigns
• deBijenkorf card
• Social media
Excluded from the vault
◦ Event streams
◦ Server logs
◦ Unstructured data
11. Challenges encountered during data modeling
Challenge Issue Details Resolution
Source issues • Source systems and original data
unavailable for most information
• Data often transformed 2-4 times before
access was available
• Business keys (ex. SKU) typically replaced
with sequences
• Business keys rebuilt in staging prior to
vault loading
Modeling returns • Retail returns can appear in ERP in 1-3
ways across multiple tables with
inconsistent keys
• Online returns appear as a state change
on original transaction and may/may not
appear in ERP
• Original model showed sale state on
line item satellite
• Revised model recorded “negative sale”
transactions and used a new link to
connect to original sale when possible
Fragmented
knowledge
• Information about the systems was being
held by multiple people
• Documentation was out-of-date
• Talking to as many people as possible
and testing hypotheses on the data
12. Targeted benefits of DWH automation
Objective Achievements
Speed of development • Integration of new sources or data from existing sources takes 1-2 steps
• Adding a new vault dependency takes one step
Simplicity • Five jobs handle all ETL processes across DWH
Traceability • Every record/source file is traced in the database and every row automatically
identified by source file in ODS
Code simplification • Replaced most common key definitions with dynamic variable replacement
File management • Every source file automatically archived to Amazon S3 in appropriate locations
sorted by source, table, and date
• Entire source systems, periods, etc can be replayed in minutes
13. Source loading automation
o Design of loader focused on process abstraction, traceability, and minimization of “moving parts”
o Final process consisted of two base jobs working in tandem: one for generating incremental extracts from
source systems, one for loading flat files from all sources to staging tables
o Replication was desired but rejected due to limited access to source systems
Source tables
duplicated in
staging with
addition of
loadTs and
sourceFile
columns
Metadata for
source file
added
Loader
automatically
generates ODS,
begins tracking
source files for
duplication and
data quality
Query
generator
automatically
executes full
duplication on
first execution
and
incrementals
afterward
CREATE TABLE stg_oms.customer
(
customerId int
, customerName varchar(500)
, customerAddress varchar(5000)
, loadTs timestamp NOT NULL
, sourceFile varchar(255) NOT NULL
)
ORDER BY customerId
PARTITION BY date(loadTs)
;
INSERT INTO meta.source_to_stg_mapping
(targetSchema, targetTable, sourceSystem, fileNamePattern, delimiter, nullField)
VALUES
('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL')
;
Example: Add additional table from existing sourceWorkflow of source integration
14. Vault loading automation
• New sources
automatically
added
• Last change
epoch based
on load
stamps,
advanced
each time all
dependencies
execute
successfully
All Staging
Tables
Checked for
Changes
• Dependencies
declared at
time of job
creation
• Load
prioritization
possible but
not utilized
List of
Dependent
Vault Loads
Identified
• Jobs
parallelized
across tables
but serialized
per job
• Dynamic job
queueing
ensures
appropriate
execution
order
Loads
Planned in
Hub, Link,
Sat Order
• Variables
automatically
identified and
replaced
• Each load
records
performance
statistics and
error
messages
Loads
Executed
o Loader is fully metadata driven with focus on horizontal scalability and management simplicity
o To support speed of development and performance, variable-driven SQL templates used throughout
15. Design goals for mart loading automation
Requirement Solution Benefit
Simple,
standardized
models
Metadata-driven
Pentaho PDI
Easy development
using parameters
and variables
Easily
Extensible
Plugin framework
Rapid integration
of new
functionality
Rapid new job
development
Recycle
standardized jobs
and
transformations
Limited moving
parts, easy
modification
Low
administration
overhead
Leverage built in
logging and
tracking
Easily integrated
mart loading
reporting with
other ETL reports
16. Data Information mart automation flow
Retrieve
commands
• Each dimension and fact is processed independently
Get
dependencies
• Based on defined transformation, get all related vault tables: links, satellites or hubs
Retrieve
changed data
• From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension
• Store the data in the database until further processing
Execute
transformations
• Multiple Pentaho transformations can be processed per command using the data captured in previous steps
Maintentance
• Logging happens throughout the whole process
• Cleanup after all commands have been processed
17. Primary uses of Bijenkorf DWH
CustomerAnalysis
• Provided first unified
data model of
customer activity
• 80% reduction in
unique customer keys
• Allowed for
segmentation of
customers based on
combination of in-
store and online
activity
Personalization
• DV drives
recommendation
engine and customer
recommendations
(updated nightly)
• Data pipeline
supports near real
time updating of
customer
recommendations
based on web activity
BusinessIntelligence
• DV-based marts
replace joining dozens
of tables across
multiple sources with
single facts/
dimensions
• IT-driven reporting
being replaced with
self-service BI
18. Biggest drivers of success
AWS Infrastructure
Cost: Entire infrastructure for less than one
server in the data center
Toolset: Most services available off the
shelf, minimizing administration
Freedom: No dependency on IT for
development support
Scalability: Systems automatically scaled to
match DWH demands
Automation
Speed: Enormous time savings after initial
investment
Simplicity: Able to run and monitor 40k+
queries per day with minimal effort
Auditability: Enforced tracking and archiving
without developer involvement
PDI framework
Ease of use: Adding new commands takes at
most 45 minutes
Agile: Building the framework took 1 day
Low profile: Average memory usage of
250MB
19. Biggest mistakes along the way
• Initial integration design was based on provided documentation/models which was rarely accurate
• Current users of sources should have been engaged earlier to explain undocumented caveats
Reliance on documentation and requirements over expert users
• Variables were utilized late in development, slowing progress significantly and creating consistency
issues
• Good initial design of templates will significantly reduce development time in mid/long run
Late utilization of templates and variables
• We attempted to design and populate the entire data vault prior to focusing on customer deliverables
like reports (in addition to other projects)
• We have shifted focus to continuous release of new information rather than waiting for completeness
Aggressive overextension of resources
20. Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cautious with design automation!
◦ Automation can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own
◦ Balance stateful versus stateless and monolithic versus fragmented architecture design
◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant
◦ A successful vault can create a new problem: getting IT to think about business processes rather than system keys!
One of the focus points will be the return satellite, maybe the whole link to the return location and customer should have been modeled as a link?
Return satellite is an active satellite