Harness the Power of Data in a Big Data Lake discusses strategies for ingesting and processing data in a data lake. It describes how to design a data ingestion framework that accounts for factors like data format, source, size, and location. The document contrasts ETL vs ELT approaches and discusses techniques for batched and change data capture ingestion of both structured and unstructured data. It also provides an overview of tools like Sqoop that can be used to ingest data from relational databases into a data lake.
Achieve data democracy in data lake with data integration Saurabh K. Gupta
Saurabh K. Gupta discusses achieving data democratization through effective data integration. He outlines key considerations for data lake architecture and data ingestion frameworks to implement a data lake that empowers data democratization. The document provides an overview of data lake styles, principles for data ingestion, and techniques for batched and streaming data integration like Apache Sqoop, Apache Flume, and change data capture.
The document discusses creating a modern data architecture using a data lake approach. It describes the key components of a data lake including different zones for landing raw data, refining it into trusted datasets, and using sandboxes. It also summarizes challenges of data lakes and how an integrated data lake management platform can help with ingestion, governance, security, and enabling self-service analytics. Finally, it briefly discusses considerations for implementing cloud-based and hybrid data lakes.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
The document discusses creating a modern data architecture using a data lake. It describes Zaloni as a provider of data lake management solutions, including a data lake management and governance platform and self-service data platform. It outlines key features of a data lake such as storing different types of data, creating standardized datasets, and providing shorter time to insights. The document also discusses Zaloni's data lake maturity model and reference architecture.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Pradeep Varadan, Verizon's Wireline OSS Data Science Lead and Scott Gidley, Zaloni's VP, Product Management discuss the benefits of augmenting your DW with a data lake in this webinar presentation.
Pervasive analytics through data & analytic centricityCloudera, Inc.
Cloudera and Teradata discuss the best-in-class solution enabling companies to put data and analytics at the center of their strategy, achieve the highest forms of agility, while reducing the costs and complexity of their current environment.
Achieve data democracy in data lake with data integration Saurabh K. Gupta
Saurabh K. Gupta discusses achieving data democratization through effective data integration. He outlines key considerations for data lake architecture and data ingestion frameworks to implement a data lake that empowers data democratization. The document provides an overview of data lake styles, principles for data ingestion, and techniques for batched and streaming data integration like Apache Sqoop, Apache Flume, and change data capture.
The document discusses creating a modern data architecture using a data lake approach. It describes the key components of a data lake including different zones for landing raw data, refining it into trusted datasets, and using sandboxes. It also summarizes challenges of data lakes and how an integrated data lake management platform can help with ingestion, governance, security, and enabling self-service analytics. Finally, it briefly discusses considerations for implementing cloud-based and hybrid data lakes.
Strata San Jose 2017 - Ben Sharma PresentationZaloni
The document discusses creating a modern data architecture using a data lake. It describes Zaloni as a provider of data lake management solutions, including a data lake management and governance platform and self-service data platform. It outlines key features of a data lake such as storing different types of data, creating standardized datasets, and providing shorter time to insights. The document also discusses Zaloni's data lake maturity model and reference architecture.
Hadoop based data Lakes have become increasingly popular within today’s modern data architectures for their ability to scale, handle data variety and low cost. Many organizations start slow with the data lake initiatives but as they grow bigger, they suffer with challenges on data consistency, quality and security, resulting in losing confidence in their data lake initiatives.
This talk will discuss the need for good data governance mechanisms for Hadoop data lakes and it relationship with productivity and how it helps organizations meet regulatory and compliance requirements. The talk advocates carrying a different mindset for designing and implementing flexible governance mechanisms on Hadoop data lakes.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
My presentation slides from Hadoop Summit, San Jose, June 28, 2016. See live video at http://www.makedatauseful.com/vid-solving-performance-problems-hadoop/ and follow along for context.
Moving analytic workloads into production - specific technical challenges and best practices for engineering SQL in Hadoop solutions. Highlighting the next generation engineering approaches to the secret sauce we have implemented in the Actian VectorH database.
Pradeep Varadan, Verizon's Wireline OSS Data Science Lead and Scott Gidley, Zaloni's VP, Product Management discuss the benefits of augmenting your DW with a data lake in this webinar presentation.
Pervasive analytics through data & analytic centricityCloudera, Inc.
Cloudera and Teradata discuss the best-in-class solution enabling companies to put data and analytics at the center of their strategy, achieve the highest forms of agility, while reducing the costs and complexity of their current environment.
1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
This document discusses deploying a governed data lake using Hadoop and Waterline Data Inventory. It begins by outlining the benefits of a data lake and differences between data lakes and data warehouses. It then discusses using Hadoop as the platform for the data lake and some challenges around governance, scale, and usability. The document proposes a three phase approach using Waterline Data Inventory to organize, inventory, and open up the data lake. It provides screenshots and descriptions of Waterline's key capabilities like metadata discovery, data profiling, sensitive data identification, governance tools, and self-service catalog. It also includes an overview of Waterline Data as a company.
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
This document compares data warehouses and data lakes. A data warehouse stores transformed and structured data to enable generating reports for strategic decision making. A data lake stores vast amounts of raw data in its native format until needed. Major differences are that data warehouses remove insignificant data while data lakes retain all data types. Data lakes also empower exploring data in novel ways. Key benefits of data lakes over data warehouses include greater scalability, supporting more data sources and advanced analytics, and deferring schema development until a business need is identified.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
Federated data architecture involves integrating data from multiple disparate sources to provide a logically integrated view. It allows existing systems to continue operating while being modernized. The US Air Force implemented a federated data solution to manage its $40 billion budget across 100 global locations. It integrated financial data from over 20 legacy systems and provided 15,000 users with real-time access and ad hoc querying capabilities while maintaining high performance.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.
In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.
Designing the Next Generation Data LakeRobert Chong
This document contains a presentation by George Trujillo on designing the next generation data lake. It discusses how analytic platforms need to change to keep up with business demands. New technologies like cloud, object storage, and self-driving databases are allowing for more flexible and scalable data architectures. This is shifting analytics platforms from tightly coupled storage and compute to independent, elastic models. These changes will impact how organizations build projects, careers, and skills in the future by focusing more on innovation and delivering results faster.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
This document discusses the importance of metadata for big data solutions and data lakes. It begins with introductions of the two speakers, Ben Sharma and Vikram Sreekanti. It then discusses how metadata allows you to track data in the data lake, improve change management and data visibility. The document presents considerations for metadata such as integration with enterprise solutions and automated registration. It provides examples of using metadata for data lineage, quality, and cataloging. Finally, it discusses using metadata across storage tiers for data lifecycle management and providing elastic compute resources.
Extending Data Lake using the Lambda Architecture June 2015DataWorks Summit
The document discusses using a Lambda architecture to extend a data lake with real-time capabilities. It describes considerations for choosing a real-time architecture and common use cases. Specific examples discussed include using real-time architectures for patient critical care in healthcare and customer engagement in marketing.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
A data warehouse is a collection of data integrated from multiple sources to support decision making. It contains subject-oriented, integrated, time-variant, and non-volatile data stored in a way that makes it readily available for analysis. Data marts can be dependent on the warehouse or independent subsets designed for specific departments. Successful implementation requires identifying data sources and governance, planning data quality and modeling, selecting ETL and database tools, and supporting end users. Key challenges include unrealistic expectations, technical issues, and ensuring ongoing value.
A data warehouse is a collection of integrated data from multiple sources organized to support management decision making. It contains subject-oriented, integrated, time-variant and non-volatile data stored in a way that is optimized for query and analysis. There are different types of data warehouses including data marts, operational data stores and enterprise data warehouses. Key components of a data warehouse include data sources, extraction, loading, a comprehensive database, metadata and middleware tools.
1) The document discusses big data strategies and technologies including Oracle's big data solutions. It describes Oracle's big data appliance which is an integrated hardware and software platform for running Apache Hadoop.
2) Key technologies that enable deeper analytics on big data are discussed including advanced analytics, data mining, text mining and Oracle R. Use cases are provided in industries like insurance, travel and gaming.
3) An example use case of a "smart mall" is described where customer profiles and purchase data are analyzed in real-time to deliver personalized offers. The technology pattern for implementing such a use case with Oracle's real-time decisions and big data platform is outlined.
This document discusses deploying a governed data lake using Hadoop and Waterline Data Inventory. It begins by outlining the benefits of a data lake and differences between data lakes and data warehouses. It then discusses using Hadoop as the platform for the data lake and some challenges around governance, scale, and usability. The document proposes a three phase approach using Waterline Data Inventory to organize, inventory, and open up the data lake. It provides screenshots and descriptions of Waterline's key capabilities like metadata discovery, data profiling, sensitive data identification, governance tools, and self-service catalog. It also includes an overview of Waterline Data as a company.
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
This document compares data warehouses and data lakes. A data warehouse stores transformed and structured data to enable generating reports for strategic decision making. A data lake stores vast amounts of raw data in its native format until needed. Major differences are that data warehouses remove insignificant data while data lakes retain all data types. Data lakes also empower exploring data in novel ways. Key benefits of data lakes over data warehouses include greater scalability, supporting more data sources and advanced analytics, and deferring schema development until a business need is identified.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma, thought leader and coauthor of Architecting Data Lakes, offers lessons learned from the field to get you started.
Federated data architecture involves integrating data from multiple disparate sources to provide a logically integrated view. It allows existing systems to continue operating while being modernized. The US Air Force implemented a federated data solution to manage its $40 billion budget across 100 global locations. It integrated financial data from over 20 legacy systems and provided 15,000 users with real-time access and ad hoc querying capabilities while maintaining high performance.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.
In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.
Designing the Next Generation Data LakeRobert Chong
This document contains a presentation by George Trujillo on designing the next generation data lake. It discusses how analytic platforms need to change to keep up with business demands. New technologies like cloud, object storage, and self-driving databases are allowing for more flexible and scalable data architectures. This is shifting analytics platforms from tightly coupled storage and compute to independent, elastic models. These changes will impact how organizations build projects, careers, and skills in the future by focusing more on innovation and delivering results faster.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
In our most recent Big Data Warehousing Meetup, we learned about transitioning from Big Data 1.0 with Hadoop 1.x with nascent technologies to the advent of Hadoop 2.x with YARN to enable distributed ETL, SQL and Analytics solutions. Caserta Concepts Chief Architect Elliott Cordo and an Actian Engineer covered the complete data value chain of an Enterprise-ready platform including data connectivity, collection, preparation, optimization and analytics with end user access.
Access additional slides from this meetup here:
http://www.slideshare.net/CasertaConcepts/big-data-warehousing-meetup-january-20
For more information on our services or upcoming events, please visit http://www.actian.com/ or http://www.casertaconcepts.com/.
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
This document discusses the importance of metadata for big data solutions and data lakes. It begins with introductions of the two speakers, Ben Sharma and Vikram Sreekanti. It then discusses how metadata allows you to track data in the data lake, improve change management and data visibility. The document presents considerations for metadata such as integration with enterprise solutions and automated registration. It provides examples of using metadata for data lineage, quality, and cataloging. Finally, it discusses using metadata across storage tiers for data lifecycle management and providing elastic compute resources.
Extending Data Lake using the Lambda Architecture June 2015DataWorks Summit
The document discusses using a Lambda architecture to extend a data lake with real-time capabilities. It describes considerations for choosing a real-time architecture and common use cases. Specific examples discussed include using real-time architectures for patient critical care in healthcare and customer engagement in marketing.
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
Come to this deep dive on how Pivotal's Data Lake Vision is evolving by embracing next generation in-memory data exchange and compute technologies around Spark and Tachyon. Did we say Hadoop, SQL, and what's the shortest path to get from past to future state? The next generation of data lake technology will leverage the availability of in-memory processing, with an architecture that supports multiple data analytics workloads within a single environment: SQL, R, Spark, batch and transactional.
A data warehouse is a collection of data integrated from multiple sources to support decision making. It contains subject-oriented, integrated, time-variant, and non-volatile data stored in a way that makes it readily available for analysis. Data marts can be dependent on the warehouse or independent subsets designed for specific departments. Successful implementation requires identifying data sources and governance, planning data quality and modeling, selecting ETL and database tools, and supporting end users. Key challenges include unrealistic expectations, technical issues, and ensuring ongoing value.
A data warehouse is a collection of integrated data from multiple sources organized to support management decision making. It contains subject-oriented, integrated, time-variant and non-volatile data stored in a way that is optimized for query and analysis. There are different types of data warehouses including data marts, operational data stores and enterprise data warehouses. Key components of a data warehouse include data sources, extraction, loading, a comprehensive database, metadata and middleware tools.
The Shifting Landscape of Data IntegrationDATAVERSITY
This document discusses the shifting landscape of data integration. It begins with an introduction by William McKnight, who is described as the "#1 Global Influencer in Data Warehousing". The document then discusses how challenges in data integration are shifting from dealing with volume, velocity and variety to dealing with dynamic, distributed and diverse data in the cloud. It also discusses IDC's view that this shift is occurring from the traditional 3Vs to the 3Ds. The rest of the document discusses Matillion, a vendor that provides a modern solution for cloud data integration challenges.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
This document provides an overview of data warehousing. It defines a data warehouse as a subject-oriented, integrated collection of data used to support management decision making. The benefits of data warehousing include high returns on investment and increased productivity. A data warehouse differs from an OLTP system in its design for analytics rather than transactions. The typical architecture includes data sources, an operational data store, warehouse manager, query manager and end user tools. Key components are extracting, cleaning, transforming and loading data, and managing metadata. Data flows include inflows from sources and upflows of summarized data to users.
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
- Data warehousing aims to help knowledge workers make better decisions by integrating data from multiple sources and providing historical and aggregated data views. It separates analytical processing from operational processing for improved performance.
- A data warehouse contains subject-oriented, integrated, time-variant, and non-volatile data to support analysis. It is maintained separately from operational databases. Common schemas include star schemas and snowflake schemas.
- Online analytical processing (OLAP) supports ad-hoc querying of data warehouses for analysis. It uses multidimensional views of aggregated measures and dimensions. Relational and multidimensional OLAP are common architectures. Measures are metrics like sales, and dimensions provide context like products and time periods.
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
Watch full webinar here: https://bit.ly/3hgOSwm
Data Lake technologies have been in constant evolution in recent years, with each iteration primising to fix what previous ones failed to accomplish. Several data lake engines are hitting the market with better ingestion, governance, and acceleration capabilities that aim to create the ultimate data repository. But isn't that the promise of a logical architecture with data virtualization too? So, what’s the difference between the two technologies? Are they friends or foes? This session will explore the details.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
An overview of modern scalable web developmentTung Nguyen
The document provides an overview of modern scalable web development trends. It discusses the motivation to build systems that can handle large amounts of data quickly and reliably. It then summarizes the evolution of software architectures from monolithic to microservices. Specific techniques covered include reactive system design, big data analytics using Hadoop and MapReduce, machine learning workflows, and cloud computing services. The document concludes with an overview of the technologies used in the Septeni tech stack.
The document discusses Data Vault 2.0, a data modeling technique for integrating historical data from multiple sources. It then summarizes Pivotal Greenplum, an open source massively parallel processing database, and provides examples of reference cases using these technologies for data integration and analytics projects.
Various Applications of Data Warehouse.pptRafiulHasan19
The document discusses various applications of data warehousing. It begins by describing problems with traditional transactional systems and how data warehouses address these issues. It then defines key components of a data warehouse including the extraction, transformation, and loading of data from various sources. The document outlines how online analytical processing (OLAP) tools, metadata repositories, and data mining techniques analyze and explore the collected data. Finally, it weighs the benefits of a data warehouse against the costs of implementation and maintenance.
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
The document discusses strategies for modern data warehousing and analytics on Azure including using Hadoop for ETL/ELT, integrating streaming data engines, and using lambda and hybrid architectures. It also describes using data lakes on Azure to collect and analyze large amounts of data from various sources. Additionally, it covers performing real-time stream analytics, machine learning, and statistical analysis on the data and discusses how Azure provides scalability, speed of deployment, and support for polyglot environments that incorporate many data processing and storage options.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Similar to Harness the power of Data in a Big Data Lake (20)
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
1. Harness the Power of Data in a
Big Data Lake
Saurabh K. Gupta
Manager, Data & Analytics, GE
www.amazon.com/author/saurabhgupta
@saurabhkg
2. Disclaimer:
“This report has been prepared by the Authors who are part of GE. The opinions expresses herein by the Authors
and the information contained herein are in good faith and Authors and GE disclaim any liability for the content in
the report. The report is the property of GE and GE is the holder of the copyright or any intellectual property over
the report. No part of this document may be reproduced in any manner without the written permission of GE. This
Report also contains certain information available in public domain, created and maintained by private and public
organizations. GE does not control or guarantee the accuracy, relevance, timelines or completeness of such
information. This Report constitutes a view as on the date of publication and is subject to change. GE does not
warrant or solicit any kind of act or omission based on this Report.”
3. About me
• Manager, Data & Analytics at GE Digital
• 10+ years of experience in data architecture,
engineering, analytics.
• Authored couple of books
• Oracle Advanced PL/SQL Developer Professional
Guide (2012)
• Advanced Oracle PL/SQL Developer’s Guide
(2016)
• Current session - excerpt from an “untitled book”
(TBD)
• Speaker at AIOUG, IOUG, NASSCOM
• Twitter @saurabhkg, Blog@
sbhoracle.wordpress.com
AIOUG Sangam’17
4. Why I’m here?
• Data lake is relatively a new term when compared to all fancy ones since the industry
realized the potential of data. Industry is planning their way out to adopt big data lake as
the key data store but what challenges them is the traditional approach. Traditional
approaches pertaining to data pipelines, data processing, data security still hold good but
architects do need to leap an extra mile while designing big data lake.
• This session will focus on this shift in approaches. We will explore what are the road
blockers while setting up a data lake and how to size the key milestones. Health and
efficiency of a data lake largely depends on two factors - data ingestion and data
processing. Attend this session to learn key practices of data ingestion under different
circumstances. Data processing for variety of scenarios will be covered as well.
AIOUG Sangam’17
5. Agenda
• Data Lake – the evolution
• Data Lake architecture
• Data Lake ingestion strategy
• Structured
• Unstructured
• Change data capture
• Principles
• Data Processing techniques
AIOUG Sangam’17
6. Data Explosion
Future data trends
AIOUG Sangam’17
Data as a Service
Cybersecurity
Augmented Analytics Machine Intelligence
“Fast Data” and
“Actionable data”
7. Evolution of Data Lake
• James Dixon’s “time machine” vision of data
• Leads Data-as-an-Asset strategy
“If you think of a datamart as a store of bottled water
– cleansed and packaged and structured for easy
consumption – the data lake is a large body of water
in a more natural state. The contents of the data
lake stream in from a source to fill the lake, and
various users of the lake can come to examine, dive
in, or take samples.”
-James Dixon
Footer
8. What is Data Lake? And Why?
What?
• Data lake represents a state of enterprise at a given time
• Store all data in much detailed fashion at one place
• Empower business analytics applications, predictive and deep learning models with one
“time machine” data store
• Unifies data discovery, data science and enterprise BI in an organization
Why?
• Scalable framework to store massive volumes of data and churn out analytical insights
• Data ingestion and processing framework becomes the cornerstone rather than just the
storage
• warrants data discovery platforms to soak the data trends at a horizontal scale and
produce visual insights
Footer
10. Data lake vs Data warehouse
Footer
• Data lake gets the data in its raw format,
unprocessed and untouched to build mirror
layer. Follows schema on-read strategy
• War room for data scientists, analysts and
data engineering specialists from multiple
domains to run analytical models.
• Data lake is primarily hosted on Hadoop,
which is open source and comes with free
community support
• Data warehouse pulls data from the source
and takes through a processing layer i.e.
schema on-write
• Data warehouse targets management staff
and business leaders who expect structured
analyst reports end of the day
• Data warehouse, runs on relatively expensive
storage to withstand high-scale data
processing.
Data Lake Data Warehouse
11. Data Lake architecture
Footer
Source
data
systems
(RDBMS, web,
IoT, Files)
Data Acquisition
Layer
(Source-to-Data Lake
ETL)
Data
Visualization
Mirror Layer
Analytics Layer
Data Science and
exploration
Data
Consumption
Data
Discovery/Exploration
ENTERPRISE DATA LAKE
Intermediate layers
(optional)
Deep learning
models
12. Data Lake operational pillars
Footer
Data
As-An
Asset
Data
Management
Data Ingestion
Data
Engineering
Data
Visualization
Data Ingestion
14. Understand the data
• Structured data is an organized piece of
information
• Aligns strongly with the relational
standards
• Defined metadata
• Easy ingestion, retrieval, and processing
• Unstructured data lacks structure and
metadata
• Not so easy to ingest
• Complex retrieval
• Complex processing
Footer
15. Understand the data sources
• OLTP and Data warehouses – structured data from typical relational data stores.
• Data management systems – documents and text files
• Legacy systems – essential for historical and regulatory analytics
• Sensors and IoT devices – Devices installed on healthcare, home, and mobile appliances
and large machines can upload logs to data lake at periodic intervals or in a secure
network region
• Web content – data from the web world (retail sites, blogs, social media)
• Geographical data – data flowing from location data, maps, and geo-positioning systems.
Footer
16. Data Ingestion Framework
Design Considerations
• Data format – What format is the data to be ingested?
• Data change rate – Critical for CDC design and streaming data. Performance is a
derivative of throughput and latency.
• Data location and security –
• Whether data is located on-premise or public cloud infrastructure. While fetching data
from cloud instances, network bandwidth plays an important role.
• If the data source is enclosed within a security layer, ingestion framework should be
enabled establishment of a secure tunnel to collect data for ingestion
• Transfer data size (file compression and file splitting) – what would be the average and
maximum size of block or object in a single ingestion operation?
• Target file format – Data from a source system needs to be ingested in a hadoop
compatible file format.
Footer
17. ETL vs ELT for Data Lake
• Heavy transformation may restrict data
surface area for data exploration
• Brings down the data agility
• Transformation on huge volumes of data
may foster a latency between data
source and data lake
• Curated layer to empower analytical
models
Footer
ETL
ELT
18. Batched data ingestion principles
Structured data
• Data collector fires a SELECT query (also known as filter query) on the source to pull
incremental records or full extract
• Query performance and source workload determine how efficient data collector is
• Robust and flexible
Footer
• Change Track flag – flag rows with the operation code
• Incremental extraction – pull all the changes after certain timestamp
• Full Extraction – refresh target on every ingestion run
Ingestion techniques
19. Change Data Capture
• Log mining process
• Capture changed data from the source system’s transaction logs and integrate with the
target
• Eliminating the need to run SQL queries on source system. Incurs no load overhead on a
transactional source system.
• Achieves near real-time replication between source and target
Footer
20. Change Data Capture
Design Considerations
• Source database be enabled for logging
• Commercial tools - Oracle GoldenGate, HVR, Talend CDC, custom replicators
• Keys are extremely important for replication
• Helps capture job in establishing uniqueness of a record in the changed data set
• Source PK ensures the changes are applied to the correct record on target
• PK not available; establish uniqueness based on composite columns
• Establish uniqueness based on a unique constraint - terrible design!!
• Trigger based CDC
• Event on a table triggers the change to be captured in an change-log table
• Change-log table merged with the target
• Works when source transaction logs are not available
Footer
21. LinkedIn Databus
CDC capture pipeline
• Relay is responsible for pulling the most recent committed
transactions from the source
• Relays are implemented through tungsten replicator
• Relay stores the changes in logs or cache in compressed
format
• Consumer pulls the changes from relay
• Bootstrap component – a snapshot of data source on a
temporary instance. It is consistent with the changes
captured by Relay
• If any consumer falls behind and can’t find the changes in
relay, bootstrap component transforms and packages the
changes to the consumer
• A new consumer, with the help of client library, can apply all
the changes from bootstrap component until a time. Client
library will point the consumer to Relay to continue pulling
most recent changes
Footer
22. Apache Sqoop
• Native member of Hadoop tech stack for data ingestion
• Batched ingestion, no CDC
• Java based utility (web interface in Sqoop2) that spawns Map jobs from MapReduce
engine to store data in HDFS
• Provides full extract as well as incremental import mode support
• Runs on HDFS cluster and can populate tables in Hive, HBase
• Can establish a data integration layer between NoSQL and HDFS
• Can be integrated with Oozie to schedule import/export tasks
• Supports connectors to multiple relational databases like Oracle, SQL Server, MySQL
Footer
23. Sqoop architecture
• Mapper jobs of MapReduce
processing layer in Hadoop
• By default, a sqoop job has four
mappers
• Rule of Split
• Values of --split-by column
must be equally distributed to each
mapper
• --split-by column must be a
primary key
• --split-by column should be a
primary key
Footer
24. Sqoop
Design considerations - I
• Mappers
• --num-mappers [n] argument
• run in parallel within Hadoop. No formula but needs to be judiciously set
• Cannot split
• --autoreset-to-one-mapper to perform unsplit extraction
• Source has no PK
• Split based on natural or surrogate key
• Source has character keys
• Divide and conquer! Manual partitions and run one mapper per partition
• If key value is an integer, no worries
Footer
25. Sqoop
Design considerations - II
• If only subset of columns is required from the source table, specify column list in --
columns argument.
• For example, --columns “orderId, product, sales”
• If limited rows are required to be “sqooped”, specify --where clause with the predicate
clause.
• For example, --where “sales > 1000”
• If result of a structured query needs to be imported, use --query clause.
• For example, --query ‘select orderId, product, sales from orders where sales>1000’
• Use --hive-partition-key and --hive-partition-value attributes to create
partitions on a column key from the import
• Delimiters can be handled through either of the below ways –
• Specify --hive-drop-import-delims to remove delimiters during import process
• Specify --hive-delims-replacement to replace delimiters with an alternate character
Footer
26. Oracle copyToBDA
• Licensed under Oracle BigData SQL
• Stack of Oracle BDA, Exadata, Infiniband
• Helps in loading Oracle database tables to Hadoop by –
• Dumping the table data in Data Pump format
• Copying them into HDFS
• Full extract and load
• Source data changes
• Rerun the utility to refresh Hive tables
Footer
27. Greenplum’s GPHDFS
• Setup on all segment nodes of
a Greenplum cluster
• All segments concurrently push
the local copies of data splits to
Hadoop cluster
• Cluster segments yield the
power of parallelism
Footer
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Hadoop clusterGreenplum cluster
28. Ingest unstructured data using Flume
• Distributed system to capture and load large volumes of log data from different source
systems to data lake
• Collection and aggregation of streaming data as events
Footer
Incoming
Events
Outgoing
Data
Source Sink
Flume Agent
Channel
Source Transaction Sink Transaction
Client
PUT TAKE
29. Apache Flume
Design considerations - I
• Channel type
• MEMORY - events are read from source to memory
• Good performance, but volatile. Not cost effective
• FILE – events are ready from source into file system
• Controllable performance. Persistent and Transactional guarantee
• JDBC - events are read and stored in Derby database
• Slow performance
• Kafka – store events in Kafka topic
• Event batch size - maximum number of events that can be batched by source or sink in a
single transaction
• Fatter the better. But not for FILE channel
• Stable number ensures data consistency
Footer
30. Apache Flume
Design considerations - II
• Channel capacity and transaction capacity
• For MEMORY channel, channel capacity is
limited by RAM size.
• For FILE, channel capacity is limited by disk
size
• Should not exceed batch size configured for
the sinks
• Channel selector
• An event can either be replicated or
multiplexed
• Preferable vs conditional
• Handle high throughput systems
• Tiered architecture to handle event flow
• Aggregate and push approach
Footer