This document discusses big data concepts and applications. It begins by defining big data characteristics including volume, velocity, and variety. It then outlines common big data applications in business intelligence and transactions. Different big data architectures like MapReduce, massively parallel processing databases, and in-memory databases are described along with their strengths and limitations. The document concludes with a demonstration of exploring millions of US patent pages in real-time using various big data technologies.
This document discusses the two main approaches to data warehouse architecture: Bill Inmon's approach and Ralph Kimball's approach. Bill Inmon advocates for a single, large integrated data warehouse schema with a top-down design. This takes longer and is more expensive but provides a more complex, stable solution. Ralph Kimball prefers multiple smaller subject-oriented data marts with dimensional modeling. This is quicker to deliver and implement but requires later integration into a full data warehouse. The document also stresses the importance of understanding user needs and involving users throughout the process.
The document discusses the goals and requirements for building a data warehouse for the SF Goodwill Retail organization. The data warehouse would provide a single place for sales and inventory reports, allow automated reporting available from any location, and pull data from POS systems for consolidated performance reporting and comparisons to goals. It would also standardize the design and development process using common tools like SQL, HTML and PHP. A variety of standardized reports would be available through a web interface, including high-level summaries, drill-down details, filtering and exporting capabilities.
The document provides an overview of data warehousing concepts. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data. It discusses the differences between OLTP and OLAP systems. It also covers data warehouse architectures, components, and processes. Additionally, it explains key concepts like facts and dimensions, star schemas, normalization forms, and metadata.
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
This document discusses data warehouse design on the cloud using a big data approach. It covers topics such as business intelligence, data warehousing, data marts, data mining, ETL architecture, data warehouse design methodologies, Bill Inmon's top-down approach, Ralph Kimball's bottom-up approach, and addressing the new challenges of volume, velocity and variety of big data with Hadoop. The document proposes an architecture for next generation data warehousing using Hadoop to handle these new big data challenges.
The document discusses operational analytics and its performance on Informix, including what operational analytics is, how it can be implemented on Informix, and performance analysis of Informix on Intel platforms. It provides an overview of operational analytics and its challenges, how it can leverage Informix for the complete lifecycle, and benchmarks showing Informix's scaling on Intel's Xeon platforms for operational analytics workloads.
This document discusses an agile approach to developing a data warehouse. It advocates using an Agile Enterprise Data Model to provide vision and guidance. The "Spock Approach" is described, which uses an operational data store, dimensional data warehouse, and iterative development of data marts. Data visualization techniques like data hexes are recommended to improve planning and visibility. Leadership, version control, adaptability, refinement, and refactoring are identified as important ongoing processes for an agile data warehouse project.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
This document discusses the two main approaches to data warehouse architecture: Bill Inmon's approach and Ralph Kimball's approach. Bill Inmon advocates for a single, large integrated data warehouse schema with a top-down design. This takes longer and is more expensive but provides a more complex, stable solution. Ralph Kimball prefers multiple smaller subject-oriented data marts with dimensional modeling. This is quicker to deliver and implement but requires later integration into a full data warehouse. The document also stresses the importance of understanding user needs and involving users throughout the process.
The document discusses the goals and requirements for building a data warehouse for the SF Goodwill Retail organization. The data warehouse would provide a single place for sales and inventory reports, allow automated reporting available from any location, and pull data from POS systems for consolidated performance reporting and comparisons to goals. It would also standardize the design and development process using common tools like SQL, HTML and PHP. A variety of standardized reports would be available through a web interface, including high-level summaries, drill-down details, filtering and exporting capabilities.
The document provides an overview of data warehousing concepts. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data. It discusses the differences between OLTP and OLAP systems. It also covers data warehouse architectures, components, and processes. Additionally, it explains key concepts like facts and dimensions, star schemas, normalization forms, and metadata.
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
This document discusses data warehouse design on the cloud using a big data approach. It covers topics such as business intelligence, data warehousing, data marts, data mining, ETL architecture, data warehouse design methodologies, Bill Inmon's top-down approach, Ralph Kimball's bottom-up approach, and addressing the new challenges of volume, velocity and variety of big data with Hadoop. The document proposes an architecture for next generation data warehousing using Hadoop to handle these new big data challenges.
The document discusses operational analytics and its performance on Informix, including what operational analytics is, how it can be implemented on Informix, and performance analysis of Informix on Intel platforms. It provides an overview of operational analytics and its challenges, how it can leverage Informix for the complete lifecycle, and benchmarks showing Informix's scaling on Intel's Xeon platforms for operational analytics workloads.
This document discusses an agile approach to developing a data warehouse. It advocates using an Agile Enterprise Data Model to provide vision and guidance. The "Spock Approach" is described, which uses an operational data store, dimensional data warehouse, and iterative development of data marts. Data visualization techniques like data hexes are recommended to improve planning and visibility. Leadership, version control, adaptability, refinement, and refactoring are identified as important ongoing processes for an agile data warehouse project.
This document discusses data warehousing and OLAP (online analytical processing) technology. It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data to support management decision making. It describes how data warehouses use a multi-dimensional data model with facts and dimensions to organize historical data from multiple sources for analysis. Common data warehouse architectures like star schemas and snowflake schemas are also summarized.
This document provides an introduction to data warehousing. It defines key concepts like data, databases, information and metadata. It describes problems with heterogeneous data sources and fragmented data management in large enterprises. The solution is a data warehouse, which provides a unified view of data from various sources. A data warehouse is defined as a subject-oriented, integrated collection of historical data used for analysis and decision making. It differs from operational databases in aspects like data volume, volatility, and usage. The document outlines the extract-transform-load process and common architecture of data warehousing.
A data warehouse is a central repository for storing historical and integrated data from multiple sources to be used for analysis and reporting. It contains a single version of the truth and is optimized for read access. In contrast, operational databases are optimized for transaction processing and contain current detailed data. A key aspect of data warehousing is using a dimensional model with fact and dimension tables. This allows for analyzing relationships between measures and dimensions in a multi-dimensional structure known as a data cube.
This document provides an overview of data warehousing. It defines a data warehouse as a subject-oriented, integrated collection of data used to support management decision making. The benefits of data warehousing include high returns on investment and increased productivity. A data warehouse differs from an OLTP system in its design for analytics rather than transactions. The typical architecture includes data sources, an operational data store, warehouse manager, query manager and end user tools. Key components are extracting, cleaning, transforming and loading data, and managing metadata. Data flows include inflows from sources and upflows of summarized data to users.
The document presents information on data warehousing. It defines a data warehouse as a repository for integrating enterprise data for analysis and decision making. It describes the key components, including operational data sources, an operational data store, and end-user access tools. It also outlines the processes of extracting, cleaning, transforming, loading and accessing the data, as well as common management tools. Data marts are discussed as focused subsets of a data warehouse tailored for a specific department.
Webinar: Achieving Customer Centricity and High Margins in Financial Services...MongoDB
It is imperative that Financial Services firms align the organization around providing maximum value to customers across all channels and products with the agility to capitalize on new opportunities. They must do this at the same time as cutting costs, improving operational efficiency, and complying with current and future regulations. This effort is commonly referred to as Industrialization, or streamlining people, process, and technology for maximum customer value, service, and efficiency.
MongoDB can help you in this initiative by allowing you to centralize data management no matter how it is structured across channels and products and make it easy to aggregate data from multiple systems, while lowering TCO and delivering applications faster. MetLife publicly announced that they used MongoDB to enable a single view of the customer in 3 months across 70+ existing systems. We will explore case studies demonstrating these capabilities to help you industrialize your firm.
Key takeaways:
Unique capabilities, brought to you by MongoDB
Concrete use cases that help industrialization
Implementation case studies, to pave the way
This document provides an overview of data warehousing. It defines a data warehouse as a central database that includes information from several different sources and keeps both current and historical data to support management decision making. The document describes key characteristics of a data warehouse including being subject-oriented, integrated, time-variant, and non-volatile. It also discusses common data warehouse architectures and applications.
MariaDB AX: Solución analítica con ColumnStoreMariaDB plc
MariaDB ColumnStore is a high performance columnar storage engine that provides fast and efficient analytics on large datasets in distributed environments. It stores data column-by-column for high compression and read performance. Queries are processed in parallel across nodes for scalability. MariaDB ColumnStore is used for real-time analytics use cases in industries like healthcare, life sciences, and telecommunications to gain insights from large datasets for applications like customer behavior analysis, genome research, and call data monitoring.
MariaDB AX: Analytics with MariaDB ColumnStoreMariaDB plc
MariaDB ColumnStore is a high performance columnar storage engine that provides fast and efficient analytics on large datasets in distributed environments. It stores data column-by-column for high compression and read performance. Queries are processed in parallel across nodes for scalability. MariaDB ColumnStore is used for real-time analytics use cases in industries like healthcare, life sciences, and telecommunications to gain insights from large datasets.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
Concept to production Nationwide Insurance BigInsights Journey with TelematicsSeeling Cheung
This document summarizes Nationwide Insurance's use of IBM BigInsights to process telematics data from their SmartRide program. It discusses the architecture used, which included 6 management nodes and 16 data nodes of IBM BigInsights. It also describes the various phases of data processing, including acquiring raw trip files from HDFS, standardizing the data, scrubbing and calculating events, and summarizing the data for loading into HBase. Key benefits included improving processing performance and enabling customers to access insights about their driving through a web portal.
1) A data warehouse is a collection of data from multiple sources used to enable informed decision making. It contains data, metadata, dimensions, facts and aggregates.
2) The typical processes in a data warehouse are extract and load, data cleaning and transformation, user queries, and data archiving.
3) The key components that manage these processes are the load manager, warehouse manager and query manager. The load manager extracts, loads and does simple transformations on the data. The warehouse manager performs more complex transformations, integrity checks and generates summaries. The query manager directs user queries to the appropriate data.
SAP HANA utilizes cutting-edge in-memory computing technology to provide the enterprise with real-time data and avail a competitive edge.
In 2013, I have created this presentation for the SAP HANA workshop conducted by me. Primarily, before jumping into HANA, audience needs a smooth transition...required to understand the proper necessity to attend the workshop.
Five tuning tips for data warehouses are presented:
1. Partition tables for improved performance, manageability, and availability.
2. Use data segment compression to reduce storage requirements while improving performance.
3. Make optimal use of PGA memory for queries.
4. Be aware of how temporal data can affect query optimization.
5. Monitor query execution to identify optimization opportunities.
Data Warehouses & Deployment By Ankita dubeyAnkita Dubey
This document contains the notes about data warehouses and life cycle for data warehouse deployment project. This can be useful for students or working professionals to gain the basic knowledge about Data warehouses.
IBM's Big Data platform provides tools for managing and analyzing large volumes of data from various sources. It allows users to cost effectively store and process structured, unstructured, and streaming data. The platform includes products like Hadoop for storage, MapReduce for processing large datasets, and InfoSphere Streams for analyzing real-time streaming data. Business users can start with critical needs and expand their use of big data over time by leveraging different products within the IBM Big Data platform.
This document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, non-volatile, and time-variant collection of data used to support management decision making. It describes the data warehouse architecture including extract-transform-load processes, OLAP servers, and metadata repositories. Finally, it outlines common data warehouse applications like reporting, querying, and data mining.
The document discusses data warehousing, including its history, types, security, applications, components, architecture, benefits and problems. A data warehouse is defined as a subject-oriented, integrated, time-variant collection of data to support management decision making. In the 1990s, organizations needed timely data but traditional systems were too slow. Data warehouses now provide competitive advantages through improved decision making and productivity. They integrate data from multiple sources to support applications like customer analysis, stock control and fraud detection.
This document discusses optimizing the analytics process for a Brazilian e-commerce company called Olist. It begins with an overview of the client scenario and scattered data. The goals are to create a normalized database, optimize the ETL process, and automate analytics insights. It describes plans to normalize the data across multiple tables, extract data from CSV files, transform and clean the data, and load it into a PostgreSQL database. Analytical procedures and dashboard benefits are discussed for various business roles. Instructions are provided for building metrics, reviewing performance, and improving the process.
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
La nascita dei data lake - La aziende, ormai, sono sommerse dai dati e il classico datawarehouse fa fatica a macinare questi dati per numerosità e varietà. In molti hanno iniziato a guardare a delle architetture chiamate Data Lakes con Hadoop come tecnologia di riferimento. Ma questa soluzione va bene per tutto? Vieni a capire come operazionalizzare i data lakes per creare delle moderne architetture di gestione dati.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of informa
This document provides an introduction to data warehousing. It defines key concepts like data, databases, information and metadata. It describes problems with heterogeneous data sources and fragmented data management in large enterprises. The solution is a data warehouse, which provides a unified view of data from various sources. A data warehouse is defined as a subject-oriented, integrated collection of historical data used for analysis and decision making. It differs from operational databases in aspects like data volume, volatility, and usage. The document outlines the extract-transform-load process and common architecture of data warehousing.
A data warehouse is a central repository for storing historical and integrated data from multiple sources to be used for analysis and reporting. It contains a single version of the truth and is optimized for read access. In contrast, operational databases are optimized for transaction processing and contain current detailed data. A key aspect of data warehousing is using a dimensional model with fact and dimension tables. This allows for analyzing relationships between measures and dimensions in a multi-dimensional structure known as a data cube.
This document provides an overview of data warehousing. It defines a data warehouse as a subject-oriented, integrated collection of data used to support management decision making. The benefits of data warehousing include high returns on investment and increased productivity. A data warehouse differs from an OLTP system in its design for analytics rather than transactions. The typical architecture includes data sources, an operational data store, warehouse manager, query manager and end user tools. Key components are extracting, cleaning, transforming and loading data, and managing metadata. Data flows include inflows from sources and upflows of summarized data to users.
The document presents information on data warehousing. It defines a data warehouse as a repository for integrating enterprise data for analysis and decision making. It describes the key components, including operational data sources, an operational data store, and end-user access tools. It also outlines the processes of extracting, cleaning, transforming, loading and accessing the data, as well as common management tools. Data marts are discussed as focused subsets of a data warehouse tailored for a specific department.
Webinar: Achieving Customer Centricity and High Margins in Financial Services...MongoDB
It is imperative that Financial Services firms align the organization around providing maximum value to customers across all channels and products with the agility to capitalize on new opportunities. They must do this at the same time as cutting costs, improving operational efficiency, and complying with current and future regulations. This effort is commonly referred to as Industrialization, or streamlining people, process, and technology for maximum customer value, service, and efficiency.
MongoDB can help you in this initiative by allowing you to centralize data management no matter how it is structured across channels and products and make it easy to aggregate data from multiple systems, while lowering TCO and delivering applications faster. MetLife publicly announced that they used MongoDB to enable a single view of the customer in 3 months across 70+ existing systems. We will explore case studies demonstrating these capabilities to help you industrialize your firm.
Key takeaways:
Unique capabilities, brought to you by MongoDB
Concrete use cases that help industrialization
Implementation case studies, to pave the way
This document provides an overview of data warehousing. It defines a data warehouse as a central database that includes information from several different sources and keeps both current and historical data to support management decision making. The document describes key characteristics of a data warehouse including being subject-oriented, integrated, time-variant, and non-volatile. It also discusses common data warehouse architectures and applications.
MariaDB AX: Solución analítica con ColumnStoreMariaDB plc
MariaDB ColumnStore is a high performance columnar storage engine that provides fast and efficient analytics on large datasets in distributed environments. It stores data column-by-column for high compression and read performance. Queries are processed in parallel across nodes for scalability. MariaDB ColumnStore is used for real-time analytics use cases in industries like healthcare, life sciences, and telecommunications to gain insights from large datasets for applications like customer behavior analysis, genome research, and call data monitoring.
MariaDB AX: Analytics with MariaDB ColumnStoreMariaDB plc
MariaDB ColumnStore is a high performance columnar storage engine that provides fast and efficient analytics on large datasets in distributed environments. It stores data column-by-column for high compression and read performance. Queries are processed in parallel across nodes for scalability. MariaDB ColumnStore is used for real-time analytics use cases in industries like healthcare, life sciences, and telecommunications to gain insights from large datasets.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
Concept to production Nationwide Insurance BigInsights Journey with TelematicsSeeling Cheung
This document summarizes Nationwide Insurance's use of IBM BigInsights to process telematics data from their SmartRide program. It discusses the architecture used, which included 6 management nodes and 16 data nodes of IBM BigInsights. It also describes the various phases of data processing, including acquiring raw trip files from HDFS, standardizing the data, scrubbing and calculating events, and summarizing the data for loading into HBase. Key benefits included improving processing performance and enabling customers to access insights about their driving through a web portal.
1) A data warehouse is a collection of data from multiple sources used to enable informed decision making. It contains data, metadata, dimensions, facts and aggregates.
2) The typical processes in a data warehouse are extract and load, data cleaning and transformation, user queries, and data archiving.
3) The key components that manage these processes are the load manager, warehouse manager and query manager. The load manager extracts, loads and does simple transformations on the data. The warehouse manager performs more complex transformations, integrity checks and generates summaries. The query manager directs user queries to the appropriate data.
SAP HANA utilizes cutting-edge in-memory computing technology to provide the enterprise with real-time data and avail a competitive edge.
In 2013, I have created this presentation for the SAP HANA workshop conducted by me. Primarily, before jumping into HANA, audience needs a smooth transition...required to understand the proper necessity to attend the workshop.
Five tuning tips for data warehouses are presented:
1. Partition tables for improved performance, manageability, and availability.
2. Use data segment compression to reduce storage requirements while improving performance.
3. Make optimal use of PGA memory for queries.
4. Be aware of how temporal data can affect query optimization.
5. Monitor query execution to identify optimization opportunities.
Data Warehouses & Deployment By Ankita dubeyAnkita Dubey
This document contains the notes about data warehouses and life cycle for data warehouse deployment project. This can be useful for students or working professionals to gain the basic knowledge about Data warehouses.
IBM's Big Data platform provides tools for managing and analyzing large volumes of data from various sources. It allows users to cost effectively store and process structured, unstructured, and streaming data. The platform includes products like Hadoop for storage, MapReduce for processing large datasets, and InfoSphere Streams for analyzing real-time streaming data. Business users can start with critical needs and expand their use of big data over time by leveraging different products within the IBM Big Data platform.
This document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, non-volatile, and time-variant collection of data used to support management decision making. It describes the data warehouse architecture including extract-transform-load processes, OLAP servers, and metadata repositories. Finally, it outlines common data warehouse applications like reporting, querying, and data mining.
The document discusses data warehousing, including its history, types, security, applications, components, architecture, benefits and problems. A data warehouse is defined as a subject-oriented, integrated, time-variant collection of data to support management decision making. In the 1990s, organizations needed timely data but traditional systems were too slow. Data warehouses now provide competitive advantages through improved decision making and productivity. They integrate data from multiple sources to support applications like customer analysis, stock control and fraud detection.
This document discusses optimizing the analytics process for a Brazilian e-commerce company called Olist. It begins with an overview of the client scenario and scattered data. The goals are to create a normalized database, optimize the ETL process, and automate analytics insights. It describes plans to normalize the data across multiple tables, extract data from CSV files, transform and clean the data, and load it into a PostgreSQL database. Analytical procedures and dashboard benefits are discussed for various business roles. Instructions are provided for building metrics, reviewing performance, and improving the process.
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliData Driven Innovation
La nascita dei data lake - La aziende, ormai, sono sommerse dai dati e il classico datawarehouse fa fatica a macinare questi dati per numerosità e varietà. In molti hanno iniziato a guardare a delle architetture chiamate Data Lakes con Hadoop come tecnologia di riferimento. Ma questa soluzione va bene per tutto? Vieni a capire come operazionalizzare i data lakes per creare delle moderne architetture di gestione dati.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.[Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of informa
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
In this session, you will learn the key differences between a relational database management service (RDBMS) and non-relational (NoSQL) databases like Amazon DynamoDB. You will learn about suitable and unsuitable use cases for NoSQL databases. You'll learn strategies for migrating from an RDBMS to DynamoDB through a 5-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.
Dynamics CRM high volume systems - lessons from the fieldStéphane Dorrekens
Three field stories from companies describe their experiences with high volume CRM implementations: a financial institution with 8,000 users and 350GB of data across two implementations; a financial institution with 2,000 users, 2,500GB of data across two implementations; and a financial institution with 1,000 users and over 450GB of data across six implementations, with 50GB added per month for the largest one. The document discusses lessons learned from these implementations regarding infrastructure design, functional design, and performance testing to support high volume systems.
Can data virtualization uphold performance with complex queries?Denodo
Watch full webinar here: https://bit.ly/2JzypTx
There are myths about data virtualization that are based on misconceptions and even falsehoods. These myths can confuse and worry people who - quite rightly - look at data virtualization as a critical technology for a modern, agile data architecture.
We've decided that we need to set the record straight, so we put together this webinar series. It's time to bust a few myths!
In the first webinar of the series, we’ll be busting the 'performance' myth. “What about performance?” is usually the first question that we get when talking to people about data virtualization. After all, the data virtualization layer sits between you and your data, so how does this affect the performance of your queries? Sometimes the myth is perpetuated by people with alternative solutions…the ‘Put all your data in our Cloud and everything will be fine. Data virtualization? Nah, you don’t need that! It can't handle big queries anyway,’ type of thing.
Join us for this webinar to look at the basis of the 'performance' myth and examine whether there is any underlying truth to it.
The majority of cloud-based DWH provides a wide range of migration tools from in-house DWH. However, I believe that cloud migration success is based not only on reducing infrastructure maintenance costs, but also on additional performance profit inherited from tailored data model.
I am going to prove that copying star or snowflake schemas as is will not lead to maximum performance boost in such DWH as Amazon Redshift and Google BigQuery. Moreover, this approach may cause additional cloud expenses.
We will discuss why data models should be different for each particular database, and how to get maximum performance from database peculiarities.
Most of performance tuning techniques for cloud-based DWH are about adding extra nodes to cluster, but it may lead to performance degradation in some cases, as well as extra costs burden. Sometimes, this approach allows to get maximum speed from current hardware configuration, may be even less expensive servers.
I will show some examples from production projects with extra performance using lower hardware, and edge cases like huge wide fact table with fully denormalized dimensions instead of classical star schema.
1. Hadoop is a software platform that allows for the distributed storage and processing of extremely large datasets across clusters of commodity hardware.
2. It addresses problems like parallel processing, fault tolerance, and scalability to reliably handle data at the petabyte scale.
3. Using Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing - it can efficiently distribute data and computation across large clusters to enable analysis of big data.
When to Use MongoDB...and When You Should Not...MongoDB
MongoDB is well-suited for applications that require:
- A flexible data model to handle diverse and changing data sets
- Strong performance on mixed workloads involving reads, writes, and updates
- Horizontal scalability to grow with increasing user needs and data volume
Some common use cases that leverage MongoDB's strengths include mobile apps, real-time analytics, content management, and IoT applications involving sensor data. However, MongoDB is less suited for tasks requiring full collection scans under load, high write availability, or joins across collections.
Prof. Bellur discusses the concepts of big data and fast data. Big data is characterized by volume, variety, and velocity, with large amounts of data coming from many sources at a high speed that is difficult to process using traditional tools. Fast data must be processed in real-time from continuous streams as it arrives, with no ability to revisit data. This presents challenges like limited memory, noise, and requiring rapid responses. Standards are emerging to help with adoption of solutions for processing both big and fast data across various domains.
1. Big data refers to large and complex datasets that are difficult to process using traditional database and software techniques.
2. Hadoop is an open-source software platform that allows distributed processing of large datasets across clusters of computers. It solves the problems of big data by dividing it across nodes and processing it in parallel using MapReduce.
3. Hadoop provides reliable and scalable storage of big data using HDFS and efficient parallel processing of that data using MapReduce, allowing organizations to gain insights from large and diverse datasets.
This document provides an overview of big data concepts and technologies. It discusses the growth of data, characteristics of big data including volume, variety and velocity. Popular big data technologies like Hadoop, MapReduce, HDFS, Pig and Hive are explained. NoSQL databases like Cassandra, HBase and MongoDB are introduced. The document also covers massively parallel processing databases and column-oriented databases like Vertica. Overall, the document aims to give the reader a high-level understanding of the big data landscape and popular associated technologies.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
The document discusses DeepDB, a storage engine plugin for MySQL that aims to address MySQL's performance and scaling limitations for large datasets and heavy indexing. It does this through techniques like a Cache Ahead Summary Index Tree, Segmented Column Store, Streaming I/O, Extreme Concurrency, and Intelligent Caching. The document provides examples showing DeepDB significantly outperforming MySQL's InnoDB storage engine for tasks like data loading, transactions, queries, backups and more. It positions DeepDB as a drop-in replacement for InnoDB that can scale MySQL to support billions of rows and queries 2x faster while reducing data footprint by 50%.
The document discusses data warehousing concepts including:
1) A data warehouse is a subject-oriented, integrated, and non-volatile collection of data used for decision making. It stores historical and current data from multiple sources.
2) The architecture of a data warehouse is typically three-tiered, with an operational data tier, data warehouse/data mart tier for storage, and client access tier. OLAP servers allow analysis of stored data.
3) ROLAP and MOLAP refer to relational and multidimensional approaches for OLAP. ROLAP dynamically generates data cubes from relational databases, while MOLAP pre-calculates and stores aggregated data in multidimensional structures.
This document summarizes a summer training seminar on BigData Hadoop that was attended. The training was provided by LinuxWorld Informatics Pvt Ltd, which offers open source and commercial training programs. The attendee learned about Hadoop, MapReduce, single and multi-node clusters, Docker, and Ansible. Big data challenges related to volume, variety, velocity, and veracity of data were also covered. Hadoop and its core components HDFS and MapReduce were explained as solutions for storing and processing large datasets in a distributed manner across commodity hardware. Docker containers were introduced as a lightweight alternative to virtual machines.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
This document discusses hardware provisioning best practices for MongoDB. It covers key concepts like bottlenecks, working sets, and replication vs sharding. It also presents two case studies where these concepts were applied: 1) For a Spanish bank storing logs, the working set was 4TB so they provisioned servers with at least that much RAM. 2) For an online retailer storing products, testing found the working set was 270GB, so they recommended a replica set with 384GB RAM per server to avoid complexity of sharding. The key lessons are to understand requirements, test with a proof of concept, measure resource usage, and expect that applications may become bottlenecks over time.
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...Insight Technology, Inc.
MariaDB ColumnStore is the analytics engine for MariaDB. This talk will introduce the product, use cases, and also introduce the new features coming in the next major release 1.1.
Learn about recent advances in MongoDB in the area of In-Memory Computing (Apache Spark Integration, In-memory Storage Engine), and how these advances can enable you to build a new breed of applications, and enhance your Enterprise Data Architecture.
Similar to Big Data presentation at GITPRO 2013 (20)
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Essentials of Automations: The Art of Triggers and Actions in FME
Big Data presentation at GITPRO 2013
1. Big Data- GITPRO 2013
By - Sameer Wadkar
Co-Founder & Big Data Architect / Data Scientist at Axiomine
2. Agenda
• What is Big Data
• Big Data Characteristics
• Big Data and Business Intelligence Applications
• Big Data and Transactional Applications
• Demo
3. What is Big Data?
Volume
Velocity
Big
Data
Variety
Big Data monitors 12 Terabytes
of Tweets each day to improve
product sentiment analysis
(source :IBM)
Amazon and PayPal
use Big-Data for real
time fraud detection
(source: McKinsey)
In 15 of the US economy’s 17
sectors, companies with upward of
1,000 employees store, on average,
more information than the Library
of Congress (source: McKinsey)
Big Data monitors 12
TB of Tweets each day
to improve product
sentiment analysis
(source :IBM)
Most Big Data applications are based around the Volume dimension
4. Visualizing Big Data
• 1 Petabyte is 54000 movies in digital format
• Reading 1 Terabyte of data sequentially from a single disk drive
takes 3 hours
• Typical speed to read from the hard-disk – 80 MB/sec
• Traversing 1 Terabyte of data randomly over 1 disk (a typical
database access scenario) requires orders of magnitude longer
• Disk transfer rates are significantly higher than disk seek rate
Single node processing capacity will drown in the face of Big Data
5. Big Data vs. Traditional
Big Data Architecture
…
In Big Data architectures the application is moves to the data. Why?
User launches a batch job
1
Three Tier Architecture
App Request Data from Data Tier
2
Data Tier sends data to the App Tier
3
App Tier
processes data4
App Tier sends the report
5
User requests a report
1
Master Distributes Application
2
Master launches App on nodes
35
User downloads results
4
All Nodes
process the
data on their
nodes
Master Node
Application & Data Tier
Data Tier
Application Tier
6. Why is Big Data hard?
Divide out and conquer in place is a Big Data Strategy
• Goal is to divide the data on multiple nodes and conquer by
processing the data in-place of the node.
• Real world processing cannot be always divided into smaller sub-
problems (Divide and Conquer is not always feasible)
• Data has dependencies
• Normalization v/s Denormalization
• There are processing dependencies. Later phase of the process may
require results of an earlier phase
• Single Pass v/s Multi-pass
7. Big Data Characteristics
Scale-out, Fault Tolerance & Graceful Recovery are essential features
• Big Data Systems must scale out
• Adding more nodes should lead to greater parallelization
• Big Data Systems must be resilient to partial failure
• If one part of the system fails other parts should continue to
function
• Big Data Systems must be able to self-recover from partial failure
• If any part of the system fails another part of the system will
attempt to recover from the failure.
• Data must be replicated on separate nodes
• Loss of any node does not lose data or processing.
• Recovery should be transparent to the end-user.
8. Big Data Applications
Big Data design is dictated by the nature of the applications
• Business Intelligence applications
• Read-only systems
• ETL Systems
• Query massive data for purpose of generating reports or for large
scale transformations and import into destination data-source
• Transactional Applications
• One part of the system updates data while another part reads the
data
• Example Systems – Imagine running a online store of Amazon.com
scale.
9. BI - Sample Use-Case
A very simple query but size makes all the difference
• “Select SUM(SALES_AMT) from SALES where state=‘MD’ group by
YEAR order by YEAR”
• Find me total sales revenue by year for “Maryland” and order them
by year
• What if SALES table has billions of rows over 20 years?
Sales Transactions
Table
Big Data
Reporting
Year Sales Revenue
1980 11 Million
1981 13 Million
… …
2010 10 Billion
Input
Output
10. BI Big Data Flavors
We discuss three flavors in increasing order of scale-out capability
Big Data Flavor Products
In-Memory Databases Oracle Exalytics, SAP HANA
Massively Parallel Computing (MPP) Greenplum, Netezza
Map Reduce Hadoop
11. In Memory Databases
If State=‘VA’ is next query & cache is only big enough to hold one state
Simplified version - Data is partitioned randomly across all nodes.
Selection Phase
1. Each data node contains fast Memory (SSD) and
mechanism to apply “Where” clause
2. Only the necessary data (“MD” records) are
passed over the expensive Network I/O to the
processing node
Processing Phase
1. The processing nodes will compute the
SUM(sales_amt) by year
2. Order the results
3. Place it In-Memory cache
• First execution of the query is slow.
• Subsequent executions are very fast (almost real-time) as the cache is
hot.
• Cache has SQL-Interface. User experiences “Real-Time”!!
Data Node Data Node …… Data Node
Processing Node
In-Memory TB Cache
with SQL Interface
User SQL
Interface
Fetch Phase
The user is served the results from the cache through
the familiar SQL Interface
12. In Memory Databases (cont.)
In-Memory DBs provide real-time querying on moderate sized data
• Specialized hardware
• Specialized I/O and Flash Memory for faster I/O
• Massive in-memory cache (Multi-Terabyte TB) with SQL Interface
Characteristics
• Familiar model (SQL Interface)
• Can integrate with standard toolkits and BI Solutions
• Unified software/hardware solution
Pros
• Vendor lock-in
• Expensive – Hardware as well as licensing cost
• Typically cannot scale beyond 1-2 TB of data
• Works best when same data is read often (Cache remains hot).
Cons
13. MPP (Typical Architecture)
Data is partitioned horizontally across all slave nodes. Assume “Sale Year” is the distribution
key. Secondary indexes by other keys can be added to each slave node.
Distributed Query Phase
1. Each salve node will compute the query
for the data contained in its own node.
2. Each year data is completely held in its
own node
3. This phase produces partial query results
which are complete for each year
Slave Node
(1980 & 1990 data)
Slave Node
(1981 & 1991 data) .. Slave Node
(2000 & 2010 data)
Master Node
Accumulation Phase
1. All slave results are aggregated and sorted.
• Scale Out – More nodes means less years of data per node.
• Redundancy & Failover – Each node will have a backup node.
• Data distribution strategy & access patterns compatibility
determine performance.
• Enormous network overhead if access-patterns do not respect
distribution strategy
14. MPP (cont.)
MPP supports familiar RDBMS paradigm for medium scalability
• Balances throughput with responsiveness.
• Some implementations use specialized hardware (Ex. Netezza uses FPGA)
• Familiar RDBMS (SQL) paradigm
• Can scale to 10’s of Terabytes in most cases
Characteristics
• Familiar model (SQL Interface)
• Can integrate with standard toolkits and BI Solutions
Pros
• Vendor lock-in
• Cannot scale for ad-hoc queries
• Queries must respect data distribution strategy for acceptable performance.
Cons
15. MapReduce
Data is partitioned randomly/redundantly across all data nodes. Every data node contain sales
data for every state and every year.
Map Phase
1. Each data node reads all of its
records sequentially.
2. It filters out all non- “MD” state
records
3. It computes a SUM(sales_amt) by
year for each year
Data Node Data Node … Data Node
Reduce Node
Reduce Phase
1. Reduce node receives
SUM(sales_amt) for state “MD” by
each year from each node
2. Add all map results by year and
compute the final SUM(sales_amt)
by year for “MD” sales
3. Orders results by year
• Data blocks (order of 128 MB) are stored and accessed contiguously
• Scales out efficiently and degrades gracefully.
• If a task fails the framework restarts automatically (on another node
if necessary) – Redundancy and Graceful Recovery
Master Node
Map Nodes
16. MapReduce (cont.)
Map Reduce – How it works
Year Sales
1990 $1M
1982 $2M
… ..
1999 $20M
Map Process 1
Year Sales
1998 $6M
1982 $5M
… ..
2010 $30M
Map Process 20
……
Reduce Node adds up all the map
results, sorts by year to give final
result
Year Sales
1980 $100M
1981 $102M
… ..
2010 $250M
17. MapReduce (cont.)
MapReduce is general purpose but requires complex skills.
• Batch oriented - Maximizes throughput not responsiveness
Characteristics
• Simple programming model
• Scales out efficiently
• Failure and redundancy built in
• Adapts well to a wide variety of problems
Pros
• Requires custom programming
• Higher level languages (SQL-like) exist but programming skills are often
critical
• Requires a complex array of skills to manage & maintain a MapReduce
System
Cons
18. Summary of BI Apps
Each option has tradeoffs. Choose based on requirements
Big Data Flavor How much data can it typically handle?
In Memory
Databases
Order of 1TB
Massively Parallel
Databases
Order of 10 TB
MapReduce Order of 100’s of TB into the Petabyte
range
19. Transactional System - Use-Case
How many items in stock do users A and B on their second access?
Web Based Online
Store Database
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
20. Context – CAP Theorem
You can get any two but not all three features in any system
Characteristic
Consistency All nodes (and users) see the same data
at the same time.
Availability A guarantee that every request receives
a valid response. Site does not go down
or appear down under heavy load.
Partition Tolerance The system continues to function
regardless of loss of one of its
components
21. CA – Single RDBMS
A single RDBMS instance is both consistent and available
Web Based Online
Store RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• When setup in “Read Committed” every user sees the same
inventory count
• System responds with last committed inventory count even during
updates
• Consistent
• Available
22. CP – Distributed RDBMS
A Distributed RDBMS is consistent and resilient to failure of nodes
Web Based Online
Store
East Region
RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• Under “Read Committed” mode all user see consistent counts.
• If one DB fails the other one will serve all users(Partition Tolerance)
• During two phase commit system is unavailable.
• Consistent
• Partition Tolerant
West Region
RDBMS
2- Phase
Commit
23. AP – Distributed RDBMS
Eventual Consistency is the key to Big Data Transactional Systems
Web Based Online Store
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• Amazon Dynamo and Apache Cassandra work on this principle
• If one DB fails the other one will serve all users(Partition Tolerance)
• Users will always be able to browse all products but occasionally
some users will see a stale count of inventory (Eventual Consistency)
• Available
• Partition Tolerant
• Eventually Consistent
24. Hybrid Solution
Big Data Techniques – Not an either or choice!
Large
Structured DB
Large
Unstructured
DB
Map Reduce
based ETL MPP DB
In-Memory
DB
Business Users can
use familiar SQL
based tools in
real-time. In-
Memory DB
allows that
No-SQL DB
Programmers, System
Admins with no real-time
requirements can use all
three techniques. NoSQL
DB’s allow technical users
to gain real-time benefits
in ways which suite their
complex needs.
Familiar BI
Solution
Programs &
Scripts
100 TB to
1 PB
5-10 TB
1 TB
Few 100
GB
25. Exploring over Millions US Patent Pages at the Speed of Thought
www.axiomine.com/patents/
Demo- US Patent Explorer
26. Patent Explorer Goals
Seamlessly navigate Structured and Unstructured data in real-time
• Navigate 3 million US Patents Data (Text and Metadata) from 1963 to
1999 at the speed of thought.
• Data Sources
• Patent Metadata - National Bureau of Economic Research
• Patent Text – Bulk Download from Google Site
• Each week granted patents are published to the Google Site as an
archive.
• Size of uncompressed data
• Structured Metadata – Approximately 2 GB
• Patent Text Data – Approximately 300 GB
27. Patent Metadata
Cannot answer – What is the title of Patent No 8086905?
Source – National Bureau of Economic Research
http://data.nber.org/patents/
Patent Master
Pairwise
Citations
*
Inventors
*
Patent Master Other Master Data
Company
Master
Country
Master
Classification
Master
Contains only meta-data. No text data such as Patent Title available.
Ex. Pairwise citations contains millions of patent id pairs
28. Patent Text
Need to merge both metadata & text
Source – Google
http://www.google.com/googlebooks/uspto.html
Sample File
29. High Level Architecture
Need to merge both metadata & text
Hadoop
Patent
Metadata
Patent Text
Navigation, Search
& Text Analytics
Apache Solr
Patent Details
MongoDB
Text Enhanced
Citation Data
Raw Data Tier ETL & Text Analytics Tier Search & Visualization
Navigate, Search
& Visualize
Drill down to
Patent Details
30. Big Data Flavors – Summary
Choose a Big Data tool and product based on requirements
Flavor Characteristics
Map-Reduce • Massive 100 TB to 1 PB Scale ELT
• Complex Analytics on Massive Data
• Large Scale Unstructured Data Analysis
Massively Parallel
Processing (MPP)
• Batch oriented aggregations
• Analytics on Moderately Large Structured Data with
predictable access patterns
In-Memory DB • Similar to MPP but with real-time access patterns required.
• Rich and Interactive Business Intelligence Apps
NoSQL databases • Similar to In-Memory DB but simpler (Non SQL) access
patterns
• Provide fast access to detail data where other techniques are
used to serve summary data
GPGPU • Real time Value At Risk (Financial Risk Management)
• Compute intensive analytics Ex. Simulation of a Hospital
Waiting Room over 1 years