Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It provides low-latency queries using standard SQL and supports nested and hierarchical data. Drill is inspired by Google's Dremel system and provides an alternative to traditional batch processing systems like MapReduce for interactive analysis of big data.
The document summarizes Yahoo!'s use of Hadoop for grid computing. Some key points:
- Yahoo! operates multiple Hadoop grids with 10,000s of nodes to support large data processing and storage needs.
- Hadoop provides an on-demand, shared resource pool for computation and storage across the company.
- Yahoo! uses Hadoop MapReduce for parallel processing of large datasets and the Hadoop Distributed File System for petabytes of data storage.
- Additional tools like Hadoop On Demand are used for job scheduling and resource management across the Hadoop clusters.
1) HBase satisfied Facebook's requirements for a real-time data store by providing excellent write performance, horizontal scalability, and features like atomic operations.
2) At Facebook, HBase is used for messaging and user activity tracking applications that involve massive write-throughput and petabytes of data.
3) HBase's integration with HDFS provides fault tolerance and scalability, while its column orientation enables complex queries on user activity data.
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Common and unique use cases for Apache HadoopBrock Noland
The document provides an overview of Apache Hadoop and common use cases. It describes how Hadoop is well-suited for log processing due to its ability to handle large amounts of data in parallel across commodity hardware. Specifically, it allows processing of log files to be distributed per unit of data, avoiding bottlenecks that can occur when trying to process a single large file sequentially.
Cloud computing, big data, and mobile technologies are driving major changes in the IT world. Cloud computing provides scalable computing resources over the internet. Big data involves extremely large data sets that are analyzed to reveal business insights. Hadoop is an open-source software framework that allows distributed processing of big data across commodity hardware. It includes tools like HDFS for storage and MapReduce for distributed computing. The Hadoop ecosystem also includes additional tools for tasks like data integration, analytics, workflow management, and more. These emerging technologies are changing how businesses use and analyze data.
Apache Drill is an open source engine for interactive analysis of large-scale datasets. It provides low-latency queries using standard SQL and supports nested and hierarchical data. Drill is inspired by Google's Dremel system and provides an alternative to traditional batch processing systems like MapReduce for interactive analysis of big data.
The document summarizes Yahoo!'s use of Hadoop for grid computing. Some key points:
- Yahoo! operates multiple Hadoop grids with 10,000s of nodes to support large data processing and storage needs.
- Hadoop provides an on-demand, shared resource pool for computation and storage across the company.
- Yahoo! uses Hadoop MapReduce for parallel processing of large datasets and the Hadoop Distributed File System for petabytes of data storage.
- Additional tools like Hadoop On Demand are used for job scheduling and resource management across the Hadoop clusters.
1) HBase satisfied Facebook's requirements for a real-time data store by providing excellent write performance, horizontal scalability, and features like atomic operations.
2) At Facebook, HBase is used for messaging and user activity tracking applications that involve massive write-throughput and petabytes of data.
3) HBase's integration with HDFS provides fault tolerance and scalability, while its column orientation enables complex queries on user activity data.
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.
Common and unique use cases for Apache HadoopBrock Noland
The document provides an overview of Apache Hadoop and common use cases. It describes how Hadoop is well-suited for log processing due to its ability to handle large amounts of data in parallel across commodity hardware. Specifically, it allows processing of log files to be distributed per unit of data, avoiding bottlenecks that can occur when trying to process a single large file sequentially.
Cloud computing, big data, and mobile technologies are driving major changes in the IT world. Cloud computing provides scalable computing resources over the internet. Big data involves extremely large data sets that are analyzed to reveal business insights. Hadoop is an open-source software framework that allows distributed processing of big data across commodity hardware. It includes tools like HDFS for storage and MapReduce for distributed computing. The Hadoop ecosystem also includes additional tools for tasks like data integration, analytics, workflow management, and more. These emerging technologies are changing how businesses use and analyze data.
HBase is a distributed, scalable, big data store that provides fast lookup capabilities like Google BigTable. It uses a table-like data structure with rows indexed by a key and stores data in columns grouped by families. HBase is designed to operate on top of Hadoop HDFS for scalability and high availability. It allows for fast lookups, full table scans, and range scans across large datasets distributed across clusters of commodity servers.
Petabyte scale on commodity infrastructureelliando dias
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It describes how Hadoop addresses the need to reliably process huge datasets using a distributed file system and MapReduce processing on commodity hardware. It also provides details on how Hadoop has been implemented and used at Yahoo to process petabytes of data and support thousands of jobs weekly on large clusters.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Offline processing with Hadoop allows for scalable, simplified batch processing of large datasets across distributed systems. It enables increased innovation by supporting complex analytics over large data sets without strict schemas. Hadoop adoption is moving beyond legacy roles to focus on data processing and value creation through scalable and customizable systems like Cascading.
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Facing enterprise specific challenges – utility programming in hadoopfann wu
This document discusses managing large Hadoop clusters through various automation tools like SaltStack, Puppet, and Chef. It describes how to use SaltStack to remotely control and manage a Hadoop cluster. Puppet can be used to easily deploy Hadoop on hundreds of servers within an hour through Hadooppet. The document also covers Hadoop security concepts like Kerberos and folder permissions. It provides examples of monitoring tools like Ganglia, Nagios, and Splunk that can be used to track cluster metrics and debug issues. Common processes like datanode decommissioning and tools like the HBase Canary tool are also summarized. Lastly, it discusses testing Hadoop on AWS using EMR and techniques to reduce EMR costs
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for scalable data storage, and MapReduce for distributed processing of large datasets in parallel. Typical problems suited for Hadoop involve complex data from multiple sources that need to be consolidated, stored inexpensively at scale, and processed in parallel across the cluster.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Hadoop clusters can be provisioned quickly and easily on virtual infrastructure using techniques like linked clones and thin provisioning. This allows Hadoop to leverage capabilities of virtualization like high availability, resource controls, and re-using spare resources. Shared storage like SAN is useful for VM images and metadata, while local disks provide scalable bandwidth for HDFS data. Virtualizing Hadoop simplifies operations and enables flexible, on-demand provisioning of Hadoop clusters.
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this session, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy. If you have currently deployed Hadoop within your organization or in early stages of learning and evaluating Hadoop, you will benefit from attending this session. It will give you an opportunity to interact with practitioners and industry experts who have successfully deployed Hadoop and Netezza within their organizations
Presentation: Overview of Kognitio, Kognitio Cloud and the Kognitio Analytical Platform
Kognitio is driving the convergence of Big Data, in-memory analytics and cloud computing. Having delivered the first in-memory analytical platform in 1989, it was designed from the ground up to provide the highest amount of scalable compute power to allow rapid execution of complex analytical queries without the administrative overhead of manipulating data. Kognitio software runs on industry-standard x86 servers, or as an appliance, or in Kognitio Cloud, a ready-to-use analytical platform. Kognitio Cloud is a secure, private or public cloud Platform-as-a-Service (PaaS), leveraging the cloud computing model to make the Kognitio Analytical Platform available on a subscription basis. Clients span industries, including market research, consumer packaged goods, retail, telecommunications, financial services, insurance, gaming, media and utilities.
To learn more, visit www.kognitio.com and follow us on Facebook, LinkedIn and Twitter.
The document discusses YapMap, a visual search technology focused on threaded conversations. It was built using Hadoop to handle massive scales of data. The presentation covers YapMap's approach to crawling forums and message boards to build a searchable index, its distributed processing pipeline in Hadoop to reconstruct threads from individual posts and generate pre-indexed sub-threads, and how it presents search results with contextual threads and posts.
This document discusses Replication Server - Real Time Loading (RTL) for replicating data from a source database in real-time to Sybase IQ for analytics purposes. It provides dial-in numbers and passcode for a presentation on the topic. The presentation will cover limitations of pre-RS 15.5 replication solutions to IQ, an overview of RTL, and the new RTL update capabilities in RS.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Hadoop and its Ecosystem Components in ActionAndrew Brust
This document provides an overview of Andrew Brust's presentation on Hadoop and its ecosystem components. The presentation introduces key concepts like MapReduce, HDFS, Hive, Pig, HBase, Zookeeper and Mahout. It also provides instructions on setting up and using Hadoop on Amazon Elastic MapReduce and Microsoft Azure HDInsight. The document includes examples of commands for working with HDFS, MapReduce, Hive, Pig, HBase and Mahout.
MapReduce is a framework for processing large datasets in a distributed manner. It involves two functions: map and reduce. The map function processes individual elements to generate intermediate key-value pairs, and the reduce function merges all intermediate values with the same key. Hadoop is an open-source implementation of MapReduce that uses HDFS for storage. A typical MapReduce job in Hadoop involves defining map and reduce functions, configuring the job, and submitting it to the JobTracker which schedules tasks across nodes and monitors execution.
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...Krishnan Parasuraman
This document summarizes an executive webcast about extracting value from big data in digital media. It discusses how big data analytics can help digital marketers achieve their goals of a single customer view, increased targeting precision, improved relevance, and higher campaign profitability. The challenges of fragmented customer data from multiple online and offline sources are also outlined. It promotes IBM's Netezza big data platform and analytic solutions for consolidating, segmenting, matching, and optimizing large amounts of structured and unstructured customer data in real time to drive better marketing outcomes.
HBase is a distributed, scalable, big data store that provides fast lookup capabilities like Google BigTable. It uses a table-like data structure with rows indexed by a key and stores data in columns grouped by families. HBase is designed to operate on top of Hadoop HDFS for scalability and high availability. It allows for fast lookups, full table scans, and range scans across large datasets distributed across clusters of commodity servers.
Petabyte scale on commodity infrastructureelliando dias
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It describes how Hadoop addresses the need to reliably process huge datasets using a distributed file system and MapReduce processing on commodity hardware. It also provides details on how Hadoop has been implemented and used at Yahoo to process petabytes of data and support thousands of jobs weekly on large clusters.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Offline processing with Hadoop allows for scalable, simplified batch processing of large datasets across distributed systems. It enables increased innovation by supporting complex analytics over large data sets without strict schemas. Hadoop adoption is moving beyond legacy roles to focus on data processing and value creation through scalable and customizable systems like Cascading.
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
If you are interested in Hadoop and its capabilities, but you are not sure where to begin, this is the session for you. Learn the basics of Hadoop, see how to spin up a development cluster in the cloud or on-premise, and start exploring ETL processing with SQL and other familiar tools
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
Now in its fifth year, Apache Hadoop has firmly established itself as the platform of choice for organizations that need to efficiently store, organize, analyze, and harvest valuable insight from the flood of data that they interact with. Since its inception as an early, promising technology that inspired curiosity, Hadoop has evolved into a widely embraced, proven solution used in production to solve a growing number of business problems that were previously impossible to address. In his opening keynote, Mike will reflect on the growth of the Hadoop platform due to the innovative work of a vibrant developer community and on the rapid adoption of the platform among large enterprises. He will highlight how enterprises have transformed themselves into data-driven organizations, highlighting compelling use cases across vertical markets. He will also discuss Cloudera’s plans to stay at the forefront of Hadoop innovation and its role as the trusted solution provider for Hadoop in the enterprise. He will share Cloudera’s view of the road ahead for Hadoop and Big Data and discuss the vital roles for the key constituents across the Hadoop community, ecosystem and enterprises.
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Facing enterprise specific challenges – utility programming in hadoopfann wu
This document discusses managing large Hadoop clusters through various automation tools like SaltStack, Puppet, and Chef. It describes how to use SaltStack to remotely control and manage a Hadoop cluster. Puppet can be used to easily deploy Hadoop on hundreds of servers within an hour through Hadooppet. The document also covers Hadoop security concepts like Kerberos and folder permissions. It provides examples of monitoring tools like Ganglia, Nagios, and Splunk that can be used to track cluster metrics and debug issues. Common processes like datanode decommissioning and tools like the HBase Canary tool are also summarized. Lastly, it discusses testing Hadoop on AWS using EMR and techniques to reduce EMR costs
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for scalable data storage, and MapReduce for distributed processing of large datasets in parallel. Typical problems suited for Hadoop involve complex data from multiple sources that need to be consolidated, stored inexpensively at scale, and processed in parallel across the cluster.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Hadoop clusters can be provisioned quickly and easily on virtual infrastructure using techniques like linked clones and thin provisioning. This allows Hadoop to leverage capabilities of virtualization like high availability, resource controls, and re-using spare resources. Shared storage like SAN is useful for VM images and metadata, while local disks provide scalable bandwidth for HDFS data. Virtualizing Hadoop simplifies operations and enables flexible, on-demand provisioning of Hadoop clusters.
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this session, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy. If you have currently deployed Hadoop within your organization or in early stages of learning and evaluating Hadoop, you will benefit from attending this session. It will give you an opportunity to interact with practitioners and industry experts who have successfully deployed Hadoop and Netezza within their organizations
Presentation: Overview of Kognitio, Kognitio Cloud and the Kognitio Analytical Platform
Kognitio is driving the convergence of Big Data, in-memory analytics and cloud computing. Having delivered the first in-memory analytical platform in 1989, it was designed from the ground up to provide the highest amount of scalable compute power to allow rapid execution of complex analytical queries without the administrative overhead of manipulating data. Kognitio software runs on industry-standard x86 servers, or as an appliance, or in Kognitio Cloud, a ready-to-use analytical platform. Kognitio Cloud is a secure, private or public cloud Platform-as-a-Service (PaaS), leveraging the cloud computing model to make the Kognitio Analytical Platform available on a subscription basis. Clients span industries, including market research, consumer packaged goods, retail, telecommunications, financial services, insurance, gaming, media and utilities.
To learn more, visit www.kognitio.com and follow us on Facebook, LinkedIn and Twitter.
The document discusses YapMap, a visual search technology focused on threaded conversations. It was built using Hadoop to handle massive scales of data. The presentation covers YapMap's approach to crawling forums and message boards to build a searchable index, its distributed processing pipeline in Hadoop to reconstruct threads from individual posts and generate pre-indexed sub-threads, and how it presents search results with contextual threads and posts.
This document discusses Replication Server - Real Time Loading (RTL) for replicating data from a source database in real-time to Sybase IQ for analytics purposes. It provides dial-in numbers and passcode for a presentation on the topic. The presentation will cover limitations of pre-RS 15.5 replication solutions to IQ, an overview of RTL, and the new RTL update capabilities in RS.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
Hadoop and its Ecosystem Components in ActionAndrew Brust
This document provides an overview of Andrew Brust's presentation on Hadoop and its ecosystem components. The presentation introduces key concepts like MapReduce, HDFS, Hive, Pig, HBase, Zookeeper and Mahout. It also provides instructions on setting up and using Hadoop on Amazon Elastic MapReduce and Microsoft Azure HDInsight. The document includes examples of commands for working with HDFS, MapReduce, Hive, Pig, HBase and Mahout.
MapReduce is a framework for processing large datasets in a distributed manner. It involves two functions: map and reduce. The map function processes individual elements to generate intermediate key-value pairs, and the reduce function merges all intermediate values with the same key. Hadoop is an open-source implementation of MapReduce that uses HDFS for storage. A typical MapReduce job in Hadoop involves defining map and reduce functions, configuring the job, and submitting it to the JobTracker which schedules tasks across nodes and monitors execution.
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...Krishnan Parasuraman
This document summarizes an executive webcast about extracting value from big data in digital media. It discusses how big data analytics can help digital marketers achieve their goals of a single customer view, increased targeting precision, improved relevance, and higher campaign profitability. The challenges of fragmented customer data from multiple online and offline sources are also outlined. It promotes IBM's Netezza big data platform and analytic solutions for consolidating, segmenting, matching, and optimizing large amounts of structured and unstructured customer data in real time to drive better marketing outcomes.
Automated Trading Summit 2012, Amsterdam
Big Data impacts the way we think about managing, processing and analyzing marketing data. It is the foundational element for building Digital Marketing solutions such as Audience Optimization, Channel Optimization, Content Optimization and Yield Optimization.
Recent research and studies provides some fascinating insights into how
(a) CMO's view Big Data as their biggest areas of "under-preparedness",
(b) Organizations view Advanced Analytics as a competitive advantage and
(c) Digital Marketers view Big Data as an enabling platform for all their future initiatives
Big Data Forum at Salt River Fields (the spring training field for the Arizona Diamondbacks). Krishnan Parasuraman discusses how companies are using big data and analytics to transform their business.
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Krishnan Parasuraman
Implementing a Big Data program can be a long and arduous journey. Each organization has its own unique business drivers and technical considerations that drive their big data adoption roadmaps. Whatever be your organization's specific big data driver - be it managing a rapid surge of data, implementing a new set of analytic capabilities, incorporating unstructured data as part of your enterprise data platform or accessing real time information for actionable intelligence - the approach and roadmap that you put in place to reach that end goal becomes all the more critical in a space where early success stories are relatively rare, skill sets are hard to find and technologies are still evolving.
In this session we will chronicle the journeys of four different organizations that were early adopters of big data. Each of them charted a different path to achieve their big data goals. We will look at what were the key drivers behind their respective approaches, what worked and what did not work for them.
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
The document provides an introduction to Apache Drill, an open source SQL query engine for analysis of large-scale datasets across Hadoop, NoSQL and cloud storage systems. It discusses Tomer Shiran's role in Apache Drill, provides an agenda for the talk, describes the need for interactive analysis of big data and how existing solutions are limited. It then outlines Apache Drill's architecture, key features like full SQL support, optional schemas and support for nested data formats.
Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
This document discusses big data and Hadoop. It provides an overview of Hadoop, including what it is, how it works, and its core components like HDFS and MapReduce. It also discusses what Hadoop is good for, such as processing large datasets, and what it is not as good for, like low-latency queries or transactional systems. Finally, it covers some best practices for implementing Hadoop, such as infrastructure design and performance considerations.
Brad Anderson presented on NOSQL databases and CouchDB. He discussed how relational databases do not scale well and are rigid. NOSQL databases like CouchDB are a better fit for large, growing datasets. CouchDB is a document oriented database written in Erlang that uses a REST API and supports views and incremental replication. It can be deployed on a cloud platform to improve scalability, redundancy and query distribution.
This document provides an overview and introduction to Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It discusses what Hadoop is, common use cases like ETL and analysis, key architectural components like HDFS and MapReduce, and why Hadoop is useful for solving problems involving "big data" through parallel processing across commodity hardware.
This document summarizes Apache Drill, an open source SQL query engine for analysis of data stored in Hadoop and other data sources. It was inspired by Google's Dremel query engine. The document outlines Drill's architecture, which uses distributed Drillbit processes that can retrieve and process data in parallel. It also supports SQL queries over nested data with an optional schema. The status of the project is that the logical query planning is available along with a basic SQL parser and demo, with ongoing work to add more SQL support and distributed execution capabilities.
Understanding the Value and Architecture of Apache DrillDataWorks Summit
This document summarizes Apache Drill, an open source SQL query engine for interactive analysis of large-scale datasets. It was inspired by Google's Dremel and allows for interactive, ad-hoc queries across data sources using standard SQL. The key features highlighted are its support for nested data, optional schemas, extensibility points, and full ANSI SQL 2003 compatibility. An overview of Drill's architecture is provided, including its use of distributed Drillbit processes and a coordinator node.
This document provides an introduction to relational databases, NoSQL databases, and data in general. It includes the following:
- An overview of relational databases and their ACID properties. Relational databases are best for structured, centralized data and scale vertically.
- A survey of several popular NoSQL databases like MongoDB, Cassandra, Redis, and HBase. NoSQL databases are best for unstructured, large quantities of data and scale horizontally.
- General advice that the data and query models, durability needs, scalability needs, and consistency requirements should determine the best database choice. Trying different options is recommended.
The document provides an overview of common and unique use cases for Apache Hadoop. It begins with an introduction of what Hadoop is and its origins. It then discusses how Hadoop is well suited for tasks like log processing due to its ability to handle large amounts of data in parallel across clusters of machines. Specific examples are given around using Hadoop for log processing to efficiently perform tasks like grepping, calculating metrics, and investigating user sessions from large log files.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR -See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
This document discusses YapMap, a visual search platform built on Hadoop and HBase. It summarizes how YapMap interfaces with HBase data, uses HBase as a data processing pipeline with checkpoints, and had to adjust schemas and migrate data as the system evolved. It also covers how YapMap constructs search indexes in shards based on HBase regions and stored indexes on HDFS. The document concludes with some lessons learned around optimizing HBase operations.
The document introduces MongoDB as an open source, high performance database that is a popular NoSQL option. It discusses how MongoDB stores data as JSON-like documents, supports dynamic schemas, and scales horizontally across commodity servers. MongoDB is seen as a good alternative to SQL databases for applications dealing with large volumes of diverse data that need to scale.
The document discusses The Apache Way Done Right and the success of Hadoop. It provides an overview of Apache Hadoop, including that it is a set of open source projects that transforms commodity hardware into a reliable system for storing and analyzing large amounts of data. It also discusses how Hadoop originated from the Nutch project and was adopted by early users like Yahoo, Facebook, and Twitter to handle big data challenges. Examples are given of how Yahoo used Hadoop for applications like the Webmap and personalized homepages.
The document discusses big data architectures and technologies. It introduces concepts like Hadoop, HDFS, MapReduce, Spark, Storm and Kafka. It proposes a reference architecture using these technologies with data sources like databases, user tracking, logs and streaming data. The architecture includes an event broker to handle streaming data which is then processed via Spark, Storm or Hadoop and stored in data warehouses or search indexes. It also provides examples of using these technologies for analytics, machine learning and graph processing.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
This document discusses building big data solutions using Microsoft's HDInsight platform. It provides an overview of big data and Hadoop concepts like MapReduce, HDFS, Hive and Pig. It also describes HDInsight and how it can be used to run Hadoop clusters on Azure. The document concludes by discussing some challenges with Hadoop and the broader ecosystem of technologies for big data beyond just Hadoop.
HadoopDB is a system that combines the performance of parallel database systems with the flexibility and fault tolerance of Hadoop. It uses Hadoop as the communication layer between multiple single-node database instances running on cluster nodes. Benchmark results showed that HadoopDB's performance was close to parallel databases for structured queries and similar to Hadoop for unstructured queries, while also providing Hadoop's ability to operate in heterogeneous environments and tolerate faults.
This document discusses common use cases for MongoDB and why it is well-suited for them. It describes how MongoDB can handle high volumes of data feeds, operational intelligence and analytics, product data management, user data management, and content management. Its flexible data model, high performance, scalability through sharding and replication, and support for dynamic schemas make it a good fit for applications that need to store large amounts of data, handle high throughput of reads and writes, and have low latency requirements.
Similar to Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models (20)
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models
1. Krishnan Parasuraman Greg Rokita
Netezza Edmunds.com
Building Scalable Data Platforms
Hadoop and Netezza Deployment Models
2. Talking Points
• Building scalable data platforms
– Architectural considerations
• Hadoop and Massively Parallel Databases
– Similarities and differences
– Usage patterns
• Practitioner’s View Point
– Edmunds.com data warehouse platform
2 Hadoop World 2011
3. Building scalable data platforms
Typical Digital Media Information Processing Pipeline
Clicks
Visits
Page Views • Scoring
Real Time • Yield optimization
Likes Data • Audience Analytics
Decision
Tweets Processing
Impressions
Engine
Locations
• Display Ads • Correlate Reporting
• Recommendation • Structure
• Personalized Content • Consolidate
• Aggregate
• Summarize
• Ad-hoc analysis
3 Hadoop World 2011
4. Building scalable data platforms
Clicks
Visits
Page Views
Real Time
Likes Data
Decision
Tweets Processing
Impressions
Engine
Locations Reporting
DATA PLATFORM
4 Hadoop World 2011
5. Building scalable data platforms
Real Time
Data
Decision
Processing
Engine
Reporting
• Real Time
• High Velocity • Compute intensive • Cached Queries
• High Concurrency
Workloads • Transactional
• Linearly Scalable • Full table scans • Low Latency
• Disk bound • Disk bound • H. Concurrency
• High Thruput
• Structured • Structured • Mostly Structured • Structured
Data • Un-Structured • Un-Structured • Some unstructured • Relational
• Key-Value pairs • Machine Gen.
• Stream Processing • Low Disk I/O • In-DB computation • OLAP
Capability • Memory resident • Fast Processing • SQL and MR • Columnar
• Key based • Low Cost/TB • Analytic Libraries
lookups
5 Hadoop World 2011
6. Building scalable data platforms
Real Time
Data
Decision
Processing
Engine
Reporting
• Real Time
• High Velocity • Compute intensive • Cached Queries
• High Concurrency
Workloads • Transactional
• Linearly Scalable • Full table scans • Low Latency
• Disk bound • Disk bound
Massively
• High Thruput • H. Concurrency
Hadoop Parallel DB
NoSQL
• Structured • Structured • Mostly Structured • Structured
Data Databases
• Un-Structured • Un-Structured • Some unstructured • Relational
In-Memory
• Key-Value pairs • Machine Gen.
DB
Graph
• Stream Processing • Low Disk I/O Plain Ole’ DB
• In-DB computation • OLAP
DB
Capability • Memory resident • Fast Processing on steroids • Columnar
• SQL and MR
• Key based • Low Cost/TB • Analytic Libraries
lookups
6 Hadoop World 2011
7. Myt A single technology will meet all the considerations for
h our scalable data platform needs
Best Practices
Workloads scale differently – Monolithic architectures don’t work
Minimize components – Data movement is painful
Understand tradeoffs – Performance Price Effort
Start with the core architecture and work in the edge cases
7 Hadoop World 2011
8. Massively parallel data warehouses
SQL And MR
Host controllers
Hosts
Network fabric
FPGA CPU FPGA CPU FPGA CPU Massively
parallel
Memory Memory Memory
compute nodes
Distributed
Storage
8 Hadoop World 2011
9. Hadoop
Map Reduce
Job
Tracke
Name Master Node
Node
r
Network fabric
Task Task Task
Tracke
Data
Node
Tracke
Data
Node
Tracke
Data
Node
Parallel
r r r
compute nodes
Distributed
Storage
9 Hadoop World 2011
10. There are striking similarities….
Map Reduce
Job
Tracke
Name
Node
Massive
r
parallelism
Execute code &
algorithms next to
Task Task Task data
Data Data Data
Tracke Tracke Tracke
Node Node Node
r r r
Scalable
Highly Available
Map Reduce
10 Hadoop World 2011
11. But also key differences
Map
Reduce
Schema on Read – Data loading is fast
Hadoop
Job
Tracker
Name
Node Batch Mode data access
Lower cost of data storage
Process unstructured data
Task Data Task Data Task Data
Tracker Node Tracker Node Tracker Node
Optimized for Performance
Netezza Real time access, random reads,
query optimizer, co-located joins
Hardware Accelerated queries
Data Loading = File copy SQL and Map Reduce
Look Ma, No ETL
11
12. These differences lead to opportunities for co-
existence for Hadoop in a Netezza environment
1. Scalable ETL engine
– Complex data
– Relationships not defined
– Evolving schema
2. Queryable Archive
– Moving computation is cheaper than moving data
3. Analytics sandbox
– Exploratory analysis
12 Hadoop World 2011
13. Netezza-Hadoop: Deployment Patterns
Create context
Analyze
unstructured data (classification, text mining)
Parse, aggregate Analyze, report
semi-structured data
Active archival
Analyze, report Long running queries
structured data
13 Hadoop World 2011
14. Pattern 1: Data Processing Engine (ETL)
Hadoop Cluster
Netezza Environment
NameNode
JobTracker
Raw Weblogs
DataNode DataNode DataNode
TaskTracker TaskTracker TaskTracker
14 Hadoop World 2011
16. Pattern 3: Queryable Archive
1
3
Data Sources 2
Netezza
Environment
16 Hadoop World 2011
17. Edmunds.com and Scale
o Premier online resource for automotive information
launched in 1995 as the first automotive information
Web site
o 15 million unique visitors
o 210 million page views
o 1 million+ new inventory items per day
o 2 TB of new data every month
o 40 node Hadoop cluster aggregating logs,
advertising, vehicle, pricing, inventory and other data
sets
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
18. Edmunds Proposition
We have developed an iterative
approach to data warehouse
development that has dropped the time
it takes for us to deliver reports to our
users from months to weeks.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
18 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
19. How did we do it?
o Process
o Technology
o Understanding of Value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
20. Process: agile approach
o Continuous and fast delivery of new features
o Collaboration between users and developers
o Make new data available quickly and
inexpensively
o Quick problem resolution
o No wasting of entire development cycle if data is
not useful
o Encouragement of exploration and creation of
new applications
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
21. Process Pre-process:
• Complete
• Raw
• Modeled as source data
• Generically loaded
• Quick turn-around
• Low retention
• Slower performance
Post-process:
• Filtered
• Transformed
• Modeled as star schema
• Optimized
• Slow turn-around
• High retention
• Fast performance
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
21 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
22. Post-Process Sandbox
Use Pre- Load data
process in ad-hock
data manner
Discard:
prevents shadow
No production
Change little effort lost
schema (by
users or Prototype Data has value?
developers)
Develop Optimized
Yes Pipeline:
data is confirmed to
Enhance
Schema is be useful
stable? effort is warranted
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
22 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
23. Technology
Publishing Hadoop
Netezza
System Stack
• All Data • HBase raw data • All data loaded from
• Generic • Oozie job coordinator Hadoop in batch
• Thrift IDL with • HDFS storage of pre • Analysis and data
Versioning and optimized data exploration - use the
replica of RDBMS in speed and power
files • Report generation
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
24. Edmunds Publishing System
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
25. Generic flow for pre-process
Producers: Inventory, Pricing, Vehicle,
Dealer, Leads
Broker
Consumer
HBase
Map- G
e
Reduce
n
Netezza e
Action r
i
c
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
,
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
26. What architecture enables generic
consumer?
Thrift
Camel
ActiveMQ
o Message o Retries
o Delivery o Throttling
o Routing
o Persistence o Versioning
o Durability o Monitoring
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
27. Flexibility for Producers and Consumers:
Support for Topologies
Field Example Values Purpose
Environment PROD, TEST, DEV Promotion cycle of
deployment units
Index Blue, Green, Stage Environment Index
Data Center LAX1, EC2 The data center where
deployment unit is located
Site Edmunds, Insideline Company’s Product
Application HBase, Digital Asset Manager Deployment Unit
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
28. Producer-Consumer matching
Match!
Producer Virtual Queue
Consumer
Topic Name
Name
Publish Publish
Inventory Inventory
I am I am
Prod Test
Lax Broker
EC2
Edmunds Destination
Edmunds
Inventory Interceptor
Dealer
Prod, Test Prod
Send To Lax, EC2 Lax, EC2 Receive From
Edmunds Edmunds
Dealer Inventory
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
29. HBase: how to handle data generically
Colum Binary Discrete Type 2
Family
Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of
Thrift the Thrift Object Object Object Date Date fields
Object Object Field 1 Field 2 Field 3
Role System of Check if Versioning at the most Versioning for
record updates are granular level for lookups optimized
necessary dimension tables
(optimization)
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
29 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
30. Generic Thrift Persistence in HBase
Column Name Value
[ModelYear]|F:id|T:long|I:0 1368
[ModelYear]|F:midYear|T:boolean|I:1 false
[ModelYear]|F:year|T:int|I:2 1993
[ModelYear]|F:name|T:java.lang.String|I:4 Celica
[ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport
value|T:java.lang.String|I:1 V:GT-S 2dr
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback
value|T:java.lang.String|I:1
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441
d|T:long|I:2
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S
value|T:java.lang.String|I:1
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
30 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
31. Netezza: Time is Money
Compared to Oracle Business Value
Up to 12x faster load times Can reload data more frequently
Failed workflows are no longer a big problem
Helps in transition to real time system:
We can now create intraday reports for Leads!
Up to 400x faster query More productive Business Intelligence
times Queries that could ‘never’ finish in Oracle are
now providing business value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
32. Generic and reusable Oozie actions for
Netezza
Oozie Load and Remove Action
Apache CLI
Nzload and Nzsql (provisioned
on worker nodes using Chef)
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
33. Value
o Data warehouse proves product value both
internally and to our customers
o Failing fast and quick turn around allow us to
know when we are building the right reporting
and analytical products without a large up front
investment
o By combining all data in a single system we are
enabling new products to be developed that we
previously could not
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
34. Krishnan Parasuraman Greg Rokita
@kparasuraman Edmunds.com
Building Scalable Data Platforms
Hadoop and Netezza Deployment Models