Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
This presentation covers the following topics:
1. HBase versions and origins
2. HBase core concepts
3. HBase vs. RDBMS
4. Data Modeling
5. HBase architecture
6. HBase Master and Region Servers
7. Column Families and Regions
8. HBase Internals: Bloom Filters and Block Indexes
9. Write Pipeline / Read Pipeline
10. Compactions
11. Learning Resources
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop
Below topics are covered in this video:
1. Problems with RDBMS
2. Need for Apache Sqoop
3. Introduction to Sqoop
4. Apache Sqoop Architecture
5. Sqoop Commands
6. Demo to transfer data between Mysql and Hadoop
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
This presentation covers the following topics:
1. HBase versions and origins
2. HBase core concepts
3. HBase vs. RDBMS
4. Data Modeling
5. HBase architecture
6. HBase Master and Region Servers
7. Column Families and Regions
8. HBase Internals: Bloom Filters and Block Indexes
9. Write Pipeline / Read Pipeline
10. Compactions
11. Learning Resources
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
** Hadoop Training: https://www.edureka.co/hadoop **
This Edureka PPT on Sqoop Tutorial will explain you the fundamentals of Apache Sqoop. It will also give you a brief idea on Sqoop Architecture. In the end, it will showcase a demo of data transfer between Mysql and Hadoop
Below topics are covered in this video:
1. Problems with RDBMS
2. Need for Apache Sqoop
3. Introduction to Sqoop
4. Apache Sqoop Architecture
5. Sqoop Commands
6. Demo to transfer data between Mysql and Hadoop
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Yahoo has long been involved in HBase and its community. In 2013, HBase was offered as a hosted service at Yahoo. Since then, adoption has grown rapidly., and today, HBase is used by numerous teams across the company, helping to enable a diverse set of use cases ranging from near real-time processing to data warehousing.
This was made possible thanks to HBase along with some enhancements to support multi-tenancy and scale. As our clusters continue to grow and use cases become more demanding we are working towards supporting a million regions in a single cluster.
In this keynote, we’ll paint a picture of where Yahoo! is today and the enhancements we have been working on to reach today’s scale as well as supporting a million regions and beyond.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
NoSQL includes a wide range of different database technologies and were developed as a result of surging volume of data stored. Relational databases are not capable of coping with this huge volume and faces agility challenges. This is where NoSQL databases have come in to play and are popular because of their features. The session covers the following topics to help you choose the right NoSQL databases:
Traditional databases
Challenges with traditional databases
CAP Theorem
NoSQL to the rescue
A BASE system
Choose the right NoSQL database
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
Slides from my talk at IEEE BigData 2013 presenting our paper "Hourglass: a Library for Incremental Processing on Hadoop"
Abstract:
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Apache HBase - Introduction & Use CasesData Con LA
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable
This talk will introduce to Apache HBase and will give you an overview of Columnar databases. We will also talk about how Facebook is using HBase currently. We will talk about HBase security, Apache Phoenix and Apache Slider
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
Are you new to SlideShare? Are you looking to fine tune your channel plan? Are you using SlideShare but are looking for ways to enhance what you're doing? How can you use SlideShare for content marketing tactics such as lead generation, calls-to-action to other pieces of your content, or thought leadership? Read more from the CMI team in their latest SlideShare presentation on SlideShare.
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...IndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Hbase is an open-source, non-relational, distributed, sparse, column-oriented data-store modeled after Google’s BigTable and is written in Java.
In this presentation we will talk about how to migrate a RDBMS based Java application to Hbase based application. We will have a discussion on following points:
• Hbase schema design (a paradigm shift from the way we think about data-storage right now) compared to RDBMS based schema design.
• The challenges faced while porting the application with HBase.
• Introduction to HBql to query the data from Hbase.
• Monitoring example application for Hbase (JMX APIs exposed) and Machine’s performance with Gangila.
• Discussion on Thrift interface and how can we used Rest interface to integrate hbase with non java based applications.
• Cluster replication and what is coming in the next major 0.90 release of Hbase.
• We will end up the session, with the demo of ported application.
Takeaways for the Audience 1. When is Hbase appropriate and when not?
2. Hbase architecture and schema design
3. RDBMS vs Hbase
4. Interfacing Hbase with applications using Thrift or REST
5. Hbase cluster and Replication
6. Hbase monitoring
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.
In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.
HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.
Similar to Chicago Data Summit: Apache HBase: An Introduction (20)
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
The Art of the Pitch: WordPress Relationships and Sales
Chicago Data Summit: Apache HBase: An Introduction
1.
2.
3.
4. Apache HBase HBase is an open source , distributed , sorted map datastore modeled after Google’s BigTable
5.
6.
7.
8. Sorted Map Datastore (logical view as “records”) A single cell might have different values at different timestamps Different rows may have different sets of columns(table is sparse ) Useful for *-To-Many mappings Different types of data separated into different “ column families” Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data cutting info: { ‘height’: ‘9ft’, ‘state’: ‘CA’ } roles: { ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ } tlipcon info: { ‘height’: ‘5ft7, ‘state’: ‘CA’ } roles: { ‘Hadoop’: ‘Committer’@ts=2010, ‘ Hadoop’: ‘PMC’@ts=2011, ‘ Hive’: ‘Contributor’ }
9. Sorted Map Datastore (physical view as “cells”) Sorted on disk by Row key, Col key, descending timestamp Milliseconds since unix epoch info Column Family roles Column Family Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director cutting roles:Hadoop 1183746289103 Founder tlipcon roles:Hadoop 1300062064923 PMC tlipcon roles:Hadoop 1293388212294 Committer tlipcon roles:Hive 1273616297446 Contributor Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA
10.
11.
12.
13. High Level Architecture HBase HDFS ZooKeeper Java Client MapReduce Hive/Pig Thrift/REST Gateway Your Java Application
14.
15. Cluster Architecture RegionServer HDFS HMaster RegionServer RegionServer … HMaster ZK Peer ZK Peer ZK Peer ZK Quorum Client Client finds RegionServer addresses in ZooKeeper Client reads and writes rows by directly accessing the RegionServers Master assigns regions and achieves load balancing
16. Cluster Deployment (big cluster) HDFS NameNode Secondary NameNode MapReduce JobTracker ZooKeeper ZooKeeper ZooKeeper HMaster HMaster RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 3 or 5 nodes ZK HMaster with one standby 40+ slaves with HBase, HDFS, and MR slave processes
17. Cluster Deployment (small cluster / POC) NameNode SecondaryNameNode HMaster JobTracker ZooKeeper RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker RegionServer DataNode TaskTracker 5+ slaves with HBase, HDFS, and MR slave processes The proverbial basket full of eggs
19. HBase vs just HDFS If you have neither random write nor random read, stick to HDFS! Plain HDFS/MR HBase Write pattern Append-only Random write, bulk incremental Read pattern Full table scan, partition table scan Random read, small range scan, or table scan Hive (SQL) performance Very good 4-5x slower Structured storage Do-it-yourself / TSV / SequenceFile / Avro / ? Sparse column-family data model Max data size 30+ PB ~1PB
20. HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-family-oriented Transactions Multi-row ACID Single row only Query language SQL get/put/scan/etc * Security Authentication/Authorization Work in progress Indexes On arbitrary columns Row-key only Max data size TBs ~1PB Read/write throughput limits 1000s queries/second Millions of queries/second
Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable. Open-source: Apache HBase is an open source project with an Apache 2.0 license. Distributed: HBase is designed to use multiple machines to store and serve data. Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk. HBase is modeled after BigTable, a system that is used for hundreds of applications at Google. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Earlier, I said that Hbase is a big sorted map. Here is an example of a table. The map key is (row key+column+timestamp). The value is the cell contents. The rows in the map are sorted by key. In this example, Row1 has 3 columns in the "info" column family. Row2 only has a single column. A column can also be empty. Each row has a timestamp. By default, the timestamp is set to the current time (in milliseconds since the Unix Epoch, January 1 st 1970) when the row is inserted. A client can specify a timestamp when inserting or retrieving data, and specify how many versions of each cell should be maintained. Data in HBase is non-typed; everything is an array of bytes. Rows are sorted lexicographically. This order is maintained on disk, so Row1 and Row2 can be read together in just one disk seek. Copyright 2010 Cloudera - Do not distribute
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
Given that Hbase stores a large sorted map, the API looks similar to a map. You can get or put individual rows, or scan a range of rows. There is also a very efficient way of incrementing a particular cell – this can be useful for maintaining high performance counters or statistics. Lastly, it’s possible to write MapReduce jobs that analyze the data in Hbase.
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
Data Layout : An traditional RDBMS uses a fixed schema and row-oriented storage model. This has drawbacks if the number of columns per row could vary drastically. A semi-structured column-oriented store handles this case very well. Transactions : A benefit that an RDBMS offers is strict ACID compliance with full transaction support. HBase currently offers transactions on a per row basis. There is work being done to expand HBase's transactional support. Query language : RDBMSs support SQL, a full-featured language for doing filtering, joining, aggregating, sorting, etc. HBase does not support SQL*. There are two ways to find rows in HBase: get a row by key or scan a table. Security : In version 0.20.4, authentication and authorization are not yet available for HBase. Indexes : In a typical RDBMS, indexes can be created on arbitrary columns. HBase does not have any traditional indexes**. The rows are stored sorted, with a sparse index of row offsets. This means it is very fast to find a row by its row key. Max data size : Most RDBMS architectures are designed to store GBs or TBs of data. HBase can scale to much larger data sizes. Read/write throughput limits : Typical RDBMS deployments can scale to thousands of queries/second. There is virtually no upper bound to the number of reads and writes HBase can handle. * Hive/HBase integration is being worked on ** There are contrib packages for building indexes on HBase tables Copyright 2010 Cloudera - Do not distribute
One of the interesting things about NoSQL is that the different systems don’t usually compete directly. We all have picked different tradeoffs. Hbase is a strongly consistent system, so it does not have as good availability as an eventual consistency system like Cassandra. But, we find that availability is good in practice! Since Hbase is built on top of Hadoop, it has very good integration. For example, we have a very efficient bulk load feature, and the ability to run mapreduce into or out of Hbase tables. Hbase’s partitioning is range based, and data is sorted by key on disk. This is different than other systems which use a hash function to distribute keys. This can be useful for guaranteeing that for a given user account, all of that user’s data can be read with just one disk seek. Hbase automatically reshards when necessary, and regions automatically reassign if servers die. Adding more servers is simple – just turn them on. There is no “reshard” step. Hbase is not just a key value store – it is similar to Cassandra in that each row has a sparse set of columns which are efficiently stored
People often want to know “the numbers” about a storage system. I would recommend that you test it yourself – benchmarks always lie. But, here are some general numbers about Hbase. The largest cluster I’ve seen is 600 nodes, storing around 600TB. Most clusters are much smaller, only 5-20 nodes, hosting a few hundred gigabytes. Generally, writes take a few ms, and throughput is on the order of thousands of writes per node per second, but of course it depends on the size of the writes. Reads are a few milliseconds if the data is in cache, or 10-30ms if disk seeks are required. Generally we don’t recommend that you store very large values in Hbase. It is not efficient if the values stored are more than a few MB.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
Hbase is currently used in production at a number of companies. Here are a few examples. Facebook is using Hbase for a new user-facing product which is going to launch very soon. They also are using Hbase for analytics. StumbleUpon hosts large parts of its website from Hbase, and also built an advertising platform based on Hbase. Mozilla’s crash reporting infrastructure is based on Hbase. If your browser crashes and you submit the crash to mozilla, it is stored in Hbase for later analysis by the Firefox developers.
So, if you are interested in Hadoop and Hbase, here are some resources. The easiest way to install Hadoop is to use Cloudera’s Distribution for Hadoop from cloudera.com. You can also download the Apache source directly from hadoop.apache.org. You can get started on your laptop, in a VM, or running on EC2. I also recommend our free training videos from our website. The Hadoop: The Definitive Guide book is also really great – it’s also available translated in Japanese.
Thanks very much for having me! If you have any questions, please feel free to ask now or send me an email. Also, we’re hiring both in the USA and in Japan, so if you’re interested in working on Hadoop or Hbase, please get in touch.