This document discusses HBase, an open-source, non-relational, distributed database built on top of Hadoop. It provides an overview of why HBase is useful, examples of how Navteq uses HBase at scale, and considerations for designing HBase schemas and deploying HBase clusters, including hardware requirements and configuration tuning. The document also outlines some desired future features for HBase like better tools, secondary indexes, and security improvements.
NoSQL databases such as Redis, MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offer significantly better scalability and performance. However, using a NoSQL database means giving up the benefits of the relational model such as SQL, constraints and ACID transactions. For some applications, the solution is polyglot persistence: using SQL and NoSQL databases together.
In this talk, you will learn about the benefits and drawbacks of polyglot persistence and how to design applications that use this approach. We will explore the architecture and implementation of an example application that uses MySQL as the system of record and Redis as a very high-performance database that handles queries from the front-end. You will learn about mechanisms for maintaining consistency across the various databases.
Calpont CTO Jim Tommaney provides an overview InfiniDB 3, Calpont’s analytic data platform.
Discussion Topics
•How InfiniDB is architected for Big Data analytics
•How InfiniDB is provisioned for Amazon EC2 with an AMI
•How to quickly create a small or large cluster
•How InfiniDB’s parallel load capabilities deliver linear load scaling
SQL Server 2008 Fast Track Data Warehouse 2.0
This was a presentation to the Silicon Valley SQL Server User Group in February 2010.
Speaker: Phil Hummel of WinWire Technologies
Presentation developed by Bruce Campbell
Western Region Data Warehouse Specialist, Microsoft
For more information about the SQL Server User Group, contact Mark Ginnebaugh, President of DesignMind, at mark@designmind.com
NoSQL databases such as Redis, MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offer significantly better scalability and performance. However, using a NoSQL database means giving up the benefits of the relational model such as SQL, constraints and ACID transactions. For some applications, the solution is polyglot persistence: using SQL and NoSQL databases together.
In this talk, you will learn about the benefits and drawbacks of polyglot persistence and how to design applications that use this approach. We will explore the architecture and implementation of an example application that uses MySQL as the system of record and Redis as a very high-performance database that handles queries from the front-end. You will learn about mechanisms for maintaining consistency across the various databases.
Calpont CTO Jim Tommaney provides an overview InfiniDB 3, Calpont’s analytic data platform.
Discussion Topics
•How InfiniDB is architected for Big Data analytics
•How InfiniDB is provisioned for Amazon EC2 with an AMI
•How to quickly create a small or large cluster
•How InfiniDB’s parallel load capabilities deliver linear load scaling
SQL Server 2008 Fast Track Data Warehouse 2.0
This was a presentation to the Silicon Valley SQL Server User Group in February 2010.
Speaker: Phil Hummel of WinWire Technologies
Presentation developed by Bruce Campbell
Western Region Data Warehouse Specialist, Microsoft
For more information about the SQL Server User Group, contact Mark Ginnebaugh, President of DesignMind, at mark@designmind.com
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...MongoDB
This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
Do applications using NoSQL still require performance management? Is it always the best option to throw more hardware at a MapReduce job? In both cases, performance management is still about the application, but "Big Data" technologies have added a new wrinkle.
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
Slides describing Cloudera and Karmasphere, and how combined their products can install a Hadoop cluster, import data, run queries and generate results.
Microsoft® SQL Azure™ Database is a cloud-based relational database service built for Windows® Azure platform. It provides a highly available, scalable, multi-tenant database service hosted by Microsoft in the cloud. SQL Azure Database enables easy provisioning and deployment of multiple databases. Developers do not have to install, setup, patch or manage any software. High Availability and fault tolerance is built-in and no physical administration is required. SQL Azure supports Transact-SQL (T-SQL). Customers can leverage existing tools and knowledge in T-SQL based familiar relational
data model for building applications.
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...MongoDB
This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Performance Management in ‘Big Data’ ApplicationsMichael Kopp
Do applications using NoSQL still require performance management? Is it always the best option to throw more hardware at a MapReduce job? In both cases, performance management is still about the application, but "Big Data" technologies have added a new wrinkle.
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
Slides describing Cloudera and Karmasphere, and how combined their products can install a Hadoop cluster, import data, run queries and generate results.
Microsoft® SQL Azure™ Database is a cloud-based relational database service built for Windows® Azure platform. It provides a highly available, scalable, multi-tenant database service hosted by Microsoft in the cloud. SQL Azure Database enables easy provisioning and deployment of multiple databases. Developers do not have to install, setup, patch or manage any software. High Availability and fault tolerance is built-in and no physical administration is required. SQL Azure supports Transact-SQL (T-SQL). Customers can leverage existing tools and knowledge in T-SQL based familiar relational
data model for building applications.
HBaseCon 2012 | HBase Security for the Enterprise - Andrew Purtell, Trend MicroCloudera, Inc.
Trend Micro developed the new security features in HBase 0.92 and has the first known deployment of secure HBase in production. We will share our motivations, use cases, experiences, and provide a 10 minute tutorial on how to set up a test secure HBase cluster and a walk through of a simple usage example. The tutorial will be carried out live on an on-demand EC2 cluster, with a video backup in case of network or EC2 unavailability.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Presentation on Secondary Indexes from the 9/11/12 HBase Contributor's Meetup. It discusses the current state of the discussion and some possible future directions.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Precisando lidar com dados massivos onde centenas de gigabytes com crescimento para terabytes ou mesmo petabytes fazem parte do seu dia-a-dia ? Você precisa realizar milhares de operações por segundo em múltiplos terabytes de dados ? Venha conhecer o Apache HBase, um banco de dados NoSQL que roda em cima do HDFS e é altamente disponível, tolerante a falhas e escalável. HBase tem sido muito utilizado em empresas como Facebook e Twitter. Esta palestra faz uma introdução, mostrando o que é o HBase e quando usar, sua arquitetura e também exemplos de soluções reais de grandes empresas como Facebook, Twitter e Trend Micro
Different application domains including sensor networks, social networks, science, financial services, condition monitoring systems demand the storage of a vast amount of data in the petabytes area. Prominent candidates are Google, Facebook, Yahoo!, Amazon just to name a few.
This data volume can't be tackled with convential relational database technologies anymore, either from a technical or licensing point of view or both. It demands a scale-out environment, which allows reliable, scalable and distributed processing. This trend in Big Data management is more and more approached with NoSQL solutions like Apache HBase on top of Apache Hadoop.
This session discusses big data management and their scalability challenges in general with a short introduction into Apache Hadoop/HBase and a case study on the co-existence of Apache Hadoop/HBase with Firebird in a sensor data aquisition system.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today.
The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact.
The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories:
Data Lifecycle Connection
Data for Enterprise AI
Cloud Innovation
Security & Governance Leadership
People First
Data for Good
Industry Transformation
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
2. Topics
Why HBase?
HBase Usecases – HBase @Navteq
Design Considerations
Hardware/Deployment Considerations
Practical Tips (Tuning/Optimization)
Wanted Features
Ravi Veeramachaneni HBase – In Practice 2
3. Hadoop Benefits
• Stores (HDFS) and Process (MR) large amounts of data
• Scales (100s and 1000s of nodes)
• Inexpensive (no license cost, low cost hardware)
• Fast (1TB sort in 62s, 1PB in 16.25h*)
• Availability (failover built into the platform)
• Data Recoverability (failure should not result in any data
loss)
• Replication (out-of-the-box 3-way replication and
configurable)
• Better Throughput (Time to read the whole dataset is more
important than latency in reading the first record)
• Write once and read-many-times pattern
• Works well with structured, unstructured or semi-structured
data
*YDN Blog: Jim Gray’s Benchmark @ http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
Ravi Veeramachaneni HBase – In Practice 3
4. But …
Not so good or does not support
• Random access
• Updating the data and/or file (writes are always
at the EOF)
• Apps that require low latency access to data
• Does not to support lots of small files
• Does not support multiple writers
• Not a solution for every Data problem
Ravi Veeramachaneni HBase – In Practice 4
5. Featuring HBase
HBase Scales (runs on top of Hadoop)
HBase provides fast table scans for time ranges and
fast key based lookups
HBase stores null values for free
• Saves both disk space and disk IO time
HBase supports unstructured/semi-structured data through
column families
HBase has built-in version management
Map Reduce data input
• Tables are sorted and have unique keys
• Reducer often times optional
• Combiner not needed
Strong community support and wider adoption
Ravi Veeramachaneni HBase – In Practice 5
6. HBase Usecases
To solve Big Data problems
Sparse data (un- or semi-structured)
Cost effectively Scalable
Versioned data
Some other features may interest to you
Linear distribution of data across the data nodes
Rows are stored in byte-lexographic sorted order
Atomic Read/Write/Update
Data Access – Random, Sequential reads and writes
Automatic replication of Data for HA
But not for every Data problem
Ravi Veeramachaneni HBase – In Practice 6
7. Navteq’s Usecase
Content is
– Constantly growing (in higher TB)
– Sparse and unstructured
– Provided in multiple data formats
– Ingested, processed and delivered in transactional and batch mode
Content Breadth
– 100s of millions of content records
– 100s of content suppliers + community input
Content Depth
– On average, a content record has 120 attributes
– Certain types of content have more than 400 attributes
– Content classified across 270+ categories
Ravi Veeramachaneni HBase – In Practice 7
8. Content Processing High-level Overview
Batch and Transactional API
Bulk Content Customer and
Sources Community UGC
Merchant Community, Us
Data er and
Merchant
Media
Place ID from Place Registry
Location ID from Location Referencing
Source & Blended Record Management Tiered Quality System
PUBLISHING
real-time, on-demand
Place ID Bulk Content delivery; Search, and
Location ID other mobile devices
Ravi Veeramachaneni HBase – In Practice 8
9. HBase @ NAVTEQ
Started in 2009, hbase 0.19.x (apache)
• 8-node VMWare Sandbox Cluster
• Flaky, unstable, RS Failures
• Switched to CDH
Early 2010, hbase 0.20.x (CDH2)
• 10-node Physical Sandbox Cluster
• Still had lot of challenges, RS Failures, META corruption
• Cluster expanded significantly with multiple environments
Current (hbase 0.90.3)
• Moved to CDH3u1 official release
• Multiple teams/projects using Hadoop/HBase implementation
• Working on Hive/HBase integration, Oozie, Lucene/Solr
integration, Cloudera Enterprise and few other
Ravi Veeramachaneni HBase – In Practice 9
10. Measured Business Value
Scalability & Deployment
• Handling spikes are addressed by simply adding nodes
• No code changes or deployment needed
• From 15 to 30 to 60 nodes and more, as data grows
• Deployment are well managed and controlled (from 12-16
hours to < 2 hours)
Speed to Market
• By supporting Real-time transactions (instead of quarterly
update)
• Batch updates are handled more efficiently (from days to
hours)
Faster Supplier On-boarding
• Flexible and externally managed Business Rules
Cheaper than the existing solution
<$2m vs. $12m (based on projected growth)
Ravi Veeramachaneni HBase – In Practice 10
11. HBase & Zookeeper
ZK – Distributed coordination service
• Coordinates messages sent across the network between nodes
(network fails, etc.)
HBase depends on ZK and authorizes ZK to manage the state
HBase hosts key info on ZK
• Location of root catalog table
• Address of the current cluster master
• Bootstrapping a client connection to an HBase cluster
Client connects to ZK quorum first
• To learn the location of -ROOT-
• Clients consult -ROOT- to elicit the location of the .META. Region
• Client then does a lookup against the found .META. Region to figure
the hosting user-space region and its location
• Clients caches all the above for future traversing
Ravi Veeramachaneni HBase – In Practice 11
12. Design Considerations
Database/schema design
• Transition to Column-oriented or flat schema
Understand your access pattern
Row-key design/implementation
• Sequential keys
• Suffers from distribution of load but uses the block caches
• Can be addressed by pre-splitting the regions
• Randomize keys to get better distribution
• Achieved through hashing on Key Attributes – SHA1 or MD5
• Suffers range scans
Too many Column Families (NOT Good)
• Initially we had about 30 or so, now reduced to 8
Compression
• LZO or Snappy (20% better than LZO) – Block (default)
Ravi Veeramachaneni HBase – In Practice 12
13. Design Considerations
Serialization
• AVRO didn’t work well – deserialization issue
• Developed configurable serialization mechanism that uses JSON
except Date type
Secondary Indexes
• Were using ITHBase and IHBase from contrib – doesn’t work well
• Redesigned schema without need for index
• We still need it though
Performance
• Several tunable parameters
• Hadoop, HBase, OS, JVM, Networking, Hardware
Scalability
• Interfacing with real-time (interactive) systems from batch oriented
system
Ravi Veeramachaneni HBase – In Practice 13
15. Hardware/Deployment Considerations
Hardware (Hadoop+HBase)
• Data Node - 24GB RAM, 8 Cores, 4x1TB (64GB, 24 Cores, 8x2TB)
• 6 mappers and 6 reducers per node (16 mappers, 4 reducers)
• Memory allocation by process
• Data Node – 1GB (2GB)
• Task Tracker – 1GB (2GB)
• Map Tasks – 6x1GB (16x1.5GB)
• Reduce Tasks – 6x1GB (4x1.5GB)
• Region Server – 8GB (24GB)
• Total Allocation: 24GB (64GB)
Deployment
• Do not run ZK instances on DN, have a separate ZK quorum (3
minimum)
• Do not run HMaster on NN
• Avoid SPOF for HMaster (run additional master(s))
Ravi Veeramachaneni HBase – In Practice 15
16. HBase Configuration/Tuning
Configuring HBase
• Configuration is the key
• Many moving parts – typos, out of synchronization
• Operating System
• Number of open files (ulimit) to 32K or even higher (/etc/security/limits.conf)
• vm.swapiness to lower or 0
• HDFS
• Adjust block size based on the use case
• Increase xceivers to 2047 (dfs.datanode.max.xceivers)
• Set socket timeout to 0 (dfs.datanode.socket.write.timeout)
• HBase
• Needs more memory
• No swapping – JVM hates it
• GC pauses could cause timeouts or RS failures (read article posted by
Todd Lipcon on avoiding full GC)
Ravi Veeramachaneni HBase – In Practice 16
17. HBase Configuration/Tuning
HBase
• Per-cluster
• Turn-off block cache if the hit ratio is less (hfile.block.cache.size, default
20%)
• Per-table
• MemStore flush Size (hbase.hregion.memstore.flush.size, default 64MB and
hbase.hregion.memstore.block.multiplier, default 2)
• Max File Size (hbase.hregion.max.filesize, default 256MB)
• Per-CF
• Compression
• Bloom Filter
• Per-RS
• Amount of heap in each RS to reserve for all MemStores
(hbase.regionserver.global.memstore.upperLimit, default 0.4)
• MemStore flush size
• Max file size
• Per-SF
• Maximum number of SFs per store to allow
(hbase.hstore.blockingStoreFiles, default 7)
Ravi Veeramachaneni HBase – In Practice 17
18. HBase Configuration/Tuning
• HBase
• Write (puts) optimization (Ryan Rawson HUG8 presentation – HBase
importing)
– hbase.regionserver.global.memstore.upperLimit=0.3
– hbase.regionserver.global.memstore.lowerLimit=0.15
– hbase.regionserver.handler.count=256
– hbase.hregion.memstore.block.multiplier=8
– hbase.hstore.blockingStoreFiles=25
• Control number of store files (hbase.hregion.max.filesize)
Security
• Still in flux, need robust RBAC
Reliability
• Name Node is SPOF
• HBase is sensitive
• Region Server Failures
Ravi Veeramachaneni HBase – In Practice 18
19. Desired Features
Better operational tools for using Hadoop and HBase
• Job management, backup, restore, user provisioning, general
administrative tasks, etc.
Support for Secondary Indexes
Full-text Indexes and Searching (Lucene/Solr integration?)
HA support for Name Node
Need Data Replication for HA & DR
Security at Table, CF and Row level
Good documentation (it’s getting better though) – now Lars
book out
Ravi Veeramachaneni HBase – In Practice 19