HugeTable is an application-oriented structured data storage system designed to address the needs of handling huge data volumes for multiple applications areas at China Mobile. It is built on Hadoop and HBase and aims to provide SQL support, fast index queries, support for multiple applications, and CRUD operations. HugeTable has been through several versions to add features like global indexing, secondary indexing, schema support, and application integration. It uses HBase as an index store and provides various APIs, administration tools, and solutions for telecommunications applications at China Mobile requiring analytics of large datasets.
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.
Speaker: Ashish Thusoo, CEO, Qubole
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack datastore for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of clicklogs and combine it with the power of Apache HBase to store all Facebook Messages.
This talk describes the reasons why each of these databases are appropriate for their workloads and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We briefly touch upon some futures of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
The cloud reduces the barrier to entry for many small and medium size enterprises into analytics. Hadoop and related frameworks like Hive, Oozie, Sqoop are becoming tools of choice for deriving insights from data. However, these frameworks were designed for in-house datacenters which have different tradeoffs from a cloud environment and making them run well in the cloud presents some challenges. In this talk, we describe how we've extended Hadoop and Hive to exploit these new tradeoffs and offer them as part of the Qubole Data Service (QDS). We will also present use-cases that show how QDS is making it extremely easy for an end user to use these technologies in the cloud.
Speaker: Ashish Thusoo, CEO, Qubole
Comparing Hive with HBase is like comparing Google with Facebook - although they compete over the same turf (our private information), they don’t provide the same functionality. But things can get confusing for the Big Data beginner when trying to understand what Hive and HBase do and when to use each one of them. We're going to clear it up.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Apache Hadoop is fast becoming the defacto platform for processing Big Data. Hadoop started from a relatively humble beginning as a point solution for small search systems. Its growth into an important technology to the broader enterprise community dates back to Yahoo’s 2006 decision to evolve Hadoop into a system for solving its internet scale big data problems. Eric will discuss the current state of Hadoop and what is coming from a development standpoint as Hadoop evolves to meet more workloads.
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of interactive SQL queries with the capacity, scalability, and flexibility of a Hadoop cluster. In this webinar, join Cloudera and MicroStrategy to learn how Impala works, how it is uniquely architected to provide an interactive SQL experience native to Hadoop, and how you can leverage the power of MicroStrategy 9.3.1 to easily tap into more data and make new discoveries.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Doing Business 2015: au-delà de l’efficience est une publication phare du Groupe de la Banque Mondiale et est le 12ème d'une série de rapports annuels mesurant les réglementations favorables et défavorables de l'activité commerciale. Doing Business présente des indicateurs quantitatifs sur la réglementation des affaires et la protection des droits de propriété de 189 pays - de l'Afghanistan au Zimbabwe - au fil du temps.
Doing Business mesure les réglementations affectant 11 domaines de la vie d'une entreprise. Dix de ces domaines sont inclus dans le classement de cette année sur la facilité de faire des affaires: création d'entreprise, octroi de permis de construire, raccordement à l'électricité, transfert de propriété, obtention de prêts, protection des investisseurs minoritaires, paiement des impôts, commerce transfrontalier, exécution des contrats et règlement de l’insolvabilité. Doing Business mesure également la régulation du marché du travail, ce qui n'est pas inclus dans le classement de cette année.
Les données de Doing Business 2015 sont mises à jour en date du 1er Juin 2014. Les indicateurs sont utilisés pour analyser les résultats économiques et identifier les meilleures réformes de la réglementation des affaires, dépendant de l’endroit et de l’objectif. Le rapport de cette année présente une expansion notable de plusieurs ensembles d'indicateurs et un changement dans le calcul du classement.
Comparing Hive with HBase is like comparing Google with Facebook - although they compete over the same turf (our private information), they don’t provide the same functionality. But things can get confusing for the Big Data beginner when trying to understand what Hive and HBase do and when to use each one of them. We're going to clear it up.
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
A look at common patterns being applied to leverage Hadoop with traditional data management systems and the emerging landscape of tools which provide access and analysis of Hadoop data with existing systems such as data warehouses, relational databases, and business intelligence tools.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Video: http://www.youtube.com/watch?v=BT8WvQMMaV0
Hadoop is the technology of choice for processing large data sets. At salesforce.com, we service internal and product big data use cases using a combination of Hadoop, Java MapReduce, Pig, Force.com, and machine learning algorithms. In this webinar, we will discuss an internal use case and a product use case:
Product Metrics: Internally, we measure feature usage using a combination of Hadoop, Pig, and the Force.com platform (Custom Objects and Analytics).
Community-Based Recommendations: In Chatter, our most successful people and file recommendations are built on a collaborative filtering algorithm that is implemented on Hadoop using Java MapReduce.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business intelligence. Hadoop is uniquely capable of storing, aggregating, and refining multi-structured data sources into formats that fuel new business insights. Apache Hadoop is fast becoming the defacto platform for processing Big Data. Hadoop started from a relatively humble beginning as a point solution for small search systems. Its growth into an important technology to the broader enterprise community dates back to Yahoo’s 2006 decision to evolve Hadoop into a system for solving its internet scale big data problems. Eric will discuss the current state of Hadoop and what is coming from a development standpoint as Hadoop evolves to meet more workloads.
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of interactive SQL queries with the capacity, scalability, and flexibility of a Hadoop cluster. In this webinar, join Cloudera and MicroStrategy to learn how Impala works, how it is uniquely architected to provide an interactive SQL experience native to Hadoop, and how you can leverage the power of MicroStrategy 9.3.1 to easily tap into more data and make new discoveries.
Explores the notion of "Hadoop as a Data Refinery" within an organisation, be it one with an existing Business Intelligence system or none - looks at 'agile data' as a a benefit of using Hadoop as the store for historical, unstructured and very-large-scale datasets.
The final slides look at the challenge of an organisation becoming "data driven"
Hadoop as Data Refinery - Steve LoughranJAX London
Apache Hadoop is often described as a "Big Data Platform" but what does that mean? One way to better understand Hadoop is to talk about how Hadoop is used. This talk discusses using Hadoop as a "Data Refinery", which is a common use case. The concept is very much like a traditional oil refinery except with data, pulling in large quantities of "crude data" over pipelines, refining some into useful business intelligence; refining other pieces into slightly less crude data that stays in the cluster until needed later. This metaphor proves useful when considering how Hadoop could be adopted in an organisation that already has data warehousing and business intelligence systems -and when contemplating how to hook up a Hadoop cluster to the sources of data inside and outside that organisation. A key point to remember is that storing data in Hadoop is not a means to an end any more than storing data in a database is: it is extracting information from that data. Using Hadoop as a front end "data refinery" means that it can integrate with existing Business Intelligence systems, while providing the platform for new applications.
Doing Business 2015: au-delà de l’efficience est une publication phare du Groupe de la Banque Mondiale et est le 12ème d'une série de rapports annuels mesurant les réglementations favorables et défavorables de l'activité commerciale. Doing Business présente des indicateurs quantitatifs sur la réglementation des affaires et la protection des droits de propriété de 189 pays - de l'Afghanistan au Zimbabwe - au fil du temps.
Doing Business mesure les réglementations affectant 11 domaines de la vie d'une entreprise. Dix de ces domaines sont inclus dans le classement de cette année sur la facilité de faire des affaires: création d'entreprise, octroi de permis de construire, raccordement à l'électricité, transfert de propriété, obtention de prêts, protection des investisseurs minoritaires, paiement des impôts, commerce transfrontalier, exécution des contrats et règlement de l’insolvabilité. Doing Business mesure également la régulation du marché du travail, ce qui n'est pas inclus dans le classement de cette année.
Les données de Doing Business 2015 sont mises à jour en date du 1er Juin 2014. Les indicateurs sont utilisés pour analyser les résultats économiques et identifier les meilleures réformes de la réglementation des affaires, dépendant de l’endroit et de l’objectif. Le rapport de cette année présente une expansion notable de plusieurs ensembles d'indicateurs et un changement dans le calcul du classement.
Open Education Week presentation as part of session organised by Gabi Witthaus for her SCORE fellowship:
http://toucansproject.wordpress.com/2012/03/07/rich-sharing/
Matching presentation from Martin Weller: http://www.slideshare.net/mweller/standing-up-for-little-oer
And Sandra Wills presentation: http://www.slideshare.net/Sandrawills/oeru-sandra
cC-BY: PAtrick McAndrew
This presentation made at TI Developer Conference 2008, introduces the options available for developers to create User Interfaces on TI SGX based platforms.
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Getting started with big data in Azure HDInsightNilesh Gule
Slide deck from the Singapore DataCamp 2019 event held on 2nd March. The session and the demo were used to showcase HDInsight offering and the capabilities from OSS tools like Sqoop, Hive etc. Spark was used to query data from Hive and also CSV files using Jupyter notebook. Finally, PowerBI was used to build visualization with data sourced from HDInsight cluster using DirectQuery and Spark
https://www.learntek.org/big-data-and-hadoop-training/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
Hortonworks and Red Hat Webinar - Part 2Hortonworks
Learn more about creating reference architectures that optimize the delivery the Hortonworks Data Platform. You will hear more about Hive, JBoss Data Virtualization Security, and you will also see in action how to combine sentiment data from Hadoop with data from traditional relational sources.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
3. Motivations
Huge Data Volumes
Total data volumes: Several PB per system
Daily data volumes: Several TB per system
Longer retention period: several months
Big potential: 200% increase in some area
Multiple Applications Areas Data Warehouse
•Scalable
BOSS BI NMS Internet ...
•High Available
Data Integration •Reliable
Traditional Application Model + App Solution
SQL support
Fast Index Query … Affordable
Multiple Application support
Sensitive data
CRUD support
Statistic & Reporting
4. Hadoop: Raw Techniques
HDFS: distributed file system with fault tolerance
MapReduce: parallel programming
environments over HDFS
Similar to the situation of POSIX API + Local FS
High Level Toolkits are initiated
Yahoo: PIG/Latin
Business.com: Cloudbase/Hadoop+JDBC
China Mobile: BC-PDM
Facebook: Hive/SQL
5. Hive: A Petabytes Scale Data Warehouse
Features:
• Schema support
• Pluggable Storage Engine I/F
• SQL MR translation
• xDBC Driver
• Tools: HQL Console
• Admin: HWI
Usage Scenarios
• Reporting
• Ad hoc Analysis
• Machine Learning
• Others
•Log analysis
•Trend detection
Facebook has huge clusters
>1000 nodes
Source: ICDE 2010/Facebook
6. HBase: structured storage of sparse data for
Hadoop
Features
• ColumnFamilies
• ACID
• Optimized R/W
• BigTable I/F + BU
• Tools: HBase Shell
• Admin: Jetty Based
Usage Scenarios
• Social Service
• MapReduce Analysis
• Content Repository
• Wiki, RSS
• Near Realtime Reporting
Source: ApacheCon2009/ HBase
& analytics
• Store web pages
… Replacing SQL Systems
7. HugeTable: Application-Oriented Structure
Data Storage System
Address the missing blocks HugeTable
Index store & Query Optimizer Tools
Client I/F Admin
s Data,
Access Control List HFile w/ Index
config,
FM, Log,
Insert, Update and Delete CF Store Perf
Web-based Administration
Build Solutions for Telco Applications
Network Management System – NMS
Value-added System – VAS
Business Intelligence – BI
Other areas
8. A Brief History of HugeTable
HT-p1 HT-p2 HT-p3
1. Connect Hive with 1. Move to higher version
1. HBase-based
HBase of Hive, Hadoop and
2. Partial xDBC/SQL HBase
support 3. Support HFile, CF in
2. New Storage Engine
3. Integration HBase Hive 3. Fruitful external I/F
with ZK before 2. Global Indexing 4. Many other
official release 4. Secondary Index improvements
4. Secondary Index 5. Multiple DB support 4. Application Solution
5. Support Schema 6. ACL support
6. ACL support 7. MR & Scan I/F
7. SQL console 8. Loader Tools, HT-Client
9. Admin Portal
10.JDBC remote console
2008 2009 2010
10. HBase as HugeTable Index Store
Create Index Select … using index xxx
Drop Index Select … where idxcol
Find Index
Index Meta Data Query Engine
Find Index Read Index
Write Index Index Data
Load Service
HBase
HT Loader Check Index
11. Index Store Implementation
Primary Index: index into data file
Secondary Index: index into primary index
Exact match and Range scan
Integrated with Hive ql and other modules
20 Nodes,
1TB/Node Hive HT-
HT-p1 HT-p2
Memory
No Additional cost 8GB/Node*TB 2GB/Node*TB
Consumption
20MB/s·Node(No 2.5MB/s·Node(Primar >5MB/s·Node(Primary
Load Speed
Index) y Index) Index)
Index Query N/A <10 sec <10 sec
12. HugeTable IUD Support
Goal: Support Insert, Update and Delete on application data.
IUD Statement Select
Find IUD table
Meta Data Query Engine
Write IUD Data
HT Data IUD Table Read IUD Data
HDFS HBase
Offline Merger
13. HugeTable Access Control
Goal: Support Multiple Users from Multiple Applications , w/o mutual trust
Database privileges: User Access Level:
1. Meta Data: Index, Create, 1. System Administrator
Drop 2. User Manager
2. User Data: IUD 3. User
Grant/Revoke
DDL/DML Loader/Portal
Check Privileges
Meta Data ACL Module
14. Administration Portal
Goal: Unified HugeTable management point, decrease management effort
Data Management User Management Monitor & FM Configuration
DB/TBL/IDX Add/Delete/Modify Log/Alert/Service Deploy/Setup
15. HugeTable Application API
Various kinds of Applications
JDBC/SQL API MapReduce API BigTable API
• Migration of traditional database • Compatible with Hadoop MR API • BigTable/HBase style API
applications • For data analysis, e.g. data mining • For NoSQL application, on HFile2
• For SQL developer • Work with HT records format • Range scan, Key-value access
• Batch processing & interactive • Access control • Access Control
Table table = new Table("gdr", "admin", "admin");
public void map(LongWritable key, {"default"};
String[] families = new String[] HugeRecord value,
OutputCollector<HugeRecordRowKey, HugeRecord> output,
String[] partitions = new String[] {"dt=20100317"};
int limit = 10; reporter);
Reporter
TableScannerInterface tsi = table.getScanner(
public void reduce(HugeRecordRowKey key,
new byte[0],new byte[] {Byte.MAX_VALUE},families, partitions);
Iterator<HugeRecord> values,
for (int i=0; i<limit; ++i) {
OutputCollector<HugeRecordRowKey, HugeRecord> output,
GroupValue gv = tsi.next();
Reporter reporter);
for (String family : families) {
System.out.println(family + " = " + Bytes.toString(gv.getByteValue(family)));
}
}
16. HugeTable based Telco Application Solutions
Heavy Requirements, e.g.
Batch processing Telco App
Complex data analysis
Interactive query on CDR
Statistic and reporting Reporting
Interactive Complex Interactive
Simple Query Analyze Complex Query
Database
Data Source HugeTable
Cluster
Data
Data + warehouse
Aggregator
DataMing
Data Source Tool kits
Mass Data Store
Batch processing
Statistic
17. Future works
Column Sorage Engine
File Format
Compression
Local Index
Global Index
Query Optimization
Join Optimization: index
Load Optimization
Parallel Load
Application Solution