HBaseCon 2013: Using Apache HBase for Large Matrices

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
HBase for Dealing with
Large Matrices
Who am I?
Leads data team at Dilisim
Researcher at Anadolu University
Machine Learning
Some big problems
Classifying huge text collections
Recommending to millions of users
Predicting links in a social network
Recommender Systems
Recommenders input large sparse
matrices
How would you input a millions X
millions matrix?
Recommender Systems
m users
3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …
2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00…
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …
4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00…
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00…
3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00…
0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00…
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00…
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
…………………………………………………………………………………………………………………… .
…………………………………………………………………………………………………………………… .
………………………………………………………………………………………………………………… … .
n items
Input
Recommender Systems
State-of-the-art recommender systems learn
large models
One factor vector per each user and item
One parameter vector (on side info) per
each user and item
Recommender Systems
m users
3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 …
2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 …
3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 …
0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 …
0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 …
4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 …
4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 …
1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 …
1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 …
3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 …
0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 …
………………………………………………………… .
………………………………………………………… .
………………………………………………………… .
n items
Input
User Model
Item Model
m x k
n x k
0.54 0.48 0.83 0.75 0.28 …
0.02 0.29 0.99 0.85 0.68 …
0.05 0.53 0.60 0.98 0.19 …
0.52 0.47 0.50 0.12 0.98 …
0.26 0.39 0.29 0.91 0.50 …
0.15 0.43 0.66 0.07 0.51 …
0.52 0.36 0.01 0.87 0.53 …
…………………………. .
………………………….. .
…………………………... .
0.93 0.78 0.56 0.77 0.75 …
0.21 0.44 0.99 0.01 0.00 …
0.04 0.42 0.36 0.72 0.19 …
0.77 0.07 0.24 0.67 0.87 …
0.42 0.79 0.62 0.80 0.79 …
0.42 0.32 0.26 0.50 0.85 …
0.94 0.76 0.93 0.34 0.46 …
…………………………. .
………………………….. .
…………………………... .
Learning Process
What does a machine learning algorithm
require to do with that matrix?
Machine Learning - Techniques
Batch Learning
All parameters are updated once per
iteration
Machine Learning - Techniques
Batch Learning
Updates can be calculated in parallel
using MapReduce
(SequenceFile might be enough)
Machine Learning - Techniques
Batch Learning
Output model should provide random
access to rows
Machine Learning - Techniques
Online Learning
Parameters are updated per training
example
Machine Learning - Techniques
Online Learning
Each update results in updates in
a row
Needs random access while learning
Machine Learning - Techniques
Online Learning
Output model should provide random
access to rows
Deployment Process
How do you decide to deploy a machine
learning model in production?
Machine Learning - Deployment
Usual process
Works
good?
Deploy in
production
Experiment
on prototype
Y
N
Machine Learning - Deployment
How would you turn your prototype into
production easily?
Common matrix interface for in-
memory and persistent versions
HBase Backed Matrix
Implements Mahout matrix
Dense or sparse
HBase Backed Matrix
Random access to cells
Random access to rows
Iteration over rows
Lazy loading while iterating
HBase Backed Matrix
Common interface for prototype and
product
Easy to deploy (Model already persisted)
HBase Backed Matrix
Matrix operations with existing mahout-
math library
Logical Schema
Composite row keys:
12_0:
12_9:
12_22000:
data:value:0.41
data:value:0.41
data:value:0.41
Logical Schema
Composite row keys:
Row access by scan
Cell access by get
Atomic row update should be
handled in application
Logical Schema
Row indices as row keys
12:
data:0:0.41 data:22000:0.41data:9:0.41
Logical Schema
Row indices as row keys
Atomic updates are handled
automatically
Speed – Cell access/write
GET SET
row index as row key
composite row key
Speed – Row access/write
GET SET
row index as row key
composite row key
Code
github.com/gcapan/mahout/tree/hbase-matrix
Future Work
MatrixInputFormat
Might replace SequenceFile based
MapReduce inputs
Future Work – A little digression
Recommender Systems
Calculating score for a user-item
pair is easy with HBaseMatrix
Future Work – A little digression
Recommender Systems
top-N recommendation?
All candidate items for a user in
the user row as a nested entity
(See Ian Varley's HBase Schema
Design)
Thank you!
1 of 32

Recommended

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory... by
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
4.1K views32 slides
HBaseCon 2013: Deal Personalization Engine with HBase @ Groupon by
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponHBaseCon 2013: Deal Personalization Engine with HBase @ Groupon
HBaseCon 2013: Deal Personalization Engine with HBase @ GrouponCloudera, Inc.
6K views17 slides
HBase Storage Internals by
HBase Storage InternalsHBase Storage Internals
HBase Storage InternalsDataWorks Summit
23.8K views27 slides
HBase and HDFS: Understanding FileSystem Usage in HBase by
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
74K views33 slides
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce by
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
41.7K views189 slides
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production by
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionBreaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in productionNeelesh Srinivas Salian
594 views18 slides

More Related Content

Viewers also liked

Application architectures with Hadoop and Sessionization in MR by
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
11.4K views101 slides
HDFS Analysis for Small Files by
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small FilesDataWorks Summit/Hadoop Summit
3.2K views27 slides
Hadoop World 2011: Advanced HBase Schema Design by
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
17.9K views33 slides
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase. by
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
7.1K views52 slides
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb... by
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
3.2K views11 slides
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data... by
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...Cloudera, Inc.
3.5K views11 slides

Viewers also liked(20)

Application architectures with Hadoop and Sessionization in MR by markgrover
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover11.4K views
Hadoop World 2011: Advanced HBase Schema Design by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.17.9K views
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase. by Cloudera, Inc.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.7.1K views
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb... by Cloudera, Inc.
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.3.2K views
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data... by Cloudera, Inc.
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
Cloudera, Inc.3.5K views
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon by Cloudera, Inc.
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
Cloudera, Inc.3.4K views
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN by HBaseCon
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon2.9K views
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC by Cloudera, Inc.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.3.9K views
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase by HBaseCon
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon3.3K views
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second... by Cloudera, Inc.
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.4.2K views
Tales from the Cloudera Field by HBaseCon
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon4K views
HBase Read High Availability Using Timeline-Consistent Region Replicas by HBaseCon
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon4.1K views
HBaseCon 2012 | Scaling GIS In Three Acts by Cloudera, Inc.
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
Cloudera, Inc.3.6K views
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo! by Cloudera, Inc.
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
Cloudera, Inc.3.2K views
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics by Cloudera, Inc.
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.4.8K views
HBaseCon 2013: Apache HBase on Flash by Cloudera, Inc.
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.4.3K views
Cross-Site BigTable using HBase by HBaseCon
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
HBaseCon3.5K views
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,... by Cloudera, Inc.
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.3.8K views
HBaseCon 2013: Rebuilding for Scale on Apache HBase by Cloudera, Inc.
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
Cloudera, Inc.3.9K views

Similar to HBaseCon 2013: Using Apache HBase for Large Matrices

IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018 by
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018Amazon Web Services
615 views33 slides
Big data technologies : A survey by
Big data technologies : A survey Big data technologies : A survey
Big data technologies : A survey fatimabenjelloun1
52 views18 slides
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docx by
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docxDepartment Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docx
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docxtheodorelove43763
4 views10 slides
Assessing the consistency, quality, and completeness of the Reviewed Event Bu... by
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...Ivan Kitov
87 views23 slides
Is observability good for your brain? by
Is observability good for your brain?Is observability good for your brain?
Is observability good for your brain?Sematext Group, Inc.
751 views28 slides
Advanced online search through the web by
Advanced online search through the webAdvanced online search through the web
Advanced online search through the webnetknowlogy
248 views36 slides

Similar to HBaseCon 2013: Using Apache HBase for Large Matrices(20)

IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018 by Amazon Web Services
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
IoT Analytics Workshop (IOT314-R1) - AWS re:Invent 2018
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docx by theodorelove43763
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docxDepartment Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docx
Department Gender Ethnicity Age Job Satisfaction1 2 1 20.00 .docx
Assessing the consistency, quality, and completeness of the Reviewed Event Bu... by Ivan Kitov
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
Assessing the consistency, quality, and completeness of the Reviewed Event Bu...
Ivan Kitov87 views
Advanced online search through the web by netknowlogy
Advanced online search through the webAdvanced online search through the web
Advanced online search through the web
netknowlogy248 views
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx by Luis Beltran
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
03 GlobalAIBootcamp2020Lisboa-Rock, Paper, Scissors.pptx
Luis Beltran171 views
Crab: A Python Framework for Building Recommender Systems by Marcel Caraciolo
Crab: A Python Framework for Building Recommender Systems Crab: A Python Framework for Building Recommender Systems
Crab: A Python Framework for Building Recommender Systems
Marcel Caraciolo9.5K views
An overview of text mining and sentiment analysis for Decision Support System by Gan Keng Hoon
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
Gan Keng Hoon2K views
Chatbot mohinh sinh by pqtrung5th1
Chatbot mohinh sinhChatbot mohinh sinh
Chatbot mohinh sinh
pqtrung5th1108 views
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b... by vinoth raja
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
CS8075 - Data Warehousing and Data Mining (Ripped from Amazon Kindle eBooks b...
vinoth raja618 views
Recommendation Subsystem - Museum Radar by Panos Gemos
Recommendation Subsystem - Museum RadarRecommendation Subsystem - Museum Radar
Recommendation Subsystem - Museum Radar
Panos Gemos161 views
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh... by Luis Beltran
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Cloud Lunch and Learn ML.NET MACHINE LEARNING (AND DEEP LEARNING) FOR THE CSh...
Luis Beltran299 views
There and Back Again - A Tale of Programming Languages by BADR
There and Back Again - A Tale of Programming LanguagesThere and Back Again - A Tale of Programming Languages
There and Back Again - A Tale of Programming Languages
BADR95 views
report.doc by butest
report.docreport.doc
report.doc
butest667 views
Rock Paper Scissors with MLNET.pptx by Luis Beltran
Rock Paper Scissors with MLNET.pptxRock Paper Scissors with MLNET.pptx
Rock Paper Scissors with MLNET.pptx
Luis Beltran184 views
Machine Learning for Designers - UX Camp Switzerland by Memi Beltrame
Machine Learning for Designers - UX Camp SwitzerlandMachine Learning for Designers - UX Camp Switzerland
Machine Learning for Designers - UX Camp Switzerland
Memi Beltrame434 views
Search Engine Risk Dependency by Ronan Chardennau by Pozzolini
Search Engine Risk Dependency by Ronan ChardennauSearch Engine Risk Dependency by Ronan Chardennau
Search Engine Risk Dependency by Ronan Chardennau
Pozzolini1.8K views
Anti-MOOCs: The design of MACROSIMs by dws1d
Anti-MOOCs: The design of MACROSIMsAnti-MOOCs: The design of MACROSIMs
Anti-MOOCs: The design of MACROSIMs
dws1d253 views

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
107 views55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
6.4K views34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
6.3K views43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
4.5K views67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
3.6K views36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
2.5K views21 slides

More from Cloudera, Inc.(20)

Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.107 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.6.4K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views

Recently uploaded

AI: mind, matter, meaning, metaphors, being, becoming, life values by
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life valuesTwain Liu 刘秋艳
34 views16 slides
"How we switched to Kanban and how it integrates with product planning", Vady... by
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...Fwdays
61 views24 slides
Five Things You SHOULD Know About Postman by
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About PostmanPostman
25 views43 slides
MemVerge: Past Present and Future of CXL by
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXLCXL Forum
110 views26 slides
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... by
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...Fwdays
40 views30 slides
TE Connectivity: Card Edge Interconnects by
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge InterconnectsCXL Forum
96 views12 slides

Recently uploaded(20)

AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
"How we switched to Kanban and how it integrates with product planning", Vady... by Fwdays
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...
Fwdays61 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman25 views
MemVerge: Past Present and Future of CXL by CXL Forum
MemVerge: Past Present and Future of CXLMemVerge: Past Present and Future of CXL
MemVerge: Past Present and Future of CXL
CXL Forum110 views
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... by Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays40 views
TE Connectivity: Card Edge Interconnects by CXL Forum
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge Interconnects
CXL Forum96 views
CXL at OCP by CXL Forum
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum208 views
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur by Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays40 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
.conf Go 2023 - Data analysis as a routine by Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk90 views
JCon Live 2023 - Lice coding some integration problems by Bernd Ruecker
JCon Live 2023 - Lice coding some integration problemsJCon Live 2023 - Lice coding some integration problems
JCon Live 2023 - Lice coding some integration problems
Bernd Ruecker67 views
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM by CXL Forum
Samsung: CMM-H Tiered Memory Solution with Built-in DRAMSamsung: CMM-H Tiered Memory Solution with Built-in DRAM
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum105 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada110 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 views
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy by Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 views
GigaIO: The March of Composability Onward to Memory with CXL by CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays26 views
Liqid: Composable CXL Preview by CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada119 views

HBaseCon 2013: Using Apache HBase for Large Matrices

  • 1. HBase for Dealing with Large Matrices
  • 2. Who am I? Leads data team at Dilisim Researcher at Anadolu University
  • 3. Machine Learning Some big problems Classifying huge text collections Recommending to millions of users Predicting links in a social network
  • 4. Recommender Systems Recommenders input large sparse matrices How would you input a millions X millions matrix?
  • 5. Recommender Systems m users 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 … 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00… 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 … 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00… 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00… 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00… 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00… 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00… 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00… 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … …………………………………………………………………………………………………………………… . …………………………………………………………………………………………………………………… . ………………………………………………………………………………………………………………… … . n items Input
  • 6. Recommender Systems State-of-the-art recommender systems learn large models One factor vector per each user and item One parameter vector (on side info) per each user and item
  • 7. Recommender Systems m users 3.00 0.00 2.00 0.00 4.00 2.00 3.00 2.00 1.00 3.00 0.00 … 2.00 3.00 2.00 1.00 1.00 4.00 2.00 3.00 0.00 2.00 3.00 … 3.00 4.00 3.00 2.00 3.00 4.00 1.00 1.00 1.00 1.00 3.00 … 0.00 0.00 0.00 0.00 4.00 1.00 3.00 4.00 2.00 1.00 0.00 … 0.00 2.00 3.00 2.00 2.00 4.00 3.00 0.00 3.00 4.00 2.00 … 4.00 2.00 1.00 4.00 2.00 3.00 3.00 0.00 3.00 1.00 2.00 … 4.00 0.00 4.00 2.00 2.00 3.00 3.00 3.00 2.00 0.00 0.00 … 1.00 3.00 2.00 4.00 2.00 3.00 0.00 0.00 3.00 0.00 3.00 … 1.00 4.00 1.00 1.00 3.00 4.00 2.00 2.00 0.00 1.00 4.00 … 3.00 3.00 2.00 3.00 4.00 1.00 4.00 0.00 1.00 0.00 1.00 … 0.00 1.00 2.00 4.00 2.00 2.00 3.00 4.00 4.00 4.00 1.00 … ………………………………………………………… . ………………………………………………………… . ………………………………………………………… . n items Input User Model Item Model m x k n x k 0.54 0.48 0.83 0.75 0.28 … 0.02 0.29 0.99 0.85 0.68 … 0.05 0.53 0.60 0.98 0.19 … 0.52 0.47 0.50 0.12 0.98 … 0.26 0.39 0.29 0.91 0.50 … 0.15 0.43 0.66 0.07 0.51 … 0.52 0.36 0.01 0.87 0.53 … …………………………. . ………………………….. . …………………………... . 0.93 0.78 0.56 0.77 0.75 … 0.21 0.44 0.99 0.01 0.00 … 0.04 0.42 0.36 0.72 0.19 … 0.77 0.07 0.24 0.67 0.87 … 0.42 0.79 0.62 0.80 0.79 … 0.42 0.32 0.26 0.50 0.85 … 0.94 0.76 0.93 0.34 0.46 … …………………………. . ………………………….. . …………………………... .
  • 8. Learning Process What does a machine learning algorithm require to do with that matrix?
  • 9. Machine Learning - Techniques Batch Learning All parameters are updated once per iteration
  • 10. Machine Learning - Techniques Batch Learning Updates can be calculated in parallel using MapReduce (SequenceFile might be enough)
  • 11. Machine Learning - Techniques Batch Learning Output model should provide random access to rows
  • 12. Machine Learning - Techniques Online Learning Parameters are updated per training example
  • 13. Machine Learning - Techniques Online Learning Each update results in updates in a row Needs random access while learning
  • 14. Machine Learning - Techniques Online Learning Output model should provide random access to rows
  • 15. Deployment Process How do you decide to deploy a machine learning model in production?
  • 16. Machine Learning - Deployment Usual process Works good? Deploy in production Experiment on prototype Y N
  • 17. Machine Learning - Deployment How would you turn your prototype into production easily? Common matrix interface for in- memory and persistent versions
  • 18. HBase Backed Matrix Implements Mahout matrix Dense or sparse
  • 19. HBase Backed Matrix Random access to cells Random access to rows Iteration over rows Lazy loading while iterating
  • 20. HBase Backed Matrix Common interface for prototype and product Easy to deploy (Model already persisted)
  • 21. HBase Backed Matrix Matrix operations with existing mahout- math library
  • 22. Logical Schema Composite row keys: 12_0: 12_9: 12_22000: data:value:0.41 data:value:0.41 data:value:0.41
  • 23. Logical Schema Composite row keys: Row access by scan Cell access by get Atomic row update should be handled in application
  • 24. Logical Schema Row indices as row keys 12: data:0:0.41 data:22000:0.41data:9:0.41
  • 25. Logical Schema Row indices as row keys Atomic updates are handled automatically
  • 26. Speed – Cell access/write GET SET row index as row key composite row key
  • 27. Speed – Row access/write GET SET row index as row key composite row key
  • 29. Future Work MatrixInputFormat Might replace SequenceFile based MapReduce inputs
  • 30. Future Work – A little digression Recommender Systems Calculating score for a user-item pair is easy with HBaseMatrix
  • 31. Future Work – A little digression Recommender Systems top-N recommendation? All candidate items for a user in the user row as a nested entity (See Ian Varley's HBase Schema Design)