SlideShare a Scribd company logo
Use Hadoop, One Week, No Coding
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
About Me
Noam Hasson
Team Leader for Big Data
• 12 years experience in web development
• Hadoop enthusiast since 2011
• noam.hasson@kenshoo.com
What is Kenshoo?
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
• Benefits of Hadoop:
• Solve big data challenges
• See actual results in less than a week
• Use current RDMBS infrastructure
• No coding experience necessary
• See how with an actual case study
Agenda
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Kenshoo Data Infrastructure
• 250 MySQL & application server pairs, each pair dedicated to a single client
• Identical schema on all MySQL servers
• Different table size & capacities on each MySQL server
• Total size of more than 100 TB
X 250…
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
The Challenge
• Avoid load-intensive queries on production clusters
• Reduce run-time for data analysis queries
• Run cross-server queries for clients deployed on more than one DB server
• Compare statistical information across several DB servers
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
The Solution
• Use Sqoop to migrate tables to Hive
• Running SQL queries on Hive
• That’s it!
Sqoop - Full Table Import Hadoop
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Example Code
• Oneline Import
sqoop import --hive-import --connect
jdbc:mysql://ServerAddress/Database -m 10 --table DBTable --hive-
database DBTable --username dbUsername --password dbPassword --
hive-overwrite -z
• Query
hive
use myDatabase
Select * from myTable where field = ‘value’
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Performance
• Single-node Hardware
• 12 Hard drives each 4TB
• 12 cores, 2 x CPU Sockets
• 32GB memory
• Performance
• Import table of 300M rows took 3 hours
• Select count on 5.5 Billion rows took 90 minutes
• Group By on 5.5 Billion rows, with 1.1 Billion rows took 18 hours
© 2014 Kenshoo, Inc. Confidential and Proprietary Information
What We’ve Learned
• Use partitions
• Import directly to compressed files
• Compare the row count in the source and destination tables
• Import only the columns you need to query
• Use a full table import for easy & quick results
• Refine the number of map tasks used for import
• Adopt Hive for your Map-Reduce jobs
• Increase query speed by avoiding “order by”

More Related Content

What's hot

HashiCorp at Just Eat
HashiCorp at Just EatHashiCorp at Just Eat
HashiCorp at Just Eat
Andrew Brown
 
MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
MongoDB and Amazon Web Services: Storage Options for MongoDB DeploymentsMongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
MongoDB
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
San Francisco HashiCorp User Group at GitHub
San Francisco HashiCorp User Group at GitHubSan Francisco HashiCorp User Group at GitHub
San Francisco HashiCorp User Group at GitHub
Jon Benson
 
Heap Dump Analysis - AEM: Real World Issues
Heap Dump Analysis - AEM: Real World IssuesHeap Dump Analysis - AEM: Real World Issues
Heap Dump Analysis - AEM: Real World Issues
Kanika Gera
 
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessBig Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Altoros
 
Hashicorp Nomad
Hashicorp NomadHashicorp Nomad
Hashicorp Nomad
Ivan Glushkov
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
Alluxio, Inc.
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
ScyllaDB
 
WordPress: Performance, Optimization & Scaling
WordPress: Performance, Optimization & ScalingWordPress: Performance, Optimization & Scaling
WordPress: Performance, Optimization & Scaling
Pete Mall
 
Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"
Fwdays
 
Magento 2 with Remote Storage
Magento 2 with Remote StorageMagento 2 with Remote Storage
Magento 2 with Remote Storage
Oleg Posyniak
 
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft AzureGDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
Andriy Deren'
 
Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS
Alluxio, Inc.
 
Primary Storage in CloudStack by Mike Tutkowski
Primary Storage in CloudStack by Mike TutkowskiPrimary Storage in CloudStack by Mike Tutkowski
Primary Storage in CloudStack by Mike Tutkowski
buildacloud
 
Windows Azure Caching
Windows Azure CachingWindows Azure Caching
Windows Azure Caching
Pavel Revenkov
 
Introducing ASP.NET vNext
Introducing ASP.NET vNextIntroducing ASP.NET vNext
Introducing ASP.NET vNext
Bruce Johnson
 
“Kick-off with Scale in Mind” by Yousef Wadi
“Kick-off with Scale in Mind” by Yousef Wadi“Kick-off with Scale in Mind” by Yousef Wadi
“Kick-off with Scale in Mind” by Yousef Wadi
Jordan Open Source Association
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
Lynn Langit
 
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
[Pgday.Seoul 2018]  PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha[Pgday.Seoul 2018]  PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
PgDay.Seoul
 

What's hot (20)

HashiCorp at Just Eat
HashiCorp at Just EatHashiCorp at Just Eat
HashiCorp at Just Eat
 
MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
MongoDB and Amazon Web Services: Storage Options for MongoDB DeploymentsMongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
San Francisco HashiCorp User Group at GitHub
San Francisco HashiCorp User Group at GitHubSan Francisco HashiCorp User Group at GitHub
San Francisco HashiCorp User Group at GitHub
 
Heap Dump Analysis - AEM: Real World Issues
Heap Dump Analysis - AEM: Real World IssuesHeap Dump Analysis - AEM: Real World Issues
Heap Dump Analysis - AEM: Real World Issues
 
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with SuccessBig Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success
 
Hashicorp Nomad
Hashicorp NomadHashicorp Nomad
Hashicorp Nomad
 
Building Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, AlluxioBuilding Fast SQL Analytics on Anything with Presto, Alluxio
Building Fast SQL Analytics on Anything with Presto, Alluxio
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
WordPress: Performance, Optimization & Scaling
WordPress: Performance, Optimization & ScalingWordPress: Performance, Optimization & Scaling
WordPress: Performance, Optimization & Scaling
 
Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"Володимир Цап "Constraint driven infrastructure - scale or tune?"
Володимир Цап "Constraint driven infrastructure - scale or tune?"
 
Magento 2 with Remote Storage
Magento 2 with Remote StorageMagento 2 with Remote Storage
Magento 2 with Remote Storage
 
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft AzureGDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure
 
Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS Building Cloud Native Analytical Pipelines on AWS
Building Cloud Native Analytical Pipelines on AWS
 
Primary Storage in CloudStack by Mike Tutkowski
Primary Storage in CloudStack by Mike TutkowskiPrimary Storage in CloudStack by Mike Tutkowski
Primary Storage in CloudStack by Mike Tutkowski
 
Windows Azure Caching
Windows Azure CachingWindows Azure Caching
Windows Azure Caching
 
Introducing ASP.NET vNext
Introducing ASP.NET vNextIntroducing ASP.NET vNext
Introducing ASP.NET vNext
 
“Kick-off with Scale in Mind” by Yousef Wadi
“Kick-off with Scale in Mind” by Yousef Wadi“Kick-off with Scale in Mind” by Yousef Wadi
“Kick-off with Scale in Mind” by Yousef Wadi
 
Redis Labs and SQL Server
Redis Labs and SQL ServerRedis Labs and SQL Server
Redis Labs and SQL Server
 
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
[Pgday.Seoul 2018]  PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha[Pgday.Seoul 2018]  PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha
 

Similar to Kenshoo - Use Hadoop, One Week, No Coding

MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
Shrey Batra
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
Cloudera, Inc.
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
solarisyougood
 
Custom coded projects
Custom coded projectsCustom coded projects
Custom coded projects
Marko Heijnen
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
Kellyn Pot'Vin-Gorman
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
James Serra
 
PHD Virtual: Optimizing Backups for Any Storage
PHD Virtual: Optimizing Backups for Any StoragePHD Virtual: Optimizing Backups for Any Storage
PHD Virtual: Optimizing Backups for Any Storage
Mark McHenry
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
EDB
 
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopThe Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
Jeff Smoley
 
SQL 2014 hybrid platform - Azure and on premise
SQL 2014 hybrid platform - Azure and on premise SQL 2014 hybrid platform - Azure and on premise
SQL 2014 hybrid platform - Azure and on premise
Shy Engelberg
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Community
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
bazaarvoice_engineering
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
DataWorks Summit
 
DataCore Case Study on Hyperconverged
DataCore Case Study on HyperconvergedDataCore Case Study on Hyperconverged
DataCore Case Study on Hyperconverged
Advantech Industrial Automation Group
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
Hortonworks
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Amazon Web Services
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
Francisco González Jiménez
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
Raul Chong
 

Similar to Kenshoo - Use Hadoop, One Week, No Coding (20)

MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Custom coded projects
Custom coded projectsCustom coded projects
Custom coded projects
 
Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
PHD Virtual: Optimizing Backups for Any Storage
PHD Virtual: Optimizing Backups for Any StoragePHD Virtual: Optimizing Backups for Any Storage
PHD Virtual: Optimizing Backups for Any Storage
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
 
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft ShopThe Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop
 
SQL 2014 hybrid platform - Azure and on premise
SQL 2014 hybrid platform - Azure and on premise SQL 2014 hybrid platform - Azure and on premise
SQL 2014 hybrid platform - Azure and on premise
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
DataCore Case Study on Hyperconverged
DataCore Case Study on HyperconvergedDataCore Case Study on Hyperconverged
DataCore Case Study on Hyperconverged
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 

Recently uploaded (20)

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 

Kenshoo - Use Hadoop, One Week, No Coding

  • 1. Use Hadoop, One Week, No Coding
  • 2. © 2014 Kenshoo, Inc. Confidential and Proprietary Information About Me Noam Hasson Team Leader for Big Data • 12 years experience in web development • Hadoop enthusiast since 2011 • noam.hasson@kenshoo.com
  • 4. © 2014 Kenshoo, Inc. Confidential and Proprietary Information • Benefits of Hadoop: • Solve big data challenges • See actual results in less than a week • Use current RDMBS infrastructure • No coding experience necessary • See how with an actual case study Agenda
  • 5. © 2014 Kenshoo, Inc. Confidential and Proprietary Information Kenshoo Data Infrastructure • 250 MySQL & application server pairs, each pair dedicated to a single client • Identical schema on all MySQL servers • Different table size & capacities on each MySQL server • Total size of more than 100 TB X 250…
  • 6. © 2014 Kenshoo, Inc. Confidential and Proprietary Information The Challenge • Avoid load-intensive queries on production clusters • Reduce run-time for data analysis queries • Run cross-server queries for clients deployed on more than one DB server • Compare statistical information across several DB servers
  • 7. © 2014 Kenshoo, Inc. Confidential and Proprietary Information The Solution • Use Sqoop to migrate tables to Hive • Running SQL queries on Hive • That’s it! Sqoop - Full Table Import Hadoop
  • 8. © 2014 Kenshoo, Inc. Confidential and Proprietary Information Example Code • Oneline Import sqoop import --hive-import --connect jdbc:mysql://ServerAddress/Database -m 10 --table DBTable --hive- database DBTable --username dbUsername --password dbPassword -- hive-overwrite -z • Query hive use myDatabase Select * from myTable where field = ‘value’
  • 9. © 2014 Kenshoo, Inc. Confidential and Proprietary Information Performance • Single-node Hardware • 12 Hard drives each 4TB • 12 cores, 2 x CPU Sockets • 32GB memory • Performance • Import table of 300M rows took 3 hours • Select count on 5.5 Billion rows took 90 minutes • Group By on 5.5 Billion rows, with 1.1 Billion rows took 18 hours
  • 10. © 2014 Kenshoo, Inc. Confidential and Proprietary Information What We’ve Learned • Use partitions • Import directly to compressed files • Compare the row count in the source and destination tables • Import only the columns you need to query • Use a full table import for easy & quick results • Refine the number of map tasks used for import • Adopt Hive for your Map-Reduce jobs • Increase query speed by avoiding “order by”