Kenshoo - Use Hadoop, One Week, No Coding

•Download as PPTX, PDF•

0 likes•1,884 views

Noam Hasson, team leader for Big Data at Kenshoo explains what Kenshoo does, and how it leverages Hadoop to solve it's Big Data Challenges.

Technology

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
About Me
Noam Hasson
Team Leader for Big Data
• 12 years experience in web development
• Hadoop enthusiast since 2011
• noam.hasson@kenshoo.com

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
• Benefits of Hadoop:
• Solve big data challenges
• See actual results in less than a week
• Use current RDMBS infrastructure
• No coding experience necessary
• See how with an actual case study
Agenda

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Kenshoo Data Infrastructure
• 250 MySQL & application server pairs, each pair dedicated to a single client
• Identical schema on all MySQL servers
• Different table size & capacities on each MySQL server
• Total size of more than 100 TB
X 250…

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
The Challenge
• Avoid load-intensive queries on production clusters
• Reduce run-time for data analysis queries
• Run cross-server queries for clients deployed on more than one DB server
• Compare statistical information across several DB servers

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
The Solution
• Use Sqoop to migrate tables to Hive
• Running SQL queries on Hive
• That’s it!
Sqoop - Full Table Import Hadoop

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Example Code
• Oneline Import
sqoop import --hive-import --connect
jdbc:mysql://ServerAddress/Database -m 10 --table DBTable --hive-
database DBTable --username dbUsername --password dbPassword --
hive-overwrite -z
• Query
hive
use myDatabase
Select * from myTable where field = ‘value’

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
Performance
• Single-node Hardware
• 12 Hard drives each 4TB
• 12 cores, 2 x CPU Sockets
• 32GB memory
• Performance
• Import table of 300M rows took 3 hours
• Select count on 5.5 Billion rows took 90 minutes
• Group By on 5.5 Billion rows, with 1.1 Billion rows took 18 hours

© 2014 Kenshoo, Inc. Confidential and Proprietary Information
What We’ve Learned
• Use partitions
• Import directly to compressed files
• Compare the row count in the source and destination tables
• Import only the columns you need to query
• Use a full table import for easy & quick results
• Refine the number of map tasks used for import
• Adopt Hive for your Map-Reduce jobs
• Increase query speed by avoiding “order by”

This document discusses deploying SharePoint 2013 on Microsoft Azure infrastructure as a service (IaaS). It covers key Azure concepts like virtual networks, availability, disks, and virtual machines. Virtual networks allow grouping of virtual machines and enabling Active Directory. High availability is achieved through location, regions, affinity groups, and availability sets. Disk storage and performance considerations for databases and content are provided. Sample virtual machine configurations show optimal disk layout and sizing for SharePoint and SQL Server.

London HUG 8/3 - Nomad

London HashiCorp User Group

This document discusses Nomad, a distributed, highly available, datacenter-aware cluster scheduler developed by HashiCorp. Nomad schedules work (tasks) across available resources (hosts) to optimize utilization. It allows defining jobs through a declarative job specification language and handles scheduling work to available resources. Nomad aims to provide flexibility for different workloads through pluggable drivers, schedulers and fingerprinting while also being operationally simple to use with a single binary, no dependencies, and high availability.

Роман Новиков "Best Practices for MySQL Performance & Troubleshooting with th...

Fwdays

London HUG 14/4 - Deploying and Discovering at Scale with Consul and Nomad

London HashiCorp User Group

This document discusses Nomad and Consul, two products from HashiCorp that help with deploying and discovering services at scale. Nomad is a cluster scheduler that allows specifying jobs to deploy applications across datacenters. It provides advantages like higher resource utilization, decoupling work from resources, and better quality of service through features like bin packing and priorities. Consul is a service discovery and configuration tool that supports querying across datacenters and regions. It uses Raft consensus and gossip protocols to maintain high availability and scalability.

From 0 to hero adf cicd pass mdpug oslo feb 2020

Halvar Trøyel Nerbø

Migratory Workloads Across Clouds with Nomad

Philip Watts

Amazon Web Services Customer Case Study, Fashion for Home

Amazon Web Services

This document discusses how a designer furniture company uses AWS services like S3, CloudFront, EC2, and RDS to run their online shop. It provides examples of how S3 can be used to store reports, backups, images, and other files. CloudFront is used as a CDN to distribute graphics, CSS, and JS files globally. General recommendations include using long-term pricing plans, monitoring costs with CloudWatch, using multiple regions for high availability, and employing security best practices.

When using MongoDB and AWS, you want to design your infrastructure to avoid storage bottlenecks and make the best use of your available storage resources. AWS offers a myriad of storage options, including ephemeral disks, EBS, Provisioned IOPS, and ephemeral SSD's, each offering different performance and persistence characteristics. In this session, we’ll evaluate each of these options in the context of your MongoDB deployment, assessing the benefits and drawbacks of each.

Presto on Alluxio Hands-On Lab

Alluxio, Inc.

San Francisco HashiCorp User Group at GitHub

Jon Benson

This document discusses Nomad and Consul, two products from HashiCorp that help with deploying and discovering services at scale. Nomad is a cluster scheduler that allows specifying jobs to deploy applications across datacenters. It provides advantages like higher resource utilization, decoupling work from resources, and better quality of service. Consul is a service discovery and configuration tool that supports service registration, health checking, and queries at scale across datacenters. The presentation covers the architectures and advantages of both Nomad and Consul for operating large clusters in a multi-region environment.

Heap Dump Analysis - AEM: Real World Issues

Kanika Gera

This document discusses analyzing Java heap dumps to diagnose out of memory errors. It begins with an overview of Java heap concepts like how memory is allocated and garbage collection. Next, it defines what a heap dump is and how to generate one. It then explains how to analyze heap dumps using tools like MAT to identify the largest objects consuming memory, visualize object reference graphs and dominator trees, and investigate threads. The goal is to find memory leaks and reduce memory usage to prevent out of memory errors from occurring.

Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

Altoros

This document contains information about Altoros Systems and their Chief Technology Officer Andrei Yurkevich. It discusses Altoros' services including Hadoop performance engineering and cloud automation. It also includes details about their global employee size, customers, and partners. Later sections evaluate different cloud platform options and database technologies for building a data analytics prototype within budget and functional requirements.

Hashicorp Nomad

Ivan Glushkov

This document provides an overview of HashiCorp Nomad, including its key concepts, architecture, scheduling process, job specification, runtime environment, task drivers, and HTTP API. Nomad is an open source project that supports Docker containers, operates simply with one binary across datacenters, and is built for scale and hybrid cloud deployments. It uses a client-server model with Raft consensus and gossip protocols to manage membership across regions. Scheduling is inspired by Google papers and involves evaluating state changes to generate allocation plans that place tasks based on feasibility and ranking nodes.

Building Fast SQL Analytics on Anything with Presto, Alluxio

Alluxio, Inc.

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

ScyllaDB

In this talk AWS’ Ken Krupa, Head of Specialized Solutions Architecture, will describe the architecture and capabilities of two new AWS EC2 instance types perfect for data-intensive storage and IO-heavy workloads like ScyllaDB: the Intel-based I4i and the Graviton2-based I4g series. The Intel Xeon Ice Lake-based I4i series provides unparalleled raw horsepower for your most demanding workloads. Meanwhile, the Graviton2-powered I4g instances provide lower cost per storage on a power-efficient platform to deploy your cloud-native applications. Ken will also describe the AWS Nitro SSD, a new form of high-speed NVMe storage with a Flash Translation Layer built with Nitro controllers, which powers both of these instance families. ScyllaDB VP of Product Tzach Livyatan will then share benchmarking results showing how ScyllaDB behaves under load on these two instance types, providing maximum system utility and efficiency. To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.

WordPress: Performance, Optimization & Scaling

Pete Mall

The document discusses various techniques for optimizing performance and scaling WordPress sites. It covers caching at the disk, memory, page, and object levels. It also discusses scaling strategies like using multiple web and database servers, database sharding, file syncing, and caching technologies like Memcached. Specific caching plugins like Batcache and W3TC are mentioned. Coding best practices like using transients and the WordPress APIs are recommended to optimize performance.

Володимир Цап "Constraint driven infrastructure - scale or tune?"

Fwdays

Volodymyr Tsap discusses how to save money on infrastructure through constraint driven design. He provides examples of hardware configurations on AWS, bare metal servers, and PaaS platforms to demonstrate how costs can be optimized. Tsap also outlines ways to reduce software costs through choices in operating system, virtualization, databases, and orchestration. Infrastructure support costs depend on the complexity of the environment, with basic setups costing $500-800 per month while more advanced architectures are $4,000-6,000 per month. The overall message is that money saved through optimization can be invested in people.

Magento 2 with Remote Storage

Oleg Posyniak

GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure

Andriy Deren'

Building Cloud Native Analytical Pipelines on AWS

Alluxio, Inc.

Primary Storage in CloudStack by Mike Tutkowski

buildacloud

Primary storage in CloudStack stores running virtual machine disk volumes on hosts and is used for production applications, databases, and dev/test systems. It requires high-performance storage that can handle high change content and bursty I/O workloads. To configure primary storage, administrators first set up storage space on a SAN, create a hypervisor-level storage repository, and then define a primary storage in CloudStack that is associated with compute offerings for user VMs.

Windows Azure Caching

Pavel Revenkov

This document discusses caching services available on Windows Azure, including content delivery networks (CDNs) and caching. It describes how CDNs deliver content closer to end users, and caching stores frequently accessed data closer to Azure applications. Caching on Azure can be done through dedicated roles, co-location with applications, or shared caching services. The document outlines characteristics of CDNs like dedicated endpoints and worldwide datacenters. It also provides examples of caching configuration and workflows in Visual Studio and code samples for putting and getting items from the cache.

Introducing ASP.NET vNext

Bruce Johnson

ASP.NET MVC 5 is a framework for building scalable and standards-based web applications using established design patterns and the power of ASP.NET and .NET. It allows applications to run on IIS or self-host on Windows, Linux, and Mac OS X using the .NET runtime and libraries delivered via NuGet. Applications are built with MSBuild/Roslyn and hosted by Kestrel, IIS, or HTTP.SYS, with libraries from NuGet rather than the GAC.

“Kick-off with Scale in Mind” by Yousef Wadi

Jordan Open Source Association

This document discusses server and code architectures that can scale easily as an application grows. It presents different server setup structures (linear, diamond, fan-out, multi-fan) and strategies for scaling web/API servers using Node.js. It also covers data storage options and how to scale storage. The key is to design architectures that can grow horizontally by expanding to other servers rather than only vertically by increasing the resources of a single server.

Redis Labs and SQL Server

Lynn Langit

[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha

PgDay.Seoul

This document introduces AppOS, an operating system specialized for database performance. It discusses how AppOS improves on Linux by being more optimized for database workloads through techniques like specialized caching, I/O scheduling based on database priorities, and atomic writes. It also explains how AppOS is portable, high performing, and extensible to support different databases through its modular design. Future plans include improving cache management, parallel query optimization, and cooperative CPU scheduling.

MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)

Shrey Batra

This document discusses moving MongoDB to the cloud. It provides an overview of MongoDB hosting options including on-premises data centers, cloud providers, and hosted databases. It outlines some key reasons to move to the cloud, such as cost-effectiveness, reduced need for staffing, and improved availability. It also covers important considerations for strategy planning including instance types, high availability strategy, security, and migration/rollback strategies. Finally, it discusses two common strategies for migrating - adding a cloud server to an existing replica set with no downtime, or taking backups and restoring to the cloud which requires downtime.

Rigorous and Multi-tenant HBase Performance Measurement

DataWorks Summit

The document discusses techniques for rigorously measuring HBase performance in both standalone and multi-tenant environments. It begins with an overview of HBase and the Yahoo! Cloud Serving Benchmark (YCSB) for evaluating databases. It then discusses best practices for cluster setup, data loading, and benchmarking techniques like warming the cache, setting target throughput, and using appropriate workloads. Finally, it covers challenges in measuring HBase performance when used alongside other frameworks like MapReduce and Solr in a multi-tenant setting.

What's hot

HashiCorp at Just Eat

Andrew Brown

MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments

MongoDB

Presto on Alluxio Hands-On Lab

Alluxio, Inc.

San Francisco HashiCorp User Group at GitHub

Jon Benson

This document discusses Nomad and Consul, two products from HashiCorp that help with deploying and discovering services at scale. Nomad is a cluster scheduler that allows specifying jobs to deploy applications across datacenters. It provides advantages like higher resource utilization, decoupling work from resources, and better quality of service. Consul is a service discovery and configuration tool that supports service registration, health checking, and queries at scale across datacenters. The presentation covers the architectures and advantages of both Nomad and Consul for operating large clusters in a multi-region environment.

Heap Dump Analysis - AEM: Real World Issues

Kanika Gera

Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

Altoros

Hashicorp Nomad

Ivan Glushkov

Building Fast SQL Analytics on Anything with Presto, Alluxio

Alluxio, Inc.

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

ScyllaDB

WordPress: Performance, Optimization & Scaling

Pete Mall

Володимир Цап "Constraint driven infrastructure - scale or tune?"

Fwdays

Magento 2 with Remote Storage

Oleg Posyniak

GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure

Andriy Deren'

Building Cloud Native Analytical Pipelines on AWS

Alluxio, Inc.

Primary Storage in CloudStack by Mike Tutkowski

buildacloud

Windows Azure Caching

Pavel Revenkov

Introducing ASP.NET vNext

Bruce Johnson

“Kick-off with Scale in Mind” by Yousef Wadi

Jordan Open Source Association

Redis Labs and SQL Server

Lynn Langit

[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha

PgDay.Seoul

What's hot (20)

HashiCorp at Just Eat

MongoDB and Amazon Web Services: Storage Options for MongoDB Deployments

Presto on Alluxio Hands-On Lab

San Francisco HashiCorp User Group at GitHub

Heap Dump Analysis - AEM: Real World Issues

Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success

Hashicorp Nomad

Building Fast SQL Analytics on Anything with Presto, Alluxio

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

WordPress: Performance, Optimization & Scaling

Володимир Цап "Constraint driven infrastructure - scale or tune?"

Magento 2 with Remote Storage

GDG Ternopil TechTalks Web #1 2015 - Data storages in Microsoft Azure

Building Cloud Native Analytical Pipelines on AWS

Primary Storage in CloudStack by Mike Tutkowski

Windows Azure Caching

Introducing ASP.NET vNext

“Kick-off with Scale in Mind” by Yousef Wadi

Redis Labs and SQL Server

[Pgday.Seoul 2018] PostgreSQL 성능을 위해 개발된 라이브러리 OS 소개 apposha

Similar to Kenshoo - Use Hadoop, One Week, No Coding

MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)

Shrey Batra

Rigorous and Multi-tenant HBase Performance Measurement

DataWorks Summit

Rigorous and Multi-tenant HBase Performance

Cloudera, Inc.

The document discusses techniques for rigorously measuring Apache HBase performance in both standalone and multi-tenant environments. It introduces the Yahoo! Cloud Serving Benchmark (YCSB) and best practices for cluster setup, workload generation, data loading, and measurement. These include pre-splitting tables, warming caches, setting target throughput, and using appropriate workload distributions. The document also covers challenges in achieving good multi-tenant performance across HBase, MapReduce and Apache Solr.

Oracle big data appliance and solutions

solarisyougood

The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.

Custom coded projects

Marko Heijnen

Virtualization and Containers

Kellyn Pot'Vin-Gorman

This document discusses using virtualization and containers to improve database deployments in development environments. It notes that traditional database deployments are slow, taking 85% of project time for creation and refreshes. Virtualization allows for more frequent releases by speeding up refresh times. The document discusses how virtualization engines can track database changes and provision new virtual databases in seconds from a source database. This allows developers and testers to self-service provision databases without involving DBAs. It also discusses how virtualization and containers can optimize database deployments in cloud environments by reducing storage usage and data transfers.

Should I move my database to the cloud?

James Serra

So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!

PHD Virtual: Optimizing Backups for Any Storage

Mark McHenry

Postgres Foreign Data Wrappers

EDB

A powerful feature in Postgres called Foreign Data Wrappers lets end users integrate data from MongoDB, Hadoop and other solutions with their Postgres database and leverage it as single, seamless database using SQL. Use of these features has skyrocketed since EDB released to the open source community new FDWs for MongoDB, Hadoop and MySQL that support both read and write capabilities. Now greatly enhanced, FDWs enable integrating data across disparate deployments to support new workloads, expanded development goals and harvesting greater value from data. Learn more about Foreign Data Wrappers (FDWs) and Postgres with Sameer Kumar, Database Consultant from Ashnik. Target Audience: This presentation is intended for IT Professionals seeking to do more with Postgres in his every day projects and build new applications.

The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop

Jeff Smoley

SQL 2014 hybrid platform - Azure and on premise

Shy Engelberg

The document provides an overview of integration features between SQL Server 2014 and Windows Azure. It discusses capabilities like deploying a SQL database to an Azure virtual machine, storing database data files in Azure storage, backing up SQL databases to Azure storage, and using Azure virtual machines for disaster recovery of SQL Server databases through availability group replicas. The document contains disclaimers that it provides overviews rather than technical details and that some demos may fail due to bugs in the preview release. It also includes contact information for the presenter.

Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...

Ceph Community

This document discusses best practices for implementing Ceph-powered storage as a service. It covers planning a Ceph implementation based on business and technical requirements. Various use cases for Ceph are described, including OpenStack, cloud storage, web-scale applications, high performance block storage, archive/cold storage, databases and Hadoop. Architectural considerations for redundancy, servers, networking are also discussed. The document concludes with a case study of a university implementing a Ceph-based storage cloud to address storage needs for cancer and genomic research data.

Using Apache Hive with High Performance

Inderaj (Raj) Bains

Austin Scales- Clickstream Analytics at Bazaarvoice

bazaarvoice_engineering

Hadoop in the cloud – The what, why and how from the experts

DataWorks Summit

DataCore Case Study on Hyperconverged

Advantech Industrial Automation Group

Delivering Apache Hadoop for the Modern Data Architecture

Hortonworks

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...

Amazon Web Services

Join us for the first-ever Amazon DynamoDB practical hands-on workshop. This session is designed for developers, engineers, and database administrators who are involved in designing and maintaining DynamoDB applications. We begin with a walkthrough of proven NoSQL design patterns for at-scale applications. Next, we use step-by-step instructions to apply lessons learned to design DynamoDB tables and indexes that are optimized for performance and cost. Expect to leave this session with the knowledge to build and monitor DynamoDB applications that can grow to any size and scale. Attendees should have a basic understanding of DynamoDB. To attend this workshop, bring your laptop.

IBM - Introduction to Cloudant

Francisco González Jiménez

0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2

Raul Chong

This document provides an introduction to Cloudant, which is a fully managed NoSQL database as a service (DBaaS) that provides a scalable and flexible data layer for web and mobile applications. The presentation discusses NoSQL databases and why they are useful, describes Cloudant's features such as document storage, querying, indexing and its global data presence. It also provides examples of how companies like FitnessKeeper and Fidelity Investments use Cloudant to solve data scaling and management challenges. The document concludes by outlining next steps for signing up and exploring Cloudant.

Similar to Kenshoo - Use Hadoop, One Week, No Coding (20)

MongoDB MUG Delhi NCR - December 19 2020 (Cloud Security)

Rigorous and Multi-tenant HBase Performance Measurement

Rigorous and Multi-tenant HBase Performance

Oracle big data appliance and solutions

Custom coded projects

Virtualization and Containers

Should I move my database to the cloud?

PHD Virtual: Optimizing Backups for Any Storage

Postgres Foreign Data Wrappers

The Perils and Triumphs of using Cassandra at a .NET/Microsoft Shop

SQL 2014 hybrid platform - Azure and on premise

Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...

Using Apache Hive with High Performance

Austin Scales- Clickstream Analytics at Bazaarvoice

Hadoop in the cloud – The what, why and how from the experts

DataCore Case Study on Hyperconverged

Delivering Apache Hadoop for the Modern Data Architecture

Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...

IBM - Introduction to Cloudant

0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2

More from MapR Technologies

Converging your data landscape

Kenshoo - Use Hadoop, One Week, No Coding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kenshoo - Use Hadoop, One Week, No Coding

Similar to Kenshoo - Use Hadoop, One Week, No Coding (20)

More from MapR Technologies

More from MapR Technologies (20)

Recently uploaded

Recently uploaded (20)

Kenshoo - Use Hadoop, One Week, No Coding