Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.
Large companies see an opportunity to replace expensive legacy data warehouse applications with Big Data technologies. But how realistic is the notion of switching from tried and true data warehouse implementations to something that's still maturing, and what are the pitfalls? What will a business user need to learn in order to adapt to the new platform?
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
The effective use of big data is the key to gaining a competitive advantage and outperforming the competition. This change demands that companies consume and blend enormous amount of data created from divergent and inherently mismatched sources, which represents a paradigm shift to the traditional data warehouse.
Companies need to modernize their data warehouse, augmenting it with a platform that allows storage, processing, exploration and analysis of large and diverse datasets without limiting the ability to deliver the data access, and flexibility responding to the needs of the business. That’s where Oracle Cloud and Qubole work together delivering a new breed of data platform —capable of storing and processing the overwhelming amount of data that on-premises big data deployments cannot handle.
Watch this on-demand webinar to understand:
- Why deploying big data on-premises is expensive, complex to maintain and limits your ability to scale across new use cases and data sources
- How Oracle Bare Metal Cloud's predictable and fast performance compute and network services deliver the foundation of a cost-effective, high-performance big data platform
- How Qubole leverages Oracle Bare Metal Cloud to provide a turnkey big data service that optimizes cost, performance, and scale, enabling self-service data exploration.
Qubole delivers a cloud-based, turnkey, self-service big data service that removes the complexity and reduces the cost of doing big data. It leverages Oracle Bare Metal Cloud’s next generation of scalable, inexpensive and performant compute, network and storage public cloud infrastructure to provide a solution that accelerates time to market and reduces the risk of your big data initiatives.
Big Data Challenges and How to Overcome Them with Qubole - a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Storing, accessing, and analyzing large amounts of data from diverse sources and making it easily accessible to deliver actionable insights for users can be challenging for data driven organizations. The solution for customers is to optimize scaling and create a unified interface to simplify analysis. Qubole helps customers simplify their big data analytics with speed and scalability, while providing data analysts and scientists self-service access in Cloud. The platform is fully elastic and automatically scales or contracts clusters based on workload. We will try to overview main features, advantages and drawback of this platform.
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole.
The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift.
There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions.
Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.
In this presentation, Zoosk will share its experience in transitioning the Zoosk Big Data Platform from Hive to a Hive/Impala configuration. We will share lessons learned, some guidelines about when to use one or another, and a high level before-and-after view of its architecture.
Large companies see an opportunity to replace expensive legacy data warehouse applications with Big Data technologies. But how realistic is the notion of switching from tried and true data warehouse implementations to something that's still maturing, and what are the pitfalls? What will a business user need to learn in order to adapt to the new platform?
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
The effective use of big data is the key to gaining a competitive advantage and outperforming the competition. This change demands that companies consume and blend enormous amount of data created from divergent and inherently mismatched sources, which represents a paradigm shift to the traditional data warehouse.
Companies need to modernize their data warehouse, augmenting it with a platform that allows storage, processing, exploration and analysis of large and diverse datasets without limiting the ability to deliver the data access, and flexibility responding to the needs of the business. That’s where Oracle Cloud and Qubole work together delivering a new breed of data platform —capable of storing and processing the overwhelming amount of data that on-premises big data deployments cannot handle.
Watch this on-demand webinar to understand:
- Why deploying big data on-premises is expensive, complex to maintain and limits your ability to scale across new use cases and data sources
- How Oracle Bare Metal Cloud's predictable and fast performance compute and network services deliver the foundation of a cost-effective, high-performance big data platform
- How Qubole leverages Oracle Bare Metal Cloud to provide a turnkey big data service that optimizes cost, performance, and scale, enabling self-service data exploration.
Qubole delivers a cloud-based, turnkey, self-service big data service that removes the complexity and reduces the cost of doing big data. It leverages Oracle Bare Metal Cloud’s next generation of scalable, inexpensive and performant compute, network and storage public cloud infrastructure to provide a solution that accelerates time to market and reduces the risk of your big data initiatives.
Big Data Challenges and How to Overcome Them with Qubole - a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Storing, accessing, and analyzing large amounts of data from diverse sources and making it easily accessible to deliver actionable insights for users can be challenging for data driven organizations. The solution for customers is to optimize scaling and create a unified interface to simplify analysis. Qubole helps customers simplify their big data analytics with speed and scalability, while providing data analysts and scientists self-service access in Cloud. The platform is fully elastic and automatically scales or contracts clusters based on workload. We will try to overview main features, advantages and drawback of this platform.
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole.
The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift.
There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions.
Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.
In this presentation, Zoosk will share its experience in transitioning the Zoosk Big Data Platform from Hive to a Hive/Impala configuration. We will share lessons learned, some guidelines about when to use one or another, and a high level before-and-after view of its architecture.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
Advancing Real-Time Responses in Web Applications
Michael Glukhovsky, Co-Founder, RethinkDB
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
How do NoSQL Document-Oriented Databases like Couchbase fit in with Apache Spark? This set of slides gives a couple of use cases, shows why Couchbase works great with Spark, and sets up a scenario for a demo.
Global Knowledge Collaboration to Cure Cancer: How GPUs Impact Graph & Predictive Analytics
Brad Bebee, CEO of Blazegraph
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
My talk from Database Camp 2016 at the United Nations. I focus on how we can bridge the gap between OLTP and OLAP workloads and discuss a very promising new technology called Apache Kudu.
Hadoop as we know is a Java based massive scalable distributed framework for processing large data (several peta bytes) across a cluster (1000s) of commodity computers.
The Hadoop ecosystem has grown over the last few years and there is a lot of jargon in terms of tools as well as frameworks.
Many organizations are investing & innovating heavily in Hadoop to make it better and easier. The mind map on the next slide should be useful to get a high level picture of the ecosystem.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
eBay has one of the largest Hadoop clusters in the industry with many petabytes of data. This talk will give an overview of how Hadoop and HBase have been used within eBay, the lessons we have learned from supporting large-scale production clusters, as well as how we plan to use and improve Hadoop and HBase moving forward. Specific use cases, production issues and platform improvement work will be discussed.
Progress® DataDirect ® Spark SQL ODBC and JDBC drivers deliver the fastest, high-performance connectivity so your existing BI and analytics applications can access Big Data in Apache Spark.
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
Hadoop was the first software to permit affordable use of petabytes. In the decade since Hadoop was introduced, many other projects have been created around the Hadoop Distributed File System (HDFS) storage layer and its MapReduce processing engine, forming a rich software ecosystem. In this keynote, Doug Cutting will explain how Apache Spark provides a second-generation processing engine that greatly improves on MapReduce, and why this transition provides an example of an evolutionary pattern in the data ecosystem that gives it long-term strength.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
Advancing Real-Time Responses in Web Applications
Michael Glukhovsky, Co-Founder, RethinkDB
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
How do NoSQL Document-Oriented Databases like Couchbase fit in with Apache Spark? This set of slides gives a couple of use cases, shows why Couchbase works great with Spark, and sets up a scenario for a demo.
Global Knowledge Collaboration to Cure Cancer: How GPUs Impact Graph & Predictive Analytics
Brad Bebee, CEO of Blazegraph
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
My talk from Database Camp 2016 at the United Nations. I focus on how we can bridge the gap between OLTP and OLAP workloads and discuss a very promising new technology called Apache Kudu.
Hadoop and object stores: Can we do it better?gvernik
Strata Data Conference, London, May 2017
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...Megha Shah
This presentation aims to compare the performance of ETL pipeline using Spark and Hive under Azure. We will examine the features, strengths, and weaknesses of each tool, and provide recommendations on which one to use based on specific use cases.
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
Amazon EMR is a managed Hadoop service that makes it easy for customers to use big data frameworks and applications like Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3 , Amazon’s highly scalable object storage service. In this webinar, we will introduce the latest release of Amazon EMR. With Amazon EMR release 5.0, customers can now launch the latest versions of popular open source frameworks including Apache Spark 2.0, Hive 2.1, Presto 0.151, Tez 0.8.4, and Apache Hadoop 2.7.2. We will walk through a demo to show you how to deploy a Hadoop environment within minutes. We will cover common use cases and best practices to lower costs using Amazon S3 as your data store and Amazon EC2 Spot Instances, which allow you to bid on space Amazon computing capacity.
Learning Objectives:
• Describe the new features and updated frameworks in Amazon EMR 5.0
• Learn best practices and real-world applications for Amazon EMR
• Understand how to use EC2 Spot pricing to save costs
• Explain the advantages of decoupling storage and compute with Amazon S3 as storage layer for EMR workloads
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
Similar to Optimizing Big Data to run in the Public Cloud (20)
7 Big Data Challenges and How to Overcome ThemQubole
Implementing a big data project is difficult. Hadoop is complex, and data governance is crucial. Learn common big data challenges and how to overcome them.
A recent survey indicated significant growth of big data adoption among enterprise companies. The survey also indicated growing interest in Hadoop in the cloud.
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
DataXu sits at the heart of the all-digital world, providing a data platform that manages tens of millions of dollars of digital advertising investments from Global 500 brands. The DataXu data platform evaluates 1.5 million online ad opportunities every second for our customers, allowing them to manage and optimize their marketing investments across all digital channels. DataXu employs a wide range of AWS services: Cloud Front, Cloud Trail, CloudWatch, Data Pipeline, Direct Connect, Dynamo DB, EC2, EMR, Glacier, IAM, Kinesis, RDS, Redshift, Route53, S3, SNS, SQS, and VPC to run various workloads at scale for DataXu data platform.
In addition, DataXu also uses Qubole Data Service, QDS, to offer a Unified Analytics Interface tool to DataXu customers. Qubole, a member of APN provides self-managing Big data infrastructure in the Cloud which leverages spot pricing for cost-efficiencies, provides fast performance, and most importantly a streamlined user-interface for ease of use.
Attendees will learn how Qubole provided self-managing Hadoop clusters in the AWS Cloud accelerated DataXu’s batch-oriented analysis jobs; and how Qubole integration with Amazon Redshift enabled DataXu to preform low latency and interactive analysis. Further, in the session we'll take a look at how DataXu opened up QDS access to their customers using QDS user interface thereby providing them with a single tool for both batch-oriented and interactive analysis. By using the QDS user interface buyers of the DataXu data service could perform all manner of analysis against the data stored in their AWS S3 bucket.
Speakers:
Scott Ward
Solutions Architect at Amazon Web Services
Ashish Dubey
Solutions Architect at Qubole
Yekesa Kosuru
VP Engineering at DataXu
Whether you are interested in healthcare data analytics or looking to get started with big data and marketing, these fundamental principles from data experts will contribute to your success. http://www.qubole.com/new-series-big-data-tips/
Slide deck from a hands on workshop: Covers the following
1. Learn what Sentiment Analysis and how it can be used
2. Perform pre-processing and post-processing of textual data using Hive
3. Use n-gram language model built into Hive for perform sentiment analysis
4. Learn how to use Hive extensibility to plug-in other language models
A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by
a. Better layout of data on HDFS via partitioning and bucketing
b. Designing test queries by using block and bucket sampling before running the queries on large datasets
c. Using bucket map joins and parallel processing to run queries faster
Visit www.qubole.com for more information.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Optimizing Big Data to run in the Public Cloud
1. Optimizing Big Data to run in the Public Cloud
April 23, 2015
NYC Hadoop Meetup
2. A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “big data” @
Facebook and the creator’s of the Apache Hive Project
Based in Mountain View, CA with offices in Bangalore,
India. Investments by Charles River, LightSpeed, Norwest
Ventures.
World class product and engineering team from:
Team
3. Qubole QDS
Qubole works in:
• Adtech
• Media & Entertainment
• Healthcare
• Retail
• eCommerce
Qubole works best when:
• Born in Cloud
• Commitment to Public Cloud
• Data Driven
• Large scale data
• Lack Hadoop Skills
• Analysts & scientist need access
4. Standard Hadoop (on premises)
Standard Hadoop (on premises)
- JobTrackers
- TaskTrackers
- NameNodes
- DataNodes
= Datacenter, servers, VMs, wires…
How about adding more capacity? Dev/Test environments?
Non-Technical Users? Version upgrades?
5. Hadoop in the Cloud
Hadoop in the Cloud
= Someone else’s datacenter, servers, VMs, wires…
Designed for capacity scaling (and reduction)
Designed for multiple Dev/Test environments
Potential UI for Non-Technical Users
Potential support for version upgrades
Ad hoc queries can spin up a cluster on-demand exactly when needed
= Cost reduction, self-service, custom configurable clusters
*Security (important and production ready, but we won’t focus on it here)
7. Data stored in HDFS requires nodes to be kept running
continuously (an EC2 instance). This can be expensive.
Data stored in S3 means it is remote from the compute nodes.
S3 performs well in general, but it’s not uncommon to see
significant variance in performance.
7
8. Split Computation and File I/O
Split Computation and File I/O
Multiple map tasks are instantiated and each of these is assigned a split.
Hadoop needs to know the size of input files so that they can grouped
into equal sized splits.
Input files are spread across many directories.
For example, two years of data, organized into hourly directories, results
in 17520 directories. If each directory contains 6 files, this makes a grand
total of 105,120 files.
Map-Reduce calls the generic Hadoop file listing API against each input
directory to get the size of all files in the directory.
9. Split Computation and File I/O
Split Computation and File I/O
This is okay when on HDFS. But, in our example, this results in
17520 API calls. This is not a big deal in HDFS, but results in very
bad performance in S3.
Every listing call in S3 involves using a Rest API call and parsing of
XML results which has very high overhead and latency.
Furthermore, Amazon employs protection mechanisms against high
rate of API calls. For certain workloads, split computation becomes
a huge bottleneck.
10. Data Consistency
Data Consistency
From AWS FAQ:
What data consistency model does Amazon S3 employ?
Amazon S3 buckets in the US Standard region provide eventual
consistency. Amazon S3 buckets in all other regions provide read-after-
write consistency for PUTS of new objects and eventual consistency for
overwrite PUTS and DELETES.
13. Data Consistency
Data Consistency
Specify named S3 endpoints instead of US Standard.
For example, replace:
http://mybucket.s3.amazonaws.com/somekey.ext
with:
http://mybucket.s3-external-1.amazonaws.com/somekey.ext
http://docs.aws.amazon.com/redshift/latest/dg/managing-data-
consistency.html
14. Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST API
(HTTPS)
SSH
Ephemeral Hadoop Clusters,
Managed by Qubole
Slave
Master
Data Flow within
Customer’s AWS
(optional)
Other RDS,
Redshift
Ephemeral
Web Tier
Web Servers
Encrypted
Result
Cache
Encrypted
HDFS
Slave
Encrypted
HDFS
RDS – Qubole
User, Account
Configurations
(Encrypted
credentials
Amazon S3
w/S3 Server Side
Encryption
Default Hive
Metastore
Encryption Options:
a) Qubole can encrypt the result cache
b) Qubole supports encryption of the ephemeral drives used for HDFS
c) Qubole supports S3 Server Side Encryption
(c)(b)
(a)
(optional)
Custom Hive
Metastore
SSH
15. QDS Platform Features
Auto-Scaling self managed Hadoop Clusters in Cloud
– Including Amazon EC2, Rackspace, Google Compute & OpenStack
The Fastest Hadoop running on the cloud
– Numerous Optimizations that provide 4 to 8 times faster performance than Amazon
Elastic MapReduce (EMR)
Pre-built connectors
– Traditional RDBMS, MongoDB and other NoSQL solutions
– Incremental Data Scrapes
Job Scheduler
– Dependencies, Workflows, Incremental Jobs
Multi-Platform Support
– Supports AWS, Google & Azure Credentials
15
16. QDS Capabilities
Mix and Match Reserved & Spot instances
– To reduce the cost of compute hours on the cloud
Perform data exploration and analysis on raw multi-structured
data formats.
Integration with data visualization & BI tools via ODBC
– Tableau Software, Pentaho, Excel
All functionality also available through API’s and toolkits
16
QDS Platform Features
17. Faster split computations on S3
Faster split computations on S3
To solve this problem, we modified split computation to invoke listing at the level of the parent directory.
This call returns all files (and their sizes) in all subdirectories in blocks of 1000.
Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated
some of them.
We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the
list of files and list of directories of interest.
This allows us to efficiently identify files sizes of interesting files. The modified algorithm results in only 106 API
calls (each call returns 1000 files) compared to 17520 API calls in the original implementation. We compared the
two approaches using a simple Hive test. In this test, we take a partitioned table T with 15,000 files but vary the
number of partitions (a partition corresponds to a directory). We compare the performance of ‘select count(*)
from T’. In the extreme case, this optimization shows a speedup of 8x!
18. Faster reads from S3
Faster reads from S3
Opening of files take a significant amount of time – at least 50 milliseconds per file.
This problem becomes pronounced when the input dataset has lots of small files and file open latency
forms a significant portion of overall execution time.
To alleviate this problem, we included an optimization wherein we open an S3 file in a background thread
a little while before it is actually required by the map task. This hides the file open latency.
One thing to be aware of is that if a S3 file is opened, but not read from for a while, S3 returns a
RequestTimeout and potentially penalizes the caller.
We tested this optimization with a simple hive test. Our dataset consisted of 80000 files, each of size
640KB. We noticed an improvement of 30% in a count(*) query as a result of this optimization.
19. Attaching to EBS Volumes
Attaching to EBS Volumes
Storage on EC2 Instances.
Instance Type Instance Storage (GB)
c1.xlarge 1680 [4 x 420]
c3.2xlarge 160 [2 x 80 SSD]
c3.4xlarge 320 [2 x 160 SSD]
c3.8xlarge 640 [2 x 320 SSD]
The amount of storage per instance might not be
sufficient for running Hadoop clusters with high
volumes of data.
20. Attaching to EBS Volumes
Attaching to EBS Volumes
• AWS offers raw block devices called EBS
volumes which can be attached to EC2
instances.
• It is possible to attach multiple EBS volumes
with a size up to 1 TB per volume. This can
easily compensate for the low instance
storage available on the new generation
instances. Also, use of EBS volumes for
storage purpose costs much less than adding
cheaper instances with more storage capacity.
21. Attaching to EBS Volumes
Attaching to EBS Volumes
• Configurable Reserved Volumes
• On Qubole platform, using new generation
instances, users have an option to use
reserved volumes if data requirement exceeds
the local storage available in the cluster.
• AWS EBS volumes come in various flavors
e.g. magnetic, SSD backed. Users can select
the size and type of EBS volumes based on
the data and performance requirements.
SSD
Reserved
Disk
Disk Access
HDFSMapReduce
22. Protection Against Bad Jobs
Protection Against Bad Jobs
A single hadoop cluster is usually shared across many users.
- common occurrence that a certain user may issue a bad job which may
degrade the performance of the entire cluster.
Running out of Disk:
- Single mapper issuing too much output.
- A map/reduce job may have a lot of mappers outputting too much map data
- Reducer tasks copying a lot of map output data during the shuffle phase
23. Protection Against Bad Jobs
Protection Against Bad Jobs
Qubole’s hadoop distribution provides protection of the clusters against such
jobs. Its clusters periodically monitors the jobs and kills any job that may be
affecting the entire cluster.
Kill job when …
- Total map output of a job is beyond a configurable value
- Any tasks produce more map output than a set disk percent
- A job produces a lot of logs (configurable value)
- Reducers read a lot of map data (configurable value)
24. Direct output commit to S3
Direct output commit to S3
Using default S3 code path involves writing to a temp directory and then moving
the temp directory into its final location.
Move on S3 is really a copy and then delete.
Instead, write into the target directory.
25. Direct output commit to S3 - Hive
Direct output commit to S3 - Hive
Changed the naming scheme for the files Hive creates.
• Instead of 00000 we use names like UUID_000000 where all files generated by a
single insert into use the same prefix.
• Guarantees that a new insert into will not stomp on data produced by an earlier query.
To support insert overwrite with dynamic partitions the tasks that write into the directory
must delete any existing files.
Before the insert overwrite begins we generate a UUID to use for this statement. All the
mappers/reducers when deleting from a directory will delete all files that don't begin with
this UUID.
26. Direct output commit to S3 - MapReduce
Direct output commit to S3 - MapReduce
By default the MR code also writes to a temp location and then moves to final. The moves
are done by listing the temp location and moving all the files there.
• To avoid this we track expected file counts for a couple of File formats
• Provide Direct committers which avoids the move completely
27. Spot Instances, Placement Policy, and Fallback to on-demand
Spot Instances
Spot Instances allow users to bid on unused Amazon EC2 capacity and run those instances for as
long as their bid exceeds the Spot Price. QDS enables you to realize cost savings of as much as
50% to 60% by supporting the Spot Instance pricing model in addition to the Reserved Instance
pricing.
Use Qubole Placement Policy:
When using spot instances for slaves, this ensures that at least one replica of each HDFS block is
placed on Stable instances. It is recommended to keep this enabled when using spot instances.
Fallback to on-demand:
When upscaling the cluster, sometimes we may not be able to procure Spot Instances because of
low availability or high price. This option specifies that autoscaling should then fall back to procuring
On-Demand instances. This will increase the cost of running the cluster, but ensures that the
processing completes relatively quickly. Enable this if command processing time is important to you.
30. 30
Features
S3 Caching: S3 caching utilizes resources more efficiently and
brings up clusters faster – up to 10x faster on some client instances.
Variable Spot Instance Pricing: QDS allows you
to vary the number of spots vs. on-demand nodes,
providing the benefits of spot pricing (up to 90%
less expensive) with the certainty of getting your
job done.
Searchable Queries and Log Files: QDS shows
you all of your jobs, allowing you to compare the
efficiency of queries and avoid having to re-create
queries from scratch.
Built in Job Tracker: QDS provides a job tracker
accessed directly through the UI, allowing you to
identify resources, nodes, and tasks.
Security: QDS has a tight security environment,
with the ability to encrypt data at rest on nodes.
32. Why Qubole?
32
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and
stable”
Mohammad Shahangian - Lead, Data Science and Infrastructure
Moved to Qubole from Amazon EMR because
of stability and rapidly expanded big data usage by
giving access to data to users beyond developers.
Rapid expansion of big data beyond developers (240 users
out of 600 person company)
Use CasesUser and Query Growth
Rapid expansion in use cases ranging from ETL, search,
adhoc querying, product analytics etc.
Rock solid infrastructure sees 50% less failures as
compared to AWS Elastic Map/Reduce
Enterprise scale processing and data access
33. Why Qubole?
33
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
expectations that comes with committing millions of dollars in
upfront investment. Qubole was that thing.”
Marc Rosen - Sr. Director, Data Analytics
Moved to Big data on the cloud (from internal Oracle
clusters) because getting to analysis was much
quicker than operating infrastructure themselves.
Used to answer client queries and power client
dashboards.
Use Cases# Commands Per Month
0
1250
2500
3750
5000
Number of queries
Segment audiences based on their behavior including
such topics as user pathway and multi-dimensional
recency analysis
Build customer profiles (both uni/multivariate) across
thousands of first party (i.e., client CRM files) and third
party (i.e., demographic) segments
Simplify attribution insights showing the effects of upper
funnel prospecting on lower funnel remarketing media
strategies