SlideShare a Scribd company logo
1
Qubole Open Data
Lake Platform on AWS
Accelerate your data lake adoption, reduce time to
value, and lower cloud data lake cost by 50%
2
Table of Contents
Why do users choose Qubole on AWS? 3
User Spotlight 1: Malwarebytes 4
User Spotlight 2: Neustar 6
User Spotlight 3: Publicis Media 8
AWS and Qubole Native Integrations 10
•	 Amazon EC2 Spot
•	 Amazon SageMaker
•	 AWS FSx for Lustre
•	 AWS Glue
•	 Accessing via AWS Marketplace
3
Qubole is an open, simple, and secure data lake platform for machine learning, streaming, and ad-hoc analytics. Qubole on
AWS provides end-to-end data lake services such as AWS infrastructure management, data management, continuous data
engineering, analytics, and machine learning with near-zero administration. Qubole on AWS delivers:
Unified experience for data
science, data engineering, ad-hoc
analytics
Native workbench that includes notebooks, dashboards, and a common
interface for all commands and tasks. This enables data engineers and data
scientists to collaborate using familiar tools, languages (SQL, Python, R, Scala),
and data processing frameworks (Apache Spark, Presto, Hive and Airflow).
Low cost and high reliability
Workload-aware autoscaling for optimized upscaling, rebalancing, and
aggressive downscaling of clusters with a complete context of the workload,
SLA, and priority of each job. Includes intelligent policy-based management
of On-demand and Spot Nodes.
Enterprise-grade security
Fine-grained predefined or custom identity and access management roles
to separate compute and data access. Qubole also offers role-based access
controls for secure collaboration in notebooks and commands.
AWS Native Integrations
Native integration with AWS services like EC2, S3, Sagemaker, Redshift, AWS
FSx for Luster.
Why do users choose Qubole on AWS?
No other platform provides the openness and data workload flexibility of Qubole while radically accelerating data lake adoption,
reducing time to value, and lowering cloud data lake costs by 50 percent.
Users adopt Qubole Open Data Lake platform on AWS for the following reasons:
•	 Automated
cluster lifecycle
management
•	 Intelligent spot
management
•	 Heterogeneous
cluster management
•	 Automated platform
management
•	 Workload-aware
Autoscaling
•	 Insights and
reporting
•	 Built-in AWS specific
optimizations
•	 Self-service platform
for all users
•	 Out-of-box tools for
data science, data
engineering, and
analytics
•	 APIs and pre-built
integrations with
3rd party solutions
•	 Single platform
for data ingestion,
processing,
management, and
consumption
•	 Open & standard
file formats,
languages and APIs
•	 Secure and granular
access
Reduce Data Lake Cost
by more than 50%
Near-Zero
Administration
Fast Adoption of Their
Data Lakes
Unify on Simple, Open
and Secure Platform
4
About Malwarebytes
Malwarebytes is a cybersecurity company that produces anti-malware software for a variety of platforms. The company offers
consumers free, premium, and enterprise-grade versions of Malwarebytes, which detect, remove, and remediate computer
malware. Malwarebytes uses machine learning (ML) and artificial intelligence (AI) to identify and predict emerging threats before
they infect machines.
Business Problem
To predict, detect, and neutralize emerging threats, Malwarebytes processes billions of threat telemetry records daily. The
company then performs advanced analytics on this data to identify potential threats and runs ML and AI models to determine
what action to take to protect its customers.
Malwarebytes formerly relied on a third party on-premises deployment to ingest and process this data. But this system proved
inadequate. For example, the pipeline took a few days to complete Extract-Transform-Load (ETL) on one data stream alone. And,
queries on the ingested data were painfully slow.
Malwarebytes needed some way to modernize its big data processing to improve turnaround time while also keeping costs down.
Improved Processing Speed and Lowered Costs
Malwarebytes adopted Qubole in concert with Kafka (for ingesting data streams), and an AWS S3 data lake (for data storage). First,
it de-coupled compute and storage. Second, “playing by the rules of the game of the cloud,” says Kulkarni— leveraging things like
autoscaling (scaling out and scaling up, and being elastic & ephemeral in nature), low-cost compute instances (AWS Spot), and
storage (an AWS S3 data lake)— significantly improved the efficiency of the data platform. Today, Malwarebytes uses Qubole to
process its data. About 60 to 70% of it is logs, telemetry, and other types of unstructured and semi-structured data that is being
processed in Qubole.
Qubole has really
mastered the elasticity
component of the
cloud. Qubole helped
us run our ETL at
night, spinning up and
spinning down clusters
when we needed them.”
“
Manju Vasishta
Director of Data Science and Engineering,
Malwarebytes
Platform’s ability to add and remove compute resources on-
demand based on the workload or SLA, and without human
intervention—in a matter of minutes has greatly increased the
speed at which Malwarebytes processes critical data, directly
affecting the company’s ability to detect, predict, and remediate
emerging threats.
Qubole aggregates and processes between 20 and 48 terabytes
of raw data per day but delivers just 2 to 3 terabytes of meaningful
and actionable data. Qubole provides a single framework for
processing data more quickly, whether for use in ML models for
predictions, in BI applications for business reporting or for GDPR
compliance—all with just one full-time administrator plus three
senior engineers—a few times per quarter. The result is more
powerful insights because they involve better data.
User Spotlight 1: Malwarebytes
5
Quick ROI
The platform’s ROI was quickly revealed: by the meaningful
data it helps to discover, which yields more powerful insights.
These insights—for example, predictive insights about emerging
threats; marketing lead conversion propensity using ML
algorithms; behavioral clustering of malware; sentiment analysis
of reviews about Malwarebytes products and features on
various social media platforms using advanced natural language
libraries; drive key decisions that serve the business and its
customers well.
Key Takeaway
•	 Greater data-processing capacity at much lower costs
•	 Improved efficiency to produce meaningful data and more
powerful insights
•	 Easy user onboarding resulting in high adoption
•	 Quick, tangible ROI
Qubole has really
mastered the elasticity
component of the
cloud. Qubole helped
us run our ETL at
night, spinning up and
spinning down clusters
when we needed them.”
“
Manju Vasishta
Director of Data Science and Engineering,
Malwarebytes
6
Business Problem Overview
The Neustar Unified Analytics platform helps marketers understand the impact of marketing on key business outcomes, and
provides tools to enable them to optimize the allocation of their marketing investments. First, it ingests large volumes of client
marketing data from a variety of sources. Then, it applies proprietary algorithms to build a predictive spend attribution model
on top of that data. This reveals how the client’s marketing spend correlates to revenue—enabling marketers to determine which
marketing channels are working, which ones aren’t, and what to do next.
To meet the demands of its growing client roster, the Neustar Unified Analytics team needed to confront the issues of variety,
volume, velocity, and veracity—often called the “four Vs.” At the same time, the team needed to keep operational costs down. For
this, it turned to Qubole.
Ensuring Data Veracity
“Models are only as good as the input data,” said Peterson. “If your data has lots of gaps, then the model won’t be good, no matter
what algorithm you use.” But most data scientists fail to detect “dirty” data until after they run the model—a typical data science
pipeline has data processing, modeling, and scoring stages. Data scientists must then fix the data and rerun the many of their
processes—a task that might take weeks or even months, depending on processing speed and capacity. Often, this cycle repeats,
compounding the delay. Indeed, “the main reason these things take a long time is because of the reruns,” said Peterson.
Neustar Unified Analytics is different. Its machine learning models include a series of pre-checks and post-checks to validate
data. “Because we have a very comprehensive set of validation routines that run on Qubole, we’re able to isolate problems earlier
and avoid these reruns,” Peterson explains. As a result, data validation jobs require just one to one and a half run cycles. This
allows Neustar to deliver insights to its clients much faster and with the highest degree of confidence.
From a performance
aspect, we want to be
faster and faster…and
Qubole fits right into
this.”
“
Dan Peterson
Vice President of Systems Engineering,
Neustar
About Neustar Unified Analytics
Neustar Unified Analytics is an integrated marketing measurement, analytics, and attribution solution from Neustar Information
Services, Inc. Neustar Unified Analytics is not a marketing campaign management tool. Rather, it runs alongside its clients’
campaign management platforms to measure and attribute overall marketing spend across campaigns. More than 90 Fortune
200 companies depend on Neustar Unified Analytics to assess and improve their marketing investments.
User Spotlight 2: Neustar
Keeping Costs Down
In a given month, the heavy compute time needed for most
machine learning jobs is 80 to 90 hours on average for each
customer. The rest of the time is typically consumed running
reports, tuning parameters, and so on—tasks that require
considerably less compute power. For this reason, before
Neustar Unified Analytics partnered with Qubole, its 400-odd
compute nodes per customer were frequently underutilized—
with no adjustment in cost. Now, Qubole aggressively—and
automatically—shuts down excess capacity during slow periods,
efficiently “packing” workloads in fewer nodes. This dramatically
reduces operating costs, without compromising performance or
delivery times.
Neustar Unified Analytics team has reduced its costs to the tune
of 85 to 95 percent over its prior use of other vendor tools with
reserved compute instances and administrator-led scaling.
7
Qubole is cheaper and
much more economical
than other vendors…but
more importantly, it’s much
more stable, and much
more high-performing.
Qubole offered us the best
price for performance, and
outstanding support.”
“
Dan Peterson
Vice President of Systems Engineering,
Neustar
Key Takeaway:
•	 Decreased machine learning model turnaround from six
months to three weeks, end to end.
•	 Reduced model data validation cycle time by more than 62
percent.
•	 Cloud cost savings by 85 to 95 percent.
8
About Publicis Groupe
Publicis Groupe is one of the four solutions hubs of Publicis Groupe [Euronext Paris FR0000130577, CAC 40], alongside Publicis
Communications, Publicis Sapient and Publicis Health. Led by Steve King, CEO, Publicis Media and COO, Publicis Groupe, is
comprised of Starcom, Zenith, Digitas, Spark Foundry, and Performics, powered by digital-first, data-driven global practices that
together deliver client value and business transformation. Publicis Media is committed to helping its clients navigate the modern
media landscape and is present in more than 100 countries with over 23,500 employees worldwide.
Business Problem Overview
Few years ago multiple teams from the various media agencies merged to form Publicis Media. This merger revealed the need
for a central data and analytics platform. “We wanted our agency teams to be able to mine data, but not to have to deal with the
operational overhead of managing data infrastructure,” explains Darren Smith, who leads the engineering and data teams. “Our
intent was to democratize data.”
According to Smith, the team’s existing data infrastructure “was a bunch of bespoke solutions” that combined AWS Redshift, large
monolithic on-premise servers, and various unwieldy traditional technologies. Offering a central data and analytics platform
would require both a complete overhaul of this infrastructure and some way to tie all of its pieces together.
A Centralized Platform for Democratizing Data
“The focus of our team was to build a data architecture and infrastructure that would allow our agency teams to move forward
in a big data world,” says Joe Tan, director of products at Publicis Media. The resulting infrastructure couples a global data lake
—which stores large volumes of multiple types of data—with a framework to ingest and process data. In addition to building this
data infrastructure, Tan’s team had another job: “to provide tools that allow agency teams to really focus on doing analytics for
their clients instead of having to worry about data ops and data engineering.” Qubole enables agency teams to “work with the
data they’re used to in the tools and languages they’re used to, like Tableau and Presto, or SQL, Python, R, Scala, etc.” says Tan. It
also helps Publicis Media make data available to users with different skill sets. “It even,” says Tan, “gives users the ability to learn
how to do more with minimal additional effort.” As more and more clients have grasped the potential power of Publicis Media’s
platform, Qubole has played a key role in helping increase its adoption.
User Spotlight 3: Publicis Media
We have had a steady
growth rate of one to two
agency clients onboarding
onto our platform per
month,” says Tan. “That
might not sound like a lot,
but a lot of those teams
service multiple clients of
their own, so it’s pretty
impactful.”
“
Joe Tan
Director of products, Publicis Media
Scaled with Customer Data Demands
Publicis Media handles lots of data for its agency clients. Its data
lake stores close to a petabyte of it. Agency clients use this data
to run machine learning models for analytics purposes. Scaling
to process larger data sets posed a challenge before Qubole. “I
regularly walked into offices and ran into someone who’d had
a model running for six hours,” recalls Tan. Qubole solved this
problem by enabling agencies to automatically scale up compute
infrastructure for large jobs and to aggressively scale back down
when a job is complete to keep costs low. So, jobs that once took
six hours to complete can now be finished in mere minutes,
with almost 10,000 queries per month on average. In addition,
Qubole also supports multi-region data availability without
latency—further improving the performance and consistency of
Publicis Media’s data globally.
9
Secure Enterprise-grade Data Lake
Platform
Data security is top of mind for Publicis Media. Qubole addresses
its requirements with regard to single sign-on, strict role-based
access control, and agency data isolation, among other security
issues. While both Smith and Tan see these features as “table
stakes,” Smith acknowledges that, “A lot of vendors don’t support
them.”
Key Takeaway:
•	 A central data and analytics platform that democratizes data
•	 Ability to manage nearly 1 petabyte of data
•	 Reduction in the model run time from six hours to mere
minutes, with almost 10,000 queries per month on average
•	 Multi-region data availability without latency
•	 Easy administration with an administrator-to-user ratio of
3:100
•	 Support for robust security and compliance requirements
Qubole really meshed well
with the overall architecture
and design of our data lake.
I don’t think we could have
found a better platform.”
“
Joe Tan
Director of products, Publicis Media
10
AWS and Qubole Native Integrations
Amazon EC2 Spot
Qubole Open Data Lake platform provides a policy-based way to automate the spot bidding process, allowing data teams to take
full advantage of spot instances without devoting resources to managing it.
Qubole uses AWS spot nodes when dynamically adding cluster nodes or as part of the core minimum nodes for a cluster. Users
select a maximum bid they are willing to pay for a spot instance. The platform then automatically places bids for them, making
the process easy to use. Qubole clusters begin with nodes at on-demand instances and rebalance automatically by switching on-
demand instances for spot nodes when spot availability is higher. With this ease of use, the Qubole platform is used for advanced
provisioning strategies. Those strategies come in three categories:
•	 On-Demand Only: Auto-scaled nodes that are added will only be On-Demand instances.
•	 Spot Instances Only: Auto-scaled nodes that are added will only be Spot nodes.
•	 Hybrid: Auto-scaled nodes combine On-Demand and Spot nodes. Users are able to choose what the maximum percentage
of Spot nodes is.
The platform also has additional built-in intelligence to maximize spot instance usage for the workloads:
•	 Qubole Placement Policy: Qubole has multiple pricing options for stable spot nodes and volatile spot nodes. Via the
placement policy, Qubole spreads out underlying storage across stable and volatile nodes, thereby minimizing the risk of job
loss due to loss of a Spot instance.
•	 Fallback to on-demand instances after a configurable timeout: Qubole can automatically fall back to requesting on-
demand nodes if spot nodes cannot be provisioned within a configurable timeout period.
•	 Intelligent AZ Selection: Spot pricing can vary by AZ (availability zone), sometimes by up to 15-20%. Qubole can automatically
select an optimal AZ based on Spot pricing for the cluster instance type chosen. Currently, AZ selection is only supported
for non-VPC clusters.
Amazon SageMaker
The SageMaker and Qubole integration allow enterprise users to leverage Qubole Notebooks and Apache Spark on Qubole to
explore, clean, and prepare data in the format required for Machine Learning algorithms. Once the raw data is cleansed and
prepared in Qubole, it is used to train ML algorithms in SageMaker. There are 2 ways for users to leverage this integration.
•	 Prepare Data and Initiate Training from Qubole
Qubole loads data from multiple data sources such as Transactional databases, Data Warehouses, Streaming data, interaction
data such as clickstreams, social media feeds sensor data, log files, and more. Users read their data into Qubole Spark data
frames, use Qubole Notebooks to transform, cleanse, and prepare the data. Once the data is stored back on Amazon S3, the
users initiate model training — from Qubole — using the estimator in the SageMaker Spark library. This initiates ML training
in SageMaker, builds the model, and creates the endpoint to host that model.
•	 Prepare Data in Qubole from SageMaker Notebook
Alternatively, SageMaker users enhance the SageMaker data processing capabilities by connecting a SageMaker Notebook
instance to Qubole. Data scientists use Apache Spark to process and prepare data at scale with Qubole. Qubole Open Data
Lake Platform greatly reduces the cost of computing by consuming less compute and/or consuming cheaper compute. With
this integration, data scientists use Qubole to cleanse and prepare (transform, featurize, join, etc.) prior to ML training in
Amazon SageMaker.
11
Qubole High Level Architecture
AWS FSx for Lustre
AWS FSx for Lustre and Qubole Open Data Lake Platform together reduce user’s compute cost and minimize intermediate data
loss while running workloads. Users do not pay to maintain idle AWS EC2 instances and also not worry about intermediate output
(shuffle data) loss due to spot nodes interruption. Qubole uses Amazon FSx for Lustre to store and process intermediate data
through its parallel, high-speed file system. By doing so, users no longer need to retain idle EC2 instances to store this intermediate
data. Instead, Amazon FSx for Lustre allows them to re-use the data otherwise normally held within EC2 local storage.
AWS Glue
Qubole and AWS Glue provide users with flexibility and choice of a unified shared metastore with a cdata lake platform. Users use
Glue’s data crawlers to scan and classify data, extract schema details, and build the data catalog. Qubole’s platform is configured
with this catalog as the metastore and shared across your AWS accounts, applications, and services. With Qubole’s multiple
open source frameworks support, users run Hive, Presto queries, and Spark jobs leveraging this catalog. Alternatively, users can
continue using their existing or Qubole-hosted metastore and synchronize it with the Glue Data Catalog.
Accessing via AWS Marketplace
Qubole makes it easier for users to access, manage, monitor and govern their data in S3 data lake with Open Data Lake Platform.
Users can subscribe and access the platform through AWS marketplace with automatic account setup, AWS authentication and
simplified user onboarding in less than an hour with their data.
1. Copy Account ID
and External ID
from QDS
2. Create IAM
Policies on AWS
3. Create IAM
Roles on AWS
4. Link AWS and
QDS accounts
12
Learn More
For the latest information about our product and services, please see the following resources:
•	 Qubole Whitepapers
•	 Qubole Case Studies
•	 Qubole Technical Documentation
For more information:
Contact: Try Qubole Open Data Lake Platform for Free:
sales@qubole.com
469 El Camino Real, Suite 205
Santa Clara, CA 95050
(855) 423-6674 | info@qubole.com
WWW.QUBOLE.COM
About Qubole
Qubole is the open data lake company that provides a simple and secure data lake platform for machine learning, streaming, and
ad-hoc analytics. No other platform provides the openness and data workload flexibility of Qubole while radically accelerating
data lake adoption, reducing time to value, and lowering cloud data lake costs by 50 percent. Qubole is trusted by leading brands
such as Expedia, Disney, Oracle, Gannett and Adobe to spur innovation and to transform their businesses for the era of big data.
For more information visit us at www.qubole.com
You can visit the AWS Marketplace anytime
to get up and running with Qubole!
TRY QUBOLE IN AWS TODAY!
Start your 30 Day Free Trial now

More Related Content

What's hot

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
Amazon Web Services
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
Amazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Building your first Data lake platform
Building your first Data lake platform Building your first Data lake platform
Building your first Data lake platform
Amazon Web Services
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
Amazon Web Services
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
Amazon Web Services
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
Amazon Web Services
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
Amazon Web Services
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
Amazon Web Services
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
Amazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
Amazon Web Services
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Amazon Web Services
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Amazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Amazon Web Services
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Amazon Web Services
 

What's hot (20)

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building your first Data lake platform
Building your first Data lake platform Building your first Data lake platform
Building your first Data lake platform
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
How Amazon.com uses AWS Analytics
How Amazon.com uses AWS AnalyticsHow Amazon.com uses AWS Analytics
How Amazon.com uses AWS Analytics
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 

Similar to Qubole on AWS - White paper

Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics services
m vaishnavi
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
Amazon Web Services
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole
Vasu S
 
IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data
IBM
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
DATAVERSITY
 
Sql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaperSql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaper
Klaudiia Jacome
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
Amazon Web Services
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Clustrix
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
Riccardo Zamana
 
Database project edi
Database project ediDatabase project edi
Database project edi
Rey Jefferson
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Avinash Deshpande
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
ArunPandiyan890855
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
Cognizant
 
Microsoft Data Warehousing
Microsoft Data Warehousing Microsoft Data Warehousing
Microsoft Data Warehousing
Glenture
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
Amazon Web Services
 
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
Vasu S
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
Vasu S
 

Similar to Qubole on AWS - White paper (20)

Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Co 4, session 2, aws analytics services
Co 4, session 2, aws analytics servicesCo 4, session 2, aws analytics services
Co 4, session 2, aws analytics services
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole
 
IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data IBM Relay 2015: Open for Data
IBM Relay 2015: Open for Data
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Sql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaperSql server 2008 r2 analysis services overview whitepaper
Sql server 2008 r2 analysis services overview whitepaper
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?Benchmark Showdown: Which Relational Database is the Fastest on AWS?
Benchmark Showdown: Which Relational Database is the Fastest on AWS?
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Database project edi
Database project ediDatabase project edi
Database project edi
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
 
Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...From Relational Database Management to Big Data: Solutions for Data Migration...
From Relational Database Management to Big Data: Solutions for Data Migration...
 
Microsoft Data Warehousing
Microsoft Data Warehousing Microsoft Data Warehousing
Microsoft Data Warehousing
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 

More from Vasu S

O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
Vasu S
 
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | QuboleO'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
Vasu S
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
Vasu S
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolrO'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
Vasu S
 
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Vasu S
 
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
Vasu S
 
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
Vasu S
 
Case Study - Wikia Provides Federated Access To Data And Business Critical In...
Case Study - Wikia Provides Federated Access To Data And Business Critical In...Case Study - Wikia Provides Federated Access To Data And Business Critical In...
Case Study - Wikia Provides Federated Access To Data And Business Critical In...
Vasu S
 
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
Vasu S
 
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
Vasu S
 
Case Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
Case Study - AgilOne: Machine Learning At Enterprise Scale | QuboleCase Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
Case Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
Vasu S
 
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Vasu S
 
How To Scale New Products With A Data Lake Using Qubole - Case Study
How To Scale New Products With A Data Lake Using Qubole - Case StudyHow To Scale New Products With A Data Lake Using Qubole - Case Study
How To Scale New Products With A Data Lake Using Qubole - Case Study
Vasu S
 
Big Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - WhitepaperBig Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - Whitepaper
Vasu S
 
Tableau Data Sheet | Whitepaper
Tableau Data Sheet | WhitepaperTableau Data Sheet | Whitepaper
Tableau Data Sheet | Whitepaper
Vasu S
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
Vasu S
 
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data SheetsQubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Vasu S
 
Qubole GDPR Security and Compliance Whitepaper
Qubole GDPR Security and Compliance Whitepaper Qubole GDPR Security and Compliance Whitepaper
Qubole GDPR Security and Compliance Whitepaper
Vasu S
 
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
Vasu S
 

More from Vasu S (20)

O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | QuboleO'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
O'Reilly ebook: Financial Governance for Data Processing in the Cloud | Qubole
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolrO'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
O'Reilly eBook: Creating a Data-Driven Enterprise in Media | eubolr
 
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
Case Study - Spotad: Rebuilding And Optimizing Real-Time Mobile Adverting Bid...
 
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
Case Study - Oracle Uses Heterogenous Cluster To Achieve Cost Effectiveness |...
 
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
Case Study - Ibotta Builds A Self-Service Data Lake To Enable Business Growth...
 
Case Study - Wikia Provides Federated Access To Data And Business Critical In...
Case Study - Wikia Provides Federated Access To Data And Business Critical In...Case Study - Wikia Provides Federated Access To Data And Business Critical In...
Case Study - Wikia Provides Federated Access To Data And Business Critical In...
 
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
Case Study - Komli Media Improves Utilization With Premium Big Data Platform ...
 
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
Case Study - Malaysia Airlines Uses Qubole To Enhance Their Customer Experien...
 
Case Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
Case Study - AgilOne: Machine Learning At Enterprise Scale | QuboleCase Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
Case Study - AgilOne: Machine Learning At Enterprise Scale | Qubole
 
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
 
How To Scale New Products With A Data Lake Using Qubole - Case Study
How To Scale New Products With A Data Lake Using Qubole - Case StudyHow To Scale New Products With A Data Lake Using Qubole - Case Study
How To Scale New Products With A Data Lake Using Qubole - Case Study
 
Big Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - WhitepaperBig Data Trends and Challenges Report - Whitepaper
Big Data Trends and Challenges Report - Whitepaper
 
Tableau Data Sheet | Whitepaper
Tableau Data Sheet | WhitepaperTableau Data Sheet | Whitepaper
Tableau Data Sheet | Whitepaper
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data SheetsQubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
Qubole Pipeline Services - A Complete Stream Processing Service - Data Sheets
 
Qubole GDPR Security and Compliance Whitepaper
Qubole GDPR Security and Compliance Whitepaper Qubole GDPR Security and Compliance Whitepaper
Qubole GDPR Security and Compliance Whitepaper
 
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...
 

Recently uploaded

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 

Recently uploaded (20)

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 

Qubole on AWS - White paper

  • 1. 1 Qubole Open Data Lake Platform on AWS Accelerate your data lake adoption, reduce time to value, and lower cloud data lake cost by 50%
  • 2. 2 Table of Contents Why do users choose Qubole on AWS? 3 User Spotlight 1: Malwarebytes 4 User Spotlight 2: Neustar 6 User Spotlight 3: Publicis Media 8 AWS and Qubole Native Integrations 10 • Amazon EC2 Spot • Amazon SageMaker • AWS FSx for Lustre • AWS Glue • Accessing via AWS Marketplace
  • 3. 3 Qubole is an open, simple, and secure data lake platform for machine learning, streaming, and ad-hoc analytics. Qubole on AWS provides end-to-end data lake services such as AWS infrastructure management, data management, continuous data engineering, analytics, and machine learning with near-zero administration. Qubole on AWS delivers: Unified experience for data science, data engineering, ad-hoc analytics Native workbench that includes notebooks, dashboards, and a common interface for all commands and tasks. This enables data engineers and data scientists to collaborate using familiar tools, languages (SQL, Python, R, Scala), and data processing frameworks (Apache Spark, Presto, Hive and Airflow). Low cost and high reliability Workload-aware autoscaling for optimized upscaling, rebalancing, and aggressive downscaling of clusters with a complete context of the workload, SLA, and priority of each job. Includes intelligent policy-based management of On-demand and Spot Nodes. Enterprise-grade security Fine-grained predefined or custom identity and access management roles to separate compute and data access. Qubole also offers role-based access controls for secure collaboration in notebooks and commands. AWS Native Integrations Native integration with AWS services like EC2, S3, Sagemaker, Redshift, AWS FSx for Luster. Why do users choose Qubole on AWS? No other platform provides the openness and data workload flexibility of Qubole while radically accelerating data lake adoption, reducing time to value, and lowering cloud data lake costs by 50 percent. Users adopt Qubole Open Data Lake platform on AWS for the following reasons: • Automated cluster lifecycle management • Intelligent spot management • Heterogeneous cluster management • Automated platform management • Workload-aware Autoscaling • Insights and reporting • Built-in AWS specific optimizations • Self-service platform for all users • Out-of-box tools for data science, data engineering, and analytics • APIs and pre-built integrations with 3rd party solutions • Single platform for data ingestion, processing, management, and consumption • Open & standard file formats, languages and APIs • Secure and granular access Reduce Data Lake Cost by more than 50% Near-Zero Administration Fast Adoption of Their Data Lakes Unify on Simple, Open and Secure Platform
  • 4. 4 About Malwarebytes Malwarebytes is a cybersecurity company that produces anti-malware software for a variety of platforms. The company offers consumers free, premium, and enterprise-grade versions of Malwarebytes, which detect, remove, and remediate computer malware. Malwarebytes uses machine learning (ML) and artificial intelligence (AI) to identify and predict emerging threats before they infect machines. Business Problem To predict, detect, and neutralize emerging threats, Malwarebytes processes billions of threat telemetry records daily. The company then performs advanced analytics on this data to identify potential threats and runs ML and AI models to determine what action to take to protect its customers. Malwarebytes formerly relied on a third party on-premises deployment to ingest and process this data. But this system proved inadequate. For example, the pipeline took a few days to complete Extract-Transform-Load (ETL) on one data stream alone. And, queries on the ingested data were painfully slow. Malwarebytes needed some way to modernize its big data processing to improve turnaround time while also keeping costs down. Improved Processing Speed and Lowered Costs Malwarebytes adopted Qubole in concert with Kafka (for ingesting data streams), and an AWS S3 data lake (for data storage). First, it de-coupled compute and storage. Second, “playing by the rules of the game of the cloud,” says Kulkarni— leveraging things like autoscaling (scaling out and scaling up, and being elastic & ephemeral in nature), low-cost compute instances (AWS Spot), and storage (an AWS S3 data lake)— significantly improved the efficiency of the data platform. Today, Malwarebytes uses Qubole to process its data. About 60 to 70% of it is logs, telemetry, and other types of unstructured and semi-structured data that is being processed in Qubole. Qubole has really mastered the elasticity component of the cloud. Qubole helped us run our ETL at night, spinning up and spinning down clusters when we needed them.” “ Manju Vasishta Director of Data Science and Engineering, Malwarebytes Platform’s ability to add and remove compute resources on- demand based on the workload or SLA, and without human intervention—in a matter of minutes has greatly increased the speed at which Malwarebytes processes critical data, directly affecting the company’s ability to detect, predict, and remediate emerging threats. Qubole aggregates and processes between 20 and 48 terabytes of raw data per day but delivers just 2 to 3 terabytes of meaningful and actionable data. Qubole provides a single framework for processing data more quickly, whether for use in ML models for predictions, in BI applications for business reporting or for GDPR compliance—all with just one full-time administrator plus three senior engineers—a few times per quarter. The result is more powerful insights because they involve better data. User Spotlight 1: Malwarebytes
  • 5. 5 Quick ROI The platform’s ROI was quickly revealed: by the meaningful data it helps to discover, which yields more powerful insights. These insights—for example, predictive insights about emerging threats; marketing lead conversion propensity using ML algorithms; behavioral clustering of malware; sentiment analysis of reviews about Malwarebytes products and features on various social media platforms using advanced natural language libraries; drive key decisions that serve the business and its customers well. Key Takeaway • Greater data-processing capacity at much lower costs • Improved efficiency to produce meaningful data and more powerful insights • Easy user onboarding resulting in high adoption • Quick, tangible ROI Qubole has really mastered the elasticity component of the cloud. Qubole helped us run our ETL at night, spinning up and spinning down clusters when we needed them.” “ Manju Vasishta Director of Data Science and Engineering, Malwarebytes
  • 6. 6 Business Problem Overview The Neustar Unified Analytics platform helps marketers understand the impact of marketing on key business outcomes, and provides tools to enable them to optimize the allocation of their marketing investments. First, it ingests large volumes of client marketing data from a variety of sources. Then, it applies proprietary algorithms to build a predictive spend attribution model on top of that data. This reveals how the client’s marketing spend correlates to revenue—enabling marketers to determine which marketing channels are working, which ones aren’t, and what to do next. To meet the demands of its growing client roster, the Neustar Unified Analytics team needed to confront the issues of variety, volume, velocity, and veracity—often called the “four Vs.” At the same time, the team needed to keep operational costs down. For this, it turned to Qubole. Ensuring Data Veracity “Models are only as good as the input data,” said Peterson. “If your data has lots of gaps, then the model won’t be good, no matter what algorithm you use.” But most data scientists fail to detect “dirty” data until after they run the model—a typical data science pipeline has data processing, modeling, and scoring stages. Data scientists must then fix the data and rerun the many of their processes—a task that might take weeks or even months, depending on processing speed and capacity. Often, this cycle repeats, compounding the delay. Indeed, “the main reason these things take a long time is because of the reruns,” said Peterson. Neustar Unified Analytics is different. Its machine learning models include a series of pre-checks and post-checks to validate data. “Because we have a very comprehensive set of validation routines that run on Qubole, we’re able to isolate problems earlier and avoid these reruns,” Peterson explains. As a result, data validation jobs require just one to one and a half run cycles. This allows Neustar to deliver insights to its clients much faster and with the highest degree of confidence. From a performance aspect, we want to be faster and faster…and Qubole fits right into this.” “ Dan Peterson Vice President of Systems Engineering, Neustar About Neustar Unified Analytics Neustar Unified Analytics is an integrated marketing measurement, analytics, and attribution solution from Neustar Information Services, Inc. Neustar Unified Analytics is not a marketing campaign management tool. Rather, it runs alongside its clients’ campaign management platforms to measure and attribute overall marketing spend across campaigns. More than 90 Fortune 200 companies depend on Neustar Unified Analytics to assess and improve their marketing investments. User Spotlight 2: Neustar Keeping Costs Down In a given month, the heavy compute time needed for most machine learning jobs is 80 to 90 hours on average for each customer. The rest of the time is typically consumed running reports, tuning parameters, and so on—tasks that require considerably less compute power. For this reason, before Neustar Unified Analytics partnered with Qubole, its 400-odd compute nodes per customer were frequently underutilized— with no adjustment in cost. Now, Qubole aggressively—and automatically—shuts down excess capacity during slow periods, efficiently “packing” workloads in fewer nodes. This dramatically reduces operating costs, without compromising performance or delivery times. Neustar Unified Analytics team has reduced its costs to the tune of 85 to 95 percent over its prior use of other vendor tools with reserved compute instances and administrator-led scaling.
  • 7. 7 Qubole is cheaper and much more economical than other vendors…but more importantly, it’s much more stable, and much more high-performing. Qubole offered us the best price for performance, and outstanding support.” “ Dan Peterson Vice President of Systems Engineering, Neustar Key Takeaway: • Decreased machine learning model turnaround from six months to three weeks, end to end. • Reduced model data validation cycle time by more than 62 percent. • Cloud cost savings by 85 to 95 percent.
  • 8. 8 About Publicis Groupe Publicis Groupe is one of the four solutions hubs of Publicis Groupe [Euronext Paris FR0000130577, CAC 40], alongside Publicis Communications, Publicis Sapient and Publicis Health. Led by Steve King, CEO, Publicis Media and COO, Publicis Groupe, is comprised of Starcom, Zenith, Digitas, Spark Foundry, and Performics, powered by digital-first, data-driven global practices that together deliver client value and business transformation. Publicis Media is committed to helping its clients navigate the modern media landscape and is present in more than 100 countries with over 23,500 employees worldwide. Business Problem Overview Few years ago multiple teams from the various media agencies merged to form Publicis Media. This merger revealed the need for a central data and analytics platform. “We wanted our agency teams to be able to mine data, but not to have to deal with the operational overhead of managing data infrastructure,” explains Darren Smith, who leads the engineering and data teams. “Our intent was to democratize data.” According to Smith, the team’s existing data infrastructure “was a bunch of bespoke solutions” that combined AWS Redshift, large monolithic on-premise servers, and various unwieldy traditional technologies. Offering a central data and analytics platform would require both a complete overhaul of this infrastructure and some way to tie all of its pieces together. A Centralized Platform for Democratizing Data “The focus of our team was to build a data architecture and infrastructure that would allow our agency teams to move forward in a big data world,” says Joe Tan, director of products at Publicis Media. The resulting infrastructure couples a global data lake —which stores large volumes of multiple types of data—with a framework to ingest and process data. In addition to building this data infrastructure, Tan’s team had another job: “to provide tools that allow agency teams to really focus on doing analytics for their clients instead of having to worry about data ops and data engineering.” Qubole enables agency teams to “work with the data they’re used to in the tools and languages they’re used to, like Tableau and Presto, or SQL, Python, R, Scala, etc.” says Tan. It also helps Publicis Media make data available to users with different skill sets. “It even,” says Tan, “gives users the ability to learn how to do more with minimal additional effort.” As more and more clients have grasped the potential power of Publicis Media’s platform, Qubole has played a key role in helping increase its adoption. User Spotlight 3: Publicis Media We have had a steady growth rate of one to two agency clients onboarding onto our platform per month,” says Tan. “That might not sound like a lot, but a lot of those teams service multiple clients of their own, so it’s pretty impactful.” “ Joe Tan Director of products, Publicis Media Scaled with Customer Data Demands Publicis Media handles lots of data for its agency clients. Its data lake stores close to a petabyte of it. Agency clients use this data to run machine learning models for analytics purposes. Scaling to process larger data sets posed a challenge before Qubole. “I regularly walked into offices and ran into someone who’d had a model running for six hours,” recalls Tan. Qubole solved this problem by enabling agencies to automatically scale up compute infrastructure for large jobs and to aggressively scale back down when a job is complete to keep costs low. So, jobs that once took six hours to complete can now be finished in mere minutes, with almost 10,000 queries per month on average. In addition, Qubole also supports multi-region data availability without latency—further improving the performance and consistency of Publicis Media’s data globally.
  • 9. 9 Secure Enterprise-grade Data Lake Platform Data security is top of mind for Publicis Media. Qubole addresses its requirements with regard to single sign-on, strict role-based access control, and agency data isolation, among other security issues. While both Smith and Tan see these features as “table stakes,” Smith acknowledges that, “A lot of vendors don’t support them.” Key Takeaway: • A central data and analytics platform that democratizes data • Ability to manage nearly 1 petabyte of data • Reduction in the model run time from six hours to mere minutes, with almost 10,000 queries per month on average • Multi-region data availability without latency • Easy administration with an administrator-to-user ratio of 3:100 • Support for robust security and compliance requirements Qubole really meshed well with the overall architecture and design of our data lake. I don’t think we could have found a better platform.” “ Joe Tan Director of products, Publicis Media
  • 10. 10 AWS and Qubole Native Integrations Amazon EC2 Spot Qubole Open Data Lake platform provides a policy-based way to automate the spot bidding process, allowing data teams to take full advantage of spot instances without devoting resources to managing it. Qubole uses AWS spot nodes when dynamically adding cluster nodes or as part of the core minimum nodes for a cluster. Users select a maximum bid they are willing to pay for a spot instance. The platform then automatically places bids for them, making the process easy to use. Qubole clusters begin with nodes at on-demand instances and rebalance automatically by switching on- demand instances for spot nodes when spot availability is higher. With this ease of use, the Qubole platform is used for advanced provisioning strategies. Those strategies come in three categories: • On-Demand Only: Auto-scaled nodes that are added will only be On-Demand instances. • Spot Instances Only: Auto-scaled nodes that are added will only be Spot nodes. • Hybrid: Auto-scaled nodes combine On-Demand and Spot nodes. Users are able to choose what the maximum percentage of Spot nodes is. The platform also has additional built-in intelligence to maximize spot instance usage for the workloads: • Qubole Placement Policy: Qubole has multiple pricing options for stable spot nodes and volatile spot nodes. Via the placement policy, Qubole spreads out underlying storage across stable and volatile nodes, thereby minimizing the risk of job loss due to loss of a Spot instance. • Fallback to on-demand instances after a configurable timeout: Qubole can automatically fall back to requesting on- demand nodes if spot nodes cannot be provisioned within a configurable timeout period. • Intelligent AZ Selection: Spot pricing can vary by AZ (availability zone), sometimes by up to 15-20%. Qubole can automatically select an optimal AZ based on Spot pricing for the cluster instance type chosen. Currently, AZ selection is only supported for non-VPC clusters. Amazon SageMaker The SageMaker and Qubole integration allow enterprise users to leverage Qubole Notebooks and Apache Spark on Qubole to explore, clean, and prepare data in the format required for Machine Learning algorithms. Once the raw data is cleansed and prepared in Qubole, it is used to train ML algorithms in SageMaker. There are 2 ways for users to leverage this integration. • Prepare Data and Initiate Training from Qubole Qubole loads data from multiple data sources such as Transactional databases, Data Warehouses, Streaming data, interaction data such as clickstreams, social media feeds sensor data, log files, and more. Users read their data into Qubole Spark data frames, use Qubole Notebooks to transform, cleanse, and prepare the data. Once the data is stored back on Amazon S3, the users initiate model training — from Qubole — using the estimator in the SageMaker Spark library. This initiates ML training in SageMaker, builds the model, and creates the endpoint to host that model. • Prepare Data in Qubole from SageMaker Notebook Alternatively, SageMaker users enhance the SageMaker data processing capabilities by connecting a SageMaker Notebook instance to Qubole. Data scientists use Apache Spark to process and prepare data at scale with Qubole. Qubole Open Data Lake Platform greatly reduces the cost of computing by consuming less compute and/or consuming cheaper compute. With this integration, data scientists use Qubole to cleanse and prepare (transform, featurize, join, etc.) prior to ML training in Amazon SageMaker.
  • 11. 11 Qubole High Level Architecture AWS FSx for Lustre AWS FSx for Lustre and Qubole Open Data Lake Platform together reduce user’s compute cost and minimize intermediate data loss while running workloads. Users do not pay to maintain idle AWS EC2 instances and also not worry about intermediate output (shuffle data) loss due to spot nodes interruption. Qubole uses Amazon FSx for Lustre to store and process intermediate data through its parallel, high-speed file system. By doing so, users no longer need to retain idle EC2 instances to store this intermediate data. Instead, Amazon FSx for Lustre allows them to re-use the data otherwise normally held within EC2 local storage. AWS Glue Qubole and AWS Glue provide users with flexibility and choice of a unified shared metastore with a cdata lake platform. Users use Glue’s data crawlers to scan and classify data, extract schema details, and build the data catalog. Qubole’s platform is configured with this catalog as the metastore and shared across your AWS accounts, applications, and services. With Qubole’s multiple open source frameworks support, users run Hive, Presto queries, and Spark jobs leveraging this catalog. Alternatively, users can continue using their existing or Qubole-hosted metastore and synchronize it with the Glue Data Catalog. Accessing via AWS Marketplace Qubole makes it easier for users to access, manage, monitor and govern their data in S3 data lake with Open Data Lake Platform. Users can subscribe and access the platform through AWS marketplace with automatic account setup, AWS authentication and simplified user onboarding in less than an hour with their data. 1. Copy Account ID and External ID from QDS 2. Create IAM Policies on AWS 3. Create IAM Roles on AWS 4. Link AWS and QDS accounts
  • 12. 12 Learn More For the latest information about our product and services, please see the following resources: • Qubole Whitepapers • Qubole Case Studies • Qubole Technical Documentation For more information: Contact: Try Qubole Open Data Lake Platform for Free: sales@qubole.com 469 El Camino Real, Suite 205 Santa Clara, CA 95050 (855) 423-6674 | info@qubole.com WWW.QUBOLE.COM About Qubole Qubole is the open data lake company that provides a simple and secure data lake platform for machine learning, streaming, and ad-hoc analytics. No other platform provides the openness and data workload flexibility of Qubole while radically accelerating data lake adoption, reducing time to value, and lowering cloud data lake costs by 50 percent. Qubole is trusted by leading brands such as Expedia, Disney, Oracle, Gannett and Adobe to spur innovation and to transform their businesses for the era of big data. For more information visit us at www.qubole.com You can visit the AWS Marketplace anytime to get up and running with Qubole! TRY QUBOLE IN AWS TODAY! Start your 30 Day Free Trial now